相关链接
目录
1.前言
- 需求:数仓使用的hive,上游业务库之一是mysql数据库。数据接入(使用自研平台)时发现字段中文名有乱码,要求在接入前先检查所有字段中文为乱码的表,整理出来,要求上游整改完毕后再接入。
- 思路:使用正则表达式
- mysql库使用正则过滤乱码字段
- 元数据接入hive后,使用hive正则过滤乱码字段
2.mysql
- mysql 建表 + 造数据
drop table account;
CREATE TABLE IF NOT EXISTS account (
uid int(11) DEFAULT NULL COMMENT '主键`',
uname varchar(10) DEFAULT NULL COMMENT '层级',
money int(11) DEFAULT NULL COMMENT '金额'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO account (uid, uname, money) VALUES
(1, 'asA123', 123),
(2, '中文字符', 123),
(3, 'abc', 123),
(4, 'ABC', 123),
(5, '12345', 123),
(6, 'ç»„ç»‡ç»“æž„ç‰ˆæœ¬ç¼–å · ', 123),
(7, '层级 ', 123);
2.1 取出所有乱码数据
- 这里要取出来uname这一列中存在乱码的数据,其中id=6和id=7的存在乱码情况
- 其中HEX是为了返回十六进制值的字符串表示形式,mysql不支持正则
\u4e00-\u9fa5
来过滤中文,需要转成十六进制后再通过正则匹配
-- 使用正则表达式
select
*
from
account t1
where
1=1
and hex(t1.uname) not REGEXP 'e[4-9][0-9a-f]{4}'
and t1.uname not REGEXP '[a-z]|[0-9]|[-]'
and t1.uname <> ''
- 查询结果
2.2 取出库中所有乱码的字段中文注释
- 这里的sql是从文章 SQL获取 MySQL 映射 Teradata 表结构 中改造的,修改了一下where条件
SELECT
'MySQL' AS "DB_TYPE",
T2.TABLE_SCHEMA AS "库名",
T2.TABLE_NAME AS "表名",
T2.TABLE_COMMENT AS "表中文名",
T1.ORDINAL_POSITION AS "序号",
T1.COLUMN_NAME AS "字段名",
T1.COLUMN_COMMENT AS "字段中文",
T1.COLUMN_TYPE AS SOURCE_COMBINE,
CASE
WHEN T1.IS_NULLABLE = 'YES'
THEN 'N'
WHEN T1.IS_NULLABLE = 'NO'
THEN 'Y'
ELSE
NULL
END AS "非空",
CASE
WHEN T1.COLUMN_KEY='PRI'
THEN 'Y'
ELSE
NULL
END AS "主键",
CASE
WHEN T2.TABLE_TYPE = 'base table'
THEN "表"
WHEN T2.TABLE_TYPE = 'view'
THEN "视图"
WHEN T2.TABLE_TYPE = 'system view'
THEN "MySQL系统表"
ELSE NULL
END AS "表/视图",
T2.TABLE_ROWS AS "数据量",
T1.DATA_TYPE AS "DATA_TYPE",
T1.CHARACTER_OCTET_LENGTH AS "字节数",
T1.CHARACTER_MAXIMUM_LENGTH AS "长度",
T1.NUMERIC_PRECISION AS "精度",
T1.NUMERIC_SCALE AS "标度"
FROM
information_Schema.`COLUMNS` T1
LEFT JOIN
information_schema.TABLES T2 ON T1.TABLE_NAME = T2.TABLE_NAME AND T1.TABLE_SCHEMA = T2.TABLE_SCHEMA
WHERE
1=1
and hex(t1.COLUMN_COMMENT) not REGEXP 'e[4-9][0-9a-f]{4}'
and t1.COLUMN_COMMENT not REGEXP '[a-z]|[0-9]|[-]'
and t1.COLUMN_COMMENT <> ''
ORDER BY
T2.TABLE_SCHEMA, -- 库名
T2.TABLE_NAME, -- 表名
T1.ORDINAL_POSITION -- 字段序号
- 查询结果
3.hive
- t1,t2,t3,t4 存放的是接入服务的元数据信息,接入服务都是通过此服务的元数据信息进行建表的
select
t4.`host` as `主机`,
t4.`port` as `端口`,
t3.database_name as `库名`,
t2.table_name as `表名`,
t1.column_name as `字段名`,
t1.column_comment as `字段中文`
from
stg.stg_databus_t_columns t1
left join stg.stg_t_tables t2 on t1.table_id = t2.id
left join stg.stg_t_databases t3 on t2.database_id = t3.id
left join stg.stg_t_source_database_info t4 on t3.source_database_id = t4.id
where
t1.pt='20211031000000'
and t1.column_comment not REGEXP '[\\s\\w\\d\u4e00-\u9fa5]|[-]'
and t1.column_comment <> ''
- 查询结果(hive只能在平台上查询)
21/11/01
M