数据清洗——地域维度

1、数据导入

要求将样表文件中的AA_GXJSQYDC2019数据导入HIVE数据仓库中。分别将地域维度表导入数据仓库中。

(1)将改名且设置字符集为UTF-8后的文件上传到本地

数据清洗——地域维度

(2)在hive中创建表aa_2019 

create table aa_2019(

ID String,

QA04 String,

QA05 String,

QA07 String,

QA15 String,

QA19 String,

Hangye String,

QB03 String,

QB03ONE String,

QB03TWO String,

QB03_1 String,

QB06 String,

QB16 String,

QB16V String,

Gaoxin String,

QB16_1 String,

QB16_1V String,

QC02 String,

QC05_0 String,

QC24 String,

QC40 String,

QD01 String,

QD28 String,

QJ09 String,

QJ20 String,

QJ55 String,

QJ74 String,

Diyu String,

SYEAR String

)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;

 

将本地文件导入hive中:

 

load data local inpath '/kkb/install/apache-hive-3.1.2-bin/testdate/aa_2019.csv' into table aa_2019;

数据清洗——地域维度

查看数据正确性:

数据清洗——地域维度

(3)在hive中创建表diyu

create table diyu(

dm String,

dmms String

)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;

 

将本地文件导入hive中:

 

load data local inpath '/kkb/install/apache-hive-3.1.2-bin/testdate/diyu.csv' into table diyu;

数据清洗——地域维度

 

 

 

查看数据正确性:

select * from diyu limit 10;

数据清洗——地域维度

 

 

 


 

2、数据清洗

根据标准维度将地域维度字段清洗完成。

(1)删除表的第一行

 

alter table diyu set TBLPROPERTIES ('skip.header.line.count'='1');

 

(2)创建表aa_2019存放地域维度清洗完的数据:

 

create table aa_19(

ID String, QA04 String, QA05 String,

 QA07 String, QA15 String, QA19 String,

Hangye String, QB03 String, QB03ONE String,

QB03TWO String, QB03_1 String, QB06 String,

QB16 String, QB16V String, Gaoxin String,

QB16_1 String, QB16_1V String, QC02 String,

QC05_0 String, QC24 String, QC40 String,

 QD01 String, QD28 String, QJ09 String,

QJ20 String, QJ55 String, QJ74 String,

Diyu String, SYEAR String

)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;

数据清洗——地域维度

(3)清洗数据:

insert into table aa_19 select aa_2019.ID as ID , aa_2019.QA04 as QA04, aa_2019.QA05 as QA05, aa_2019.QA07 as QA07, aa_2019.QA15 as QA15, aa_2019.QA19 as QA19, aa_2019.Hangye as Hangye, aa_2019.QB03 as QB03,aa_2019.QB03ONE as QB03ONE, aa_2019.QB03TWO as QB03TWO, aa_2019.QB03_1 as QB03_1, aa_2019.QB06 as QB06, aa_2019.QB16 as QB16, aa_2019.QB16V as QB16V, aa_2019.Gaoxin as Gaoxin, aa_2019.QB16_1 as QB16_1, aa_2019.QB16_1V as QB16_1V, aa_2019.QC02 as QC02, aa_2019.QC05_0 as QC05_0, aa_2019.QC24 as QC24, aa_2019.QC40 as QC40, aa_2019.QD01 as QD01, aa_2019.QD28 as QD28, aa_2019.QJ09 as QJ09, aa_2019.QJ20 as QJ20, aa_2019.QJ55 as QJ55, aa_2019.QJ74 as QJ74, concat(aa_2019.QA19,diyu.dmms) as Diyu, aa_2019.SYEAR as SYEAR from aa_2019 join diyu on (aa_2019.QA19 =diyu.dm)

(4)清洗结果:

select * from table aa_19 limit 10;

数据清洗——地域维度

 


 

3、数据

(1)在mysql中创建表:

create table aa_19(

ID varchar(255),

QA04 varchar(255),

QA05 varchar(255),

QA07 varchar(255),

QA15 varchar(255),

QA19 varchar(255),

Hangye varchar(255),

QB03 varchar(255),

QB03ONE varchar(255),

QB03TWO varchar(255),

QB03_1 varchar(255),

QB06 varchar(255),

QB16 varchar(255),

QB16V varchar(255),

Gaoxin varchar(255),

QB16_1 varchar(255),

QB16_1V varchar(255),

QC02 varchar(255),

QC05_0 varchar(255),

QC24 varchar(255),

QC40 varchar(255),

QD01 varchar(255),

QD28 varchar(255),

QJ09 varchar(255),

QJ20 varchar(255),

QJ55 varchar(255),

QJ74 varchar(255),

Diyu varchar(255),

SYEAR varchar(255)

)

 

(2)通过sqoop将表导入mysql:

 bin/sqoop export \

--connect "jdbc:mysql://node01:3306/hive2?useUnicode=true&characterEncoding=utf-8" \

--username root \

--password wyhhxx \

--table aa_19 \

--num-mappers 1 \

--export-dir /user/hive/warehouse/aa_19 \

--input-fields-terminated-by ","

 

(3)导出结果:

数据清洗——地域维度

 

 


4、数据可视化展示

数据清洗——地域维度

 

上一篇:C++版浙大PAT甲级1002(25分)


下一篇:新手小白学习python第四周