不知道该选取什么字段
dwd dim层 orc snappy
前言
和前面的看板分析的方法一致。
讲义给出的叫做:需求文档
如图,我们要做的是:根据需求的要求,整理出指标和维度
BI看板展示:
需求分析
2.1 校区报名柱状图
说明:统计期内,全部报名客户中,各校区报名人数分布。
展现:柱状图
条件:年、月,校区
维度:天区间,按查询条件来定
指标:报名人数
粒度:天/线上线下/校区
数据来源:客户管理系统的customer_relationship、itcast_clazz报名课程表
2.2 学科报名柱状图
说明:统计期内,全部报名客户中,各学科报名人数分布。
展现:柱状图
条件:年、月,学科
维度:天区间,按查询条件来定
指标:报名人数
粒度:天/线上线下/学科
数据来源:客户管理系统的customer_relationship、itcast_clazz报名课程表
2.3 总报名量
说明:统计期内,已经缴费的报名客户总量。
展现:数值。
条件:年、月
维度:年、月
指标:报名客户总量
粒度:天
数据来源:客户管理系统的customer_relationship表
2.4 线上报名量
说明:总报名量中来源渠道为线*客渠道的报名总量
展现:线状图。
条件:年、月
维度:天区间,按查询条件来定
指标:报名客户总量
粒度:天
数据来源:客户管理系统的customer_relationship表
2.5 意向用户报名转化率
说明:统计期内,新增的意向客户中报名的客户占比。全部报名人数 / 全部新增的意向人数
展现:线状图。双轴:全部报名人数、报名转化率。
条件:年、月
维度:天/线上线下
指标:报名转化率=全部报名人数/全部新增的意向人数
粒度:天
数据来源:客户管理系统的customer_relationship表
2.6 有效线索报名转化率
说明:线上报名量 / 线上有效线索量,与上一个指标类似,此处的线索量需要排除已申诉数据。
展现:线状图。双轴:线上报名人数、线上报名转化率。
条件:年、月
维度:天/线上线下
指标:线上报名转化率=线上报名人数/线上有效线索量
粒度:天
数据来源:客户管理系统的customer_relationship表、customer_clue表、customer_appeal表
2.7 日报名趋势图
说明:统计期内,每天报名人数的趋势图。
展现:线状图。
条件:年、月
维度:天/线上线下
指标:报名人数
粒度:天
数据来源:客户管理系统的customer_relationship表
2.8 校区学科的报名学员TOP
说明:统计期内,全部报名学员中,校区学科排行榜,topN。A校区b学科第一,B校区a学科第二等等。
展现:柱状图
条件:年、月,校区,学科,数据量N
维度:天/线上线下
指标:报名学员人数
粒度:各校区各学科的报名人数和
数据来源:客户管理系统的customer_relationship表、itcast_clazz表
2.9 来源渠道占比
说明:统计期内,全部报名学员中,不同来源渠道的报名学员占比情况。
展现:饼状图
条件:年、月
维度:天/线上线下/来源渠道
指标:比值
数据来源:客户管理系统的customer_relationship表
2.10 咨询中心报名贡献
说明:统计期内,全部报名学员中,各咨询中心的报名学员人数占比情况。
展现:饼状图
条件:年、月,咨询中心
维度:天/线上线下/咨询中心
指标:报名学员人数
粒度:天/报名学员人数
数据来源:客户管理系统的customer_relationship表、employee表、scrm_department表
需求总结
指标
报名用户量,
意向客户量,
有效线索量
维度
时间:
年月日时
空间:
地区
线上线下
属性:
渠道
学科
涉及表和字段
事实表:
customer_relationship表
维度表:
itcast_clazz表
employee、scrm_department表
关联条件
id
date_day
注意事项
建模模板提示:建模要考虑到全量和增量。
特别是增量,以哪个字段来作为增量依据。
各个层的分区、分桶如何做来兼容增量采集。
ODS DIM 分析
ODS表
几个表:
4个表:
意向用户表
报名课程表
员工表
员工部门表
存储格式:
orc
压缩格式?
zlib
是否分区、分桶:
需要分桶
是否索引
需要
布隆索引
一个表不用添加
DIM层表
当前看板是否需要采集表到DIM层
需要
如有,如何存储,是否压缩,是否分区、是否分桶,是否索引?
orc存储
snappy压缩
起始时间分区
布隆索引
DWD层
ODS -> DWD 执行数据抽取、转换、清洗
抽取哪些字段?id
int COMMENT ‘客户关系id’,customer_id
int COMMENT ‘所属客户id’,origin_type
STRING COMMENT ‘数据来源’,payment_time
STRING COMMENT ‘支付状态变动时间字符串’,payment_time_hour
STRING COMMENT ‘支付状态变动小时’,itcast_clazz_id
int COMMENT ‘报名课程id’,creator
int COMMENT ‘创建人id’,origin_type_stat
STRING COMMENT ‘数据来源:0.线下;1.线上’
哪些字段需要转换?id
int COMMENT ‘客户关系id’,customer_id
int COMMENT ‘所属客户id’,origin_type
STRING COMMENT ‘数据来源’,payment_time
STRING COMMENT ‘支付状态变动时间字符串’,payment_time_hour
STRING COMMENT ‘支付状态变动小时’,itcast_clazz_id
int COMMENT ‘报名课程id’,creator
int COMMENT ‘创建人id’,origin_type_stat
STRING COMMENT ‘数据来源:0.线下;1.线上’
哪些数据需要清洗?id
int COMMENT ‘客户关系id’,customer_id
int COMMENT ‘所属客户id’,origin_type
STRING COMMENT ‘数据来源’,payment_time
STRING COMMENT ‘支付状态变动时间字符串’,payment_time_hour
STRING COMMENT ‘支付状态变动小时’,itcast_clazz_id
int COMMENT ‘报名课程id’,creator
int COMMENT ‘创建人id’,origin_type_stat
STRING COMMENT ‘数据来源:0.线下;1.线上’
DWM层
DWM层做 预聚合和维度退化
是否需要DWM层?
是
是否做维度退化?
是
customer_id 报名客户id
`itcast_school_id` '学校id',
itcast_school_name
‘学校namne’,itcast_subject_id
‘学科id’,itcast_subject_name
‘学科name’,tdepart_id
‘咨询中心id’,origin_type
‘来源渠道’,
‘origin_type_stat’ ‘数据来源:0.线下;1.线上’,payment_time_hour
‘最后更新时间’
DWS层
几个表?
1个
如何设计表?
signup_num
‘报名人数’,itcast_school_id
‘学校id’,itcast_school_name
‘学校namne’,itcast_subject_id
‘学科id’,itcast_subject_name
‘学科name’,tdepart_id
‘咨询中心id’,tdepart_name
‘咨询中心name’,origin_type
‘来源渠道’,
origin_type_stat ‘数据来源:0.线下;1.线上’,payment_time_hour
‘最后更新时间’,
groupType ‘业务分组类型:1:校区、学科组合分组;2:来源渠道分组;3:咨询中心分组;4:所有’,
time_type ‘聚合时间类型:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合。’
建表语句
ODS建表语句
ODS层的customer_relationship表可以直接复用。
itcast_clazz表是维表,放在维表层。
DIM建表语句
itcast_clazz报名课程表
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_dimen.itcast_clazz (
id int COMMENT ‘ems课程id(非自增)’,
create_date_time STRING COMMENT ‘创建时间’,
update_date_time STRING COMMENT ‘最后更新时间’,
deleted STRING COMMENT ‘是否被删除(禁用)’,
itcast_school_id STRING COMMENT ‘ems校区ID’,
itcast_school_name STRING COMMENT ‘ems校区名称’,
itcast_subject_id STRING COMMENT ‘ems学科ID’,
itcast_subject_name STRING COMMENT ‘ems学科名称’,
itcast_brand STRING COMMENT ‘ems品牌’,
clazz_type_state STRING COMMENT ‘班级类型状态’,
clazz_type_name STRING COMMENT ‘班级类型名称’,
teaching_mode STRING COMMENT ‘授课模式’,
start_time STRING COMMENT ‘开班时间’,
end_time STRING COMMENT ‘毕业时间’,
comment STRING COMMENT ‘备注’,
detail STRING COMMENT ‘详情(比如:27期)’,
uncertain STRING COMMENT ‘待定班(0:否,1:是)’,
tenant int COMMENT ‘租户’,
ends_time STRING COMMENT ‘有效时间’)
comment ‘班级信息表’
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
stored as orc
TBLPROPERTIES (‘orc.compress’=‘SNAPPY’,‘orc.create.index’=‘true’,‘orc.bloom.filter.columns’=‘id’);
DWD建表语句
customer_relationship_dwd表
CREATE TABLE IF NOT EXISTS itcast_dwd.customer_relationship_dwd (id
int COMMENT ‘客户关系id’,customer_id
int COMMENT ‘所属客户id’,origin_type
STRING COMMENT ‘数据来源’,payment_time
STRING COMMENT ‘支付状态变动时间字符串’,payment_time_hour
STRING COMMENT ‘支付状态变动小时’,itcast_clazz_id
int COMMENT ‘报名课程id’,creator
int COMMENT ‘创建人id’,origin_type_stat
STRING COMMENT ‘数据来源:0.线下;1.线上’ )
comment ‘客户关系表’
PARTITIONED BY (payment_time_year String, payment_time_month String, payment_time_day String)
CLUSTERED BY(id) sorted by(id) into 10 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
stored as orc
TBLPROPERTIES (‘orc.compress’=‘SNAPPY’,‘orc.create.index’=‘true’,‘orc.bloom.filter.columns’=‘itcast_clazz_id,creator’);
DWM建表语句
报名数据中间表
itcast_dwm.customer_signup_dwm表
CREATE TABLE IF NOT EXISTS itcast_dwm.customer_signup_dwm (customer_id
int COMMENT ‘报名客户id’,itcast_school_id
STRING COMMENT ‘学校id’,itcast_school_name
STRING COMMENT ‘学校namne’,itcast_subject_id
STRING COMMENT ‘学科id’,itcast_subject_name
STRING COMMENT ‘学科name’,tdepart_id
int COMMENT ‘咨询中心id’,tdepart_name
STRING COMMENT ‘咨询中心name’,origin_type
STRING COMMENT ‘来源渠道’,
origin_type_stat STRING COMMENT ‘数据来源:0.线下;1.线上’,payment_time_hour
STRING COMMENT ‘最后更新时间’)
comment ‘报名人数统计中间表’
PARTITIONED BY (payment_time_year String, payment_time_month String, payment_time_day String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
stored as orc
TBLPROPERTIES (‘orc.compress’=‘SNAPPY’);
DWS建表语句
报名数据宽表
itcast_dws.customer_signup_dws表
drop table itcast_dws.customer_signup_dws;
create table if not exists itcast_dws.customer_signup_dws (signup_num
int COMMENT ‘报名人数’,itcast_school_id
STRING COMMENT ‘学校id’,itcast_school_name
STRING COMMENT ‘学校namne’,itcast_subject_id
STRING COMMENT ‘学科id’,itcast_subject_name
STRING COMMENT ‘学科name’,tdepart_id
int COMMENT ‘咨询中心id’,tdepart_name
STRING COMMENT ‘咨询中心name’,origin_type
STRING COMMENT ‘来源渠道’,
origin_type_stat STRING COMMENT ‘数据来源:0.线下;1.线上’,payment_time_hour
STRING COMMENT ‘最后更新时间’,
groupType STRING COMMENT ‘业务分组类型:1:校区、学科组合分组;2:来源渠道分组;3:咨询中心分组;4:所有’,
time_type STRING COMMENT ‘聚合时间类型:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合。’
)
comment ‘报名人数统计业务表’
PARTITIONED BY (payment_time_year String, payment_time_month String,payment_time_day String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
stored as orc
TBLPROPERTIES (‘orc.compress’=‘SNAPPY’);
全量采集执行脚本
业务数据库到ODS和DIM的SQOOP执行脚本
sqoop import
–connect jdbc:mysql://192.168.52.150:3306/scrm
–username root
–password 123456
–query ‘select *,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time,date_format(“9999-12-31”,"%Y-%m-%d") as ends_time from itcast_clazz where $CONDITIONS’
–hcatalog-database itcast_dimen
–hcatalog-table itcast_clazz
-m 100
–split-by id
ODS到DWD的HQL语句
–分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
–hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
–写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
–分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
–并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
–矢量化查询
set hive.vectorized.execution.enabled=true;
–读取零拷贝
set hive.exec.orc.zerocopy=true;
INSERT INTO itcast_dwd.customer_relationship_dwd PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
id,
customer_id,
nvl(origin_type, -1) origin_type,
payment_time,
substr(payment_time, 12, 2) as payment_time_hour,
nvl(itcast_clazz_id, -1) itcast_clazz_id,
nvl(creator, -1) creator,
if(origin_type=‘NETSERVICE’ or origin_type=‘PRESIGNUP’, ‘1’, ‘0’) origin_type_stat,
substr(payment_time, 1, 4) payment_time_year,
substr(payment_time, 6, 2) payment_time_month,
substr(payment_time, 9, 2) payment_time_day
from itcast_ods.customer_relationship
WHERE deleted = 0 AND customer_id IS NOT NULL AND payment_state=‘PAID’;
DWD到DWM的HQL语句
–分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
–hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
–写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
–分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
–并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
–矢量化查询
set hive.vectorized.execution.enabled=true;
–关联优化器
set hive.optimize.correlation=true;
–读取零拷贝
set hive.exec.orc.zerocopy=true;
–join数据倾斜
set hive.optimize.skewjoin=true;
– set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
INSERT into itcast_dwm.customer_signup_dwm PARTITION (payment_time_year, payment_time_month, payment_time_day)
SELECT
dwd.customer_id,
clazz.itcast_school_id,
clazz.itcast_school_name,
clazz.itcast_subject_id,
clazz.itcast_subject_name,
e.tdepart_id,
dept.name as tdepart_name,
dwd.origin_type,
dwd.origin_type_stat,
dwd.payment_time_hour,
dwd.payment_time_year,
dwd.payment_time_month,
dwd.payment_time_day
FROM itcast_dwd.customer_relationship_dwd dwd
LEFT JOIN itcast_dimen.itcast_clazz clazz on dwd.itcast_clazz_id=clazz.id
LEFT JOIN itcast_dimen.employee e on dwd.creator=e.id
LEFT JOIN itcast_dimen.scrm_department dept on e.tdepart_id=dept.id;
DWM到DWS的HQL语句
```–*校区、学科组合分组–**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
count(dwm.customer_id) as signup_num,
dwm.itcast_school_id,
dwm.itcast_school_name,
dwm.itcast_subject_id,
dwm.itcast_subject_name,
-1 as tdepart_id,
’-1’ tdepart_name,
’-1’ origin_type,
dwm.origin_type_stat,
dwm.payment_time_hour,
‘1’ grouptype,
‘1’ as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,
dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;`
*--**天*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
dwm.itcast_school_id,
dwm.itcast_school_name,
dwm.itcast_subject_id,
dwm.itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
'-1' payment_time_hour,
'1' grouptype,
'2' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day,
dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;
*--**月*
INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
dwm.itcast_school_id,
dwm.itcast_school_name,
dwm.itcast_subject_id,
dwm.itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
'-1' payment_time_hour,
'1' grouptype,
'3' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day,
dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;
`*--**年*`
`INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`*count*(dwm.customer_id) as signup_num,`
`dwm.itcast_school_id,`
`dwm.itcast_school_name,`
`dwm.itcast_subject_id,`
`dwm.itcast_subject_name,`
`-1 as tdepart_id,`
`'-1' tdepart_name,`
`'-1' origin_type,`
`dwm.origin_type_stat,`
`'-1' payment_time_hour,`
`'1' grouptype,`
`'4' as time_type,`
`dwm.payment_time_year,`
`dwm.payment_time_month,`
`dwm.payment_time_day`
`from itcast_dwm.customer_signup_dwm dwm`
`GROUP BY dwm.origin_type_stat,`
`dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day,`
`dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;`
`*--**来源渠道分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`*count*(dwm.customer_id) as signup_num,`
`'-1' itcast_school_id,`
`'-1' itcast_school_name,`
`'-1' itcast_subject_id,`
`'-1' itcast_subject_name,`
`-1 as tdepart_id,`
`'-1' tdepart_name,`
`dwm.origin_type,`
`'1' as time_type,`
`dwm.origin_type_stat,`
`dwm.payment_time_hour,`
`'2' grouptype,`
`dwm.payment_time_year,`
`dwm.payment_time_month,`
`dwm.payment_time_day`
`from itcast_dwm.customer_signup_dwm dwm`
`GROUP BY dwm.origin_type_stat,`
`dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,`
`**dwm.origin_type**;`
`*--**天、月、年省略*`
`*--**咨询中心分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`*count*(dwm.customer_id) as signup_num,`
`'-1' itcast_school_id,`
`'-1' itcast_school_name,`
`'-1' itcast_subject_id,`
`'-1' itcast_subject_name,`
`dwm.tdepart_id,`
`dwm.tdepart_name,`
`'-1' origin_type,`
`dwm.origin_type_stat,`
`dwm.payment_time_hour,`
`'3' grouptype,`
`'1' as time_type,`
`dwm.payment_time_year,`
`dwm.payment_time_month,`
`dwm.payment_time_day`
`from itcast_dwm.customer_signup_dwm dwm`
`GROUP BY dwm.origin_type_stat,`
`dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,`
`**dwm.tdepart_id, dwm.tdepart_name**;`
`*--**天、月、年省略*`
`*--**总数分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`*count*(dwm.customer_id) as signup_num,`
`'-1' itcast_school_id,`
`'-1' itcast_school_name,`
`'-1' itcast_subject_id,`
`'-1' itcast_subject_name,`
`-1 as tdepart_id,`
`'-1' tdepart_name,`
`'-1' origin_type,`
`dwm.origin_type_stat,`
`dwm.payment_time_hour,`
`'4' grouptype,`
`'1' as time_type,`
`dwm.payment_time_year,`
`dwm.payment_time_month,`
`dwm.payment_time_day`
`from itcast_dwm.customer_signup_dwm dwm`
`GROUP BY dwm.origin_type_stat,`
`dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour;`
`*--**天、月、年省略**……*`
### 全量导出脚本
CREATE TABLE customer_signup_app (
signup_num int(11) COMMENT '报名人数',
itcast_school_id varchar(32) COMMENT '学校id',
itcast_school_name varchar(32) COMMENT '学校name',
itcast_subject_id varchar(32) COMMENT '学科id',
itcast_subject_name varchar(32) COMMENT '学科name',
tdepart_id int(11) COMMENT '咨询中心id',
tdepart_name varchar(32) COMMENT '咨询中心name',
origin_type varchar(32) COMMENT '来源渠道',
origin_type_stat varchar(32) COMMENT '数据来源:0.线下;1.线上',
payment_time_month varchar(32) COMMENT '月信息',
payment_time_day varchar(32) COMMENT '日信息',
payment_time_hour varchar(32) COMMENT '最后更新时间',
groupType varchar(32) COMMENT '统计分组类型:1:校区、学科组合分组;2:来源渠道分组;3:咨询中心分组;4:所有',
time_type varchar(32) COMMENT '聚合时间类型:1、按小时聚合(;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合。',
payment_time_year varchar(32) COMMENT '年信息'
);
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table customer_signup_app \
--hcatalog-database itcast_dws \
--hcatalog-table customer_signup_dws \
-m 100
## 增量采集执行脚本
*`--**分区*SET hive.exec.dynamic.partition=true;`
`SET hive.exec.dynamic.partition.mode=nonstrict;`
`set hive.exec.max.dynamic.partitions.pernode=10000;`
`set hive.exec.max.dynamic.partitions=100000;`
`set hive.exec.max.created.files=150000;`
`*--hive**压缩*set hive.exec.compress.intermediate=true;`
`set hive.exec.compress.output=true;`
`*--**写入时压缩生效*set hive.exec.orc.compression.strategy=COMPRESSION;`
`*--**分桶*set hive.enforce.bucketing=true;`
`set hive.enforce.sorting=true;`
`set hive.optimize.bucketmapjoin = true;`
`set hive.auto.convert.sortmerge.join=true;`
`set hive.auto.convert.sortmerge.join.noconditionaltask=true;`
`*--**并行执行*set hive.exec.parallel=true;`
`set hive.exec.parallel.thread.number=8;*--**矢量化查询*set hive.vectorized.execution.enabled=true;`
`*--**关联优化器*set hive.optimize.correlation=true;`
`*--**读取零拷贝*set hive.exec.orc.zerocopy=true;`
`INSERT INTO itcast_dwd.customer_relationship_dwd PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
id,
deleted,
customer_id,
*nvl*(origin_type, -1) origin_type,
payment_state,
payment_time,
*substr*(payment_time, 12, 2) as payment_time_hour,
*nvl*(itcast_clazz_id, -1) itcast_clazz_id,
*nvl*(creator, -1) creator,
*if*(origin_type='NETSERVICE' or origin_type='PRESIGNUP', '1', '0') origin_type_stat,
*substr*(payment_time, 1, 4) payment_time_year,
*substr*(payment_time, 6, 2) payment_time_month,
*substr*(payment_time, 9, 2) payment_time_day
from itcast_ods.customer_relationship
WHERE deleted = 0 AND customer_id IS NOT NULL AND payment_state='PAID'
and *substr*(payment_time, 1, 10) >='2037-02-01';
### 业务数据库到ODS和DIM的SQOOP执行脚本
### ODS到DWD的SQOOP脚本
``\#! /bin/bash````
````````
`\#上个月1日`
`Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)`
````````
`out_put=$(${HIVE_HOME} -S -e "`
`--分区`
`SET hive.exec.dynamic.partition=true;`
`SET hive.exec.dynamic.partition.mode=nonstrict;`
`set hive.exec.max.dynamic.partitions.pernode=10000;`
`set hive.exec.max.dynamic.partitions=100000;`
`set hive.exec.max.created.files=150000;`
`--hive压缩`
`set hive.exec.compress.intermediate=true;`
`set hive.exec.compress.output=true;`
`--写入时压缩生效`
`set hive.exec.orc.compression.strategy=COMPRESSION;`
`--分桶`
`set hive.enforce.bucketing=true;`
`set hive.enforce.sorting=true;`
`set hive.optimize.bucketmapjoin = true;`
`set hive.auto.convert.sortmerge.join=true;`
`set hive.auto.convert.sortmerge.join.noconditionaltask=true;`
`--并行执行`
`set hive.exec.parallel=true;`
`set hive.exec.parallel.thread.number=8;`
`--矢量化查询`
`set hive.vectorized.execution.enabled=true;`
`--关联优化器`
`set hive.optimize.correlation=true;`
`--读取零拷贝`
`set hive.exec.orc.zerocopy=true;`
````````
`INSERT INTO itcast_dwd.customer_relationship_dwd PARTITION(payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`id,`
`deleted,`
`customer_id,`
`nvl(origin_type, -1) origin_type,`
`payment_state,`
`payment_time,`
`substr(payment_time, 12, 2) as payment_time_hour,`
`nvl(itcast_clazz_id, -1) itcast_clazz_id,`
`nvl(creator, -1) creator,`
`if(origin_type='NETSERVICE' or origin_type='PRESIGNUP', '1', '0') origin_type_stat,`
`substr(payment_time, 1, 4) payment_time_year,`
`substr(payment_time, 6, 2) payment_time_month,`
`substr(payment_time, 9, 2) payment_time_day`
`from itcast_ods.customer_relationship`
`WHERE deleted = 0 AND customer_id IS NOT NULL AND payment_state='PAID'`
`and substr(payment_time, 1, 10) >='${Last_Month_DATE}';`
`")`
### DWD到DWM的SQOOP脚本
### SQL
*--**分区*SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
*--hive**压缩*set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
*--**写入时压缩生效*set hive.exec.orc.compression.strategy=COMPRESSION;
*--**分桶*set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
*--**并行执行*set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;*--**矢量化查询*set hive.vectorized.execution.enabled=true;
*--**关联优化器*set hive.optimize.correlation=true;
*--**读取零拷贝*set hive.exec.orc.zerocopy=true;
*--join**数据倾斜*set hive.optimize.skewjoin=true;
*-- set hive.skewjoin.key=100000;*set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
INSERT into itcast_dwm.customer_signup_dwm PARTITION (payment_time_year, payment_time_month, payment_time_day)
SELECT
dwd.customer_id,
clazz.itcast_school_id,
clazz.itcast_school_name,
clazz.itcast_subject_id,
clazz.itcast_subject_name,
e.tdepart_id,
dept.name as tdepart_name,
dwd.origin_type,
dwd.origin_type_stat,
dwd.payment_time_hour,
dwd.payment_time_year,
dwd.payment_time_month,
dwd.payment_time_day
FROM itcast_dwd.customer_relationship_dwd dwd
LEFT JOIN itcast_dimen.itcast_clazz clazz on dwd.itcast_clazz_id=clazz.id
LEFT JOIN itcast_dimen.employee e on dwd.creator=e.id
LEFT JOIN itcast_dimen.scrm_department dept on e.tdepart_id=dept.id
where *substr*(dwd.payment_time, 1, 10) >= '2037-02-01' ;
### Shell脚本
\#! /bin/bash
HIVE_HOME=/usr/bin/hive
Last_Month=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)
${HIVE_HOME} -S -e "
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
INSERT into itcast_dwm.customer_signup_dwm PARTITION (payment_time_year, payment_time_month, payment_time_day)
SELECT
dwd.customer_id,
clazz.itcast_school_id,
clazz.itcast_school_name,
clazz.itcast_subject_id,
clazz.itcast_subject_name,
e.tdepart_id,
dept.name as tdepart_name,
dwd.origin_type,
dwd.origin_type_stat,
dwd.payment_time_hour,
dwd.payment_time_year,
dwd.payment_time_month,
dwd.payment_time_day
FROM itcast_dwd.customer_relationship_dwd dwd
LEFT JOIN itcast_dimen.itcast_clazz clazz on dwd.itcast_clazz_id=clazz.id
LEFT JOIN itcast_dimen.employee e on dwd.creator=e.id
LEFT JOIN itcast_dimen.scrm_department dept on e.tdepart_id=dept.id
where substr(dwd.payment_time, 1, 10) >= '$Last_Month' ;
"
`有如DWM就写,没有就略`
### DWM到DWS的SQOOP脚本
### SQL
*--**分区*SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
*--hive**压缩*set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
*--**写入时压缩生效*set hive.exec.orc.compression.strategy=COMPRESSION;
*--**分桶*set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
*--**并行执行*set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
*--**矢量化查询*set hive.vectorized.execution.enabled=true;
*--**关联优化器*set hive.optimize.correlation=true;
*--**读取零拷贝*set hive.exec.orc.zerocopy=true;
*--join**数据倾斜*set hive.optimize.skewjoin=true;
*-- set hive.skewjoin.key=100000;*set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
*-- group**倾斜*set hive.groupby.skewindata=true;
*--**校区、学科组合分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
dwm.itcast_school_id,
dwm.itcast_school_name,
dwm.itcast_subject_id,
dwm.itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
dwm.payment_time_hour,
'1' grouptype,
'1' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
where *CONCAT_WS*('-',dwm.payment_time_year,dwm.payment_time_month,dwm.payment_time_day) >= '2037-02-01'
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,
dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;
*--**天*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
dwm.itcast_school_id,
dwm.itcast_school_name,
dwm.itcast_subject_id,
dwm.itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
'-1' payment_time_hour,
'1' grouptype,
'2' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
where *CONCAT_WS*('-',dwm.payment_time_year,dwm.payment_time_month,dwm.payment_time_day) >= '2037-02-01'
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day,
dwm.itcast_school_id, dwm.itcast_school_name, dwm.itcast_subject_id, dwm.itcast_subject_name;
*--**月、年省略**--**来源渠道分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
'-1' itcast_school_id,
'-1' itcast_school_name,
'-1' itcast_subject_id,
'-1' itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
dwm.origin_type,
dwm.origin_type_stat,
dwm.payment_time_hour,
'2' grouptype,
'1' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
where *CONCAT_WS*('-',dwm.payment_time_year,dwm.payment_time_month,dwm.payment_time_day) >= '2037-02-01'
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,
dwm.origin_type;
*--**天、月、年省略**--**咨询中心分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
'-1' itcast_school_id,
'-1' itcast_school_name,
'-1' itcast_subject_id,
'-1' itcast_subject_name,
dwm.tdepart_id,
dwm.tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
dwm.payment_time_hour,
'3' grouptype,
'1' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
where *CONCAT_WS*('-',dwm.payment_time_year,dwm.payment_time_month,dwm.payment_time_day) >= '2037-02-01'
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour,
dwm.tdepart_id, dwm.tdepart_name;
*--**天、月、年省略**--**总数分组**--**小时*INSERT INTO itcast_dws.customer_signup_dws PARTITION(payment_time_year, payment_time_month, payment_time_day)
SELECT
*count*(dwm.customer_id) as signup_num,
'-1' itcast_school_id,
'-1' itcast_school_name,
'-1' itcast_subject_id,
'-1' itcast_subject_name,
-1 as tdepart_id,
'-1' tdepart_name,
'-1' origin_type,
dwm.origin_type_stat,
dwm.payment_time_hour,
'4' grouptype,
'1' as time_type,
dwm.payment_time_year,
dwm.payment_time_month,
dwm.payment_time_day
from itcast_dwm.customer_signup_dwm dwm
where *CONCAT_WS*('-',dwm.payment_time_year,dwm.payment_time_month,dwm.payment_time_day) >= '2037-02-01'
GROUP BY dwm.origin_type_stat,
dwm.payment_time_year, dwm.payment_time_month, dwm.payment_time_day, dwm.payment_time_hour;
*--**天、月、年省略**……*
### Shell脚本
\`#! /bin/bash`
`HIVE_HOME=/usr/bin/hive`
``
`Last_Month=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)`
`${HIVE_HOME} -S -e "`
`--分区`
`SET hive.exec.dynamic.partition=true;`
`SET hive.exec.dynamic.partition.mode=nonstrict;`
`set hive.exec.max.dynamic.partitions.pernode=10000;`
`set hive.exec.max.dynamic.partitions=100000;`
`set hive.exec.max.created.files=150000;`
`--hive压缩`
`set hive.exec.compress.intermediate=true;`
`set hive.exec.compress.output=true;`
`--写入时压缩生效`
`set hive.exec.orc.compression.strategy=COMPRESSION;`
`--分桶`
`set hive.enforce.bucketing=true;`
`set hive.enforce.sorting=true;`
`set hive.optimize.bucketmapjoin = true;`
`set hive.auto.convert.sortmerge.join=true;`
`set hive.auto.convert.sortmerge.join.noconditionaltask=true;`
`--并行执行`
`set hive.exec.parallel=true;`
`set hive.exec.parallel.thread.number=8;`
`--矢量化查询`
`set hive.vectorized.execution.enabled=true;`
`--关联优化器`
`set hive.optimize.correlation=true;`
`--读取零拷贝`
`set hive.exec.orc.zerocopy=true;`
`--join数据倾斜`
`set hive.optimize.skewjoin=true;`
`-- set hive.skewjoin.key=100000;`
`set hive.optimize.skewjoin.compiletime=true;`
`set hive.optimize.union.remove=true;`
``
`INSERT into itcast_dwm.customer_signup_dwm PARTITION (payment_time_year, payment_time_month, payment_time_day)`
`SELECT`
`dwd.customer_id,`
`clazz.itcast_school_id,`
`clazz.itcast_school_name,`
`clazz.itcast_subject_id,`
`clazz.itcast_subject_name,`
`e.tdepart_id,`
`dept.name as tdepart_name,`
`dwd.origin_type,`
`dwd.origin_type_stat,`
`dwd.payment_time_hour,`
`dwd.payment_time_year,`
`dwd.payment_time_month,`
`dwd.payment_time_day`
`FROM itcast_dwd.customer_relationship_dwd dwd`
`LEFT JOIN itcast_dimen.itcast_clazz clazz on dwd.itcast_clazz_id=clazz.id`
`LEFT JOIN itcast_dimen.employee e on dwd.creator=e.id`
`LEFT JOIN itcast_dimen.scrm_department dept on e.tdepart_id=dept.id`
`where substr(dwd.payment_time, 1, 10) >= '$Last_Month' ;`
`"`
## **导出数据**
##### **sqoop语句**
### **shell脚本**
# 最后的总结
报名用户主题看板:
报名用户主题看板,该看板有十个需求。
我们先来进行维度和指标的分析,共有维度包括年月,线上线下。
校区报名柱状图,就是将不同校区的报名用户数据进行统计。指标是报名用户量,维度则是校区加上共有维度。同理,学科报名柱状图是学科+共有维度。校区学科报名学员TOP是统计各个校区各个学科的报名学员数量。在共有维度上加了学科和校区。
总报名量是统计所有报名用户数据。指标是报名用户量,维度是共有维度。线上报名量指的是总报名量的线上数据。
意向用户报名转化率,就等于全部报名人数/全部新增的意向人数。指标和维度之前的数据都有,可以直接复用。同理有效线索报名转化率也是能够直接拿来用。
剩下的日报名趋势图、来源渠道和咨询中心,指标都是报名人数,维度则是在共有维度基础上加上了天维度、来源渠道和咨询中心。
由此我们可以得出结论,这十个需求的共同指标是报名用户量、意向用户量和有效线索量。而意向用户量和有效线索量可以复用前面的看板数据。
接下来是建模分析,
首先在ODS层原始数据包括有customer_relationship(报名信息),itcast_clazz(报名后的校区和学科信息),employee(内部员工信息),scrm_department(部门信息)。
其次是在DWD层对数据进行清洗,抽取,转换,所以我们在DWD层清洗保留客户表中不为空的,且是已支付的数据,并且转换获得线上线下及年月日等字段。
再次是DWM层,在DWD层基础上,关联校区,学科和咨询中心表,来获取想要的字段。
最后DWS层按产品的属性维度进行统计,得到统计宽表,产品属性维度包括:校区,学科组合分组,来源渠道,咨询中心。
以上就是我的报名用户看板的看板分析,谢谢大家。