知行教育_访问咨询主题-增量采集

文章目录

访问咨询主题-增量采集

注意点:
业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的脚本也要做到这一点
这个功能要做成自动化的脚本,脚本能够每天定时执行

增量采集,一天执行一次
当天采集昨日数据

业务数据库 - > ODS的流程

业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的增量采集是按天来的,采集的数据能够带上按天过滤的条件

#!/bin/bash

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

# Define var
SQOOP_HOME=/export/servers/sqoop
JDBCSTR="jdbc:mysql://192.168.10.10:3306/nev"
MYSQL_USERNAME=root
YEARSTR=`date -d "-1 day" +%Y`
MONTHSTR=`date -d "-1 day" +%m`
HIVE_DB=itcast_ods
MAP_NUMBER=3


echo "执行采集web_chat_ems_${YEARSTR}_${MONTHSTR}的任务......"

$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
  id,create_date_time,session_id,sid,create_time,
  seo_source,seo_keywords,ip,area,country,province,
  city,origin_channel,user AS user_match,manual_time,
  begin_time,end_time,last_customer_msg_time_stamp,
  last_agent_msg_time_stamp,reply_msg_count,msg_count,
  browser_name,os_info, CURRENT_DATE() AS start_time
FROM  web_chat_ems_${YEARSTR}_${MONTHSTR}
WHERE create_time >= '${DATESTR} 00:00:00' AND create_time <= '${DATESTR} 23:59:59' and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_ems \
-m $MAP_NUMBER \
--split-by id



wait

echo "执行采集web_chat_text_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,referrer,from_url,landing_page_url,
url_title,platform_description,
other_params,history, CURRENT_DATE() AS start_time
FROM web_chat_text_ems_${YEARSTR}_${MONTHSTR}
WHERE 1=1 AND \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_text_ems \
-m $MAP_NUMBER \
--split-by id

配置ODS -> DWD的流程

# 配置从ODS层到DWD层的脚本

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "即将执行:${DATESTR} 这一天的ODS -> DWD的过程......"

$HIVE_HOME -e "
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
INSERT INTO TABLE itcast_dwd.visit_consult_dwd partition(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    w1.session_id,
    w1.sid,
    UNIX_TIMESTAMP(w1.create_time, 'yyyy-MM-dd HH:mm:ss') AS create_time,
    w1.seo_source,
    w1.ip,
    w1.area,
    w1.country,
    w1.province,
    w1.city,
    CAST(IF (w1.msg_count IS NULL, 0, w1.msg_count) AS INT) AS msg_count,
    w1.origin_channel,
    w2.referrer,
    w2.from_url,
    w2.landing_page_url,
    w2.url_title,
    w2.platform_description,
    w2.other_params,
    w2.history,
    SUBSTRING(w1.create_time, 12, 2) AS hourinfo,
    SUBSTRING(w1.create_time, 1, 4) AS yearinfo,
    QUARTER(w1.create_time) AS quarterinfo,
    SUBSTRING(w1.create_time, 6, 2) AS monthinfo,
    SUBSTRING(w1.create_time, 9, 2) AS dayinfo
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2
ON w1.id = w2.id;"

DWD -> DWS 流程

问题:分析针对增量数据的不同做法以及优缺点

我们DWS中是包含了年、季度、月等维度的结果数据

由于新增了一天的数据,当前年、当前季度、当前月的结果数据就失效了

需要重新计算

问题在于:DWS表用的失效数据如何处理。

问题解决方式

方式1

删除过期数据

优点:

  • 对BI友好,没有历史数据的混淆,直接取就是最新的
  • 表中的数据清晰

缺点:

  • 执行复杂
  • 破坏了数仓设计中尽量不删除的原则

方式2:

增加新列,表名当前数据计算的时间

用的时候,取最新时间的即可

优点:

  • 表中存储了历史结果的变更
  • 不会执行删除,不破坏数仓的原则

缺点:

  • 对BI分析来说,需要先过滤最新的数据(稍微不友好)
  • 修改表结构(全量操作要重来)

方式3:

增加新表

每一天的结果都生成一个表(一天一个表)

优点:

  • 每个表都是清晰的,指向了具体某天的结果
  • 通过多个表的方式也记录的历史结果的变更

缺点:

  • 数据冗余过多(只要业务需要,冗余不是事)
  • 对BI不友好(换一天,换一个表,如果BI不支持动态规则配置自动变更表,就要手动变更)

这三种方式,没有绝对的谁好谁坏

合适的需求下,选择合适的方式即可。

过期数据的删除

# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='当前月', dayinfo='-1');

示例脚本

# DWD层数据聚合到DWS层的操作脚本(增量)



# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "先删除过期的${YEARSTR}年, ${QUARTERSTR}季度, ${MONTHSTR}月数据"
$HIVE_HOME -e "
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='${MONTHSTR}', dayinfo='-1');"


echo "执行新数据的计算,计算的日期是:${DATESTR}"
$HIVE_HOME -e "
/*
指标:访问量
维度:时间(年、季度、月、天、小时)、来源渠道、受访页面、搜索渠道、区域维度
*/
-- 时间维度表开发
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo;
-- 统计某年某季度的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo;
-- 统计某年某季度某月的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo;
-- 统计某年某季度某月某日的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo;
-- 统计某年某季度某月某日某小时的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;




-- 按照时间和其它维度汇合进行聚合
-- 按照时间和国家进行组合计算
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo,country;
-- 统计某年某季度某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo, country;
-- 统计某年某季度某月某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, country;
-- 统计某年某季度某月某天某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, country;
-- 统计某年某季度某月某天某小时某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo, country;
"

DWS -> MySQL

如果DWS用的是方式1,MySQL中也要删除过期

导出脚本执行之前需要在MySQL中删除即可

mysql -uroot -p123456 scrm_bi -e "
# 删除过期年
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='-1' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期季度
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期月
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='${MONTHSTR}' AND dayinfo='-1';
"

sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root --password 123456 \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values ${YEARSTR} \
-m 1
上一篇:厚积薄发打卡Day38 :[itcast] GoF23通俗易懂的设计模式之 <组合模式>


下一篇:Scrapy框架入门