拉链表设计

一、创建拉链表

  • 创建表
    CREATE TABLE `ods_login`(                          |
    |   `uid` string,                                    |
    |   `name` string,                                   |
    |   `phone` string,                                  |
    |   `date` string)                                   |
    | PARTITIONED BY (                                   |
    |   `dt` string)                                     |
    | ROW FORMAT SERDE                                   |
    |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
    | WITH SERDEPROPERTIES (                             |
    |   'field.delim'=',',                               |
    |   'serialization.format'=',')                      |
    | STORED AS INPUTFORMAT                              |
    |   'org.apache.hadoop.mapred.TextInputFormat'       |
    | OUTPUTFORMAT                                       |
    |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
    | LOCATION                                           |
    |   'hdfs://namenode:8020/user/hive/warehouse/ods.db/ods_login' |
    | TBLPROPERTIES (                                    |
    |   'transient_lastDdlTime'='1645972509')        

     

  • 测试数据

    -- 2022-02-24.dat
    0001,mike,1398836123,2022-02-24
    0002,lucy,1396542122,2022-02-24
    0003,jack,1398836145,2022-02-24
    0004,baran,1398836178,2022-02-24
    
    -- 2022-02-25.dat, 增量数据
    0001,mike,1398836241,2022-02-25
    0004,baran,1398836756,2022-02-25
    0005,xinyi,1398836549,2022-02-25

     

  • 创建dz表

    +----------------------------------------------------+
    | CREATE TABLE `dim_user_dz`(                        |
    |   `uid` string,                                    |
    |   `name` string,                                   |
    |   `phone` string,                                  |
    |   `date` string,                                   |
    |   `start_date` string, --开始日期                            |
    |   `end_date` string)   --结束日期                            |
    | PARTITIONED BY (                                   |
    |   `dt` string)                                     |
    | ROW FORMAT SERDE                                   |
    |   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
    | STORED AS INPUTFORMAT                              |
    |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
    | OUTPUTFORMAT                                       |
    |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
    | LOCATION                                           |
    |   'hdfs://namenode:8020/user/hive/warehouse/ods.db/dim_user_dz' |
    | TBLPROPERTIES (                                    |
    |   'transient_lastDdlTime'='1645973732')            |
    +----------------------------------------------------+
  • 拉链表创建思路

    1)假如首日是2022-02-24,首先将数据从ods层加载到dim层,分区日期和结束日期都为9999-00-00

    2)第二日2022-02-25,一部分用户新增变化,需要把新增的和变化的装载到dim层,分区结束日期是9999分区,但要注意9999分区有一部分过期数据(过期理解为数据发生了变化后,变化前的数据是过期数据)需要装载到变化前一日即5月1日分区(过期的用户数据分区),就是将过期数据装载到前一日分区,有效数据是9999分区;

    3)第三日2022-02-26,和25日一样,新增数据放入9999分区,过期数据放入前一日分区;

  

  • SQL语句实现
    -- 首日加载
    INSERT INTO dim_user_dz PARTITION(dt='9999-00-00')
    SELECT 
        uid,
        name,
        phone,
        `date`,
        '2022-02-24',
        '9999-00-00'
    FROM ods_login
    WHERE dt = '2022-02-24';
    
    -- 每日装载
    -- 首先找出9999的数据与增量数据做外连接, 这里创建一张临时表
    CREATE TABLE tmp_20220225 AS 
    SELECT *
    FROM (
    SELECT 
        uid as old_uid, 
        name as old_name, 
        phone as old_phone, 
        `date` as old_date, 
        start_date as old_start_date, 
        end_date as old_end_date
    FROM dim_user_dz
    WHERE dt = '9999-00-00'
    ) t1 LEFT OUTER JOIN (
    SELECT uid, name, phone, `date`,'2022-02-25' as start_date,'9999-00-00' as end_date
    FROM ods_login
    WHERE dt = '2022-02-25'
    ) t2
    ON t1.old_uid = t2.uid;
    
    -- 获取全量最新数据和过期数据,动态分区写入
    set hive.exec.dynamic.partition.mode=nonstrict;
    INSERT OVERWRITE TABLE dim_user_dz PARTITION(dt)
    SELECT 
        nvl(uid, old_uid) as uid,
        nvl(name, old_name) as name,
        nvl(phone, old_phone) as phone,
        nvl(`date`, old_date) as `date`,
        nvl(start_date, old_start_date) as start_date,
        nvl(end_date, old_end_date) as end_date,
        nvl(end_date, old_end_date) as dt
    FROM tmp_20220225
    UNION ALL
    SELECT 
        old_uid as uid,
        old_name as name,
        old_phone as phone,
        old_date as `date`,
        old_start_date as start_date,
        cast(date_sub('2022-02-25', 1) as string) as end_date,
        cast(date_sub('2022-02-25', 1) as string) as dt
    FROM tmp_20220225
    WHERE uid is not null and old_uid is not null;

     

        

 

 

 

二、拉链表回滚

上一篇:map差异对比工具类


下一篇:C++string类写时拷贝