Hive的分区和分桶

2024-03-29 12:44:16

1.Hive的分区

分区的概念和分区表：分区表指的是在创建表时指定分区空间，实际上就是在hdfs上表的目录下再创建子目录。在使用数据时如果指定了需要访问的分区名称，则只会读取相应的分区，避免全表扫描，提高查询效率。Hive的分区分为静态分区和动态分区两种方式：

1）静态分区

首先创建分区表

create table students_pt
(
id bigint,
name string,
age int,
gender string,
clazz string
)
PARTITIONED BY(pt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

创建数据源有一张原始数据表

create table  students (
    id bigint,
    name string,
    age int,
    sex string,
    clazz string
)
row format delimited fields terminated by ',';

向students中导入数据：

采用load data 的方式导入数据

load data local inpath'/usr/local/soft/datapackage/students.txt' into table students;

得到一个有数据的students表

开始导入分区数据

insert into table students_pt partition(pt='20210413') select * from students;

得到一个分区表

最后一栏会新增一个pt 的分区数据,在数据查询的过程中,用于过滤筛选,避免全局扫描.同时,在UI界面中可以看出,Hive分区会在原来的表中新建一个子目录里面存着分区的数据

2)动态分区

开启动态分区需要手动开启支持

set hive.exec.dynamic.partition=true; //打开动态分区

set hive.exec.dynamic.partition.mode=nostrict; //设置分区的模式为不严格模式

set hive.exec.max.dynamic.partitions.pernode=1000; //设置分区的最大个数为1000个

需要创建一个原始表,并载入数据

create table  students (
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    pt string
)
row format delimited fields terminated by ',';

建立动态分区表,

//建立分区并加载数据
create table students_dt_p
(
id bigint,
name string,
age int,
gender string,
clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

向动态分区表中插入数据

insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students;

得到一个以相同dt为一个分区的分区表

完成自动分区.

2.Hive的分桶

在往分桶中插入数据的时候,会根据 clustered by 指定的字段进行hash分组对指定的buckets 个数进行取余进而可以将数据分割成buckets个数文件,以达到是数据均匀分布,方便我们对数据抽取样数据,提高join 效率,合理的分桶会有效降低数据的倾斜问题,指定一个数值的bucket,会根据指定的数值划分出相应的桶.

set hive.enforce.bucketing=true; //开启数据分桶

创建分桶表

create table students_buks
(
id bigint,
name string,
gender string,
age int,
clazz string
)
clustered by (clazz) into 12 buckets
row format delimited fields terminated by ',';

向分桶表中导入数据

insert into  students_buks select  *  from students;

码农公寓

相关文章