类型转换
可以使用CAST操作显示进行数据类型转换 例如CAST('1' AS INT)将把字符串'1' 转换成整数1;如果强制类型转换失败,如执行CAST('X' AS INT),表达式返回空值 NULL。
0: jdbc:hive2://hadoop101:10000> select '1'+2, cast('1'as int) + 2; +------+------+--+ | _c0 | _c1 | +------+------+--+ | 3.0 | 3 |
对于Hive的String类型相当于数据库的varchar类型,该类型是一个可变的字符串,不过它不能声明其中最多能存储多少个字符,理论上它可以存储2GB的字符数。
集合数据类型
Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似,而STRUCT与C语言中的Struct类似,它封装了一个命名字段集合,复杂数据类型允许任意层次的嵌套。
[kris@hadoop101 datas]$ vim test.txt songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:16, chao yang_beijing hive (default)> create table test( > name string, > friends array<string>, > children map<string, int>, > address struct<street:string, city:string>) > row format delimited fields terminated by ',' > collection items terminated by '_' > map keys terminated by ':' > lines terminated by '\n'; OK Time taken: 0.249 seconds hive (default)> load data local inpath '/opt/module/datas/test.txt/' into table test; Loading data to table default.test Table default.test stats: [numFiles=1, totalSize=145] OK Time taken: 1.365 seconds 0: jdbc:hive2://hadoop101:10000> select * from test; 0: jdbc:hive2://hadoop101:10000> select friends[1], children['xiao song'], address.city from test where name="songsong"; +-------+------+----------+--+ | _c0 | _c1 | city | +-------+------+----------+--+ | lili | 18 | beijing | +-------+------+----------+--+ 1 row selected (0.321 seconds)
DDL数据定义
创建数据库
创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive/warehouse/*.db。
修改
用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值,来描述这个数据库的属性信息。数据库的其他元数据信息都是不可更改的,包括数据库名和数据库所在的目录位置。
① 创建数据库
0: jdbc:hive2://hadoop101:10000> create database if not exists db_hive; 避免要创建的数据库已经存在错误,增加if not exists判断。(标准写法) No rows affected (0.032 seconds) 0: jdbc:hive2://hadoop101:10000> create database if not exists db_hive2 location '/db_hive2.db'; 指定数据库在HDFS上存放的位置
② 修改数据库
hive (db_hive)> alter database db_hive set dbproperties('createtime'='20190215'); OK Time taken: 0.031 seconds
③ 查看数据库| 切换数据库 use xx; hive (db_hive)> desc database extended db_hive; 显示数据库详细信息; 也可以去掉extended即显示数据库信息; OK db_name comment location owner_name owner_type parameters db_hive hdfs://hadoop101:9000/user/hive/warehouse/db_hive.db kris USER {createtime=20190215} Time taken: 0.016 seconds, Fetched: 1 row(s) ④ 删除数据库 hive (db_hive)> drop database db_hive2; hive (db_hive)> drop database if exists db_hive2; hive (db_hive)> drop database db_hive cascade; ##若数据库不为空,则强制删除用cascade;
创建表
hive (default)> create table if not exists student2( > id int, name string) > row format delimited fields terminated by '\t' > stored as textfile > location '/user/hive/warehouse/student2'; OK
管理表| 内部表
管理表,有时也被称为内部表。因为这种表,Hive会(或多或少地)控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如,/user/hive/warehouse)所定义的目录的子目录下。 当我们删除一个管理表时,Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。
外部表,Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据,不过描述表的元数据信息会被删除掉。
使用场景:每天将收集到的网站日志定期流入HDFS文本文件。在外部表(原始日志表)的基础上做大量的统计分析,用到的中间表、结果表使用内部表存储,数据通过SELECT+INSERT进入内部表。
① 普通创建表
hive (default)> create table if not exists student3 as select id, name from student; hive (default)> create table if not exists student4 like student; //根据已经存在的表机构创建表 hive (default)> desc formatted student2; #查询表的类型;查看格式化数据 OK col_name data_type comment ② 外部表 hive (default)> dfs -mkdir /student; hive (default)> dfs -put /opt/module/datas/student.txt /student; hive (default)> create external table stu_external( //创建外部表 id int, name string) row format delimited fields terminated by '\t' location '/student';
0: jdbc:hive2://hadoop101:10000> select * from stu_external; 0: jdbc:hive2://hadoop101:10000> desc formatted stu_external; Table Type: | EXTERNAL_TABLE 0: jdbc:hive2://hadoop101:10000> drop table stu_external; 外部表删除后,hdfs中的数据还在,但是metadata中stu_external的元数据已被删除
③ 内部表和外部表的互相转换 desc formatted student2; Table Type: | MANAGED_TABLE 0: jdbc:hive2://hadoop101:10000> alter table student2 set tblproperties('EXTERNAL'='TRUE'); Table Type: | EXTERNAL_TABLE 0: jdbc:hive2://hadoop101:10000> alter table student2 set tblproperties('EXTERNAL'='FALSE');
分区表
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录,把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
① 创建分区表
hive (default)> create table dept_partition( > deptno int, dname string, loc string) > partitioned by (month string) > row format delimited fields terminated by '\t'; OK
加载数据 hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709'); Loading data to table default.dept_partition partition (month=201709) Partition default.dept_partition{month=201709} stats: [numFiles=1, numRows=0, totalSize=71, rawDataSize=0] OK load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708'); load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707'); ② 单分区查询 0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201708'; +------------------------+-----------------------+---------------------+-----------------------+--+ | dept_partition.deptno | dept_partition.dname | dept_partition.loc | dept_partition.month | +------------------------+-----------------------+---------------------+-----------------------+--+ | 10 | ACCOUNTING | 1700 | 201708 | | 20 | RESEARCH | 1800 | 201708 | | 30 | SALES | 1900 | 201708 | | 40 | OPERATIONS | 1700 | 201708 | +------------------------+-----------------------+---------------------+-----------------------+-- 多分区联合查询 0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201707' 0: jdbc:hive2://hadoop101:10000> union 0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201708' 0: jdbc:hive2://hadoop101:10000> union 0: jdbc:hive2://hadoop101:10000> select * from dept_partition where month='201709';
③ 增加分区| 增加单个、增加多个分区 0: jdbc:hive2://hadoop101:10000> alter table dept_partition add partition(month='201705') partition(month='201704');
④ 删除分区| 单个、删多个用,连接 0: jdbc:hive2://hadoop101:10000> alter table dept_partition drop partition(month='201705'), partition(month='201706');
⑤ 查看分区有多少分区 0: jdbc:hive2://hadoop101:10000> show partitions dept_partition; +---------------+--+ | partition | +---------------+--+ | month=201707 | | month=201708 | | month=201709 | +---------------+--+
⑥ 查看分区表结构 0: jdbc:hive2://hadoop101:10000> desc formatted dept_partition;
⑦ 创建二级分区
hive (default)> create table dept_partition2(
deptno int, dname string, loc string)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';
加载数据到二级分区 0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition2 partition(month='201709', day='13'); 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='13'; 查看分区数据
把数据直接上传到分区目录上,让分区表和数据产生关联的三种方式
方式一:上传数据后修复
0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12; 0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12; 0: jdbc:hive2://hadoop101:10000> msck repair table dept_partition2; //修复下才能查到数据 No rows affected (0.15 seconds) 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='12'; alter table dept_partition2 drop partition(month='201709', day='11'); 删除 方式二:上传数据后添加分区 0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=11; 不能加引号 0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11; 0: jdbc:hive2://hadoop101:10000> alter table dept_partition2 add partition(month='201709', day='11'); 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='11'; 方式三:创建文件夹后load数据到分区 0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month='201709'/day='10'; 0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10'); 0: jdbc:hive2://hadoop101:10000> select * from dept_partition2 where month='201709' and day='10';
修改表
重命名表 jdbc:hive2://hadoop101:10000> alter table teacher rename to new_teacher;
添加列 0: jdbc:hive2://hadoop101:10000> alter table dept_partition add columns(deptdesc string);
更新列 0: jdbc:hive2://hadoop101:10000> alter table dept_partition change column deptdesc desc int; No rows affected (0.112 seconds) 0: jdbc:hive2://hadoop101:10000> desc dept_partition;
替换列 0: jdbc:hive2://hadoop101:10000> alter table dept_partition replace columns(deptid int, name string, loc string);
删除表 0: jdbc:hive2://hadoop101:10000> drop table new_teacher;
DML数据操作
数据导入
向表中装载数据(Load)
① 向表中装载数据:
0: jdbc:hive2://hadoop101:10000> create table student(id int, name string) row format delimited fields terminated by '\t'; 0: jdbc:hive2://hadoop101:10000> load data local inpath '/opt/module/datas/student.txt' into table default.student; 加载本地文件到hive 0: jdbc:hive2://hadoop101:10000> dfs -mkdir -p /user/kris/hive; 0: jdbc:hive2://hadoop101:10000> dfs -put /opt/module/datas/student.txt /user/kris/hive; 0: jdbc:hive2://hadoop101:10000> load data inpath '/user/kris/hive/student.txt' into table default.student; //移动hdfs上的文件;加载HDFS上的数据 0: jdbc:hive2://hadoop101:10000> load data inpath '/user/kris/hive/student.txt' overwrite into table default.student; 加载数据覆盖表中已有的数据
② 通过查询语句向表中插入数据Insert create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t'; 创建一张分区表 0: jdbc:hive2://hadoop101:10000> insert into table student partition(month='201902') values (1, "kris"), (2, "egon"); 插入数据 -rwxrwxr-x kris supergroup 14 B 2019/2/15 下午7:16:26 3 128 MB 000000_0 根据单张表查询结果来插入insert into是追加数据的方式插入表或分区,原有数据不会被删除;
insert overwrite是会覆盖表或分区中已有数据;
0: jdbc:hive2://hadoop101:10000> insert overwrite table student partition(month="201905") select id,name from student where month='201902'; 在原本基础上追加 0: jdbc:hive2://hadoop101:10000> select * from student; +-------------+---------------+----------------+--+ | student.id | student.name | student.month | +-------------+---------------+----------------+--+ | 1 | kris | 201902 | | 2 | egon | 201902 | | 1 | kris | 201905 | | 2 | egon | 201905 | +-------------+---------------+----------------+--+ 多表查询结果插入 hive (default)> from student insert overwrite table student partition(month="201904") > select id, name where month="201905" > insert overwrite table student partition(month="201903") > select id, name where month="201905"; 0: jdbc:hive2://hadoop101:10000> select * from student; +-------------+---------------+----------------+--+ | student.id | student.name | student.month | +-------------+---------------+----------------+--+ | 1 | kris | 201902 | | 2 | egon | 201902 | | 1 | kris | 201903 | | 2 | egon | 201903 | | 1 | kris | 201904 | | 2 | egon | 201904 | | 1 | kris | 201905 | | 2 | egon | 201905 | +-------------+---------------+----------------+-
③ 查询语句中创建并加载数据 AS Select
create table if not exists student3 as select id, name from student; create table if not exists student4 like student;
④ 创建表时通过Location指定加载数据路径 0: jdbc:hive2://hadoop101:10000> create external table if not exists stu(id int, name string) row format delimited fields terminated by '\t' location '/student'; No rows affected (0.093 seconds) 0: jdbc:hive2://hadoop101:10000> select * from stu; ⑤ Import数据到指定Hive表中 create table student22( id int, name string) partitioned by (month string) row format delimited fields terminated by '\t'; import table student22 partition(month='201904') from //student22必须要有分区才能导入成功 '/user/hive/warehouse/export/student';
数据导出
① Insert导出
将输出文件导出到本地/opt/module/datas/export/student中; 0: jdbc:hive2://hadoop101:10000> insert overwrite local directory '/opt/module/datas/export/student' select * from student; 结果格式化导出到本地
hive (default)> insert overwrite local directory '/opt/module/datas/export/student1' > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student; 结果导出到HDFS hive (default)> insert overwrite directory '/user/kris/student2' > row format delimited fields terminated by '\t' > select * from student; ② Hadoop命令导出到本地 hive (default)> dfs -get /user/hive/warehouse/student/month=201902/000000_0 /opt/module/datas/export/student3.txt; [kris@hadoop101 export]$ cat student3.txt 1 kris 2 egon [kris@hadoop101 export]$ pwd /opt/module/datas/export ③ Shell命令导出 [kris@hadoop101 hive]$ bin/hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt ④ Export导出到HDFS上 hive (default)> export table default.student to '/user/hive/warehouse/export/student';
⑤ Sqoop导出
清除表中数据(Truncate)
注意:Truncate只能删除管理表,不能删除外部表中数据
hive (default)> truncate table student;