写在前面
ps:干货很多…
我们都知道,Hive基本上就是内部表和外部表两种类型,在面试的时候,常会问到这种题目:
1.hive内部表和外部表的区别
2.什么时候使用内部表,什么时候使用外部表
来自官网的定义:
Managed tables
A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.
Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
External tables
An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.
Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
创建外部表的同时,语句末尾一般要自己指定 数据文件存储路径 location ‘/AUTO/PATH’
内部表不用特殊指定,默认为/user/hive/warehouse,
可配置:hive-site.xml
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>
内部表数据由Hive自身管理,外部表数据由HDFS管理;
DTOP TABLE,
内部表:元数据和数据文件都会被删除掉
外部表:元数据被删除,数据文件任然保留 ,此时重建表都是可以的,还是可以直接查数据的
LOAD DATA ,
加载HDFS DATA 都是会将HDFS数据进行移动到对应的表目录,类似 mv 命令
二 、使用场景
同样来自官网的说明:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed tables
DROP deletes data for managed tables while it only deletes metadata for external ones
ACID/Transactional only works for managed tables
Query Results Caching only works for managed tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
简单来说:
每天采集的ng日志和埋点日志,在存储的时候建议使用外部表,因为日志数据是采集程序实时采集进来的,一旦被误删,恢复起来非常麻烦。而且外部表方便数据的共享。
抽取过来的业务数据,其实用外部表或者内部表问题都不大,就算被误删,恢复起来也是很快的,如果需要对数据内容和元数据进行紧凑的管理, 那还是建议使用内部表
在做统计分析时候用到的中间表,结果表可以使用内部表,因为这些数据不需要共享,使用内部表更为合适。并且很多时候结果分区表我们只需要保留最近3天的数据,用外部表的时候删除分区时无法删除数据。
内部表
create table test( user_id string, user_name string, hobby array<string>, scores map<string,int> ) row format delimited fields terminated by '|' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n';
外部表
create external table external_test( user_id string, user_name string, hobby array<string>, scores map<string,int> ) row format delimited fields terminated by '|' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n' location '/warehouse/db01.db';
分区表
create table partition_test( user_id string, user_name string, hobby array<string>, scores map<string,int> ) partitioned by (time string) row format delimited fields terminated by '|' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n';
分桶表
create table bucket_test( user_id string, user_name string, hobby array<string>, scores map<string,int> ) clustered by (user_name) sorted by (user_name) into 3 buckets row format delimited fields terminated by '|' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n';
装载数据
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
原文链接:https://blog.csdn.net/liuge36/article/details/111425996