Apache Iceberg是一种用于大型分析数据集的开放表格式。Iceberg连接器允许查询以Iceberg格式编写的文件中存储的数据,如Iceberg表规范中所定义的。
7.1. 要求
(1) Trino协调员(coordinator)和工作节点(workers )对分布式对象存储的网络访问。
(2) 访问Hive元存储服务(HMS)、AWS Glue目录、JDBC目录、REST目录或Nessie服务器。
(3) 在支持的文件系统上,以文件格式ORC或Parquet(默认)存储的数据文件。
7.2. 配置
在/etc/trino/catalog中创建一个example.properties的属性文件。
cd /etc/trino/catalog
vi iceberg.properties
connector.name=iceberg
iceberg.catalog.type=hive_metastore
hive.metastore.uri=thrift://192.168.80.131:9083
这里主机要使用IP地址,不要使用主机名。
7.3. Iceberg准备数据
##登录hive客户端
hive
##创建数据库
create database iceberg_db ;
use iceberg_db;
##创建Iceberg表
SET iceberg.catalog.iceberg_hive.type=hive;
SET iceberg.catalog.iceberg_hive.uri=thrift://192.168.80.131:9083;
SET iceberg.catalog.iceberg_hive.clients=10;
SET iceberg.catalog.iceberg_hive.warehouse=hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive;
CREATE TABLE iceberg_test001 (
id int,
name string,
birthday date,
create_time timestamp
)
PARTITIONED BY(provincial string,ds string)
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES('iceberg.catalog'='iceberg_hive');
##插入数据
INSERT INTO iceberg_test001 values
(10001, 'a10001', '2020-01-01', current_timestamp(),'99999999','20240226'),
(10002, 'a10002', '2012-04-18', current_timestamp(),'99999999','20240226'),
(10003, 'a10003', '2015-11-03', current_timestamp(),'99999999','20240226'),
(10004, 'a10004', '2013-08-27', current_timestamp(),'99999999','20240226');
##查询数据
select * from iceberg_test001 ;
7.4. 登录客户端
重启trino服务:
service trino stop ; service trino start ;
或者
service trino restart ;
登录客户端命令:
./trino-cli --server localhost:8080
7.5. 操作Iceberg
7.5.1. Trino自建表的操作
(1)显示所有的catalog (catalog目录下的properties文件名)
show catalogs;
(2)创建SCHEMA
CREATE SCHEMA iceberg.iceberg_hive
WITH (location='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/');
(3)创建表
use iceberg.iceberg_hive ;
CREATE TABLE example_test01 (
id INTEGER,
name VARCHAR,
birthday DATE,
create_time TIMESTAMP,
provincial VARCHAR,
ds VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['provincial', 'ds'],
sorted_by = ARRAY['id']
);
(4)插入数据
insert into example_test01 values
(10001, 'a10001',cast( '2020-01-01' as date), current_timestamp,'99999999','20240226'),
(10002, 'a10002', cast('2012-04-18' as date), current_timestamp,'99999999','20240226'),
(10003, 'a10003',cast( '2015-11-03' as date), current_timestamp,'99999999','20240226'),
(10004, 'a10004',cast( '2013-08-27' as date), current_timestamp,'99999999','20240226');
select * from example_test01 ;
(5)修改数据
update example_test01 set name = 'XXXXXX' where id = 10004 ;
select * from example_test01 where id = 10004 ;
(6)删除数据
delete from example_test01 where id = 10004 ;
select * from example_test01 ;
(7)查看所有的表
show tables;
(8)查看表结构和建表语句
desc example_test01 ;
show create table example_test01;
测试在hive中是否可以读取该表的数据
select * from example_test01 ;
从中可以看到,通过Trino创建的表,在Hive中无法查询。
为什么?
Trino和Hive创建Iceberg表的序列化规则(ROW FORMAT SERDE)和存储格式(STORED BY)不同。
--通过Trino创建生成的Iceberg表:
CREATE EXTERNAL TABLE `example_test01`(
`id` int,
`name` string,
`birthday` date,
`create_time` timestamp,
`provincial` string,
`ds` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.FileOutputFormat'
LOCATION 'hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb'
TBLPROPERTIES (
'metadata_location'='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb/metadata/00004-cad950c2-57be-4550-9add-c363c70ef484.metadata.json',
'previous_metadata_location'='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb/metadata/00003-a6d3f156-0f4a-4450-a70b-d02bb1a75e97.metadata.json',
'table_type'='ICEBERG',
'transient_lastDdlTime'='1709000455');
--通过Hive创建的Iceberg表
CREATE TABLE `iceberg_test001`(
`id` int COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`birthday` date COMMENT 'from deserializer',
`create_time` timestamp COMMENT 'from deserializer',
`provincial` string COMMENT 'from deserializer',
`ds` string COMMENT 'from deserializer')
ROW FORMAT SERDE 'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION 'hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001'
TBLPROPERTIES (
'bucketing_version'='2',
'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"int"},{"id":2,"name":"name","required":false,"type":"string"},{"id":3,"name":"birthday","required":false,"type":"date"},{"id":4,"name":"create_time","required":false,"type":"timestamp"},{"id":5,"name":"provincial","required":false,"type":"string"},{"id":6,"name":"ds","required":false,"type":"string"}]}',
'current-snapshot-id'='2017496957845643908',
'current-snapshot-summary'='{"added-data-files":"1","added-records":"4","added-files-size":"1953","changed-partition-count":"1","total-records":"4","total-files-size":"1953","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}',
'current-snapshot-timestamp-ms'='1708941864398',
'default-partition-spec'='{"spec-id":0,"fields":[{"name":"provincial","transform":"identity","source-id":5,"field-id":1000},{"name":"ds","transform":"identity","source-id":6,"field-id":1001}]}',
'engine.hive.enabled'='true',
'external.table.purge'='TRUE',
'iceberg.catalog'='iceberg_hive',
'last_modified_by'='root',
'last_modified_time'='1708941860',
'metadata_location'='hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001/metadata/00001-29819742-3acb-46c0-8234-d16a350f132e.metadata.json',
'previous_metadata_location'='hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001/metadata/00000-4cf17717-07ae-44c2-9079-9b2c8f3e2503.metadata.json',
'snapshot-count'='1',
'table_type'='ICEBERG',
'transient_lastDdlTime'='1708941860',
'uuid'='164870ac-489f-4434-a34c-3ed50340ed34')
7.5.2. 已存在Iceberg表的操作
(1)查看iceberg下所有的schemas (对应为数据库)
show schemas from iceberg;
如果连接失败,可以查看日志/var/log/trino/server.log
(2)查看数据库中的所有表
show tables from iceberg.iceberg_db;
(3)查看表结构
desc iceberg.iceberg_db.iceberg_test001;
通过Hive创建的Iceberg表,在Trino中无法访问。为什么?