Trino 436 - 使用教程(亲测,详细)-7. Iceberg连接器

Apache Iceberg是一种用于大型分析数据集的开放表格式。Iceberg连接器允许查询以Iceberg格式编写的文件中存储的数据,如Iceberg表规范中所定义的。

7.1. 要求

(1) Trino协调员(coordinator)和工作节点(workers )对分布式对象存储的网络访问。

(2) 访问Hive元存储服务(HMS)、AWS Glue目录、JDBC目录、REST目录或Nessie服务器。

(3) 在支持的文件系统上,以文件格式ORC或Parquet(默认)存储的数据文件。

7.2. 配置

在/etc/trino/catalog中创建一个example.properties的属性文件。

cd  /etc/trino/catalog

vi  iceberg.properties

connector.name=iceberg

iceberg.catalog.type=hive_metastore

hive.metastore.uri=thrift://192.168.80.131:9083

这里主机要使用IP地址,不要使用主机名。

7.3. Iceberg准备数据

##登录hive客户端

hive

##创建数据库

create database iceberg_db ;

use iceberg_db;

##创建Iceberg表

SET iceberg.catalog.iceberg_hive.type=hive;

SET iceberg.catalog.iceberg_hive.uri=thrift://192.168.80.131:9083;

SET iceberg.catalog.iceberg_hive.clients=10;

SET iceberg.catalog.iceberg_hive.warehouse=hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive;

CREATE TABLE iceberg_test001 (

id    int,

name  string,

birthday  date,

create_time  timestamp

)

PARTITIONED BY(provincial string,ds string)

STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'

TBLPROPERTIES('iceberg.catalog'='iceberg_hive');

##插入数据

INSERT INTO iceberg_test001  values

(10001, 'a10001', '2020-01-01', current_timestamp(),'99999999','20240226'),

(10002, 'a10002', '2012-04-18', current_timestamp(),'99999999','20240226'),

(10003, 'a10003', '2015-11-03', current_timestamp(),'99999999','20240226'),

(10004, 'a10004', '2013-08-27', current_timestamp(),'99999999','20240226');

##查询数据

 select * from  iceberg_test001 ;

7.4. 登录客户端

重启trino服务:

service trino stop ; service trino start ;

或者

service trino restart ;

登录客户端命令:

./trino-cli --server localhost:8080

7.5. 操作Iceberg

7.5.1. Trino自建表的操作

(1)显示所有的catalog (catalog目录下的properties文件名)

show catalogs;

(2)创建SCHEMA

CREATE  SCHEMA  iceberg.iceberg_hive

WITH (location='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/');

(3)创建表

use iceberg.iceberg_hive ;

CREATE TABLE  example_test01 (

id  INTEGER,

name  VARCHAR,

birthday  DATE,

create_time  TIMESTAMP,

provincial  VARCHAR,

ds  VARCHAR

)

WITH (

    format = 'PARQUET',

    partitioning = ARRAY['provincial', 'ds'],

    sorted_by = ARRAY['id']

);

(4)插入数据

insert into example_test01 values

(10001, 'a10001',cast( '2020-01-01' as date), current_timestamp,'99999999','20240226'),

(10002, 'a10002', cast('2012-04-18' as date), current_timestamp,'99999999','20240226'),

(10003, 'a10003',cast( '2015-11-03' as date), current_timestamp,'99999999','20240226'),

(10004, 'a10004',cast( '2013-08-27' as date), current_timestamp,'99999999','20240226');

select * from example_test01 ;

(5)修改数据

update example_test01 set name = 'XXXXXX' where id = 10004 ;

select * from example_test01 where id = 10004 ;

(6)删除数据

delete from example_test01 where id = 10004 ;

select * from example_test01 ;

(7)查看所有的表

show tables;

(8)查看表结构和建表语句

desc example_test01 ;

show create table example_test01;

测试在hive中是否可以读取该表的数据

select * from example_test01 ;

从中可以看到,通过Trino创建的表,在Hive中无法查询。

为什么?

 

Trino和Hive创建Iceberg表的序列化规则(ROW FORMAT SERDE)和存储格式(STORED BY)不同。

--通过Trino创建生成的Iceberg表:

CREATE EXTERNAL TABLE `example_test01`(

  `id` int,

  `name` string,

  `birthday` date,

  `create_time` timestamp,

  `provincial` string,

  `ds` string)

ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT  'org.apache.hadoop.mapred.FileInputFormat'

OUTPUTFORMAT  'org.apache.hadoop.mapred.FileOutputFormat'

LOCATION  'hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb'

TBLPROPERTIES (

  'metadata_location'='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb/metadata/00004-cad950c2-57be-4550-9add-c363c70ef484.metadata.json',

  'previous_metadata_location'='hdfs://192.168.80.131:8020/data/warehouse/iceberg-hive/example_test01-50b32dc5ec9247498ea7fb35d5f526bb/metadata/00003-a6d3f156-0f4a-4450-a70b-d02bb1a75e97.metadata.json',

  'table_type'='ICEBERG',

  'transient_lastDdlTime'='1709000455');

 

--通过Hive创建的Iceberg表

CREATE TABLE `iceberg_test001`(

  `id` int COMMENT 'from deserializer',

  `name` string COMMENT 'from deserializer',

  `birthday` date COMMENT 'from deserializer',

  `create_time` timestamp COMMENT 'from deserializer',

  `provincial` string COMMENT 'from deserializer',

  `ds` string COMMENT 'from deserializer')

ROW FORMAT SERDE  'org.apache.iceberg.mr.hive.HiveIcebergSerDe'

STORED BY  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'

LOCATION  'hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001'

TBLPROPERTIES (

  'bucketing_version'='2',

  'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"int"},{"id":2,"name":"name","required":false,"type":"string"},{"id":3,"name":"birthday","required":false,"type":"date"},{"id":4,"name":"create_time","required":false,"type":"timestamp"},{"id":5,"name":"provincial","required":false,"type":"string"},{"id":6,"name":"ds","required":false,"type":"string"}]}',

  'current-snapshot-id'='2017496957845643908',

  'current-snapshot-summary'='{"added-data-files":"1","added-records":"4","added-files-size":"1953","changed-partition-count":"1","total-records":"4","total-files-size":"1953","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}',

  'current-snapshot-timestamp-ms'='1708941864398',

  'default-partition-spec'='{"spec-id":0,"fields":[{"name":"provincial","transform":"identity","source-id":5,"field-id":1000},{"name":"ds","transform":"identity","source-id":6,"field-id":1001}]}',

  'engine.hive.enabled'='true',

  'external.table.purge'='TRUE',

  'iceberg.catalog'='iceberg_hive',

  'last_modified_by'='root',

  'last_modified_time'='1708941860',

  'metadata_location'='hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001/metadata/00001-29819742-3acb-46c0-8234-d16a350f132e.metadata.json',

  'previous_metadata_location'='hdfs://bdc01/user/hive/warehouse/iceberg_db.db/iceberg_test001/metadata/00000-4cf17717-07ae-44c2-9079-9b2c8f3e2503.metadata.json',

  'snapshot-count'='1',

  'table_type'='ICEBERG',

  'transient_lastDdlTime'='1708941860',

  'uuid'='164870ac-489f-4434-a34c-3ed50340ed34')

7.5.2. 已存在Iceberg表的操作

(1)查看iceberg下所有的schemas (对应为数据库)

show schemas from iceberg;

如果连接失败,可以查看日志/var/log/trino/server.log

(2)查看数据库中的所有表

show tables from iceberg.iceberg_db;

(3)查看表结构

desc iceberg.iceberg_db.iceberg_test001;

通过Hive创建的Iceberg表,在Trino中无法访问。为什么?

 

上一篇:Springboot 核心注解


下一篇:前端工程化面试题(一)