注意,以下文档可能有问题,以实际测试为准。因为版本6语法有变化。
https://www.cnblogs.com/kingle-study/p/10552097.html
书籍
《Greenplum企业应用实战》
外部表简介
Greenplum 在数据加载上有一个明显的优势,就是支持数据的并发加载,gpfdisk是并发加载的工具,数据库中对应的就是外部表
所谓外部表,就是在数据库中只有表定义、没有数据,数据都存放在数据库之外的数据文件。greenplum可以对一个外部表执行正常的DML操作,当读取数据的时候,数据库从数据文件中加载数据。外部表支持在segment上并发地告诉从gpfdist导入数据,由于是从segment上导入数据,所以效率很高。
结构图:
外部表需要指定gpfdist的IP和端口,还要有详细的目录地址,文件名支持通配符匹配。可以编写多个gpfdist地址,但是总数不能超过总的segment数量,否则会报错。
GPDB提供两种外部表:可读外部表用于数据装载、可写外部表用于数据卸载。外部表可基于文件、亦可基于WEB,这两种都能实现可读、可写。
当一个查询使用一个常规的外部表,该外部表被认为是可重读的,因为在该查询期间数据是静态的。而对于WEB外部表,数据是不可重读的,因为在该查询的执行期间数据可能会发生变化。
可写外部表用以从数据库表中选择记录并输出到文件、命名管道或其他可执行程序。 比如,可以从GPDB中卸载数据并发送到一个可执行程序,该程序连接到其他数据库或者ETL工具并装载数据到其他地方。 可写外部表还可以用于输出到GPDB的并行MapReduce计算。
可写外部表被定义后,数据即可从数据库表中被选择并插入到该可写外部表。 可写外部表只允许INSERT操作 – SELECT、 UPDATE、 DELETE或TRUNCATE是不允许的。可写外部表输出数据到一个可执行程序,该程序要能够接受流输入数据。
在创建外部表的时候,可以指定分隔符、err表、指定允许出错的数据条数,以及源文件的编码等信息。
语法
CREATE [READABLE] EXTERNAL [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION (‘file://seghost[:port]/path/file‘ [, ...])
| (‘gpfdist://filehost[:port]/file_pattern[#transform]‘
| (‘gpfdists://filehost[:port]/file_pattern[#transform]‘
[, ...])
FORMAT ‘TEXT‘
[( [HEADER]
[DELIMITER [AS] ‘delimiter‘ | ‘OFF‘]
[NULL [AS] ‘null string‘]
[ESCAPE [AS] ‘escape‘ | ‘OFF‘]
[NEWLINE [ AS ] ‘LF‘ | ‘CR‘ | ‘CRLF‘]
[FILL MISSING FIELDS] )]
| ‘CSV‘
[( [HEADER]
[QUOTE [AS] ‘quote‘]
[DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] ‘escape‘]
[NEWLINE [ AS ] ‘LF‘ | ‘CR‘ | ‘CRLF‘]
[FILL MISSING FIELDS] )]
| ‘CUSTOM‘ (Formatter=<formatter specifications>)
[ ENCODING ‘encoding‘ ]
[ [LOG ERRORS] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE [READABLE] EXTERNAL WEB [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION (‘http://webhost[:port]/path/file‘ [, ...])
| EXECUTE ‘command‘ [ON ALL
| MASTER
| number_of_segments
| HOST [‘segment_hostname‘]
| SEGMENT segment_id ]
FORMAT ‘TEXT‘
[( [HEADER]
[DELIMITER [AS] ‘delimiter‘ | ‘OFF‘]
[NULL [AS] ‘null string‘]
[ESCAPE [AS] ‘escape‘ | ‘OFF‘]
[NEWLINE [ AS ] ‘LF‘ | ‘CR‘ | ‘CRLF‘]
[FILL MISSING FIELDS] )]
| ‘CSV‘
[( [HEADER]
[QUOTE [AS] ‘quote‘]
[DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] ‘escape‘]
[NEWLINE [ AS ] ‘LF‘ | ‘CR‘ | ‘CRLF‘]
[FILL MISSING FIELDS] )]
| ‘CUSTOM‘ (Formatter=<formatter specifications>)
[ ENCODING ‘encoding‘ ]
[ [LOG ERRORS] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE WRITABLE EXTERNAL [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION(‘gpfdist://outputhost[:port]/filename[#transform]‘
| (‘gpfdists://outputhost[:port]/file_pattern[#transform]‘
[, ...])
FORMAT ‘TEXT‘
[( [DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[ESCAPE [AS] ‘escape‘ | ‘OFF‘] )]
| ‘CSV‘
[([QUOTE [AS] ‘quote‘]
[DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[FORCE QUOTE column [, ...]] ]
[ESCAPE [AS] ‘escape‘] )]
| ‘CUSTOM‘ (Formatter=<formatter specifications>)
[ ENCODING ‘write_encoding‘ ]
[ DISTRIBUTED BY (column [opclass], [ ... ] ) | DISTRIBUTED RANDOMLY ]
CREATE WRITABLE EXTERNAL WEB [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
EXECUTE ‘command‘ [ON ALL]
FORMAT ‘TEXT‘
[( [DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[ESCAPE [AS] ‘escape‘ | ‘OFF‘] )]
| ‘CSV‘
[([QUOTE [AS] ‘quote‘]
[DELIMITER [AS] ‘delimiter‘]
[NULL [AS] ‘null string‘]
[FORCE QUOTE column [, ...]] ]
[ESCAPE [AS] ‘escape‘] )]
| ‘CUSTOM‘ (Formatter=<formatter specifications>)
[ ENCODING ‘write_encoding‘ ]
[ DISTRIBUTED BY (column [opclass], [ ... ] ) | DISTRIBUTED RANDOMLY ]
testDB=#
创建外部表
语法
gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout]
[-S] [-w time] [-v | -V] [-s] [-m max_length] [--ssl certificate_path]
gpfdist -? | --help
gpfdist --version
启动进程
[gpadmin@mdw ~]$ nohup gpfdist -d /home/gpadmin/script/ -p 8081 -l /home/gpadmin/script/gpfdist.log &
[1] 32971
[gpadmin@mdw ~]$ nohup: ignoring input and appending output to ‘nohup.out’
[gpadmin@mdw ~]$
[gpadmin@mdw ~]$
[gpadmin@mdw ~]$
[gpadmin@mdw ~]$
[gpadmin@mdw ~]$ ll
total 75448
drwxrwxr-x 2 gpadmin gpadmin 4096 Apr 14 22:51 gpAdminLogs
drwxrwxr-x 2 gpadmin gpadmin 41 Feb 16 16:27 gpconfig
drwxr-xr-x 2 gpadmin gpadmin 195 Feb 23 21:43 greenplum-cc-web-6.4.0-gp6-rhel7-x86_64
-rw-r--r-- 1 gpadmin gpadmin 77241507 Feb 17 23:01 greenplum-cc-web-6.4.0-gp6-rhel7-x86_64.zip
-rw-rw-r-- 1 gpadmin gpadmin 2733 Feb 23 20:49 initGreenplum
drwxrwxr-x 2 gpadmin gpadmin 61 Mar 15 16:33 l
-rw------- 1 gpadmin gpadmin 583 Apr 15 01:15 nohup.out
-rw-rw-r-- 1 gpadmin gpadmin 84 Apr 14 22:51 recov
drwxrwxr-x 2 gpadmin gpadmin 25 Apr 15 01:15 script
[gpadmin@mdw ~]$ jobs
[1]+ Running nohup gpfdist -d /home/gpadmin/script/ -p 8081 -l /home/gpadmin/script/gpfdist.log &
[gpadmin@mdw ~]$ cd script/
[gpadmin@mdw script]$ ls
gpfdist.log
[gpadmin@mdw script]$ more gpfdist.log
[gpadmin@mdw script]$ ss -lntup|grep 8081
tcp LISTEN 0 256 [::]:8081 [::]:* users:(("gpfdist",pid=32971,fd=6))
[gpadmin@mdw script]$
[gpadmin@mdw script]$
[gpadmin@mdw script]$ vim test.txt
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
San Francisco,Sept,156,6846.34
Paris,Nov,159,7134.56
San Francisco,Jan,113,5397.89
Prague,Dec,333,9894.77
Bangalore,Jul,271,8320.55
Beijing,Dec,100,4248.41
创建外部表
create external table public.test
(
country varchar(128),
name varchar(128),
id int,
sale varchar(128)
)
location (‘gpfdist://10.10.10.101:8081/test.txt‘)
format ‘text‘
(delimiter ‘,‘ null as ‘‘ escape ‘off‘)
encoding ‘utf8‘
log errors segment reject limit 3 rows;
--- location 文件所在位置,可以直接是本地路径、gpfdist地址、gpfdists地址、gphdfs地址。
--- format 文本类型
--- delimiter 分隔符
--- encoding 编码
--- log errors into 错误数据表,记录错误数据,会自动创建。一般都是tablename_err格式,例如t1_err。
--- segment reject limit 错误数据的条数/百分比(rows/percent),超过设置值会报错。最小值是2。用来确保数据的完整性。
testDB=# create external table public.test
testDB-# (
testDB(# country varchar(128),
testDB(# name varchar(128),
testDB(# id int,
testDB(# sale varchar(128)
testDB(# )
testDB-# location (‘gpfdist://10.10.10.101:8081/test.txt‘)
testDB-# format ‘text‘
testDB-# (delimiter ‘,‘ null as ‘‘ escape ‘off‘)
testDB-# encoding ‘utf8‘
testDB-# log errors segment reject limit 3 rows;
CREATE EXTERNAL TABLE
testDB=#
testDB=#
testDB=# select * from public.test;
country | name | id | sale
---------------+------+-----+----------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
Beijing | Jul | 411 | 11600.67
San Francisco | Sept | 156 | 6846.34
Paris | Nov | 159 | 7134.56
San Francisco | Jan | 113 | 5397.89
Prague | Dec | 333 | 9894.77
Bangalore | Jul | 271 | 8320.55
Beijing | Dec | 100 | 4248.41
(10 rows)
数据装载
表结构必须存在
testDB=# create table public.test_in
testDB-# (
testDB(# country varchar(128),
testDB(# name varchar(128),
testDB(# id int,
testDB(# sale varchar(128)
testDB(# )
testDB-# ;
NOTICE: Table doesn‘t have ‘DISTRIBUTED BY‘ clause -- Using column named ‘country‘ as the Greenplum Database data distribution key for this table.
HINT: The ‘DISTRIBUTED BY‘ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
testDB=# insert into test_in select * from public.test;
INSERT 0 10
testDB=#
如果表结构不存在,则会插入不进去
testDB=# insert into testx001 select * from public.test;
ERROR: relation "testx001" does not exist
LINE 1: insert into testx001 select * from public.test;