Phoenix常用操作记录

Apache Phoenix 常用操作

基础知识

1****、****Phoenix 主要技术点

a、将SQL转化为HBase Scan,将结果封装为JDBC Result Set。

b、表的元数据保存在HBase表(系统表)中。

c、使用了coprocessor 和 custom filter 保证高效,使得小规模查询的延时在毫秒级,百万行的查询延时在秒级。

· coprocessors to perform operations on the server-side thus minimizing client/server data transfer

· custom filters to prune data as close to the source as possible In addition, to minimize any startup costs, Phoenix uses native HBase APIs rather than going through the map/reduce framework

2****、****JDBC****连接的****URL

jdbc:phoenix [ :<zookeeper quorum> [ :<port number> ] [ :<root node> ] ]

如:

Connection conn = DriverManager.getConnection("jdbc:phoenix:server1,server2:2181");

属性对应于:hbase.zookeeper.quorum, hbase.zookeeper.property.clientPort, zookeeper.znode.parent

3、Phoenix 不支持特性

a、Full Transaction, 现在只支持 TRANSACTION_READ_COMMITTED,不支持其他类型。

b、关系操作. Union, Intersect, Minus

c、杂项内置函数

4****、****映射到已存在****HBase****表

使用 CREATE TABLE and CREATE VIEW

区别:

a、CREATE TABLE 启用 KEEP_DELETED_CELLS = true, CREATE VIEW 不会

b、CREATE TABLE 能添加HBase 中不存在的列族,CREATE VIEW 不会

加盐处理

因为HBase 数据储存按照 row key 排序,如果HBase表的 row key 是单调递增的,则HBase 容易有RegionServer 的局部热点问题。加盐可以缓解这个问题。

create table H3 (id varchar not null primary key, cf1.a varchar, cf2.b varchar) SALT_BUCKETS=20; 只能在创建表格时候加,创建后不可更改。

alter table h1 set salt_buckets=10;
Error: ERROR 1024 (42Y83): Salt bucket number may only be specified when creating a table. tableName=H1

加盐后的注意事项:

a、sequential scan 返回的结果可能不是自然排序的,如果sequential scan使用了LIMIT语句,将与不加盐的情况不一样。

b、 Spit point:If no split points are specified for the table, the salted table would be pre-split on salt bytes boundaries to ensure load distribution among region servers even during the initial phase of the table. If users are to provide split points manually, users need to include a salt byte in the split points they provide.

c、Row Key 排序:Pre-spliting also ensures that all entries in the region server all starts with the same salt byte, and therefore are stored in a sorted manner. When doing a parallel scan across all region servers, we can take advantage of this properties to perform a merge sort of the client side. The resulting scan would still be return sequentially as if it is from a normal table

实际上是改写了Row Key,添加了一个prefix

new_row_key = (++index % BUCKETS_NUMBER) + original_key

数据存储到 Buckects_Number 个Bucket中 ,每个Bucket的Prefix 相同,在query的时候,同时在各个Bucket进行。

提升效率常用方法

1****、加盐:

加盐可以将数据存入多个region里,从而提升读写性能。

CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) SALT_BUCKETS=42

如果有16台region server,每台server有4核CPU,则SALT_BUCKETS 设置为32~64之间。

即如果集群总的CPU核数为N,则SALT_BUCKETS为 0.5N ~ N 之间。

2****、****split

如果不想通过加盐来分区,可以自己手动设置分区的方法。这样可以不引入额外的byte,或者改变row key的顺序,例子

CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) SPLIT ON ('CS','EU','NA')

3****、使用多个列族

CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR)

4****、使用压缩

CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) COMPRESSION='GZ'

5****、使用二级索引

参考其余博文

6****、优化集群

参考其余博文

7****、优化****phoenix 参数

使用任意时间戳

在Property里面设置属性 "CurrentSCN"。

ts是一个long。

Properties props = new Properties();

props.setProperty(PhoenixRuntime.CURRENT_SCN_ATTRIB, Long.toString(ts));

Connection conn = DriverManager.connect(myUrl, props);

conn.createStatement().execute("UPSERT INTO myTable VALUES ('a')");

conn.commit();

相当于:

myTable.put(Bytes.toBytes('a'),ts);

索引基础

1、immutable Index

原文:Immutable indexing targets use cases that are write once, append only; this is common in time-series data, where you log once, but read multiple times. In this case, the indexing is managed entirely on the client - either we successfully write all the primary and index data or we return a failure to the client. Since once written, rows are never updated, no incremental index maintenance is required making them perform very well. This reduces the overhead of secondary indexing at write time. However, keep in mind that immutable indexing are only applicable in a limited set of use cases.

One restriction of immutable indexes is that rows from the data table may not be deleted. Instead, the only way to delete rows is to drop the entire data table.

Immutable 索引适用于一次写入,数据只添加不修改的情况,例如时间序列数据。因为只需要一次写入,数据行不会更新,不需要额外的索引维护,所以性能非常好。

例子:

<pre style="line-height:15.0pt;background:whitesmoke;white-space:pre-wrap;
word-wrap: break-word">CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR) IMMUTABLE_ROWS=true;</pre>

<pre style="margin-bottom:7.5pt;line-height:15.0pt;background:whitesmoke;
white-space:pre-wrap;word-wrap: break-word">CREATE INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);</pre>

2.Global Indexing (Mutable)

原文:Global indexing targets read heavy, low write uses cases. With global indexes, all the performance penalties for indexes occur at write time. We intercept the data table updates on write (DELETE, UPSERT VALUES and UPSERT SELECT), build the index update and then sent any necessary updates to all interested index tables. At read time, Phoenix will select the index table to use that will produce the fastest query time and directly scan it just like any other HBase table. Note, however, if a column is referenced in a query that isn’t part of the index, the index will not be used for that query.

全局索引适用于 高数据量读取,低数据量写入的情况。全局索引的消耗主要在索引写入的时候,在对数据表进行写操作的时候,同时更新所有的索引表。在读取的时候,将直接从索引表扫描读取数据,如果读取另外一个表一样。如果query中某个列在索引表中不存在,全局索引将不会用到。

CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR)

CREATE INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);

3.Local Indexing (Mutable)

原文: Local indexing targets *****write heavy*****, *****space constrained***** use cases. With local indexes index data and table data are co-reside at same server so no network overhead during writes and reads. Local indexes can be used even when the query isn’t fully covered i.e. Phoenix automatically retrieve the columns not in the index through point gets against the data table. Unlike global indexes all local indexes data of a table are stored in a separate shared table. At read time when the local index is used, every region must be examined for the data as the exact region location of index data cannot be predetermined which incurs some overhead.

局部索引适用于大数据量写、空间受限的情况下。使用局部索引,索引数据和表数据将同时放在同一台server上,所以在读写的时候不会有网络通信的开销。如果query中某个列在索引表中不存在,局部索引也能用到。不同于全局索引,一个表的全部局部索引的数据保存在同一个共享表中。在读取的时候,每一个region 都必须检查数据,因为索引数据的确切的区域位置无法预先,确定这会增加一些系统开销。

例子:

CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR)

CREATE LOCAL INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);

4****、不会使用到二级索引的情况

创建表:create table usertable (id varchar primary key, firstname varchar, lastname varchar);

创建全局索引:create index idx_name on usertable (firstname);

检索:select id, firstname, lastname from usertable where firstname = 'foo';

不会使用到索引 idx_name。若要使用到,必须这样:

create idx_name on usertable (firstname)include(lastname);

5****、不会使用主键索引情况:

创建表:CREATE TABLE TEST (pk1 char(1) not null, pk2 char(1) not null, pk3 char(1) not null, non-pk varchar **CONSTRAINT PK PRIMARY KEY(pk1, pk2, pk3) **);

不会使用到索引的检索:select * from test where pk2='x' and pk3='y'

会使用到索引的检索:select * from test where pk1='x' and pk2='y'

子查询

1、IN 和 Not In 的子查询

SELECT ItemName

FROM Items

WHERE ItemID IN

(SELECT ItemID

FROM Orders

WHERE Date >= to_date('2013/09/02'));

2、Exists 和Not Exists的子查询

SELECT ItemName

FROM Items i

WHERE EXISTS

(SELECT *

FROM Orders

WHERE Date >= to_date('2013/09/02')

AND ItemID = i.ItemID);

3、半连接、反连接

见Join

4、比较运算

SELECT ID, Name

FROM Contest

WHERE Score >

(SELECT avg(Score)

FROM Contest)

ORDER BY Score DESC;

5、ANY/SOME/ALL 运算

SELECT OrderID

FROM Orders

WHERE quantity >= ANY

(SELECT max(quantity)

FROM Orders

GROUP BY ItemID);

6、相关子查询

SELECT PatentID, Title

FROM Patents p

WHERE FileDate <= ALL

(SELECT FileDate

FROM Patents

WHERE Region = p.Region);

7、多重嵌套

SELECT ItemID, ItemName

FROM Items i

WHERE NOT EXISTS

(SELECT *

FROM Orders

WHERE CustomerID IN

(SELECT CustomerID

FROM Customers

WHERE Country = ‘Belgium’)

AND Quantity < 1000

AND ItemID = i.ItemID)

OR ItemID != ALL

(SELECT ItemID

FROM Orders

WHERE CustomerID IN

(SELECT CustomerID

FROM Customers

WHERE Country = ‘Germany’)

AND Quantity < 2000);

8、衍生表

SELECT m, count(*)

FROM

(SELECT max(x) m

FROM a1

GROUP BY name) AS t

GROUP BY m

ORDER BY count(*) DESC;

多租户

创建多租户表:

CREATE TABLE base.event (tenant_id VARCHAR, event_type CHAR(1), created_date DATE, event_id BIGINT)

MULTI_TENANT=true;

连接到某租户的数据库表:

Properties props = new Properties();

props.setProperty("TenantId", "Acme");

Connection conn = DriverManager.getConnection("localhost", props);

在特定租户连接的情况下,以下语句只创建特定租户的视图

CREATE VIEW acme.event AS

SELECT * FROM base.event;

CREATE VIEW acme.login_event AS

SELECT * FROM base.event

WHERE event_type='L';

Array类型

creating a table:

CREATE TABLE regions (

region_name VARCHAR PRIMARY KEY,

zips VARCHAR ARRAY[10],

CONSTRAINT pk PRIMARY KEY (region_name));

Insert

UPSERT INTO regions(region_name,zips)

VALUES('SF Bay Area',ARRAY['94115','94030','94125']);

select:

SELECT zip[1] FROM regions WHERE region_name = 'SF Bay Area';

SELECT region_name FROM regions WHERE zip[1] = '94030' OR zip[2] = '94030' OR zip[3] = '94030';

SELECT region_name FROM regions WHERE zip[1] = '94030' OR zip[2] = '94030' OR zip[3] = '94030';

分页查询

组合使用order by, > , LIMIT :

SELECT title, author, isbn, description

FROM library

WHERE published_date > 2010

AND (title, author, isbn) > (?, ?, ?)

ORDER BY title, author, isbn

LIMIT 20

Skip Scan

SELECT * from T

WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR (KEY1 > 'c' AND KEY1 <= 'e'))

AND KEY2 IN (1, 2)

The List<List<KeyRange>> for SkipScanFilter for the above query would be [ [ [ a - b ], [ d - e ] ], [ 1, 2 ] ] where [ [ a - b ], [ d - e ] ] is the range for KEY1and [ 1, 2 ] keys for KEY2.

跟踪Tracing

只支持Hadoop2

配置 hadoop-metrics2-phoenix.properties

Sample from all the sources every 10 seconds

*.period=10

Write Traces to Phoenix

##########################

ensure that we receive traces on the server

phoenix.sink.tracing.class=org.apache.phoenix.trace.PhoenixMetricsSink

Tell the sink where to write the metrics

phoenix.sink.tracing.writer-class=org.apache.phoenix.trace.PhoenixTableMetricsWriter

Only handle traces with a context of "tracing"

phoenix.sink.tracing.context=tracing

配置 hadoop-metrics2-hbase.properties

ensure that we receive traces on the server

hbase.sink.tracing.class=org.apache.phoenix.trace.PhoenixMetricsSink

Tell the sink where to write the metrics

hbase.sink.tracing.writer-class=org.apache.phoenix.trace.PhoenixTableMetricsWriter

Only handle traces with a context of "tracing"

hbase.sink.tracing.context=tracing

配置 hbase-site.xml

<configuration>

<property>

<name>phoenix.trace.frequency</name>

<value>always</value>

</property>

</configuration>

<property>

<name>phoenix.trace.statsTableName</name>

<value><your custom tracing table name></value>

</property>

The tracing table is initialized via the ddl:

CREATE TABLE SYSTEM.TRACING_STATS (

trace_id BIGINT NOT NULL,

parent_id BIGINT NOT NULL,

span_id BIGINT NOT NULL,

description VARCHAR,

start_time BIGINT,

end_time BIGINT,

hostname VARCHAR,

tags.count SMALLINT,

annotations.count SMALLINT,

CONSTRAINT pk PRIMARY KEY (trace_id, parent_id, span_id)

统计收集

统计收集有助于提升query性能。

命令:

UPDATE STATISTICS my_table

等效于

UPDATE STATISTICS my_table ALL

如果只收集index或者column

UPDATE STATISTICS my_table INDEX

UPDATE STATISTICS my_table COLUMNS

参数配置

phoenix.stats.guidepost.width 默认104857600 (100 MB)

phoenix.stats.guidepost.per.region

phoenix.stats.updateFrequency 默认900000 (15 mins)

phoenix.stats.minUpdateFrequency 默认7.5 mins

phoenix.stats.useCurrentTime 默认true

mutable 和 immutable 表区别

分别创建表:

create table** my_mutable** (id varchar not null primary key, cf1.a varchar , cf1.b varchar, cf2.c varchar, cf2.d varchar) ;

create table my_immutable (id varchar not null primary key, cf1.a varchar , cf1.b varchar, cf2.c varchar, cf2.d varchar) immutable_rows=true ;

分别创建索引:

create index index_my_mutable on** my_mutable**(a,c) include (b,d);

create index** index_my_immutable** on my_immutable(a,c) include (b,d);

分别插入数据

upsert into my_mutable values ('1000001','a1','b1','c1','d1');
upsert into my_mutable values ('1000001','a2','b2','c2','d2');
upsert into my_mutable values ('1000001','a3','b3','c3','d3');

upsert into my_immutable values ('1000001','a1','b1','c1','d1');
upsert into my_immutable values ('1000001','a2','b2','c2','d2');
upsert into my_immutable values ('1000001','a3','b3','c3','d3');

查看数据:

select * from my_mutable ;

| ID | A | B | C | D |

| 1000001 | a3 | b3 | c3 | d3 |

select * from my_immutable ;

| ID | A | B | C | D |

| 1000001 | a1 | b1 | c1 | d1 |
| 1000001 | a2 | b2 | c2 | d2 |
| 1000001 | a3 | b3 | c3 | d3 |


select * from** index_my_mutable** ;

| CF1:A | CF2:C | :ID | CF1:B | CF2:D |

| a3 | c3 | 1000001 | b3 | d3 |

select * from** index_my_immutable** ;

| CF1:A | CF2:C | :ID | CF1:B | CF2:D |

| a1 | c1 | 1000001 | b1 | d1 |
| a2 | c2 | 1000001 | b2 | d2 |
| a3 | c3 | 1000001 | b3 | d3 |


Global 和 Local 索引。

1****、创建表:

create table **immutable_local **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=true;

create table **immutable_global **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=true;

create table **mutable_local **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=false;

create table **mutable_global **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=false;

2****、创建索引:

create index index_mutable_global on mutable_global(a,c) include(b); 成功

create **local **index index_mutable_local on mutable_local(a,c) include(b); 成功

create index** index_immutable_global** on immutable_global(a,c) include(b); 成功

create local index index_immutable_local on immutable_local(a,c) include(b); 失败,immutable 表无法创建Local . Error: ERROR 1048 (43A04): Local indexes aren't allowed on tables with immutable rows. tableName=INDEX_IMMUTABLE_LOCAL (state=43A04,code=1048)

3.****插入数据

upsert into mutable_global values ('100001','a1','b1','c1','d1');

upsert into mutable_global values ('100002','a2','b2','c2','d2');

upsert into mutable_local values ('100001','a1','b1','c1','d1');

upsert into mutable_local values ('100002','a2','b2','c2','d2');

upsert into immutable_global values ('100001','a1','b1','c1','d1');

upsert into immutable_global values ('100002','a2','b2','c2','d2');

4****、测试检索

检索中包含了列d, 此列不包含在索引中。

a****、****immutable 表使用的时全表扫描,没有使用索引

explain select a,b,c,d from immutable_global where a='a1';

CLIENT PARALLEL 1-WAY **FULL SCAN **OVER IMMUTABLE_GLOBAL
SERVER FILTER BY CF1.A = 'a1'

b、mutable 表使用的时全表扫描,没有使用****Global****索引

explain select a,b,c,d from mutable_global where a='a1';
CLIENT PARALLEL 1-WAY** FULL SCAN** OVER MUTABLE_GLOBAL
SERVER FILTER BY CF1.A = 'a1'

c、mutable 表使用了 LOCAL 索引

explain select a,b,c,d from mutable_local where a='a1';

CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1'] |
CLIENT MERGE SORT

5****、****Local 索引细节

索引定义:create local index index_mutable_local on mutable_local(a,c) include(b);

索引内容:

a****、使用部分索引:索引组合的第一个

explain select a,b,c,d from mutable_local where a='a1' ;

CLIENT PARALLEL 1-WAY **RANGE SCAN **OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1']
CLIENT MERGE SORT

b****、使用部分索引:索引组合的第二个

explain select a,b,c,d from mutable_local where c='c1' ;

CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768]
**SERVER FILTER BY **C = 'c1'
CLIENT MERGE SORT

c****、使用部分索引:****include****的部分

explain select a,b,c,d from mutable_local where b='b1' ;
CLIENT PARALLEL 1-WAY** RANGE SCAN** OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768]
SERVER FILTER BY CF1.B = 'b1'
CLIENT MERGE SORT

d****、使用全部索引

explain select a,b,c,d from mutable_local where a='a1' and c='c1' ;
CLIENT PARALLEL 1-WAY **RANGE SCAN **OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
CLIENT MERGE SORT

调换a和c的位置,phoenix会自动优化。

explain select a,b,c,d from mutable_local where c='c1' and a='a1' ;
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
CLIENT MERGE SORT

e****、使用索引中全部字段

explain select a,b,c,d from mutable_local where c='c1' and b='b1' and a='a1' ;
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
SERVER FILTER BY CF1.B = 'b1'
CLIENT MERGE SORT

f****、使用了不存在于索引中的字段。

explain select a,b,c,d from mutable_local where a='a1' and d='d1';

CLIENT PARALLEL 1-WAY** FULL SCAN** OVER MUTABLE_LOCAL
SERVER FILTER BY (CF1.A = 'a1' AND CF2.D = 'd1')

g. 在****Select ****中不要使用*********,*********会导致全表扫描

explain select **a,b,c,d **from MUTABLE_LOCAL where a='a';
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a']
CLIENT MERGE SORT

explain select * from MUTABLE_LOCAL where a='a';
** CLIENT PARALLEL 1-WAY FULL SCAN OVER MUTABLE_LOCAL
SERVER FILTER BY CF1.A = 'a' **

上一篇:XenApp_XenDesktop_7.6实战篇之八:申请及导入许可证


下一篇:一行配置作业性能提升53%!Flink SQL 性能之旅