(一)问题背景
在使用impdp进行数据导入的时候,往往在导入表和索引的统计信息的时候,速度非常慢,因此我在使用impdp进行导入时,会使用exclude=table_statistics排除表的统计信息,从而加快导入速度,之后再手动收集统计信息。
图.impdp导入数据的时导入统计信息速度非常慢
导入语句如下:
impdp user/password directory=DUMPDIR dumpfile=TEST01.dmp logfile=TEST01.log remap_schema=TEST_USER:TEST_USER123 exclude=table_statistics
手动收集统计信息语句如下:
EXEC dbms_stats.gather_table_stats(ownname => ‘LIJIAMAN‘,tabname => ‘TEST01‘);
最近使用以上方法将数据还原到测试环境后,发现与生产环境执行计划存在偏差,本来应该走全表扫描的,却走了索引范围扫描。经过确认,是由于列的直方图统计信息未收集引发的执行计划偏差。
(二)列的直方图统计信息
什么是列的直方图统计信息呢?在Oracle数据库中,Oracle默认列上的值是在最小值与最大值之间均分布的,当在计算cardinatity时,会以均匀分布的方式计算,但是在实际生活中某些场景下数据并非均匀分布。举个列子,某公司有员工10000人,表A的列COL1记录员工的绩效(分别是:A、B、C、D,A最好,D最差),那么可能A占了15%,B占了60,C占了20%,D占了5%。很明显在该场景下数据并非均匀分布,假如以均匀分布的方式去统计员工的绩效,可能会导致执行计划失准。
当列的数据分布不均匀的时候,就需要统计列上的数据分布情况,从而走出正确的执行计划,列的直方图统计信息就是记录列上的数据分布情况的。
(三)异常模拟
STEP1:创建测试表test01
create table test01 (id number, name varchar2(10) ); create index idx_test01_id on test01(id);
向test01中插入测试数据
begin insert into test01 values(1,‘a‘); for i in 1..10 loop insert into test01 values(2,‘b‘); end loop; for i in 1..100 loop insert into test01 values(3,‘c‘); end loop; for i in 1..1000 loop insert into test01 values(4,‘d‘); end loop; commit; end;
查看数据分布情况:
SQL> SELECT ID,NAME,COUNT(*) FROM test01 GROUP BY ID,NAME ORDER BY COUNT(*); ID NAME COUNT(*) ---------- ---------- ---------- 1 a 1 2 b 10 3 c 100 4 d 1000
STEP2:收集统计信息,因为上面查询过id列,故在收集统计信息的时候,会收集直方图的统计信息
EXEC dbms_stats.gather_table_stats(ownname => ‘LIJIAMAN‘,tabname => ‘TEST01‘);
查看是否已经收集了直方图信息,发现id列上已经收集
SQL> SELECT a.OWNER,a.TABLE_NAME,a.COLUMN_NAME,a.LOW_VALUE,a.HIGH_VALUE,a.NUM_BUCKETS,a.HISTOGRAM 2 FROM dba_tab_columns a 3 WHERE a.OWNER = ‘LIJIAMAN‘ AND a.TABLE_NAME = ‘TEST01‘; OWNER TABLE_NAME COLUMN_NAME LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM --------- ----------- ------------ ------------ ------------ ----------- --------------- LIJIAMAN TEST01 ID C102 C105 4 FREQUENCY LIJIAMAN TEST01 NAME 61 64 1 NONE
查看直方图,已经将id列的4个值放入了4个bucket中:
SQL> SELECT * FROM dba_tab_histograms a WHERE a.OWNER = ‘LIJIAMAN‘ AND a.TABLE_NAME = ‘TEST01‘; OWNER TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE ----------- ------------ ------------- --------------- -------------- ---------------------- LIJIAMAN TEST01 ID 1 1 LIJIAMAN TEST01 ID 11 2 LIJIAMAN TEST01 ID 111 3 LIJIAMAN TEST01 ID 1111 4 LIJIAMAN TEST01 NAME 0 5.036527952778 LIJIAMAN TEST01 NAME 1 5.192296858534
STEP3:查看id=1和id=4的执行计划,当id=1时,走索引范围扫描,当id=4时,走全表扫描
id列存在直方图统计信息,当id=1时,走索引范围扫描 | id列存在直方图统计信息,当id=4时,走全表扫描 |
SELECT * FROM test01 WHERE ID=1 Plan Hash Value : 1151852672 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 2 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | TEST01 | 1 | 5 | 2 | 00:00:01 | | * 2 | INDEX RANGE SCAN | IDX_TEST01_ID | 1 | | 1 | 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("ID"=1) |
SELECT * FROM test01 WHERE ID=4 Plan Hash Value : 262542483 ----------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ----------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 5000 | 3 | 00:00:01 | | * 1 | TABLE ACCESS FULL | TEST01 | 1000 | 5000 | 3 | 00:00:01 | ----------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 1 - filter("ID"=4) |
STEP4:接下来模拟数据迁移,排除统计信息
导出表test01
expdp lijiaman/lijiaman directory=DUMPDIR tables=LIJIAMAN.TEST01 dumpfile =test01.dmp
删除原来的表:
SQL> drop table test01; Table dropped
再次导入表,排除统计信息:
impdp lijiaman/lijiaman directory=DUMPDIR dumpfile =test01.dmp exclude=table_statistics
查看表的统计信息,不存在统计信息:
SQL> SELECT a.OWNER,a.TABLE_NAME,a.COLUMN_NAME,a.LOW_VALUE,a.HIGH_VALUE,a.NUM_BUCKETS,a.HISTOGRAM 2 FROM dba_tab_columns a 3 WHERE a.OWNER = ‘LIJIAMAN‘ AND a.TABLE_NAME = ‘TEST01‘; OWNER TABLE_NAME COLUMN_NAME LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM -------------- --------------- --------------- ------------ ------------ ----------- --------------- LIJIAMAN TEST01 ID NONE LIJIAMAN TEST01 NAME NONE
STEP5:手动收集统计信息
EXEC dbms_stats.gather_table_stats(ownname => ‘LIJIAMAN‘,tabname => ‘TEST01‘);
发现统计信息已经收集,但是不存在直方图的统计信息
SQL> SELECT a.OWNER,a.TABLE_NAME,a.COLUMN_NAME,a.LOW_VALUE,a.HIGH_VALUE,a.NUM_BUCKETS,a.HISTOGRAM 2 FROM dba_tab_columns a 3 WHERE a.OWNER = ‘LIJIAMAN‘ AND a.TABLE_NAME = ‘TEST01‘; OWNER TABLE_NAME COLUMN_NAME LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM --------- ----------- ----------- ----------- ----------- ----------- --------------- LIJIAMAN TEST01 ID C102 C105 1 NONE LIJIAMAN TEST01 NAME 61 64 1 NONE
STEP6:再次查看id=1和id=4的执行计划,当id=1或id=4时,都走索引范围扫描
id列未收集直方图统计信息,当id=1时,走索引范围扫描 | id列未收集直方图统计信息,当id=4时,走索引范围扫描 |
SELECT * FROM test01 WHERE ID=1 Plan Hash Value : 1151852672 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 278 | 1390 | 2 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | TEST01 | 278 | 1390 | 2 | 00:00:01 | | * 2 | INDEX RANGE SCAN | IDX_TEST01_ID | 278 | | 1 | 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("ID"=1) |
SELECT * FROM test01 WHERE ID=4 Plan Hash Value : 1151852672 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 278 | 1390 | 2 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | TEST01 | 278 | 1390 | 2 | 00:00:01 | | * 2 | INDEX RANGE SCAN | IDX_TEST01_ID | 278 | | 1 | 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("ID"=4) |
STEP7:再次收集统计信息,因为使用过了id列作为查询条件,故再次收集统计信息时,会收集id列的直方图信息:
EXEC dbms_stats.gather_table_stats(ownname => ‘LIJIAMAN‘,tabname => ‘TEST01‘);
可以看到,此时已经收集了id列的直方图统计信息:
SQL> SELECT a.OWNER,a.TABLE_NAME,a.COLUMN_NAME,a.LOW_VALUE,a.HIGH_VALUE,a.NUM_BUCKETS,a.HISTOGRAM 2 FROM dba_tab_columns a 3 WHERE a.OWNER = ‘LIJIAMAN‘ AND a.TABLE_NAME = ‘TEST01‘; OWNER TABLE_NAME COLUMN_NAME LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM ------------------------------ ------------------------------ ------------------------------ ------------- ------------- ----------- --------------- LIJIAMAN TEST01 ID C102 C105 4 FREQUENCY LIJIAMAN TEST01 NAME 61 64 1 NONE
执行计划已经按照我们想要的方式走:
id列重新收集直方图统计信息,当id=1时,走索引范围扫描 | id列重新收集直方图统计信息,当id=4时,走全表扫描 |
SELECT * FROM test01 WHERE ID=1 Plan Hash Value : 1151852672 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 2 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | TEST01 | 1 | 5 | 2 | 00:00:01 | | * 2 | INDEX RANGE SCAN | IDX_TEST01_ID | 1 | | 1 | 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("ID"=1) |
SELECT * FROM test01 WHERE ID=4 Plan Hash Value : 262542483 ----------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ----------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 5000 | 3 | 00:00:01 | | * 1 | TABLE ACCESS FULL | TEST01 | 1000 | 5000 | 3 | 00:00:01 | ----------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 1 - filter("ID"=4) |
(四)总结
在使用expdp/impdp进行导出/导入数据的时,统计信息是非常重要的,对于大部分统计信息,我们可以在导入结束之后收集获得。但是对于列的直方图统计信息,Oracle默认收集的方式是auto,即Oracle会根据用户对列的使用情况进行判断是否收集直方图统计信息,然而数据刚迁移完成,在表还未使用的情况下收集统计信息,往往收集不到列的直方图信息,这就造成了执行计划异常,这种情况通常在下一次收集统计信息之后会有所改变。
参考文档:
DBMS_STATS With METHOD_OPT =>‘..SIZE auto‘ May Not Collect Histograms (Doc ID 557594.1)