对于Oracle 数据泵expdp,impdp是一种逻辑导出导入迁移数据的一个工具,是服务端的工具,常见于DBA人员使用,用于数据迁移。从A库迁移至B库,或者从A用户迁移至B用户等。
那么有个疑问?
在Oracle 11.2.0.4的版本,对一个表,使用并行参数,是否真实的起用了并行?假设并行为2,是否真的分2个进程,一个进程负责导出一半的数据???
1.测试导出两个不同的表,使用并行2
$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel= cluster=n tables=dept,emp
Total estimation using BLOCKS method: KB
. . exported "SCOTT"."DEPT" 5.929 KB rows
. . exported "SCOTT"."EMP" 8.562 KB rows
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/home/oracle/D01.dmp
/home/oracle/D02.dmp
这两个文件,一个大一个小,根本不是大小相等的文件格式。 并且dump文件是二进制文件,无法观察里面具体存放什么信息,因此直接读取dump文件观察这条路不通。
2.阅读MOS文档
Parallel Capabilities of Oracle Data Pump (Doc ID 365459.1) For every export operation, Data Pump estimates how much disk space each table data object in the export job will consume (in bytes).
This is done whether or not the user uses the ESTIMATE parameter. The estimate is printed in the log file and displayed on the client's
standard output device. The estimate is for table row data only; it does not include metadata. This estimate is used to determine how many
PX processes will be applied to the table data object, if any.
对于每个导出的表来说,会估算导出大小,只包含表的行记录对应预估的大小。并且以此评估真正使用并行时,需要使用多少个进程? The columns of the tables are examined to determine if direct path, external tables, or both methods can be used. For direct path,
the parallel number for the table data object is always one since direct path does not support parallel unload of a table data object.
PX processes are only used with external tables.
Oracle导出两种方式,直接路径读 or 外部表,直接路径读并行一直1,外部表才允许并行导出。 If the external tables method is chosen, Data Pump will determine the maximum number of PX processes that can work on a table data object.
It does this by dividing the estimated size of the table data object by MB and rounding the result down. If the result is zero or one,
then PX processes are not used to unload the table.
如果选择了外部表方法,则数据泵将确定可在表数据对象上运行的PX进程的最大数量。它通过将表数据对象的估计大小除以250 MB并将结果四舍五入来实现。如果结果为零或一,则不使用PX进程卸载表。 If a job is not big enough to make use of the maximum parallel number, then the user will not see the maximum number of active workers and
Parallel Execution Processes. For example, if there is one MB table, and it has been determined that external tables will be used,
there will be one worker for the metadata, one worker for the data, and three PX processes. As mentioned above, the worker process for
the data acts as the coordinator for the PX processes and does not count toward the parallel total. So, if a user specifies PARALLEL = ,
the degree of parallelism is actually four. The user will only see one active worker in the STATUS display. Data Pump is working optimally;
the job is too small for the specified degree of parallelism. 这段话不太理解,有点懵,后续测试下。
3.模拟一个800M的非分区表,并行使用10个并行导出,结果如何?
疑问? 表导出,是根据表统计信息估算大小? 还是根据DBA_SEGMENTS 估算?
SQL> select sum(bytes)// from dba_segments where owner='SCOTT' and segment_name='A';
SUM(BYTES)//
--------------------
824
SQL> select owner,table_name,NUM_ROWS,BLOCKS*8/1024,SAMPLE_SIZE from dba_tables where owner='SCOTT' and table_name='A';
OWNER TABLE_NAME NUM_ROWS BLOCKS*8/1024 SAMPLE_SIZE
---------- ------------------------------ ---------- ------------- -----------
SCOTT A
导出
$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel= tables=a
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: MB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "SCOTT"."A" 708.3 MB rows
******************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D01.dmp
/u01/dump/D02.dmp
/u01/dump/D03.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug :: elapsed ::
$ ls -lrt D*.dmp
-rw------- oracle oinstall Aug : D03.dmp
-rw------- oracle oinstall Aug : D02.dmp
-rw------- oracle oinstall Aug : D01.dmp $expdp \'/ as sysdba\' attach=SYS_EXPORT_TABLE_01
Job: SYS_EXPORT_TABLE_01
Owner: SCOTT
Operation: EXPORT
Creator Privs: TRUE
GUID: AC156DF4AC940F1DE053453CA8C0F1FA
Start Time: Wednesday, August, ::
Mode: TABLE
Instance: tt
Max Parallelism:
EXPORT Job Parameters:
Parameter Name Parameter Value:
CLIENT_COMMAND scott/******** directory=dump dumpfile=D%U.dmp parallel=10 tables=a
State: EXECUTING
Bytes Processed: 0
Current Parallelism: 10
Job Error Count: 0
Dump File: /u01/dump/D01.dmp
bytes written: 4,096
Dump File: /u01/dump/D%u.dmp Worker 1 Status:
Process Name: DW00
State: EXECUTING
Object Schema: SCOTT
Object Name: A
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 1
Worker Parallelism: 1 Worker 2 Status:
Process Name: DW01
State: EXECUTING
Object Schema: SCOTT
Object Name: A
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 1
Worker Parallelism: 3 自问自答:数据泵这个值说是预估,实际上是直接使用DBA_SEGMENTS里面的数据。
Total estimation using BLOCKS method: 824 MB 并且此时在结合MOS最后一段话的理解,并行参数使用10,但是实际上相当于使用了两个并行主进程,一个进程负责导出元数据,导出完成元数据,在帮忙导出一点数据;
另一个进程主要负责导输出,根据算法,出现3个辅助进程,加快导出! 824/250=3 四舍五入。
这也侧面验证了对于非分区表,实际上并行导出,并不会多出几个进程并行。
疑问又来了??? 那么对于单表的导出,使用并行参数是否真的快? 还是慢?
并行10 ,11秒执行。
$ expdp scott/tiger directory=dump dumpfile=D1%U.dmp tables=a
******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D101.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:12:40 2020 elapsed 0 00:00:15
15s的时间,与11s相比,差距并不大。 因此对于单表来说,其实并行并不会真正意义上明细加快速度。
4.模拟分区表,800M数据,使用并行参数导出时间如何?
create table RANGE_PART_TAB(id number,
deal_date date, area_code number, contents varchar2())
partition by range(deal_date)
(
partition p1 values less than(to_date('2020-02-01','yyyy-mm-dd')),
partition p2 values less than(to_date('2020-03-01','yyyy-mm-dd')),
partition p3 values less than(to_date('2020-04-01','yyyy-mm-dd')),
partition p4 values less than(to_date('2020-05-01','yyyy-mm-dd')),
partition p5 values less than(to_date('2020-06-01','yyyy-mm-dd')),
partition p6 values less than(to_date('2020-07-01','yyyy-mm-dd')),
partition p7 values less than(to_date('2020-08-01','yyyy-mm-dd')),
partition p8 values less than(to_date('2020-09-01','yyyy-mm-dd')),
partition p9 values less than(to_date('2020-10-01','yyyy-mm-dd')),
partition p10 values less than(to_date('2020-11-01','yyyy-mm-dd'))); insert into range_part_tab (id,deal_date,area_code,contents)
select rownum,
to_date(to_char(to_date('','yyyymmdd'),'J')+ trunc(dbms_random.value(,)),'J'),
ceil(dbms_random.value(,)),
rpad('*',,'*')
from dual
connect by rownum <= ;
SQL> r 多次自插入数据
* insert into range_part_tab select * from range_part_tab
rows created.
SQL> commit;
SQL> select sum(bytes)// from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB';
SUM(BYTES)//
-------------------- SQL> select PARTITION_NAME,sum(bytes)// from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB' group by
PARTITION_NAME order by ; PARTITION_NAME SUM(BYTES)//
------------------------------ --------------------
P1
P10
P2
P3
P4
P5
P6
P7
P8
P9
rows selected.
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D2%U.dmp tables=RANGE_PART_TAB parallel= Worker Status:
Process Name: DW00
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects:
Total Objects:
Completed Rows: ,
Completed Bytes: ,,
Percent Done:
Worker Parallelism: Worker Status:
Process Name: DW01
State: WORK WAITING
******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D201.dmp
/u01/dump/D202.dmp
/u01/dump/D203.dmp
/u01/dump/D204.dmp
/u01/dump/D205.dmp
/u01/dump/D206.dmp
/u01/dump/D207.dmp
/u01/dump/D208.dmp
/u01/dump/D209.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug :: elapsed :: [oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D3%U.dmp tables=RANGE_PART_TAB
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug :: elapsed :: !!! 测试了一个寂寞,根本不是想象中的十个不同的主进程,每个人负责一个分区,而还是一个主进程,串行化导出每个分区! 很傻很天真。 ??? 难道是对于每个Segments 大小 250M的限制??? 对测试分区表扩大几倍。 666 果然是250MB的阈值,当单个分区SEGMENTS大于250M,才真正的开始了并行的作用,导出存在10个主进程导出。
SQL> select PARTITION_NAME,sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB' group by PARTITION_NAME;
PARTITION_NAME SUM(BYTES)/1024/1024
------------------------------ --------------------
P7 309
P4 304
P1 312
P2 288
P3 304
P6 296
P10 264
P9 300
P8 312
P5 312
10 rows selected.
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D4%U.dmp tables=RANGE_PART_TAB parallel=10
Worker 1 Status:
Process Name: DW00
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 2 Status:
Process Name: DW01
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 3 Status:
Process Name: DW02
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 4 Status:
Process Name: DW03
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 5 Status:
Process Name: DW04
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 6 Status:
Process Name: DW05
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 7 Status:
Process Name: DW06
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 8 Status:
Process Name: DW07
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 9 Status:
Process Name: DW08
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
Worker 10 Status:
Process Name: DW09
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D401.dmp
/u01/dump/D402.dmp
/u01/dump/D403.dmp
/u01/dump/D404.dmp
/u01/dump/D405.dmp
/u01/dump/D406.dmp
/u01/dump/D407.dmp
/u01/dump/D408.dmp
/u01/dump/D409.dmp
/u01/dump/D410.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:37:20 2020 elapsed 0 00:00:40
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D5%U.dmp tables=RANGE_PART_TAB
测试下时间对比。
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:39:14 2020 elapsed 0 00:01:21
并行10,非并行时间 40s:80s=1:2 时间上减少了很多。数据越大,效率越明显。
总结:1.对于单个表而言,表小于250M,并行无任何意义,都是1;
2.对单个表来说,并行度2 ,基本上都是第一个进程导出元数据,第二个导出数据,第一个导出元数据的进程导出后,数据未导完毕,会继续往dump灌数据,但是几乎没有并行效果;
3.对于非分区表来说,表很大,并行开很多,仅仅只是导出时辅助进程多几个,效率这块未验证,但是初步对比,没啥效果;
4.对于分区表来说,表很大,单个分区大于250M,并行很多才有用,分区都很小,可以忽略等同于无并行。 数据量越大,并行才能真正使用。