相关概念:
- **Motion:**除了常见的数据库操作(例如表扫描,联接等)之外,Greenplum数据库还有一种名为motion的操作类型。motion操作用于在segment之间移动元组。
- **Slice:**为了在查询执行期间实现最大的并行度,Greenplum将查询计划的工作划分为slices。slice是计划中可以独立进行处理的部分。查询计划会为motion生成slice,motion的每一侧都有一个slice。
- **Gang:**属于同一个slice但是运行在不同的segment上的进程,称为gang。
实验环境:
- greenplum版本:6,
- 集群环境:单master,没有standerby Master. 两个 primary segment,没有mirror。
实验日志:
test=# select * from test;
DEBUG1: 00000: Message type Q received by from libpq, len = 20
DEBUG3: 00000: StartTransactionCommand
DEBUG3: 00000: StartTransaction
LOG: 00000: statement: select * from test;
LOCATION: exec_simple_query, postgres.c:1639
[OPT]: Using default search strategy
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=8)
-> Seq Scan on test (cost=0.00..431.00 rows=1 width=8)
DEBUG1: 00000: GPORCA produced plan
LOG: 00000: plan:
DETAIL: {PLANNEDSTMT
:commandType 1
:planTree
{MOTION
:motionID 1
:motionType 1
:nMotionNodes 1
:nInitPlans 0
:lefttree
{SEQSCAN
:flow
{FLOW
:flotype 0
}
:nMotionNodes 0
:nInitPlans 0
}
}
:rtable (
{RTE
:eref
{ALIAS
:aliasname test
:colnames ("id" "age")
}
}
)
:utilityStmt <>
:subplans <>
}
Slice 1 on seg0
DEBUG1: 00000: Message type M received by from libpq, len = 457 (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3: 00000: StartTransactionCommand (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3: 00000: StartTransaction (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3: 00000: CommitTransactionCommand (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3: 00000: CommitTransaction (seg0 slice1 192.168.106.132:7000 pid=43071)
Slice 1 on seg1
DEBUG1: 00000: Message type M received by from libpq, len = 457 (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3: 00000: StartTransactionCommand (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3: 00000: StartTransaction (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3: 00000: CommitTransactionCommand (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3: 00000: CommitTransaction (seg1 slice1 192.168.106.133:7000 pid=43218)
master
DEBUG3: 00000: CommitTransactionCommand
DEBUG3: 00000: CommitTransaction
上面的日志,是经过整理的,去掉了一些无用的或这里不关心的信息。
上面是执行语句select * from test;的调试信息,从上面可以看到:
-
(1)执行计划:
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=8) -> Seq Scan on test (cost=0.00..431.00 rows=1 width=8)
-
(2)执行计划树:
从日志可以看到,执行计划日志和执行计划树的对应关系:Gather Motion-->MOTION,Seq Scan-->SEQSCAN。
-
(3)Slice0:根slice,在master上跑。
GangType::GANGTYPE_UNALLOCATED, /* a root slice executed by the qDisp */,slice根节点(master)上的slice类型。
-
(4)Slice1:在seg1,seg2上跑,具体见上面日志。
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=8)
代码分析:
void
PostgresMain(int argc, char *argv[], const char *dbname, const char *username)
{
......
for (;;)
{
......
switch (firstchar)
{
case 'Q': /* simple query */
{
......
else
exec_simple_query(query_string);
send_ready_for_query = true;
}
break;
}
}
}
日志截取:
DEBUG1: 00000: Message type Q received by from libpq, len = 20
......
LOCATION: exec_simple_query, postgres.c:1639
- 从日志和逻辑可以看到,master执行SQL的入口点就在这里,当第一个字符是‘Q’时,表示要执行语句。而执行SQL字符串的入口函数,就是exec_simple_query。
- 这里先忽略SQL的解析,计划制定,直接到初始化slice,和后面的步骤,这里重点讨论并行计划中的slice和gang。
Slice相关结构:
* Slice 0 is the root slice of plan as a whole.
* Slices 1 through nMotion are motion slices with a sending motion at
* the root of the slice.
* Slices nMotion+1 and on are root slices of initPlans.
typedef struct SliceTable
{
NodeTag type;
int nMotions; /* The number Motion nodes in the entire plan */
int nInitPlans; /* The number of initplan slices allocated */
int localSlice; /* Index of the slice to execute. */
List *slices; /* List of slices */
int instrument_options; /* OR of InstrumentOption flags */
uint32 ic_instance_id;
} SliceTable;
由注释可以看到,Slice分为三类:
-
根slice,在master上,id为0
-
Motion的slice
-
initPlans的slice
Segment信息表:
template1=# select * from gp_segment_configuration;
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+------+----------+---------+-----------------------
1 | -1 | p | p | n | u | 5432 | mdw | mdw | /data/master/gpseg-1
2 | 0 | p | p | n | u | 7000 | sdw1 | sdw1 | /data1/primary/gpseg0
3 | 1 | p | p | n | u | 7000 | sdw2 | sdw2 | /data1/primary/gpseg1
(3 rows)
由这张表可以看到,master的信息中content为-1,这也与代码中的gp_segment_config.h中的MASTER_CONTENT_ID相对应,实际上,Master的节点,会做GpIdentity.segindex = MASTER_CONTENT_ID的初始化,所以,只要segindex是-1,我们可以认为当前在master节点上。
Slice创建调用栈:
void
InitSliceTable(EState *estate, int nMotions, int nSubplans)
{
SliceTable *table;
Slice *slice;
int i,
n;
MemoryContext oldcontext;
n = 1 + nMotions + nSubplans;
table = makeNode(SliceTable);
table->nMotions = nMotions;
table->nInitPlans = nSubplans;
......
for (i = 0; i < n; i++)
{
slice = makeNode(Slice);
slice->sliceIndex = i;
......
slice->gangType = GANGTYPE_UNALLOCATED;
......
table->slices = lappend(table->slices, slice);
}
estate->es_sliceTable = table;
......
}
由日志可以看到:
:nMotionNodes 1
:nInitPlans 0
所以,InitSliceTable创建了两个slice,index分别为0,1,实际上,对应这slice0和slice1。
PlanStmt相关Log信息:
DETAIL: {PLANNEDSTMT
......
:intoClause <>
:copyIntoClause <>
:refreshClause <>
......
}
Slice初始化调用栈:
static void
FillSliceTable(EState *estate, PlannedStmt *stmt)
{
FillSliceTable_cxt cxt;
SliceTable *sliceTable = estate->es_sliceTable;
if (!sliceTable)
return;
cxt.prefix.node = (Node *) stmt;
cxt.estate = estate;
cxt.currentSliceId = 0;
if (stmt->intoClause != NULL || stmt->copyIntoClause != NULL || stmt->refreshClause)
{
......
}
/*
* NOTE: We depend on plan_tree_walker() to recurse into subplans of
* SubPlan nodes.
*/
FillSliceTable_walker((Node *) stmt->planTree, &cxt);
}
通过上面的日志信息可以看到:if (stmt->intoClause != NULL || stmt->copyIntoClause != NULL || stmt->refreshClause) 这个条件不满足,我们重点看下面的函数
/* ----------------
typedef struct ModifyTable
{
......
CmdType operation; /* INSERT, UPDATE, or DELETE */
......
} ModifyTable;
由注释可以看到,这个结构体,代表修改表的操作。
static bool FillSliceTable_walker(Node *node, void *context)
{
if (IsA(node, ModifyTable))
{
......
}
/* A DML node is the same as a ModifyTable node, in ORCA plans. */
if (IsA(node, DML))
{
......
}
if (IsA(node, Motion))
{
......
/* Top node of subplan should have a Flow node. */
Insist(motion->plan.lefttree && motion->plan.lefttree->flow);
sendFlow = motion->plan.lefttree->flow;
/* Look up the sending gang's slice table entry. */
sendSlice = (Slice *) list_nth(sliceTable->slices, motion->motionID);
/* Look up the receiving (parent) gang's slice table entry. */
recvSlice = (Slice *)list_nth(sliceTable->slices, parentSliceIndex);
/* Sending slice become a children of recv slice */
recvSlice->children = lappend_int(recvSlice->children, sendSlice->sliceIndex);
sendSlice->parentIndex = parentSliceIndex;
sendSlice->rootIndex = recvSlice->rootIndex;
/* The gang beneath a Motion will be a reader. */
sendSlice->gangType = GANGTYPE_PRIMARY_READER;
if (sendFlow->flotype != FLOW_SINGLETON) //日志信息(:flotype 0 ),FLOW_SINGLETON为1,所以走入这个分支。
{
sendSlice->gangType = GANGTYPE_PRIMARY_READER;
/*
* If the PLAN is generated by ORCA, We assume that they
* distpatch on all segments.
*/
if (stmt->planGen == PLANGEN_PLANNER),日志信息(:planGen 1 ),PLANGEN_PLANNER为0,所以走下面的分支。
FillSliceGangInfo(sendSlice, sendFlow->numsegments);
else
FillSliceGangInfo(sendSlice, getgpsegmentCount());
}
else
{
......
}
......
/* recurse into children */
cxt->currentSliceId = motion->motionID;
result = plan_tree_walker(node, FillSliceTable_walker, cxt);
cxt->currentSliceId = parentSliceIndex;
return result;
}
if (IsA(node, SubPlan))
{
......
}
return plan_tree_walker(node, FillSliceTable_walker, cxt);
}
逻辑相关结构体:
typedef enum FlowType
{
FLOW_UNDEFINED, /* used prior to calculation of type of derived flow */
FLOW_SINGLETON, /* flow has single stream */
FLOW_REPLICATED, /* flow is replicated across IOPs */
FLOW_PARTITIONED, /* flow is partitioned across IOPs */
} FlowType;
typedef enum PlanGenerator
{
PLANGEN_PLANNER, /* plan produced by the planner*/
PLANGEN_OPTIMIZER, /* plan produced by the optimizer*/
} PlanGenerator;
逻辑相关的日志信息:
DETAIL: {PLANNEDSTMT
:commandType 1
:planGen 1
:planTree
{MOTION
:motionID 1
:nMotionNodes 1
:nInitPlans 0
:lefttree
{SEQSCAN......}
:flow
{FLOW
:flotype 0
:req_move 0
:locustype 0
:segindex 0
:numsegments 1
:hashExprs <>
:hashOpfamilies <>
:flow_before_req_move <>
}
}
:rtable (
{RTE
:eref
{ALIAS
:aliasname test
:colnames ("id" "age")
}
}
)
:utilityStmt <>
:subplans <>
}
FillSliceTable_walker有四个分支:
- if (IsA(node, ModifyTable))
- if (IsA(node, DML))
- if (IsA(node, Motion))
- if (IsA(node, SubPlan))
FillSliceTable_walker,会被FillSliceTable,调用,调用逻辑:
FillSliceTable_walker((Node *) stmt->planTree, &cxt);
- 可以看到,smpt是一个PlannedStmt对象,对应着Log的PLANNEDSTMT
关键字,而日志里面,planTree是一个Motion(MOTION)。
所以流程会走Motion对应的分支。
第三个分支:
- FillSliceTable_walker被FillSliceTable调用,在FillSliceTable中cxt.currentSliceId= 0; 所以这里FillSliceTable_walker中,int parentSliceIndex = cxt->currentSliceId;,parentSliceIndex 为0。
- 由日志可以看到,motion->motionID是1(:motionID 1)。所以sendSlice是日志里的slice1,recvSlice是slice0。
这个函数做了四件事:
- 设置recvSlice为slice0
并且把sendSlice设置为子slice,由前面的逻辑知道,slice0的类型为GANGTYPE_UNALLOCATED; - 设置slice1类型为GANGTYPE_PRIMARY_READER。
- 设置把slice1发送到所有的segment。
- 对slice1调用plan_tree_walker,待整理。
上下文变量初始化:
static int
BackendStartup(Port *port)
{
pid = fork_process();
if (pid == 0) /* child */
{
......
MyProcPid = getpid(); /* reset MyProcPid */
......
}
}
由此可见,MyProcPID代表当前被fork的子进程。
创建Gang调用栈:
代码:
void
AssignGangs(CdbDispatcherState *ds, QueryDesc *queryDesc)
{
......
InventorySliceTree(ds, sliceTable->slices, rootIdx);
}
这个函数调用InventorySliceTree实现功能。
void
InventorySliceTree(CdbDispatcherState *ds, List *slices, int sliceIndex)
{
ListCell *cell;
int childIndex;
Slice *slice = list_nth(slices, sliceIndex);
if (slice->gangType == GANGTYPE_UNALLOCATED)
{
slice->primaryGang = NULL;
slice->primaryProcesses = getCdbProcessesForQD(true);
}
else
{
Assert(slice->segments != NIL);
slice->primaryGang = AllocateGang(ds, slice->gangType, slice->segments);
setupCdbProcessList(slice);
}
foreach(cell, slice->children)
{
childIndex = lfirst_int(cell);
InventorySliceTree(ds, slices, childIndex);
}
}
- 由前面的分析可以知道,slice0的gangType为GANGTYPE_UNALLOCATED,所以,slice0的primaryProcesses被设置。而slice1走else的逻辑。最后,访问当前slice的每个子,递归执行InventorySliceTree。我们的场景只有两个slice,所以不会进foreach的逻辑。
slice0的处理:
/*
* getCdbProcessForQD: Manufacture a CdbProcess representing the QD,
* as if it were a worker from the executor factory.
*
* NOTE: Does not support multiple (mirrored) QDs.
*/
List *
getCdbProcessesForQD(int isPrimary)
{
CdbComponentDatabaseInfo *qdinfo;
CdbProcess *proc;
Assert(Gp_role == GP_ROLE_DISPATCH);
qdinfo = cdbcomponent_getComponentInfo(MASTER_CONTENT_ID);
proc = makeNode(CdbProcess);
......
proc->pid = MyProcPid;
......
list = lappend(list, proc);
return list;
}
- 由上面的实现可以看到,primaryProcesses被设置为当前的进程。可以看到,这个函数,是为master分配slice的执行进程的,实际上就是当前的dispatch进程。(关于Gp_role的描述,可以看我的另一篇文章:greenplum-QD&QE启动流程)所以,slice0上没有被分配gang。
slice1的处理:
/*
* Creates a new gang by logging on a session to each segDB involved.
*
* elog ERROR or return a non-NULL gang.
*/
Gang *
AllocateGang(CdbDispatcherState *ds, GangType type, List *segments)
{
MemoryContext oldContext;
SegmentType segmentType;
Gang *newGang = NULL;
int i;
......
if (Gp_role != GP_ROLE_DISPATCH)
{
elog(FATAL, "dispatch process called with role %d", Gp_role);
}
if (type == GANGTYPE_PRIMARY_WRITER)
segmentType = SEGMENTTYPE_EXPLICT_WRITER;
/* for extended query like cursor, must specify a reader */
else if (ds->isExtendedQuery)
segmentType = SEGMENTTYPE_EXPLICT_READER;
else
segmentType = SEGMENTTYPE_ANY;
......
newGang = cdbgang_createGang(segments, segmentType);
newGang->allocated = true;
newGang->type = type;
/*
* Push to the head of the allocated list, later in
* cdbdisp_destroyDispatcherState() we should recycle them from the head to
* restore the original order of the idle gangs.
*/
ds->allocatedGangs = lcons(newGang, ds->allocatedGangs);
ds->largestGangSize = Max(ds->largestGangSize, newGang->size);
if (type == GANGTYPE_PRIMARY_WRITER)
{
/*
* set "whoami" for utility statement. non-utility statement will
* overwrite it in function getCdbProcessList.
*/
for (i = 0; i < newGang->size; i++)
cdbconn_setQEIdentifier(newGang->db_descriptors[i], -1);
}
return newGang;
}
由前面的逻辑知道,slice1的类型为GANGTYPE_PRIMARY_READER。这里,segmentType为SEGMENTTYPE_ANY。
创建libpq连接调用链:
代码:
Gang *
cdbgang_createGang_async(List *segments, SegmentType segmentType)
{
Gang *newGangDefinition;
newGangDefinition = NULL;
/* allocate and initialize a gang structure */
......
newGangDefinition = buildGangDefinition(segments, segmentType);
CurrentGangCreating = newGangDefinition;
totalSegs = getgpsegmentCount();
size = list_length(segments);
......
PG_TRY();
{
for (i = 0; i < size; i++)
{
......
segdbDesc = newGangDefinition->db_descriptors[i];
ret = build_gpqeid_param(gpqeid, sizeof(gpqeid),
segdbDesc->isWriter,
segdbDesc->identifier,
segdbDesc->segment_database_info->hostSegs,
totalSegs * 2);
......
cdbconn_doConnectStart(segdbDesc, gpqeid, options);
pollingStatus[i] = PGRES_POLLING_WRITING;
}
for (;;)
{......}
......
return newGangDefinition;
}
这里我们省略了网络连接,交互的细节,重点看Gang相关的东西:
- buildGangDefinition,为每个Gang里面的segment,创建一个SegmentDatabaseDescriptor,可以理解为一个代表Segment Database的对象。
- cdbconn_doConnectStart,用这个函数,连接每个SegmentDatabaseDescriptor代表的数据库,这里就是每个segment上的数据库,从前面的分析可以知道,当前的场景,是连接所有的segment。而每个连接,会对应生成一个QE的进程,QE的初始化流程,见:greenplum-QD&QE启动流程
由此,我们可以得到当前实验场景下的网络拓扑:
所以,可以把slice看成管理Gang的数据结构,而Gang是管理分布式进程工作的数据结构。