greenplum-执行SQL创建Slice&Gang 学习计划。

2023-10-21 13:57:40

实验环境：

greenplum版本：6，
集群环境：单master，没有standerby Master. 两个 primary segment，没有mirror。

实验日志：

test=# select * from test;
DEBUG1:  00000: Message type Q received by from libpq, len = 20
DEBUG3:  00000: StartTransactionCommand
DEBUG3:  00000: StartTransaction

LOG:  00000: statement: select * from test;
LOCATION:  exec_simple_query, postgres.c:1639
[OPT]: Using default search strategy

 Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=8)
   ->  Seq Scan on test  (cost=0.00..431.00 rows=1 width=8)

DEBUG1:  00000: GPORCA produced plan
LOG:  00000: plan:
DETAIL:     {PLANNEDSTMT 
   :commandType 1 
   :planTree 
      {MOTION 
      :motionID 1 
      :motionType 1 
      :nMotionNodes 1 
      :nInitPlans 0 
      :lefttree 
         {SEQSCAN 
         :flow 
            {FLOW 
            :flotype 0 
            }
         :nMotionNodes 0 
         :nInitPlans 0 
         }
      }
   :rtable (
      {RTE 
      :eref 
         {ALIAS 
         :aliasname test 
         :colnames ("id" "age")
         }
      }
   )
   :utilityStmt <> 
   :subplans <> 
   }

Slice 1 on seg0
DEBUG1:  00000: Message type M received by from libpq, len = 457  (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3:  00000: StartTransactionCommand  (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3:  00000: StartTransaction  (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3:  00000: CommitTransactionCommand  (seg0 slice1 192.168.106.132:7000 pid=43071)
DEBUG3:  00000: CommitTransaction  (seg0 slice1 192.168.106.132:7000 pid=43071)

Slice 1 on seg1
DEBUG1:  00000: Message type M received by from libpq, len = 457  (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3:  00000: StartTransactionCommand  (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3:  00000: StartTransaction  (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3:  00000: CommitTransactionCommand  (seg1 slice1 192.168.106.133:7000 pid=43218)
DEBUG3:  00000: CommitTransaction  (seg1 slice1 192.168.106.133:7000 pid=43218)

master
DEBUG3:  00000: CommitTransactionCommand
DEBUG3:  00000: CommitTransaction

上面的日志，是经过整理的，去掉了一些无用的或这里不关心的信息。

上面是执行语句select * from test;的调试信息，从上面可以看到：

(1)执行计划：

 Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=8)
		->  Seq Scan on test  (cost=0.00..431.00 rows=1 width=8)

(2)执行计划树：

 从日志可以看到，执行计划日志和执行计划树的对应关系：Gather Motion-->MOTION，Seq Scan-->SEQSCAN。

(3)Slice0：根slice，在master上跑。

 GangType::GANGTYPE_UNALLOCATED,       /* a root slice executed by the qDisp */，slice根节点（master）上的slice类型。

(4)Slice1：在seg1,seg2上跑，具体见上面日志。

 Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..431.00 rows=1 width=8)

代码分析：

void
PostgresMain(int argc, char *argv[], const char *dbname, const char *username)
{
......
    for (;;)
    {
 	......
        switch (firstchar)
        {
            case 'Q':           /* simple query */
                {
                      ......
                    else
                        exec_simple_query(query_string);
                    send_ready_for_query = true;
                }
                break;
     }
}
}

日志截取：
DEBUG1:  00000: Message type Q received by from libpq, len = 20
......
LOCATION:  exec_simple_query, postgres.c:1639

从日志和逻辑可以看到，master执行SQL的入口点就在这里，当第一个字符是‘Q’时，表示要执行语句。而执行SQL字符串的入口函数，就是exec_simple_query。
这里先忽略SQL的解析，计划制定，直接到初始化slice，和后面的步骤，这里重点讨论并行计划中的slice和gang。

Slice相关结构：

 * Slice 0 is the root slice of plan as a whole.
 * Slices 1 through nMotion are motion slices with a sending motion at
 *  the root of the slice.
 * Slices nMotion+1 and on are root slices of initPlans.
typedef struct SliceTable
{
    NodeTag     type;

    int         nMotions;       /* The number Motion nodes in the entire plan */
    int         nInitPlans;     /* The number of initplan slices allocated */
    int         localSlice;     /* Index of the slice to execute. */
    List       *slices;         /* List of slices */
    int         instrument_options; /* OR of InstrumentOption flags */
    uint32      ic_instance_id;
} SliceTable;

由注释可以看到，Slice分为三类：

根slice，在master上，id为0
Motion的slice
initPlans的slice

Segment信息表：

template1=# select * from gp_segment_configuration;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |        datadir        
------+---------+------+----------------+------+--------+------+----------+---------+-----------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /data/master/gpseg-1
    2 |       0 | p    | p              | n    | u      | 7000 | sdw1     | sdw1    | /data1/primary/gpseg0
    3 |       1 | p    | p              | n    | u      | 7000 | sdw2     | sdw2    | /data1/primary/gpseg1
(3 rows)

由这张表可以看到，master的信息中content为-1，这也与代码中的gp_segment_config.h中的MASTER_CONTENT_ID相对应，实际上，Master的节点，会做GpIdentity.segindex = MASTER_CONTENT_ID的初始化，所以，只要segindex是-1，我们可以认为当前在master节点上。

Slice创建调用栈：

void
InitSliceTable(EState *estate, int nMotions, int nSubplans)
{
    SliceTable *table;
    Slice      *slice;
    int         i,
                n;
    MemoryContext oldcontext;
    n = 1 + nMotions + nSubplans;
  
    table = makeNode(SliceTable);
    table->nMotions = nMotions;
	table->nInitPlans = nSubplans;
	......
    for (i = 0; i < n; i++)
    {
        slice = makeNode(Slice);
        slice->sliceIndex = i;
         ......
        slice->gangType = GANGTYPE_UNALLOCATED;
        ......
        table->slices = lappend(table->slices, slice);
    }
	estate->es_sliceTable = table;
	......
}

由日志可以看到：

:nMotionNodes 1 
:nInitPlans 0

所以，InitSliceTable创建了两个slice，index分别为0，1，实际上，对应这slice0和slice1。

PlanStmt相关Log信息：

DETAIL:     {PLANNEDSTMT 
   ......
   :intoClause <> 
   :copyIntoClause <> 
   :refreshClause <> 
   ......
}

Slice初始化调用栈：

static void
FillSliceTable(EState *estate, PlannedStmt *stmt)
{
    FillSliceTable_cxt cxt;
    SliceTable *sliceTable = estate->es_sliceTable;

    if (!sliceTable)
        return;
    cxt.prefix.node = (Node *) stmt;
    cxt.estate = estate;
    cxt.currentSliceId = 0;
    if (stmt->intoClause != NULL || stmt->copyIntoClause != NULL || stmt->refreshClause)
	{
    	......
    }
    /*
     * NOTE: We depend on plan_tree_walker() to recurse into subplans of
     * SubPlan nodes.
     */
    FillSliceTable_walker((Node *) stmt->planTree, &cxt);
}

通过上面的日志信息可以看到：if (stmt->intoClause != NULL || stmt->copyIntoClause != NULL || stmt->refreshClause) 这个条件不满足，我们重点看下面的函数

/* ----------------
typedef struct ModifyTable
{
    ......
CmdType     operation;      /* INSERT, UPDATE, or DELETE */
......
} ModifyTable;
由注释可以看到，这个结构体，代表修改表的操作。

static bool FillSliceTable_walker(Node *node, void *context)
{
    if (IsA(node, ModifyTable))
	{
    	......
    }
    /* A DML node is the same as a ModifyTable node, in ORCA plans. */
    if (IsA(node, DML))
    {
 	     ......
    }
    if (IsA(node, Motion))
	{
    	......
        /* Top node of subplan should have a Flow node. */
        Insist(motion->plan.lefttree && motion->plan.lefttree->flow);
        sendFlow = motion->plan.lefttree->flow;
        /* Look up the sending gang's slice table entry. */
        sendSlice = (Slice *) list_nth(sliceTable->slices, motion->motionID);
        /* Look up the receiving (parent) gang's slice table entry. */
        recvSlice = (Slice *)list_nth(sliceTable->slices, parentSliceIndex);
        /* Sending slice become a children of recv slice */
        recvSlice->children = lappend_int(recvSlice->children, sendSlice->sliceIndex);
        sendSlice->parentIndex = parentSliceIndex;
        sendSlice->rootIndex = recvSlice->rootIndex;
        /* The gang beneath a Motion will be a reader. */
        sendSlice->gangType = GANGTYPE_PRIMARY_READER;

        if (sendFlow->flotype != FLOW_SINGLETON) //日志信息（:flotype 0 ），FLOW_SINGLETON为1，所以走入这个分支。
        {
            sendSlice->gangType = GANGTYPE_PRIMARY_READER;
            /*
             * If the PLAN is generated by ORCA, We assume that they
             * distpatch on all segments.
             */
            if (stmt->planGen == PLANGEN_PLANNER)，日志信息（:planGen 1 ），PLANGEN_PLANNER为0，所以走下面的分支。
                FillSliceGangInfo(sendSlice, sendFlow->numsegments);
            else
                FillSliceGangInfo(sendSlice, getgpsegmentCount());
        }
        else
        {
             ......
        }
        ......
        /* recurse into children */
        cxt->currentSliceId = motion->motionID;
        result = plan_tree_walker(node, FillSliceTable_walker, cxt);
        cxt->currentSliceId = parentSliceIndex;
        return result;
    }
    if (IsA(node, SubPlan))
	{
      	......
    }
    return plan_tree_walker(node, FillSliceTable_walker, cxt);
}

逻辑相关结构体：

typedef enum FlowType
{
    FLOW_UNDEFINED,     /* used prior to calculation of type of derived flow */
    FLOW_SINGLETON,     /* flow has single stream */
    FLOW_REPLICATED,    /* flow is replicated across IOPs */
    FLOW_PARTITIONED,   /* flow is partitioned across IOPs */
} FlowType;

typedef enum PlanGenerator
{
    PLANGEN_PLANNER,            /* plan produced by the planner*/
    PLANGEN_OPTIMIZER,          /* plan produced by the optimizer*/
} PlanGenerator;

逻辑相关的日志信息：

DETAIL:     {PLANNEDSTMT 
   :commandType 1 
   :planGen 1 
   :planTree 
      {MOTION 
      :motionID 1 
      :nMotionNodes 1 
      :nInitPlans 0 
      :lefttree 
         {SEQSCAN......}
         :flow 
         {FLOW 
            :flotype 0 
            :req_move 0 
            :locustype 0 
            :segindex 0 
            :numsegments 1 
            :hashExprs <> 
            :hashOpfamilies <> 
            :flow_before_req_move <>
         }
      }
   :rtable (
      {RTE 
      :eref 
         {ALIAS 
         :aliasname test 
         :colnames ("id" "age")
         }
      }
   )
   :utilityStmt <> 
   :subplans <> 
   }

FillSliceTable_walker有四个分支：

if (IsA(node, ModifyTable))
if (IsA(node, DML))
if (IsA(node, Motion))
if (IsA(node, SubPlan))

FillSliceTable_walker，会被FillSliceTable，调用，调用逻辑：

FillSliceTable_walker((Node *) stmt->planTree, &cxt);

可以看到，smpt是一个PlannedStmt对象，对应着Log的PLANNEDSTMT
关键字，而日志里面，planTree是一个Motion（MOTION）。

所以流程会走Motion对应的分支。

第三个分支：

FillSliceTable_walker被FillSliceTable调用，在FillSliceTable中cxt.currentSliceId= 0; 所以这里FillSliceTable_walker中，int parentSliceIndex = cxt->currentSliceId;，parentSliceIndex 为0。
由日志可以看到，motion->motionID是1（:motionID 1）。所以sendSlice是日志里的slice1，recvSlice是slice0。

这个函数做了四件事：

设置recvSlice为slice0
并且把sendSlice设置为子slice，由前面的逻辑知道，slice0的类型为GANGTYPE_UNALLOCATED;
设置slice1类型为GANGTYPE_PRIMARY_READER。
设置把slice1发送到所有的segment。
对slice1调用plan_tree_walker，待整理。

上下文变量初始化：

static int
BackendStartup(Port *port)
{

    pid = fork_process();
    if (pid == 0)               /* child */
	{
		......
        MyProcPid = getpid();   /* reset MyProcPid */
        ......
	}
}

由此可见，MyProcPID代表当前被fork的子进程。

创建Gang调用栈：

代码：

void
AssignGangs(CdbDispatcherState *ds, QueryDesc *queryDesc)
{
    ......
    InventorySliceTree(ds, sliceTable->slices, rootIdx);
}

这个函数调用InventorySliceTree实现功能。

void
InventorySliceTree(CdbDispatcherState *ds, List *slices, int sliceIndex)
{
    ListCell *cell;
    int childIndex;
    Slice *slice = list_nth(slices, sliceIndex);

    if (slice->gangType == GANGTYPE_UNALLOCATED)
    {
        slice->primaryGang = NULL;
        slice->primaryProcesses = getCdbProcessesForQD(true);
    }
    else
    {
        Assert(slice->segments != NIL);
        slice->primaryGang = AllocateGang(ds, slice->gangType, slice->segments);
        setupCdbProcessList(slice);
    }
    foreach(cell, slice->children)
    {
        childIndex = lfirst_int(cell);
        InventorySliceTree(ds, slices, childIndex);
    }
}

由前面的分析可以知道，slice0的gangType为GANGTYPE_UNALLOCATED，所以，slice0的primaryProcesses被设置。而slice1走else的逻辑。最后，访问当前slice的每个子，递归执行InventorySliceTree。我们的场景只有两个slice，所以不会进foreach的逻辑。

slice0的处理：

/*
 * getCdbProcessForQD:  Manufacture a CdbProcess representing the QD,
 * as if it were a worker from the executor factory.
 *
 * NOTE: Does not support multiple (mirrored) QDs.
 */
List *
getCdbProcessesForQD(int isPrimary)
{
    CdbComponentDatabaseInfo *qdinfo;
	CdbProcess *proc;

    Assert(Gp_role == GP_ROLE_DISPATCH);
    qdinfo = cdbcomponent_getComponentInfo(MASTER_CONTENT_ID);
    proc = makeNode(CdbProcess);

	......
	proc->pid = MyProcPid;
	......
    list = lappend(list, proc);
    return list;
}

由上面的实现可以看到，primaryProcesses被设置为当前的进程。可以看到，这个函数，是为master分配slice的执行进程的，实际上就是当前的dispatch进程。（关于Gp_role的描述，可以看我的另一篇文章：greenplum-QD&QE启动流程）所以，slice0上没有被分配gang。

slice1的处理：

/*
 * Creates a new gang by logging on a session to each segDB involved.
 *
 * elog ERROR or return a non-NULL gang.
 */
Gang *
AllocateGang(CdbDispatcherState *ds, GangType type, List *segments)
{
    MemoryContext   oldContext;
    SegmentType     segmentType;
    Gang            *newGang = NULL;
	int             i;
	......
    if (Gp_role != GP_ROLE_DISPATCH)
    {
        elog(FATAL, "dispatch process called with role %d", Gp_role);
	}

    if (type == GANGTYPE_PRIMARY_WRITER)
        segmentType = SEGMENTTYPE_EXPLICT_WRITER;
    /* for extended query like cursor, must specify a reader */
    else if (ds->isExtendedQuery)
        segmentType = SEGMENTTYPE_EXPLICT_READER;
    else
        segmentType = SEGMENTTYPE_ANY;
    
	......
    newGang = cdbgang_createGang(segments, segmentType);
    newGang->allocated = true;
    newGang->type = type;
    /*
     * Push to the head of the allocated list, later in
     * cdbdisp_destroyDispatcherState() we should recycle them from the head to
     * restore the original order of the idle gangs.
     */
    ds->allocatedGangs = lcons(newGang, ds->allocatedGangs);
	ds->largestGangSize = Max(ds->largestGangSize, newGang->size);

    if (type == GANGTYPE_PRIMARY_WRITER)
    {
        /*
         * set "whoami" for utility statement. non-utility statement will
         * overwrite it in function getCdbProcessList.
         */
        for (i = 0; i < newGang->size; i++)
            cdbconn_setQEIdentifier(newGang->db_descriptors[i], -1);
    }
    return newGang;
}

由前面的逻辑知道，slice1的类型为GANGTYPE_PRIMARY_READER。这里，segmentType为SEGMENTTYPE_ANY。

创建libpq连接调用链：

代码：

Gang *
cdbgang_createGang_async(List *segments, SegmentType segmentType)
{
    Gang    *newGangDefinition;

    newGangDefinition = NULL;
	/* allocate and initialize a gang structure */
	......
    newGangDefinition = buildGangDefinition(segments, segmentType);
    CurrentGangCreating = newGangDefinition;
	totalSegs = getgpsegmentCount();
	size = list_length(segments);
	......
    PG_TRY();
    {
        for (i = 0; i < size; i++)
        {
             ......
            segdbDesc = newGangDefinition->db_descriptors[i];

            ret = build_gpqeid_param(gpqeid, sizeof(gpqeid),
                                     segdbDesc->isWriter,
                                     segdbDesc->identifier,
                                     segdbDesc->segment_database_info->hostSegs,
                                     totalSegs * 2);
             ......
            cdbconn_doConnectStart(segdbDesc, gpqeid, options);
  
            pollingStatus[i] = PGRES_POLLING_WRITING;
        }

        for (;;)
        {......}
 	 ......
     return newGangDefinition;
}

这里我们省略了网络连接，交互的细节，重点看Gang相关的东西：

buildGangDefinition，为每个Gang里面的segment，创建一个SegmentDatabaseDescriptor，可以理解为一个代表Segment Database的对象。
cdbconn_doConnectStart，用这个函数，连接每个SegmentDatabaseDescriptor代表的数据库，这里就是每个segment上的数据库，从前面的分析可以知道，当前的场景，是连接所有的segment。而每个连接，会对应生成一个QE的进程，QE的初始化流程，见：greenplum-QD&QE启动流程

由此，我们可以得到当前实验场景下的网络拓扑：

所以，可以把slice看成管理Gang的数据结构，而Gang是管理分布式进程工作的数据结构。

码农公寓

greenplum-执行SQL创建Slice&Gang 学习计划。

相关概念：

实验环境：

实验日志：

代码分析：

码农公寓

相关概念：

实验环境：

实验日志：

代码分析：

相关文章