我一直在使用PETSc库编写一些代码,现在我要将其中的一部分更改为并行运行.我想要并行化的大多数事情是矩阵初始化以及我生成和计算大量值的部分.无论如何,如果因为某些原因运行代码的所有部分将运行的次数与我使用的内核数量相同,我的问题仍然存在.
这只是我测试PETSc和MPI的简单示例代码
int main(int argc, char** argv)
{
time_t rawtime;
time ( &rawtime );
string sta = ctime (&rawtime);
cout << "Solving began..." << endl;
PetscInitialize(&argc, &argv, 0, 0);
Mat A; /* linear system matrix */
PetscInt i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
PetscMPIInt rank; // processor rank
PetscMPIInt size; // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;
cout << "Generating 2D-Array" << endl;
double temp2D[120000][3];
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
temp2D[Ii][J] = 1;
}
}
cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;
v = -1.0;
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
}
}
cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
MPI_Finalize();
cout << "No more MPI" << endl;
return 0;
}
我的真实程序有几个不同的.cpp文件.我在主程序中初始化MPI,在另一个.cpp文件中调用一个函数,我在那里实现了相同类型的矩阵填充,但是在填充矩阵之前程序所做的所有cout都将被打印出与我的核心数一样多的次数.
我可以运行我的测试程序作为mpiexec -n 4测试并且它成功运行但由于某种原因我必须运行我的真实程序作为mpiexec -n 4 ./myprog
我的测试程序输出如下
Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI
两条评论后编辑:
所以我的目标是在具有20个节点且每个节点有2个核心的小型集群上运行它.后来应该在超级计算机上运行所以mpi绝对是我需要的方式.我目前正在两台不同的机器上测试它,其中一台有1个处理器/ 4个核心,第二个有4个处理器/ 16个核心.
解决方法:
MPI是SPMD / MPMD模型的实现(单个程序多个数据/多个程序多个数据). MPI作业包括同时运行在彼此之间交换消息的进程,以便合作解决问题.您不能并行运行部分代码.您只能让部分代码不能相互通信但仍然可以并发执行.你应该使用mpirun或mpiexec以并行模式启动你的应用程序.
如果您只想使代码的一部分并行并且可以忍受只能在一台机器上运行代码的限制,那么您需要的是OpenMP而不是MPI.或者您也可以根据PETSc网站使用低级POSIX线程编程,它支持pthreads. OpenMP建立在pthread之上,因此使用PETSc和OpenMP可能是可能的.