-
问题:矩阵相乘一系列优化。下面我们以
c[2000][2000] = a[2000][2000] * b[2000][2000]
为例来进行优化。 -
环境:CPU intel 9600k 6核心6线程 3.7GHz
多线程
To optimize the performance of this program, we have to use the advantage of multi-cores. We will create a thread for each row in a matrix that does the multiplication in parellel and reduce the processing time.
Let us create a Thread class that implements Runnable Interface.
RowMultiplyWorker.class
:
public class RowMultiplyWorker implements Runnable /* extends Thread */ {
public RowMultiplyWorker(int[][] result, int[][] matrix1, int[][] matrix2, int row) {
this.result = result;
this.matrix1 = matrix1;
this.matrix2 = matrix2;
this.row = row;
}
@Override
public void run() {
for (int i = 0; i < matrix2[0].length; i++) {
result[row][i] = 0;
for (int j = 0; j < matrix1[row].length; j++) {
result[row][i] += matrix1[row][j] * matrix2[j][i];
}
}
}
private int[][] matrix1;
private int[][] matrix2;
private final int[][] result;
private final int row;
}
Next, create a class to create 10 threads at a time because if we create 2000 threads for 2000 x 2000 matrix then the application gets to hang up. So we will be using the 10 threads as a group and let them complete then again initiate the next 10 threads untill complete each row multiplication.
ParallelThreadCreator
:
import java.util.List;
import java.util.ArrayList;
public class ParallelThreadCreator {
// creating 10 threads and waiting for them to complete then again repeat steps.
public static void multiply(int[][] matrix1, int[][] matrix2, int[][] result) {
List<Thread> threads = new ArrayList<>();
int rows1 = matrix1.length;
for (int i = 0; i < rows1; i++) {
// 方式一:直接创建线程(RowMultiplyWorker implements Runnable)
RowMultiplyWorker task = new RowMultiplyWorker(result, matrix1, matrix2, i);
Thread thread = new Thread(task);
thread.start();
threads.add(thread);
// // 方式二:以子类的方式创建线程(RowMultiplyWorker extends Thread)
// Thread thread = new RowMultiplyWorker(result, matrix1, matrix2, i);
// thread.start();
// threads.add(thread);
if (threads.size() % 10 == 0) {
waitForThreads(threads);
}
}
}
private static void waitForThreads(List<Thread> threads) {
for (Thread thread : threads) {
try {
thread.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
threads.clear();
}
}
Let us create the main class to test the time taking using this approach.
import java.util.Random;
public class Main {
public static void main(String[] args) {
// 初始化a,b数组
init();
// 无任何优化
int t1 = get1();
System.out.println("单线程串行:" + t1 + "毫秒,约" + t1 / 1000 + "秒");
System.out.println("==================================================");
// 利用 Cache:改变循环的嵌套顺序
int t2 = get2();
System.out.println("单线程串行(利用 Cache:改变循环的嵌套顺序):" + t2 + "毫秒,约" + t2 / 1000 + "秒");
// 检查get2()计算的d数组,结果是否正确
if (check(c, d)) System.out.println("正确");
else System.out.println("错误");
System.out.println("==================================================");
// 多线程
long start = System.currentTimeMillis();
ParallelThreadCreator.multiply(a, b, e);
long end = System.currentTimeMillis();
int t3 = (int) (end - start);
System.out.println("多线程并行:" + t3 + "毫秒,约" + t3 / 1000 + "秒");
// 检查多线程计算的e数组,结果是否正确
if (check(c, e)) System.out.println("正确");
else System.out.println("错误");
}
// 返回一次矩阵乘法运算所需时间(ms)
private static int get1() {
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
c[i][j] += a[i][k] * b[k][j];
long end = System.currentTimeMillis();
return (int) (end - start);
}
// 返回一次矩阵乘法运算所需时间(ms)无任何优化
private static int get2() {
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
d[i][j] += a[i][k] * b[k][j];
long end = System.currentTimeMillis();
return (int) (end - start);
}
private static void init() {
// a,b矩阵赋值 [1, 100] 范围内的随机数
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++) {
// 元素值为[1, 100]的随机数
a[i][j] = rd.nextInt(seed) + 1;
b[i][j] = rd.nextInt(seed) + 1;
}
}
private static boolean check(int[][] matrix1, int[][] matrix2) {
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
if (matrix1[i][j] != matrix2[i][j])
return false;
return true;
}
private static final int N = 2000, seed = 100;
private static int[][] a = new int[N][N];
private static int[][] b = new int[N][N];
private static int[][] c = new int[N][N]; // 无任何优化计算的结果
private static int[][] d = new int[N][N]; // 利用 Cache:改变循环的嵌套顺序计算的结果
private static int[][] e = new int[N][N]; // 多线程计算的结果
private static Random rd = new Random();
}
In this article, we have seen how to build the concurrent application for Matrix Multiplication. Now compare the time taken for 2000 x 2000 sized matrix is 65 seconds in the normal program and when we applied the multiple threads then it took only 17 seconds to complete this.
Still we can improve parallel version using Runtime.getRuntime().availableProcessors(). This returns the threads available to use and this count can be instead of 10 threads.
So, It is better to use the multiple threads efficiently for large scale applications.
实验结果
因为我们是把N
行分组,然后把该组交由多线程计算,所以组内行数必须能整除N
。否则最后会有几行没有计算。
-
当我们设置矩阵维度
N = 2000
,组内行数(线程数量)设置为20
时,结果如下:单线程串行:48698毫秒,约48秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):5859毫秒,约5秒 正确 ================================================== 多线程并行:8210毫秒,约8秒 正确
-
当我们设置矩阵维度
N = 2000
,组内行数(线程数量)设置为10
时,结果如下:单线程串行:48904毫秒,约48秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):5864毫秒,约5秒 正确 ================================================== 多线程并行:8887毫秒,约8秒 正确
-
当我们设置矩阵维度
N = 2000
,组内行数(线程数量)设置为8
时,结果如下:单线程串行:48263毫秒,约48秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):5871毫秒,约5秒 正确 ================================================== 多线程并行:8934毫秒,约8秒 正确
-
当我们设置矩阵维度
N = 2000
,组内行数(线程数量)设置为4
时,结果如下:单线程串行:49045毫秒,约49秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):6003毫秒,约6秒 正确 ================================================== 多线程并行:11993毫秒,约11秒 正确
-
当我们设置矩阵维度
N = 2000
,组内行数设置为2
时,结果如下:单线程串行:50119毫秒,约50秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):5861毫秒,约5秒 正确 ================================================== 多线程并行:24061毫秒,约24秒 正确
-
当我们设置矩阵维度
N = 2000
,组内行数(线程数量)设置为1
时,结果如下:单线程串行:51972毫秒,约51秒 ================================================== 单线程串行(利用 Cache:改变循环的嵌套顺序):6111毫秒,约6秒 正确 ================================================== 多线程并行:49762毫秒,约49秒 正确
Preference
-
[What is the difference between concurrency and parallelism?](https://*.com/questions/1050222/what-is-the-difference-between-concurrency-and-parallelism)
-
Matrix Multiplication with Java Threads - Optimized Code (Parallel)
-
Languages and Compilers for Parallel Computing: 11th International
-
[Any suggestions for a program or small project to learn about concurrency in Java?