具有C 11多线程的特征库


一段代码重复计算这样的模型,用于许多试验Ntrial(一个独立的,但使用相同的输入数据),以便最终获得最佳解决方案(从模型中最大化总可能性的那个) ).该概念可以推广到许多其他聚类算法(例如k均值).

我希望通过C 11的多线程并行化必须重复Ntrial次数的代码部分,以便每个线程执行一次试验.

一个代码示例,假设(Ndimensions x Npoints)的输入Eigen :: ArrayXXd样本可以是以下类型:

    double bestTotalModelProbability = 0;
    Eigen::ArrayXd clusterIndicesFromSample(Npoints);

    for (int i=0; i < Ntrials; i++)
         totalModelProbability = computeGaussianMixtureModel(sample);

         // Check if this trial is better than the previous one.
         // If so, update the results (cluster index for each point
         // in the sample) and keep them.

         if totalModelProbability > bestTotalModelProbability
             bestTotalModelProbability = totalModelProbability;
             clusterIndicesFromSample = obtainClusterMembership(sample);

其中我传递样本的参考值(Eigen :: Ref),而不是将其自身采样到函数computeGaussianMixtureModel()和obtainClusterMembership().

我的代码很大程度上基于Eigen数组,我采用的N维问题可以解释10-100维度和500-1000个不同的样本点.我正在寻找一些例子,使用Eigen数组和std:C 11的线程创建这个代码的多线程版本,但是找不到任何东西,我正在努力制作一些简单的例子来操作Eigen数组.

我甚至不确定Eigen可以在C 11中的std :: thread中使用.
我在具有6个内核(12个线程)的CPU上使用clang作为Mac OSX中的编译器.



我必须承认,我对Eigen库的经验非常有限. (我曾经将3×3旋转矩阵的分解用于欧拉角,这在特征库中以一般方式非常巧妙地解决.)





#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <limits>
#include <thread>
#include <vector>

// a sample function to process a certain amount of data
template <typename T>
size_t countFrequency(
  size_t n, const T data[], const T &begin, const T &end)
  size_t result = 0;
  for (size_t i = 0; i < n; ++i) result += data[i] >= begin && data[i] < end;
  return result;

typedef std::uint16_t Value;
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds MuSecs;
typedef decltype(std::chrono::duration_cast<MuSecs>(Clock::now() - Clock::now())) Time;

Time duration(const Clock::time_point &t0)
  return std::chrono::duration_cast<MuSecs>(Clock::now() - t0);

std::vector<Time> makeTest()
  const Value SizeGroup = 4, NGroups = 10000, N = SizeGroup * NGroups;
  const size_t NThreads = std::thread::hardware_concurrency();
  // make a test sample
  std::vector<Value> sample(N);
  for (Value &value : sample) value = (Value)rand();
  // prepare result vectors
  std::vector<size_t> results4[4] = {
    std::vector<size_t>(NGroups, 0),
    std::vector<size_t>(NGroups, 0),
    std::vector<size_t>(NGroups, 0),
    std::vector<size_t>(NGroups, 0)
  // make test
  std::vector<Time> times{
    [&]() { // single threading
      // make a copy of test sample
      std::vector<Value> data(sample);
      std::vector<size_t> &results = results4[0];
      // remember start time
      const Clock::time_point t0 = Clock::now();
      // do experiment single-threaded
      for (size_t i = 0; i < NGroups; ++i) {
        results[i] = countFrequency(data.size(), data.data(),
          (Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
      // done
      return duration(t0);
    [&]() { // multi-threading - stupid aproach
      // make a copy of test sample
      std::vector<Value> data(sample);
      std::vector<size_t> &results = results4[1];
      // remember start time
      const Clock::time_point t0 = Clock::now();
      // do experiment multi-threaded
      std::vector<std::thread> threads(NThreads);
      for (Value i = 0; i < NGroups;) {
        size_t nT = 0;
        for (; nT < NThreads && i < NGroups; ++nT, ++i) {
          threads[nT] = std::move(std::thread(
            [i, &results, &data, SizeGroup]() {
              size_t result = countFrequency(data.size(), data.data(),
                (Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
              results[i] = result;
        for (size_t iT = 0; iT < nT; ++iT) threads[iT].join();
      // done
      return duration(t0);
    [&]() { // multi-threading - interleaved
      // make a copy of test sample
      std::vector<Value> data(sample);
      std::vector<size_t> &results = results4[2];
      // remember start time
      const Clock::time_point t0 = Clock::now();
      // do experiment multi-threaded
      std::vector<std::thread> threads(NThreads);
      for (Value iT = 0; iT < NThreads; ++iT) {
        threads[iT] = std::move(std::thread(
          [iT, &results, &data, NGroups, SizeGroup, NThreads]() {
            for (Value i = iT; i < NGroups; i += NThreads) {
              size_t result = countFrequency(data.size(), data.data(),
                (Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
              results[i] = result;
      for (std::thread &threadI : threads) threadI.join();
      // done
      return duration(t0);
    [&]() { // multi-threading - grouped
      std::vector<Value> data(sample);
      std::vector<size_t> &results = results4[3];
      // remember start time
      const Clock::time_point t0 = Clock::now();
      // do experiment multi-threaded
      std::vector<std::thread> threads(NThreads);
      for (Value iT = 0; iT < NThreads; ++iT) {
        threads[iT] = std::move(std::thread(
          [iT, &results, &data, NGroups, SizeGroup, NThreads]() {
            for (Value i = iT * NGroups / NThreads,
              iN = (iT + 1) * NGroups / NThreads; i < iN; ++i) {
              size_t result = countFrequency(data.size(), data.data(),
                (Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
              results[i] = result;
      for (std::thread &threadI : threads) threadI.join();
      // done
      return duration(t0);
  // check results (must be equal for any kind of computation)
  const unsigned nResults = sizeof results4 / sizeof *results4;
  for (unsigned i = 1; i < nResults; ++i) {
    size_t nErrors = 0;
    for (Value j = 0; j < NGroups; ++j) {
      if (results4[0][j] != results4[i][j]) {
#ifdef _DEBUG
          << "results4[0][" << j << "]: " << results4[0][j]
          << " != results4[" << i << "][" << j << "]: " << results4[i][j]
          << "!\n";
#endif // _DEBUG
    if (nErrors) std::cerr << nErrors << " errors in results4[" << i << "]!\n";
  // done
  return times;

int main()
  std::cout << "std::thread::hardware_concurrency(): "
    << std::thread::hardware_concurrency() << '\n';
  // heat up
  std::cout << "Heat up...\n";
  for (unsigned i = 0; i < 3; ++i) makeTest();
  // repeat NTrials times
  const unsigned NTrials = 10;
  std::cout << "Measuring " << NTrials << " runs...\n"
    << "   "
    << " | " << std::setw(10) << "Single"
    << " | " << std::setw(10) << "Multi 1"
    << " | " << std::setw(10) << "Multi 2"
    << " | " << std::setw(10) << "Multi 3"
    << '\n';
  std::vector<double> sumTimes;
  for (unsigned i = 0; i < NTrials; ++i) {
    std::vector<Time> times = makeTest();
    std::cout << std::setw(2) << (i + 1) << ".";
    for (const Time &time : times) {
      std::cout << " | " << std::setw(10) << time.count();
    std::cout << '\n';
    sumTimes.resize(times.size(), 0.0);
    for (size_t j = 0; j < times.size(); ++j) sumTimes[j] += times[j].count();
  std::cout << "Average Values:\n   ";
  for (const double &sumTime : sumTimes) {
    std::cout << " | "
      << std::setw(10) << std::fixed << std::setprecision(1)
      << sumTime / NTrials;
  std::cout << '\n';
  std::cout << "Ratio:\n   ";
  for (const double &sumTime : sumTimes) {
    std::cout << " | "
      << std::setw(10) << std::fixed << std::setprecision(3)
      << sumTime / sumTimes.front();
  std::cout << '\n';
  // done
  return 0;

在Windows 10上的cygwin64上编译和测试:

$g++ --version
g++ (GCC) 7.3.0

$g++ -std=c++11 -O2 -o test-multi-threading test-multi-threading.cc

std::thread::hardware_concurrency(): 8
Heat up...
Measuring 10 runs...
    |     Single |    Multi 1 |    Multi 2 |    Multi 3
 1. |     384008 |    1052937 |     130662 |     138411
 2. |     386500 |    1103281 |     133030 |     132576
 3. |     382968 |    1078988 |     137123 |     137780
 4. |     395158 |    1120752 |     138731 |     138650
 5. |     385870 |    1105885 |     144825 |     129405
 6. |     366724 |    1071788 |     137684 |     130289
 7. |     352204 |    1104191 |     133675 |     130505
 8. |     331679 |    1072299 |     135476 |     138257
 9. |     373416 |    1053881 |     138467 |     137613
10. |     370872 |    1096424 |     136810 |     147960
Average Values:
    |   372939.9 |  1086042.6 |   136648.3 |   136144.6
    |      1.000 |      2.912 |      0.366 |      0.365

我在coliru.com上做了同样的事情. (当我超过原始值的时间限制时,我不得不减少加热周期和样本量.):

g++ (GCC) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

std::thread::hardware_concurrency(): 4
Heat up...
Measuring 10 runs...
    |     Single |    Multi 1 |    Multi 2 |    Multi 3
 1. |     224684 |     297729 |      48334 |      39016
 2. |     146232 |     337222 |      66308 |      59994
 3. |     195750 |     344056 |      61383 |      63172
 4. |     198629 |     317719 |      62695 |      50413
 5. |     149125 |     356471 |      61447 |      57487
 6. |     155355 |     322185 |      50254 |      35214
 7. |     140269 |     316224 |      61482 |      53889
 8. |     154454 |     334814 |      58382 |      53796
 9. |     177426 |     340723 |      62195 |      54352
10. |     151951 |     331772 |      61802 |      46727
Average Values:
    |   169387.5 |   329891.5 |    59428.2 |    51406.0
    |      1.000 |      1.948 |      0.351 |      0.303

关于3.我有点挣扎这是否合法,即它是否被授予在线程中写入的数据,以便在加入后在主线程中正确显示. (事情似乎工作得很好但总的来说是虚幻的,但在多线程方面尤其虚幻.)


> std::thread::thread()

The completion of the invocation of the constructor synchronizes-with (as defined in 07003) the beginning of the invocation of the copy of f on the new thread of execution.

> std::thread::join()

The completion of the thread identified by *this synchronizes with the corresponding successful return from join().

在Stack Overflow中,我发现了以下相关的Q / A:

> Does relaxed memory order effect can be extended to after performing-thread’s life?
> Are memory fences required here?
> Is there an implicit memory barrier with synchronized-with relationship on thread::join?




另一种方法可能是使用线程池来克服这个问题.我google了一下,发现例如Jakob Progsch’s ThreadPool on github.但是,我想,在一个线程池中,锁定问题又回到了“游戏中”.



Eigen and multi-threading – Using Eigen in a multi-threaded application

In the case your own application is multithreaded, and multiple threads make calls to 070010, then you have to initialize 070010 by calling the following routine before creating the threads:



With 070010 3.3, and a fully C++11 compliant compiler (i.e., 070013), then calling initParallel() is optional.


note that all functions generating random matrices are not re-entrant nor thread-safe. Those include 070014, and 070015 despite a call to Eigen::initParallel(). This is because these functions are based on std::rand which is not re-entrant. For thread-safe random generator, we recommend the use of boost::random or c++11 random feature.


