Assigning to std::vector> in parallel

Question

I have some serial code that does a matrix-vector multiply with matrices represented as std::vector<std::vector<double>> and std::vector<double>, respectively:

void mat_vec_mult(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
  for (auto i = beg; i < end; i++) {
    (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
  }
}

I would like to parallelize it using OpenMP, which I am trying to learn. From here, I got to the following:

void mat_vec_mult_parallel(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
    #pragma omp parallel
    {
        #pragma omp for nowait
          for (auto i = beg; i < end; i++) {
            (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
          }
    }
}

This approach has not resulted in any speedup; I would appreciate any help in choosing the correct OpenMP directives.

The approach looks correct. What is the size of the matrix? There is an overhead for creating threads that might kill performances for small matrices. — Gilles-Philippe Paillé, Apr 11 '19 at 01:49
Hi @Gilles-PhilippePaillé, I tried with a 20,000 x 20,000 matrix and saw no improvement. — Luciano, Apr 11 '19 at 01:51
What compiler are you using? There is some compilations flags that needs to be added for OpenMP to be activated. If not, the code will be executed as usual. — Gilles-Philippe Paillé, Apr 11 '19 at 01:53
I am using `g++-8 (Homebrew GCC 8.3.0) 8.3.0` and compiling with `g++-8 -std=c++11 ${file}.cpp -fopenmp`. — Luciano, Apr 11 '19 at 01:55
What do you get if you print "omp_get_num_threads()"? (Include omp.h) — Gilles-Philippe Paillé, Apr 11 '19 at 02:19
Check out cache locality. That may be the issue. `vector> may have very scatted memory. — doug, Apr 11 '19 at 02:21
@Gilles-PhilippePaillé it prints `8`. @doug how can I improve cache locality? — Luciano, Apr 11 '19 at 02:33
You would need an openmp which supports pinning e.g. omp_places=cores. If the matrix is very large you would need to process by blocks fitting cache L1. If too small you will have difficulty measuring performance, excessive threading overhead, and premature cache eviction. — tim18, Apr 11 '19 at 09:36

score 1 · Answer 1 · answered Apr 11 '19 at 04:18

There are several things that could explain your lack of seeing performance improvement. The most promising ones are these:

You didn't activate OpenMP support at your compiler's level. Well, from the comments, this seems not to be the case, so this can be ruled out for you. I'm still mentioning it as this is so common a mistake that it's better to remind that this is needed.
The way you measure your time: beware of CPU time vs. elapsed time. See this answer for example to see how to properly measure the elapsed time, as this is the time you want to see decreasing.
The fact that your code is memory bound: normally, matrix-matrix multiplication is the type of code that shines for exploiting CPU power. However, that doesn't appear by magic. The code has to be tuned towards that goal. And one of the first tuning techniques to apply is tiling / cache blocking. The aim is to maximize data (re)use while in cache memory, instead of fetching it to central memory. And from what I can see in your code, the algorithm is doing exactly the opposite, so it streams data from memory for processing, completely ignoring reuse potential. So you're memory bound and in this case, sorry, but OpenMP can't help you much. See this answer for example to see why.

These are not the only reasons that could explain some lack of scalability, but with the limited info you give, I think they are the most likely culprits.

Assigning to std::vector> in parallel

1 Answers1