I have some serial code that does a matrix-vector multiply with matrices represented as std::vector<std::vector<double>> and std::vector<double>, respectively:
void mat_vec_mult(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
std::vector<std::vector<double>> *result, size_t beg, size_t end) {
// multiply a matrix by a pre-transposed column vector; returns a column vector
for (auto i = beg; i < end; i++) {
(*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
}
}
I would like to parallelize it using OpenMP, which I am trying to learn. From here, I got to the following:
void mat_vec_mult_parallel(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
std::vector<std::vector<double>> *result, size_t beg, size_t end) {
// multiply a matrix by a pre-transposed column vector; returns a column vector
#pragma omp parallel
{
#pragma omp for nowait
for (auto i = beg; i < end; i++) {
(*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
}
}
}
This approach has not resulted in any speedup; I would appreciate any help in choosing the correct OpenMP directives.