首頁猿問與OpenMP并行填充直方圖（減少...

與OpenMP并行填充直方圖（減少數(shù)組），而無需使用關(guān)鍵部分

源碼

牛魔王的故事 2019-11-19 11:06:07

我想使用OpenMP并行填充直方圖。我想出了兩種使用C / C ++中的OpenMP進(jìn)行此操作的方法。第一種方法為每個(gè)線程proccess_data_v1創(chuàng)建一個(gè)私有直方圖變量hist_private，將其填充成小節(jié)，然后將私有直方圖求和成hist一個(gè)critical部分中的共享直方圖。第二種方法proccess_data_v2制作一個(gè)共享直方圖數(shù)組，其數(shù)組大小等于線程數(shù)，并行填充此數(shù)組，然后并行求和該共享直方圖hist。第二種方法對(duì)我來說似乎更好，因?yàn)樗苊饬岁P(guān)鍵部分，并且并行地對(duì)直方圖求和。但是，它需要知道線程數(shù)并調(diào)用omp_get_thread_num()。我通常會(huì)嘗試避免這種情況。有沒有更好的方法來執(zhí)行第二種方法而不引用線程號(hào)并使用大小等于線程數(shù)的共享數(shù)組？void proccess_data_v1(float *data, int *hist, const int n, const int nbins, float max) { #pragma omp parallel { int *hist_private = new int[nbins]; for(int i=0; i<nbins; i++) hist_private[i] = 0; #pragma omp for nowait for(int i=0; i<n; i++) { float x = reconstruct_data(data[i]); fill_hist(hist_private, nbins, max, x); } #pragma omp critical { for(int i=0; i<nbins; i++) { hist[i] += hist_private[i]; } } delete[] hist_private; }}void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) { const int nthreads = 8; omp_set_num_threads(nthreads); int *hista = new int[nbins*nthreads]; #pragma omp parallel { const int ithread = omp_get_thread_num(); for(int i=0; i<nbins; i++) hista[nbins*ithread+i] = 0; #pragma omp for for(int i=0; i<n; i++) { float x = reconstruct_data(data[i]); fill_hist(&hista[nbins*ithread], nbins, max, x); } #pragma omp for for(int i=0; i<nbins; i++) { for(int t=0; t<nthreads; t++) { hist[i] += hista[nbins*t + i]; } } } delete[] hista;}

查看完整描述

3 回答

慕妹3242003

TA貢獻(xiàn)1824條經(jīng)驗(yàn) 獲得超6個(gè)贊

我創(chuàng)建了一種改進(jìn)的方法，稱為process_data_v3

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))

void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {

int* hista;

#pragma omp parallel

{

const int nthreads = omp_get_num_threads();

const int ithread = omp_get_thread_num();

int lda = ROUND_DOWN(nbins+1023, 1024); //1024 ints = 4096 bytes -> round to a multiple of page size

#pragma omp single

hista = (int*)_mm_malloc(lda*sizeof(int)*nthreads, 4096); //align memory to page size

for(int i=0; i<nbins; i++) hista[lda*ithread+i] = 0;

#pragma omp for

for(int i=0; i<n; i++) {

float x = reconstruct_data(data[i]);

fill_hist(&hista[lda*ithread], nbins, max, x);

}

#pragma omp for

for(int i=0; i<nbins; i++) {

for(int t=0; t<nthreads; t++) {

hist[i] += hista[lda*t + i];

}

_mm_free(hista);

}

反對(duì) 回復(fù) 2019-11-19

慕絲7291255

TA貢獻(xiàn)1859條經(jīng)驗(yàn) 獲得超6個(gè)贊

您可以在并行區(qū)域內(nèi)分配大數(shù)組，您可以在其中查詢所使用的實(shí)際線程數(shù)：

int *hista;

#pragma omp parallel

{

const int nthreads = omp_get_num_threads();

const int ithread = omp_get_thread_num();

#pragma omp single

hista = new int[nbins*nthreads];

...

}

delete[] hista;

為了獲得更好的性能，我建議您將每個(gè)線程的塊的大小四舍五入為hista系統(tǒng)內(nèi)存頁面大小的倍數(shù)，即使這可能在不同的部分直方圖之間留下空白。這樣，您既可以防止在NUMA系統(tǒng)上進(jìn)行錯(cuò)誤共享，又可以防止對(duì)遠(yuǎn)程內(nèi)存的訪問（但不能在最后的還原階段）。

反對(duì) 回復(fù) 2019-11-19

POPMUISE

TA貢獻(xiàn)1765條經(jīng)驗(yàn) 獲得超5個(gè)贊

這實(shí)際上取決于所使用的內(nèi)存管理器。例如，在某些發(fā)行版中，glibc配置為使用每個(gè)線程的競技場，并且每個(gè)線程都有自己的堆空間。較大的分配通常實(shí)現(xiàn)為匿名mmap，因此總是獲得新的頁面。但是，哪個(gè)線程分配了內(nèi)存并不重要。哪個(gè)胎面首先接觸每個(gè)特定頁面很重要-Linux上當(dāng)前的NUMA策略是“首次接觸”，即物理內(nèi)存頁面來自NUMA節(jié)點(diǎn)，在該節(jié)點(diǎn)中，第一次接觸該頁面的代碼在此運(yùn)行。

反對(duì) 回復(fù) 2019-11-19