Chapter 26 introduces a trio of useful new pseudocode keywords, spawn
, sync
, and parallel for
. We have seen a few examples of how these simple constructs allow us to form powerful parallel implementations of common algorithms.
Here is the pseudocode for the parallelized version of merge sort from chapter 26.
MERGE-SORT'(A,p,r) if p < r q = (p+r)/2 spawn MERGE-SORT'(A,p,q) MERGE-SORT'(A,q+1,r) sync P-MERGE(A,p,q,r) P-MERGE(T,p1,r1,p2,r2,A,p3) n1 = r1 - p1 + 1 n2 = r2 - p2 + 1 if n1 < n2 exchange p1 with p2 exchange r1 with r2 exchange n1 with n2 if n1 == 0 return else q1 = (p1+r1)/2 q2 = BINARY-SEARCH(T[q1],T,p2,r2) q3 = p3 + (q1 - p1) + (q2 - p2) A[q3] = T[q1] spawn P-MERGE(T,p1,q1-1,p2,q2-1,A,p3) P-MERGE(T,q1+1,r1,q2,r2,A,q3+1) sync
In the programming assignment below I am going to ask you write a C++ program that implements another parallel algorithm. The notes below will show the basics of implementing parallel programming in C++.
OpenMP is a system for writing parallel programs in C++. OpenMP is a combination of a set of compiler extentions and an OpenMP library. The compiler extentions allow you to implement parallel programming features in your program through the use of compiler directives, specifically pragmas.
Here is source code for a C++ version of the parallel merge sort algorithm.
The OpenMP pragmas embedded in this code are fairly easy to use.
To set up a parallel for loop we use a construct like this:
#pragma omp parallel for for(i = 0;i < n;i++) A[i] = 0;
This is simple a special omp pragma placed before the start of the for loop.
To set up a spawn/sync arrangement analogous to
x = spawn f(i) y = f(i-1) sync
we use the OpenMP construct
#pragma omp parallel sections { #pragma omp section x = f(i); #pragma omp section y = f(i-1); }
The combination of the parallel sections
pragma and the curly braces sets up an implicit sync
that takes place after the right curly brace.
When you use a parallel for or a section to move some code to another thread running on a different core, you will want to use some care to make sure that you distinguish variables that are shared across all threads from variables are meant to be used exclusively in a single thread. You make this distinction by adding shared
and private
variable directives to your OpenMP pragmas. If you look back at the code for parallel merge sort above you will see that in each case I was careful to indicated which variables are shared and which are private.
Many modern development environments offer support for OpenMP.
In Visual Studio you can turn on OpenMP support by right-clicking on your project in the solution explorer and selecting Properties... Go to C/C++/Language and find the option for OpenMP support. Set that option to Yes to enable OpenMP in your project.
On the Mac the situation is more complicated. The problem is that the C++ compiler that Xcode uses does not support OpenMP. As an alternative we can use the homebrew package manager to install the gcc compiler tools and use gcc with Visual Studio Code to build and debug our OpenMP programs. Here are instructions on how to do that:
brew install gcc
The merge.cpp file I provided above contains code for an implementation of the quicksort algorithm:
int partition(int A[],int p,int r) { // Median of three int a = p; int b = (p+r)/2; int c = r; int use = c; if(A[a] > A[b] && A[a] < A[c]) use = a; else if(A[a] < A[b] && A[a] > A[c]) use = a; else if(A[b] < A[a] && A[b] > A[c]) use = b; else if(A[b] > A[a] && A[b] < A[c]) use = b; if(use != c) { int pivot = A[use]; A[use] = A[r]; A[r] = pivot; } int x = A[r]; int i = p-1; for(int j = p;j < r;j++) { if(A[j] <= x) { i++; int temp = A[i]; A[i] = A[j]; A[j] = temp; } } int temp = A[i+1]; A[i+1] = A[r]; A[r] = temp; return i+1; } void insertion(int A[],int p,int q) { for(int n = p;n < q;n++) { int k = n; int x = A[k+1]; while(k >= p && A[k] > x) { A[k+1] = A[k]; k--; } A[k+1] = x; } } void quicksort(int A[],int p,int q) { if(q - p < 100) { // Use insertion sort for small chunks insertion(A,p,q); } else { int r = partition(A,p,q); quicksort(A,p,r-1); quicksort(A,r+1,q); } }
In the programming assignment below you are going to construct a parallel version of Quicksort
The first obvious change we can make to quicksort to make it run faster in OpenMP is to take advantage of an obvious opportunity in the quicksort function. Since the two recursive calls to quicksort operate largely independently of each other, we can use OpenMP to run those two recursive calls in parallel.
void naive_parallel_quicksort(int A[],int p,int q,int n_threads) { if(n_threads >= 2){ int r = partition(A,p,q); int n = n_threads/2; #pragma omp task shared(A) naive_parallel_quicksort(A,p,r-1,n); #pragma omp task shared(A) naive_parallel_quicksort(A,r+1,q,n); } else if(q - p < 100) { // Use insertion sort for small chunks insertion(A,p,q); } else { int r = partition(A,p,q); quicksort(A,p,r-1); quicksort(A,r+1,q); } }
Here I am using a different openmp construct than I used in the merge sort example. We have replaced the use of openmp parallel sections with openmp tasks. When you use a task architecture in openmp the library will set up a pool of worker threads and then pass tasks to be done to those threads each time the program encounters a task
pragma.
Here is the main function for a test program that runs both the original non-parallel and the parallel versions of quicksort and times them. You should see a slight improvement in the run time for the simple parallel version of quicksort.
#include <iostream> #include <ctime> #include <stdlib.h> #include <omp.h> int main (int argc, const char * argv[]) { #define SIZE 500000000 std::cout << "Preparing unsorted array." << std::endl; int* B = new int[SIZE]; for(int i = 0;i < SIZE;i++) B[i] = rand() % SIZE; int* A = new int[SIZE]; for(int i = 0;i < SIZE;i++) A[i] = B[i]; std::cout << "Starting sequential sort." << std::endl; time_t now = time(NULL); clock_t start = clock(); quicksort(A,0,SIZE-1); time_t later = time(NULL); clock_t end = clock(); std::cout << "Quicksort took " << (later-now) << " seconds." << std::endl; std::cout << "CPU time was " << (end-start)/CLOCKS_PER_SEC << " seconds." << std::endl; int n = omp_get_max_threads(); int k = 2; while(2*k <= n) k *= 2; for(int i = 0;i < SIZE;i++) A[i] = B[i]; std::cout << "Naive parallel using " << k << " threads out of a maximum of " << n << "." << std::endl; now = time(NULL); start = clock(); #pragma omp parallel { #pragma omp single naive_parallel_quicksort(A,0,SIZE-1,k); #pragma omp taskwait } later = time(NULL); end = clock(); std::cout << "Naive parallel quicksort took " << (later-now) << " seconds." << std::endl; std::cout << "CPU time was " << (end-start)/CLOCKS_PER_SEC << " seconds." << std::endl; delete[] A; return 0; }
Because we are using a task architecture for this program the pragmas needed to launch the parallel sorting process will look a little bit different.
We can further improve the performance of the parallel quicksort by also taking advantage of parallelism in its partition function. Here is the outline of a strategy for making a parallel partition.
A[p..q-1]
into n_threads
equal sized subregions.Run your program with this more parallel quicksort and the original quicksort and time them both. You should see a noticable improvement in the run time of the parallel version.
To manage the subblocks you should set up a simple class
class pBlock { public: int start; int mid; int end; };
When you divide A[p..q-1]
into subblocks you should create a pBlock
object for each of the subblocks you set up. At start, each pBlock
with have its start
and end
fields set. Then, to do the partitioning of the subblock pass a reference to the pBlock
along with the array and the pivot to a function that will partition the subblock. That method should set the value for the mid
field, which specifies the location of the first number in the subblock that is greater than the pivot.
To set up a parallel for loop that works with the array A and our worker threads use the following pragma before the for loop:
#pragma omp taskloop shared(A)