OpenMP

Chapter 26 introduces a trio of useful new pseudocode keywords, spawn, sync, and parallel for. We have seen a few examples of how these simple constructs allow us to form powerful parallel implementations of common algorithms.

Here is the pseudocode for the parallelized version of merge sort from chapter 26.

MERGE-SORT'(A,p,r)
  if p < r
    q = (p+r)/2
    spawn MERGE-SORT'(A,p,q)
    MERGE-SORT'(A,q+1,r)
    sync
    P-MERGE(A,p,q,r)

P-MERGE(T,p1,r1,p2,r2,A,p3)
  n1 = r1 - p1 + 1
  n2 = r2 - p2 + 1
  if n1 < n2
    exchange p1 with p2
    exchange r1 with r2
    exchange n1 with n2
  if n1 == 0
    return
  else
    q1 = (p1+r1)/2
    q2 = BINARY-SEARCH(T[q1],T,p2,r2)
    q3 = p3 + (q1 - p1) + (q2 - p2)
    A[q3] = T[q1]
    spawn P-MERGE(T,p1,q1-1,p2,q2-1,A,p3)
    P-MERGE(T,q1+1,r1,q2,r2,A,q3+1)
    sync

In the programming assignment below I am going to ask you write a C++ program that implements another parallel algorithm. The notes below will show the basics of implementing parallel programming in C++.

OpenMP is a system for writing parallel programs in C++. OpenMP is a combination of a set of compiler extentions and an OpenMP library. The compiler extentions allow you to implement parallel programming features in your program through the use of compiler directives, specifically pragmas.

Here is source code for a C++ version of the parallel merge sort algorithm.

The OpenMP pragmas embedded in this code are fairly easy to use.

To set up a parallel for loop we use a construct like this:

#pragma omp parallel for
for(i = 0;i < n;i++)
  A[i] = 0;

This is simple a special omp pragma placed before the start of the for loop.

To set up a spawn/sync arrangement analogous to

x = spawn f(i)
y = f(i-1)
sync

we use the OpenMP construct

#pragma omp parallel sections
{
#pragma omp section
x = f(i);
#pragma omp section
y = f(i-1);
}

The combination of the parallel sections pragma and the curly braces sets up an implicit sync that takes place after the right curly brace.

When you use a parallel for or a section to move some code to another thread running on a different core, you will want to use some care to make sure that you distinguish variables that are shared across all threads from variables are meant to be used exclusively in a single thread. You make this distinction by adding shared and private variable directives to your OpenMP pragmas. If you look back at the code for parallel merge sort above you will see that in each case I was careful to indicated which variables are shared and which are private.

Turning on OpenMP in your own compiler

Many modern development environments offer support for OpenMP.

In Visual Studio you can turn on OpenMP support by right-clicking on your project in the solution explorer and selecting Properties... Go to C/C++/Language and find the option for OpenMP support. Set that option to Yes to enable OpenMP in your project.

On the Mac the situation is more complicated. The problem is that the C++ compiler that Xcode uses does not support OpenMP. As an alternative we can use the homebrew package manager to install the gcc compiler tools and use gcc with Visual Studio Code to build and debug our OpenMP programs. Here are instructions on how to do that:

  1. Go to the Homebrew home page to get instructions on how to install the Homebrew package manager for OS X.
  2. Open a terminal window and type the command brew install gcc
  3. Next, if you don't already have it installed, go to the download page for Visual Studio Code to download that software.
  4. Start up Visual Studio Code and open the extensions manager by pressing the command-shift-X key combination. Locate and install the C/C++ Extension Pack.
  5. Select Add Folder to Workspace... from the file menu and select the folder where your source code for the parallel merge sort example is located.
  6. With the merge.cpp file open select Start Debugging from the Run menu. Visual Studio Code will prompt you to select an option for a compiler and a debugger to use: select the option g++ build and debug active file.
  7. Initially the program will not compile because the version of g++ that Visual Studio Code tries to use does not support OpenMP. To fix this, open the tasks.json file in the .vscode folder. In the "command" section, change "/usr/bin/g++" to "g++-13", and in the "args" section add an additional argument "-fopenmp" before the "-g" argument. Save your changes to the tasks.json file.
  8. Go back to merge.cpp file in the editor and select Start Debugging from the Run menu again. At this point your program should compile and run successfully.
  9. Visual Studio Code has a full set of debugging tools built in. You can set breakpoints in your code, view the value of variables, and step through your code line by line just as you would in Xcode.

Parallel Quicksort

The merge.cpp file I provided above contains code for an implementation of the quicksort algorithm:

int partition(int A[],int p,int r)
{
    // Median of three
    int a = p;
    int b = (p+r)/2;
    int c = r;
    int use = c;
    if(A[a] > A[b] && A[a] < A[c])
        use = a;
    else if(A[a] < A[b] && A[a] > A[c])
        use = a;
    else if(A[b] < A[a] && A[b] > A[c])
        use = b;
    else if(A[b] > A[a] && A[b] < A[c])
        use = b;
    if(use != c) {
        int pivot = A[use];
        A[use] = A[r];
        A[r] = pivot;
    }
    int x = A[r];
    int i = p-1;
    for(int j = p;j < r;j++)
    {
        if(A[j] <= x) 
        {
            i++;
            int temp = A[i];
            A[i] = A[j];
            A[j] = temp;
        }
    }
    int temp = A[i+1];
    A[i+1] = A[r];
    A[r] = temp;
    return i+1;
}

void insertion(int A[],int p,int q) {
    for(int n = p;n < q;n++) {
        int k = n;
        int x = A[k+1];
        while(k >= p && A[k] > x) {
            A[k+1] = A[k];
            k--;
        }
        A[k+1] = x;
    }
}

void quicksort(int A[],int p,int q) {
    if(q - p < 100) {
        // Use insertion sort for small chunks
        insertion(A,p,q);
    } else {
        int r = partition(A,p,q);
        quicksort(A,p,r-1);
        quicksort(A,r+1,q);
    }
}

In the programming assignment below you are going to construct a parallel version of Quicksort

Programming Assignment

The first obvious change we can make to quicksort to make it run faster in OpenMP is to take advantage of an obvious opportunity in the quicksort function. Since the two recursive calls to quicksort operate largely independently of each other, we can use OpenMP to run those two recursive calls in parallel.

void naive_parallel_quicksort(int A[],int p,int q,int n_threads) {
    if(n_threads >= 2){
        int r = partition(A,p,q);
        int n = n_threads/2;
#pragma omp task shared(A)
        naive_parallel_quicksort(A,p,r-1,n);
#pragma omp task shared(A)
        naive_parallel_quicksort(A,r+1,q,n);
    } else if(q - p < 100) {
        // Use insertion sort for small chunks
        insertion(A,p,q);
    }  else {
        int r = partition(A,p,q);
        quicksort(A,p,r-1);
        quicksort(A,r+1,q);
    } 
}

Here I am using a different openmp construct than I used in the merge sort example. We have replaced the use of openmp parallel sections with openmp tasks. When you use a task architecture in openmp the library will set up a pool of worker threads and then pass tasks to be done to those threads each time the program encounters a task pragma.

Here is the main function for a test program that runs both the original non-parallel and the parallel versions of quicksort and times them. You should see a slight improvement in the run time for the simple parallel version of quicksort.

#include <iostream>
#include <ctime>
#include <stdlib.h>
#include <omp.h>

int main (int argc, const char * argv[])
{
#define SIZE 500000000
    
    std::cout << "Preparing unsorted array." << std::endl;
    int* B = new int[SIZE];
    for(int i = 0;i < SIZE;i++)
        B[i] = rand() % SIZE;
    int* A = new int[SIZE];
    for(int i = 0;i < SIZE;i++)
        A[i] = B[i];
    
    std::cout << "Starting sequential sort." << std::endl;
    time_t now = time(NULL);
    clock_t start = clock();
    quicksort(A,0,SIZE-1);
    time_t later = time(NULL);
    clock_t end = clock();
    std::cout << "Quicksort took " << (later-now) << " seconds." << std::endl;
    std::cout << "CPU time was " << (end-start)/CLOCKS_PER_SEC << " seconds." << std::endl;
    
    int n = omp_get_max_threads();
    int k = 2;
    while(2*k <= n)
        k *= 2;
    
    for(int i = 0;i < SIZE;i++)
        A[i] = B[i];

    std::cout << "Naive parallel using " << k << " threads out of a maximum of " << n << "." << std::endl;
    
    now = time(NULL);
    start = clock();
#pragma omp parallel
{
#pragma omp single
    naive_parallel_quicksort(A,0,SIZE-1,k);
#pragma omp taskwait
}
    later = time(NULL);
    end = clock();
    std::cout << "Naive parallel quicksort took " << (later-now) << " seconds." << std::endl;
    std::cout << "CPU time was " << (end-start)/CLOCKS_PER_SEC << " seconds." << std::endl;

    delete[] A;
    return 0;
}

Because we are using a task architecture for this program the pragmas needed to launch the parallel sorting process will look a little bit different.

We can further improve the performance of the parallel quicksort by also taking advantage of parallelism in its partition function. Here is the outline of a strategy for making a parallel partition.

  1. Start by breaking A[p..q-1] into n_threads equal sized subregions.
  2. Use a parallel for loop to run a partitioning process on each of the subregions in parallel. You will be organizing each subregion into a block of numbers that are less than or equal to the pivot and a block of numbers that are greater than the pivot.
  3. Run a process that rearranges neighboring subblocks. To fix a pair of subblocks we need to swap the large numbers at the end of the first subblock with the smaller numbers that appear at the front of the second subblock. You can use a parallel for loop to do these swaps.
  4. On the first pass you will merge pairs of neighboring blocks to form larger blocks. You will need to repeat this process until only a single block from p to q-1 remains.

Run your program with this more parallel quicksort and the original quicksort and time them both. You should see a noticable improvement in the run time of the parallel version.

Helpful hints

To manage the subblocks you should set up a simple class

class pBlock {
public:
    int start;
    int mid;
    int end;
};

When you divide A[p..q-1] into subblocks you should create a pBlock object for each of the subblocks you set up. At start, each pBlock with have its start and end fields set. Then, to do the partitioning of the subblock pass a reference to the pBlock along with the array and the pivot to a function that will partition the subblock. That method should set the value for the mid field, which specifies the location of the first number in the subblock that is greater than the pivot.

To set up a parallel for loop that works with the array A and our worker threads use the following pragma before the for loop:

#pragma omp taskloop shared(A)