Thread conflicts and race conditions

Threads give a program the ability to carry out more than one process at the same time. Very often when programs make use of threads the threads will be working with shared data. When multiple threads attempt to work with the same piece of data at the same time there is the potential for conflict between the threads. In these notes I will explain how threads may come into conflict and introduce a commonly used regulatory mechanism that programs can use to prevent thread conflicts.

An example

Here is the code for a simple program that illustrates the potential for thread conflict.

#include <stdio.h>
#include <pthread.h>

void* count(void* arg) {
    int* counter = (int*) arg;
    for(int n = 0;n < 10000;n++)
        (*counter)++;
}

int main() {
    pthread_t t1, t2;
    int counter = 0;

    pthread_create(&t1,NULL,count,&counter);
    pthread_create(&t2,NULL,count,&counter);
    pthread_join(t1,NULL);
    pthread_join(t2,NULL);

    printf("Counter is %d\n",counter);
    return 0;
}

The program launches two threads, each of which uses the same thread function. That thread function attempts to increment the shared counter 10000 times. We launch the threads as usual with a call to pthread_create() and then call pthread_join() for each thread. The pthread_join() function will cause main() to wait until the given thread is done running. That way we give both threads a chance to finish their work before we try to print the value of the counter variable.

In this program we have a shared resource, the counter variable in main(). We then create two threads whose thread functions will each receive a pointer to that shared variable. Both of the threads run a loop that attempts to increment the counter 10000 times. After both threads have stopped running we would expect the counter variable to have a final value of 20000. Instead, the final result for the counter variable is almost never 20000. The final value on each run will vary just about every time we run the program, and the final value will almost never be the correct value of 20000.

What is going on here? The first thing to understand is that both threads are running the same code, and that both thread functions will have a pointer to the same counter variable. The potential for conflict arises when both functions try to manipulate the counter variable at the same time. The only place in the code where this happens is in the statement that increments the counter.

Part of what is going on here is that the statement that increments the counter appears to be an atomic operation that can be completed in a single step. In fact, even a simple increment is not an operation that can be executed by the CPU as a single step. To do an increment of the counter variable, the CPU will have to move the current value of the counter variable to a register, increment that register by one, and then move the new value back to the counter variable. What appears to be a single step is actually three. Once an operation takes more than one instruction to complete, we open the possibility of what is known as a race condition. In a race condition two or more threads are trying to complete the same sequence of instructions. Something the threads have to contend with is the fact that the operating system often stops threads at an arbitrary point and sets them aside to give other threads a chance to run. If one of the threads that is trying to increment the counter gets interrupted part way through the increment and gets to complete increment much later, this can result in that thread putting the wrong value back into the counter.

Making operations atomic via the use of mutexs

The solution for race conditions and many other thread conflicts is to use an additional regulatory mechanism to prevent more than one thread from accessing a critical section of code at the same time.

Here is a second version of the program that demonstrates how to do this:

#include <stdio.h>
#include <pthread.h>

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void* count(void* arg) {
    int* counter = (int*) arg;
    for(int n = 0;n < 10000;n++) {
        pthread_mutex_lock(&mutex);
        (*counter)++;
        pthread_mutex_unlock(&mutex);
    }
}

int main() {
    pthread_t t1, t2;
    int counter = 0;

    pthread_create(&t1,NULL,count,&counter);
    pthread_create(&t2,NULL,count,&counter);
    pthread_join(t1,NULL);
    pthread_join(t2,NULL);

    printf("Counter is %d\n",counter);
    return 0;
}

The solution here is to introduce a mutex. The term "mutex" is short for "mutual exclusion". A mutex allows one thread to lock down a section of code so that as long as that thread is inside the protected section no other thread will be able to enter.

The code here shows how to initialize a mutex. Since the mutex needs to be shared by multiple threads, we declare the mutex variable to have type static, which effectively makes it a shared global resource that all the threads can see and use.

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

To use a mutex, a thread starts by locking the mutex via a call to pthread_mutex_lock(). Locking the mutex then affects calls to that same function made by other threads. If one thread calls pthread_mutex_lock() any subsequent call to that same function made by other threads will block, causing the other threads to stop running at that spot until the situation changes. After the first thread has completed its work it will call pthread_mutex_unlock() to unlock the mutex and leave the protected section of code. As soon as the mutex gets unlocked the operating system will pick one of the threads that is currently blocked at the pthread_mutex_lock() call and allow it to procede. That in turn locks the mutex again while the second thread is running through the critical section.

This regulatory mechanism solves the problem caused by the original race condition. This version of the program will always do the right thing. When the program prints the final value for the counter it will always have the correct value of 20000.