Thread Synchronization in User Mode
August 15, 2011 Leave a comment
Threads need to communicate with each other in two basic situations:
- When you have multiple threads accessing a shared resource in such a way that the resource does not become corrupt.
- When one thread needs to notify one or more other threads that a specific task has been completed.
Atomic Access: The Interlocked Family of Functions
A big part of thread synchronization has to do with atomic access—a thread’s ability to access a resource with the guarantee that no other thread will access that same resource at the same time.
Consider the following:
// Define a global variable.
long g_x = 0;
DWORD WINAPI ThreadFunc1(PVOID pvParam) {
g_x++;
return(0);
}
DWORD WINAPI ThreadFunc2(PVOID pvParam) {
g_x++;
return(0);
}
We create two threads: one thread executes ThreadFunc1, and the other thread executes ThreadFunc2.
If one thread executes this code followed by another thread, here is what effectively executes:
MOV EAX, [g_x] ; Thread 1: Move 0 into a register. INC EAX ; Thread 1: Increment the register to 1. MOV [g_x], EAX ; Thread 1: Store 1 back in g_x. MOV EAX, [g_x] ; Thread 2: Move 1 into a register. INC EAX ; Thread 2: Increment the register to 2. MOV [g_x], EAX ; Thread 2: Store 2 back in g_x.
Windows is a preemptive, multithreaded environment. So a thread can be switched away from at any time and another thread might continue executing at any time.
MOV EAX, [g_x] ; Thread 1: Move 0 into a register. INC EAX ; Thread 1: Increment the register to 1. MOV EAX, [g_x] ; Thread 2: Move 0 into a register. INC EAX ; Thread 2: Increment the register to 1. MOV [g_x], EAX ; Thread 2: Store 1 back in g_x. MOV [g_x], EAX ; Thread 1: Store 1 back in g_x.
To solve the problem just presented we need to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. All the functions manipulate a value atomically. Take a look at InterlockedExchangeAdd and its sibling InterlockedExchangeAdd64 that works on LONGLONG values:
No thread should ever attempt to modify the shared variable by using simple C++ statements:
// The long variable shared by many threads LONG g_x; ... // Incorrect way to increment the long g_x++; ... // Correct way to increment the long InterlockedExchangeAdd(&g_x, 1);
You must also ensure that the variable addresses that you pass to these functions are properly aligned or the functions might fail. The C run-time library offers an _aligned_malloc function that you can use to allocate a block of memory that is properly aligned.
InterlockedExchange is extremely useful when you implement a spinlock.
// Global variable indicating whether a shared resource is in use or not
BOOL g_fResourceInUse = FALSE; ...
void Func1() {
// Wait to access the resource.
while (InterlockedExchange (&g_fResourceInUse, TRUE) == TRUE)
Sleep(0);
// Access the resource.
...
// We no longer need to access the resource.
InterlockedExchange(&g_fResourceInUse, FALSE);
}
- This code assumes that all threads using the spinlock run at the same priority level. You might also want to disable thread priority boosting.
- You should ensure that the lock variable and the data that the lock protects are maintained in different cache lines.
- You should avoid using spinlocks on single-CPU machines. If a thread is spinning, it’s wasting precious CPU time, which prevents the other thread from changing the value.
You have access to a series of functions that allow you to easily manipulate a stack called an Interlocked Singly Linked List. Each operation, such as pushing or popping an element, is assured to be executed in an atomic way.
Cache Lines
If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 (for older CPUs), 64, or even 128 bytes (depending on the CPU), and they are always aligned on 32-byte, 64-byte, or 128-byte boundaries, respectively. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.
However, cache lines make memory updates more difficult in a multiprocessor environment, as you can see in this example:
- CPU1 reads a byte, causing this byte and its adjacent bytes to be read into CPU1′s cache line.
- CPU2 reads the same byte, which causes the same bytes in step 1 to be read into CPU2′s cache line.
- CPU1 changes the byte in memory, causing the byte to be written to CPU1′s cache line. But the information is not yet written to RAM.
- CPU2 reads the same byte again. Because this byte was already in CPU2′s cache line, it doesn’t have to access memory. But CPU2 will not see the new value of the byte in memory.
What all this means is that you should group your application’s data together in cache line—size chunks and on cache-line boundaries. The goal is to make sure that different CPUs access different memory addresses separated by at least a cache-line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.
struct CUSTINFO {
DWORD dwCustomerID; // Mostly read-only
int nBalanceDue; // Read-write
wchar_t szName[100]; // Mostly read-only
FILETIME ftLastOrderDate; // Read-write
};
you can use the C/C++ compiler's __declspec(align(#)) directive to control field alignment. Here is an improved version of this structure:
#define CACHE_ALIGN 64
// Force each structure to be in a different cache line.
struct __declspec(align(CACHE_ALIGN)) CUSTINFO {
DWORD dwCustomerID; // Mostly read-only
wchar_t szName[100]; // Mostly read-only
// Force the following members to be in a different cache line.
__declspec(align(CACHE_ALIGN))
int nBalanceDue; // Read-write
FILETIME ftLastOrderDate; // Read-write
};
It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache-line issues entirely.
Critical Sections
A critical section is a small section of code that requires exclusive access to some shared resource before the code can execute. This is a way to have several lines of code "atomically" manipulate a resource. By atomic, I mean that the code knows that no other thread will access the resource. Of course, the system can still preempt your thread and schedule other threads. However, it will not schedule any other threads that want to access the same resource until your thread leaves the critical section.
Here is some problematic code that demonstrates what happens without the use of a critical section:
const int COUNT = 1000;
int g_nSum = 0;
DWORD WINAPI FirstThread(PVOID pvParam) {
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
return(g_nSum);
}
DWORD WINAPI SecondThread(PVOID pvParam) {
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
return(g_nSum);
}
Let’s correct the code using a critical section:
const int COUNT = 10;
int g_nSum = 0;
CRITICAL_SECTION g_cs;
DWORD WINAPI FirstThread(PVOID pvParam) {
EnterCriticalSection(&g_cs);
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
LeaveCriticalSection(&g_cs);
return(g_nSum);
}
DWORD WINAPI SecondThread(PVOID pvParam) {
EnterCriticalSection(&g_cs);
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
LeaveCriticalSection(&g_cs);
return(g_nSum);
}
The great thing about critical sections is that they are easy to use and they use the interlocked functions internally, so they execute quickly. The major disadvantage of critical sections is that you cannot use them to synchronize threads in multiple processes.
To use critical sections:
- All threads that want to access the resource must know the address of the CRITICAL_SECTION structure that protects the resource.
- The members within the CRITICAL_SECTION structure be initialized before any threads attempt to access the protected resource. The structure is initialized via a call to VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);
- When you know that your process’ threads will no longer attempt to access the shared resource, you should clean up the CRITICAL_SECTION structure by calling this function: VOID DeleteCriticalSection(PCRITICAL_SECTION pcs);
- When you write code that touches a shared resource, you must prefix that code with a call to: VOID EnterCriticalSection(PCRITICAL_SECTION pcs);
- At the end of your code that touches the shared resource, you must call this function: VOID LeaveCriticalSection(PCRITICAL_SECTION pcs);
Critical Sections and Spin Locks
When a thread attempts to enter a critical section owned by another thread, the calling thread is placed immediately into a wait state. This means that the thread must transition from user mode to kernel mode (about 1000 CPU cycles). This transition is very expensive. On a multiprocessor machine, the thread that currently owns the resource might execute on a different processor and might relinquish control of the resource shortly. In fact, the thread that owns the resource might release it before the other thread has completed executing its transition into kernel mode. If this happens, a lot of CPU time is wasted.
To improve the performance of critical sections, Microsoft has incorporated spinlocks into them. So when EnterCriticalSection is called, it loops using a spinlock to try to acquire the resource some number of times. Only if all the attempts fail does the thread transition to kernel mode to enter a wait state.
To use a spinlock with a critical section, you should initialize the critical section by calling this function:
BOOL InitializeCriticalSectionAndSpinCount( PCRITICAL_SECTION pcs, DWORD dwSpinCount);
Slim Reader-Writer Locks
An SRWLock has the same purpose as a simple critical section: to protect a single resource against access made by different threads. However, unlike a critical section, an SRWLock allows you to distinguish between threads that simply want to read the value of the resource (the readers) and other threads that are trying to update this value (the writers). It should be possible for all readers to access the shared resource at the same time because there is no risk of data corruption if you only read the value of a resource. The need for synchronization begins when a writer thread wants to update the resource. In that case, the access should be exclusive: no other thread, neither a reader nor a writer, should be allowed to access the resource. This is exactly what an SRWLock allows you to do in your code and in a very explicit way.
| VS | SRW Owner | |
| Request Owner | Reader | Writer |
| Reader | Allow | Block |
| Writer | Block | Block |
As we see from the table, that SRWLocks are very suitable when Readers are more than Writers.
This article is a very good one to understand SRWLocks http://blogs.msdn.com/b/matt_pietrek/archive/2006/10/19/slim-reader-writer-locks.aspx
To use SRWLocks:
- First, you allocate an SRWLOCK structure and initialize it with the InitializeSRWLock function: VOID InitializeSRWLock(PSRWLOCK SRWLock);
- For readers:
- Thread can try to acquire an exclusive access to the resource protected by the SRWLock by calling AcquireSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID AcquireSRWLockExclusive(PSRWLOCK SRWLock);
- When the resource has been updated, the lock is released by calling ReleaseSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID ReleaseSRWLockExclusive(PSRWLOCK SRWLock);
- For writers:
- the same two-step scenario occurs but with the following two new functions: VOID AcquireSRWLockShared(PSRWLOCK SRWLock); VOID ReleaseSRWLockShared(PSRWLOCK SRWLock);
If you want to get the best performance in an application, you should try to use nonshared data first and then use volatile reads, volatile writes, interlocked APIs, SRWLocks, critical sections. And if all of these won’t work for your situation, then and only then, use kernel objects.
Condition Variables
You have seen that an SRWLock is used when you want to allow producer and consumer threads access to the same resource either in exclusive or shared mode. In these kinds of situations, if there is nothing to consume for a reader thread, it should release the lock and wait until there is something new produced by a writer thread. If the data structure used to receive the items produced by a writer thread becomes full, the lock should also be released and the writer thread put to sleep until reader threads have emptied the data structure.
Condition Variables are used in scenarios where a thread has to atomically release a lock on a resource and blocks until a condition is met through the SleepConditionVariableCS or SleepConditionVariableSRW functions.
A thread blocked inside these Sleep* functions is awakened when WakeConditionVariable or WakeAllConditionVariable is called by another thread that detects that the right condition is satisfied, such as the presence of an element to consume for a reader thread or enough room to insert a produced element for a writer thread.
This article solves the well known consumer/producer problem using condition variables with critical sections.
API Table