Thread Synchronization with Kernel Objects

Content

  1. Introduction
  2. Wait Functions
  3. Event Kernel Objects
  4. Waitable Timer Kernel Objects
  5. Semaphore Kernel Objects
  6. Mutex Kernel Objects
  7. Mutexes vs Critical Sections

Introduction

Although user-mode thread synchronization mechanisms offer great performance, they do have limitations, such as:

  • You can use critical sections to place a thread in a wait state, but you can use them only to synchronize threads contained within a single process
  • You can easily get into deadlock situations with critical sections because you cannot specify a timeout value while waiting to enter the critical section.

The drawback of using Kernel Objects is their performance, the transition from user-mode to kernel-mode is costly: it takes about 200 CPU cycles on the x86 platform for an empty system call—and this, of course, does not include the execution of the kernel-mode code that actually implements the function your thread is calling. But what takes several orders of magnitude more is the overhead of scheduling a new thread with all the cache flushes/misses it entails. Here we’re talking about tens of thousands of cycles.

The following kernel objects can be in a signaled or nonsignaled state:

  • Processes
  • Threads
  • Jobs
  • File and console standard input/output/error streams
  • Events
  • Waitable timers
  • Semaphores
  • Mutexes

Wait Functions

DWORD dw = WaitForSingleObject(hProcess, 5000);
switch (dw) {

   case WAIT_OBJECT_0:
     // The process terminated.
     break;

   case WAIT_TIMEOUT:
      // The process did not terminate within 5000 milliseconds.
      break;

   case WAIT_FAILED:
      // Bad call to function (invalid handle?)
      break;
}

The preceding code tells the system that the calling thread should not be schedulable until either the specified process has terminated or 5000 milliseconds have expired, whichever comes first. So this call returns in less than 5000 milliseconds if the process terminates, and it returns in about 5000 milliseconds if the process hasn’t terminated. Note that you can pass 0 for the dwMilliseconds parameter. If you do this, WaitForSingleObject always returns immediately, even if the wait condition hasn’t been satisfied.

HANDLE3];
h[0] = hProcess1;
h[1] = hProcess2;
h[2] = hProcess3;
DWORD dw = WaitForMultipleObjects(3, h, FALSE, 5000);
switch (dw) {
   case WAIT_FAILED:
      // Bad call to function (invalid handle?)
      break;

   case WAIT_TIMEOUT:
      // None of the objects became signaled within 5000 milliseconds.
      break;

   case WAIT_OBJECT_0 + 0:
      // The process identified by h[0] (hProcess1) terminated.
      break;

   case WAIT_OBJECT_0 + 1:
      // The process identified by h[1] (hProcess2) terminated.
      break;

   case WAIT_OBJECT_0 + 2:
      // The process identified by h[2] (hProcess3) terminated.
      break;
}

Successful Wait Side Effects

For some kernel objects, a successful call to WaitForSingleObject or WaitForMultiple-Objects actually alters the state of the object. A successful call is one in which the function sees that the object was signaled and returns a value relative to WAIT_OBJECT_0. A call is unsuccessful if the function returns WAIT_TIMEOUT or WAIT_FAILED. Objects never have their state altered for unsuccessful calls.

Let’s look at an example. Two threads call WaitForMultipleObjects in exactly the same way:

HANDLE h[2];
h[0] = hAutoResetEvent1;   // Initially nonsignaled
h[1] = hAutoResetEvent2;   // Initially nonsignaled
WaitForMultipleObjects(2, h, TRUE, INFINITE);

When WaitForMultipleObjects is called, both event objects are nonsignaled; this forces both threads to enter a wait state. Then the hAutoResetEvent1 object becomes signaled. Both threads see that the event has become signaled, but neither can wake up because the hAutoResetEvent2 object is still nonsignaled. Because neither thread has successfully waited yet, no side effect happens to the hAutoResetEvent1 object.

Next, the hAutoResetEvent2 object becomes signaled. At this point, one of the two threads detects that both objects it is waiting for have become signaled. The wait is successful, both event objects are set to the nonsignaled state, and the thread is schedulable. But what about the other thread? It continues to wait until it sees that both event objects are signaled. Even though it originally detected that hAutoResetEvent1 was signaled, it now sees this object as nonsignaled.

If multiple threads wait for a single kernel object, which thread does the system decide to wake up when the object becomes signaled?  "The algorithm is fair." which means that if multiple threads are waiting, each should get its own chance to wake up each time the object becomes signaled.

Event Kernel Objects

Events signal that an operation has completed. There are two different types of event objects: manual-reset events and auto-reset events. When a manual-reset event is signaled, all threads waiting on the event become schedulable. When an auto-reset event is signaled, only one of the threads waiting on the event becomes schedulable.

Once an event is created, you control its state directly. When you call SetEvent, you change the event to the signaled state:

BOOL SetEvent(HANDLE hEvent);

When you call ResetEvent, you change the event to the nonsignaled state:

BOOL ResetEvent(HANDLE hEvent);

It’s that easy.

an auto-reset event is automatically reset to the nonsignaled state when a thread successfully waits on the object.

Waitable Timer Kernel Objects

Waitable timers are kernel objects that signal themselves at a certain time or at regular intervals. They are most commonly used to have some operation performed at a certain time.

Waitable timer objects are always created in the nonsignaled state. You must call the SetWaitable-Timer function to tell the timer when you want it to become signaled.

The following code sets a timer to go off for the first time on January 1, 2008, at 1:00 P.M., and then to go off every six hours after that:

// Declare our local variables.
HANDLE hTimer;
SYSTEMTIME st;
FILETIME ftLocal, ftUTC;
LARGE_INTEGER liUTC;

// Create an auto-reset timer.
hTimer = CreateWaitableTimer(NULL, FALSE, NULL);

// First signaling is at January 1, 2008, at 1:00 P.M. (local time).
st.wYear         = 2008; // Year
st.wMonth        = 1;    // January
st.wDayOfWeek    = 0;    // Ignored
st.wDay          = 1;    // The first of the month
st.wHour         = 13;   // 1PM
st.wMinute       = 0;    // 0 minutes into the hour
st.wSecond       = 0;    // 0 seconds into the minute
st.wMilliseconds = 0;    // 0 milliseconds into the second

SystemTimeToFileTime(&st, &ftLocal);

// Convert local time to UTC time.
LocalFileTimeToFileTime(&ftLocal, &ftUTC);
// Convert FILETIME to LARGE_INTEGER because of different alignment.
liUTC.LowPart  = ftUTC.dwLowDateTime;
liUTC.HighPart = ftUTC.dwHighDateTime;

// Set the timer.
SetWaitableTimer(hTimer, &liUTC, 6 * 60 * 60 * 1000,
   NULL, NULL, FALSE); ...

Instead of setting an absolute time that the timer should first go off, you can have the timer go off at a time relative to calling SetWaitableTimer. You simply pass a negative value in the pDueTime parameter. The value you pass must be in 100-nanosecond intervals. Because we don’t normally think in intervals of 100 nanoseconds, you might find this useful: 1 second = 1,000 milliseconds = 1,000,000 microseconds = 10,000,000 100-nanoseconds.

The following code sets a timer to initially go off 5 seconds after the call to SetWaitableTimer:

// Declare our local variables.
HANDLE hTimer;
LARGE_INTEGER li;

// Create an auto-reset timer.
hTimer = CreateWaitableTimer(NULL, FALSE, NULL);

// Set the timer to go off 5 seconds after calling SetWaitableTimer.
// Timer unit is 100 nanoseconds.
const int nTimerUnitsPerSecond = 10000000;

// Negate the time so that SetWaitableTimer knows we
// want relative time instead of absolute time.
li.QuadPart = -(5 * nTimerUnitsPerSecond);

// Set the timer.
SetWaitableTimer(hTimer, &li, 6 * 60 * 60 * 1000,
   NULL, NULL, FALSE); ...

Waitable Timers vs User Timers

The biggest difference is that User timers require a lot of additional user interface infrastructure in your application, which makes them more resource intensive. Also, waitable timers are kernel objects, which means that they can be shared by multiple threads and are securable.

  • User timers generate WM_TIMER messages that come back to the thread that called SetTimer (for callback timers) or the thread that created the window (for window-based timers). So only one thread is notified when a User timer goes off. Multiple threads, on the other hand, can wait on waitable timers, and several threads can be scheduled if the timer is a manual-reset timer.
  • With waitable timers, you’re more likely to be notified when the time actually expires. The WM_TIMER messages are always the lowest-priority messages and are retrieved when no other messages are in a thread’s queue.

Semaphore Kernel Objects

Semaphore kernel objects are used for resource counting. They contain a usage count, as all kernel objects do, but they also contain two additional signed 32-bit values: a maximum resource count and a current resource count. The maximum resource count identifies the maximum number of resources that the semaphore can control; the current resource count indicates the number of these resources that are currently available.

The rules for a semaphore are as follows:

  • If the current resource count is greater than 0, the semaphore is signaled.
  • If the current resource count is 0, the semaphore is nonsignaled.
  • The system never allows the current resource count to be negative.
  • The current resource count can never be greater than the maximum resource count.

A thread gains access to a resource by calling a wait function, passing the handle of the semaphore guarding the resource. Internally, the wait function checks the semaphore’s current resource count and if its value is greater than 0 (the semaphore is signaled), the counter is decremented by 1 and the calling thread remains schedulable.

Unfortunately, there is just no way to get the current resource count of a semaphore without altering it.

Mutex Kernel Objects

Mutex kernel objects ensure that a thread has mutual exclusive access to a single resource. A mutex object contains a usage count, thread ID, and recursion counter. Mutexes behave identically to critical sections. However, mutexes are kernel objects, while critical sections are user-mode synchronization objects.

This means that mutexes are slower than critical sections. But it also means that threads in different processes can access a single mutex, and it means that a thread can specify a timeout value while waiting to gain access to a resource.

The rules for a mutex are as follows:

  • If the thread ID is 0 (an invalid thread ID), the mutex is not owned by any thread and is signaled.
  • If the thread ID is a nonzero value, a thread owns the mutex and the mutex is nonsignaled.

A thread gains access to the shared resource by calling a wait function, passing the handle of the mutex guarding the resource. Internally, the wait function checks the thread ID to see if it is 0 (the mutex is signaled). If the thread ID is 0, the thread ID is set to the calling thread’s ID, the recursion counter is set to 1, and the calling thread remains schedulable.

Every time a thread successfully waits on a mutex, the object’s recursion counter is incremented. The only way the recursion counter can have a value greater than 1 is if the thread waits on the same mutex multiple times.

Abandonment Issues

So if a thread owning a mutex terminates (using ExitThread, TerminateThread, ExitProcess, or TerminateProcess) before releasing the mutex, the system considers the mutex to be abandoned— the thread that owns it can never release it because the thread has died.

Because the system keeps track of all mutex and thread kernel objects, it knows exactly when mutexes become abandoned. When a mutex becomes abandoned, the system automatically resets the mutex object’s thread ID to 0 and its recursion counter to 0. Then the system checks to see whether any threads are currently waiting for the mutex.

This is the same as before except that the wait function does not return the usual WAIT_OBJECT_0 value to the thread. Instead, the wait function returns the special value of WAIT_ABANDONED. This special return value (which applies only to mutex objects) indicates that the mutex the thread was waiting on was owned by another thread that was terminated before it finished using the shared resource. The newly scheduled thread has no idea what state the resource is currently in—the resource might be totally corrupt.

In real life, most applications never check explicitly for the WAIT_ABANDONED return value because a thread is rarely just terminated. (This whole discussion provides another great example of why you should never call the TerminateThread function.)

Mutex vs Critical Section

Characteristic

Mutex

Critical Section

Mode Kernel Mode User Mode
Performance Slow Fast
Can be used across process boundaries? Yes No
Declaration HANDLE hMutex; CRITICAL_SECTION cs;
Initialization hMutext = CreateMutex(NULL, FALSE, NULL); InitializeCriticalSection(&s);
Cleanup CloseHandle(hMutext); DeleteCriticalSection(&cs);
Infinite Wait WaitForSingleObject(hMutex, INFINITE); EnterCriticalSection(&cs);
0 Wait WaitForSingleObject(hMutex, 0); TryEnterCriticalSection(&cs);
Arbitrary Wait WaitForSingleObject(hMutex, dwMilliseconds); N/A
Release ReleaseMutext(hMutext); LeaveCriticalSection(&cs);
Can be waited on with other kernel objects? Yes (e.g WaitForMultipleObjects or similar) No

 

API Table

Function

Description

DWORD WaitForSingleObject(
   HANDLE hObject,
   DWORD dwMilliseconds);

When a thread calls this function, the first parameter, hObject, identifies a kernel object that supports being signaled/nonsignaled. (Any object mentioned in the list on the previous page works just great.) The second parameter, dwMilliseconds, allows the thread to indicate how long it is willing to wait for the object to become signaled.

The following function call tells the system that the calling thread wants to wait until the process identified by the hProcess handle terminates:

WaitForSingleObject(hProcess, INFINITE);

The second parameter tells the system that the calling thread is willing to wait forever (an infinite amount of time) or until this process terminates.

DWORD WaitForMultipleObjects(
   DWORD dwCount,
   CONST HANDLE* phObjects,
   BOOL bWaitAll,
   DWORD dwMilliseconds);

The following function, WaitForMultipleObjects, is similar to WaitForSingleObject except that it allows the calling thread to check the signaled state of several kernel objects simultaneously.

The dwCount parameter indicates the number of kernel objects you want the function to check. This value must be between 1 and MAXIMUM_WAIT_OBJECTS (defined as 64 in the WinNT.h header file). The phObjects parameter is a pointer to an array of kernel object handles.

HANDLE CreateEvent(
   PSECURITY_ATTRIBUTES psa,
   BOOL bManualReset,
   BOOL bInitialState,
   PCTSTR pszName);

creates an event kernel object.

The bManualReset parameter is a Boolean value that tells the system whether to create a manualreset event (TRUE) or an auto-reset event (FALSE). The bInitialState parameter indicates whether the event should be initialized to signaled (TRUE) or nonsignaled (FALSE).

HANDLE CreateEventEx(
   PSECURITY_ATTRIBUTES psa,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

allows you to create or open a potentially existing event requesting reduced access

HANDLE OpenEvent(
   DWORD dwDesiredAccess,
   BOOL bInherit,
   PCTSTR pszName);

Threads in other processes can gain access to the object by calling CreateEvent using the same value passed in the pszName parameter;

BOOL SetEvent(HANDLE hEvent);

change the event to the signaled state

BOOL ResetEvent(HANDLE hEvent);

change the event to the nonsignaled state

BOOL PulseEvent(HANDLE hEvent);

makes an event signaled and then immediately nonsignaled; it’s just like calling Set-Event immediately followed by ResetEvent. If you call PulseEvent on a manual-reset event, any and all threads waiting on the event when it is pulsed are schedulable. If you call PulseEvent on an auto-reset event, only one waiting thread becomes schedulable. If no threads are waiting on the event when it is pulsed, there is no effect.

HANDLE CreateWaitableTimer(
   PSECURITY_ATTRIBUTES psa,
   BOOL bManualReset,
   PCTSTR pszName);

create a waitable timer.

the bManualReset parameter indicates a manual-reset or auto-reset timer. When a manual-reset timer is signaled, all threads waiting on the timer become schedulable. When an auto-reset timer is signaled, only one waiting thread becomes schedulable.

HANDLE OpenWaitableTimer(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

process can obtain its own process-relative handle to an existing waitable timer by calling

BOOL SetWaitableTimer(
   HANDLE hTimer,
   const LARGE_INTEGER *pDueTime,
   LONG lPeriod,
   PTIMERAPCROUTINE pfnCompletionRoutine,
   PVOID pvArgToCompletionRoutine,
   BOOL bResume);

tell the timer when you want it to become signaled.

Obviously, the hTimer parameter indicates the timer that you want to set. The next two parameters, pDueTime and lPeriod, are used together. The pDueTime parameter indicates when the timer should go off for the first time, and the lPeriod parameter indicates how frequently the timer should go off after that

BOOL CancelWaitableTimer(HANDLE hTimer);

This simple function takes the handle of a timer and cancels it so that the timer never goes off unless there is a subsequent call to SetWaitableTimer to reset the timer.

HANDLE CreateSemaphore(
   PSECURITY_ATTRIBUTE psa,
   LONG lInitialCount,
   LONG lMaximumCount,
   PCTSTR pszName);

creates a semaphore kernel object

HANDLE CreateSemaphoreEx(
   PSECURITY_ATTRIBUTES psa,
   LONG lInitialCount,
   LONG lMaximumCount,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

Same as CreateSemaphore, but you can use the following function to directly provide access rights in the dwDesiredAccess parameter. Notice that the dwFlags is reserved and should be set to 0.

HANDLE OpenSemaphore(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

another process can obtain its own process-relative handle to an existing semaphore.

BOOL ReleaseSemaphore(                
   HANDLE hSemaphore,
   LONG lReleaseCount,
   PLONG plPreviousCount);

A thread increments a semaphore’s current resource count.

This function simply adds the value in lReleaseCount to the semaphore’s current resource count.

HANDLE CreateMutex(
   PSECURITY_ATTRIBUTES psa,
   BOOL bInitialOwner,
   PCTSTR pszName);

To use a mutex, one process must first create the mutex.

HANDLE CreateMutexEx(
   PSECURITY_ATTRIBUTES psa,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

You can also use the following function to directly provide access rights in the dwDesiredAccess parameter. The dwFlags parameter replaces the bInitialOwned parameter of CreateMutex: 0 means FALSE, and CREATE_MUTEX_ INITIAL_OWNER is equivalent to TRUE.

HANDLE OpenMutex(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

another process can obtain its own process-relative handle to an existing mutex

BOOL ReleaseMutex(HANDLE hMutex);

When the thread that currently has access to the resource no longer needs its access, it must release the mutex

References

Windows® via C/C++, Fifth Edition

Thread Synchronization in User Mode

Threads need to communicate with each other in two basic situations:

  • When you have multiple threads accessing a shared resource in such a way that the resource does not become corrupt.
  • When one thread needs to notify one or more other threads that a specific task has been completed.

Atomic Access: The Interlocked Family of Functions

A big part of thread synchronization has to do with atomic access—a thread’s ability to access a resource with the guarantee that no other thread will access that same resource at the same time.

Consider the following:

// Define a global variable.
long g_x = 0;

DWORD WINAPI ThreadFunc1(PVOID pvParam) {
   g_x++;
   return(0);
}

DWORD WINAPI ThreadFunc2(PVOID pvParam) {
   g_x++;
   return(0);
}

We create two threads: one thread executes ThreadFunc1, and the other thread executes ThreadFunc2.

If one thread executes this code followed by another thread, here is what effectively executes:

MOV EAX, [g_x]       ; Thread 1: Move 0 into a register.
INC EAX              ; Thread 1: Increment the register to 1.
MOV [g_x], EAX       ; Thread 1: Store 1 back in g_x.

MOV EAX, [g_x]       ; Thread 2: Move 1 into a register.
INC EAX              ; Thread 2: Increment the register to 2.
MOV [g_x], EAX       ; Thread 2: Store 2 back in g_x.

Windows is a preemptive, multithreaded environment. So a thread can be switched away from at any time and another thread might continue executing at any time.

MOV EAX, [g_x]       ; Thread 1: Move 0 into a register.
INC EAX              ; Thread 1: Increment the register to 1.

MOV EAX, [g_x]       ; Thread 2: Move 0 into a register.
INC EAX              ; Thread 2: Increment the register to 1.
MOV [g_x], EAX       ; Thread 2: Store 1 back in g_x.

MOV [g_x], EAX       ; Thread 1: Store 1 back in g_x.

To solve the problem just presented we need to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. All the functions manipulate a value atomically. Take a look at InterlockedExchangeAdd and its sibling InterlockedExchangeAdd64 that works on LONGLONG values:

No thread should ever attempt to modify the shared variable by using simple C++ statements:

// The long variable shared by many threads
LONG g_x; ...

// Incorrect way to increment the long
g_x++; ...

// Correct way to increment the long
InterlockedExchangeAdd(&g_x, 1);

You must also ensure that the variable addresses that you pass to these functions are properly aligned or the functions might fail. The C run-time library offers an _aligned_malloc function that you can use to allocate a block of memory that is properly aligned.

InterlockedExchange is extremely useful when you implement a spinlock.

// Global variable indicating whether a shared resource is in use or not
BOOL g_fResourceInUse = FALSE; ...
void Func1() {
   // Wait to access the resource.
   while (InterlockedExchange (&g_fResourceInUse, TRUE) == TRUE)
      Sleep(0);

   // Access the resource.
   ...

   // We no longer need to access the resource.
   InterlockedExchange(&g_fResourceInUse, FALSE);
}
  • This code assumes that all threads using the spinlock run at the same priority level. You might also want to disable thread priority boosting.
  • You should ensure that the lock variable and the data that the lock protects are maintained in different cache lines.
  • You should avoid using spinlocks on single-CPU machines. If a thread is spinning, it’s wasting precious CPU time, which prevents the other thread from changing the value.

You have access to a series of functions that allow you to easily manipulate a stack called an Interlocked Singly Linked List. Each operation, such as pushing or popping an element, is assured to be executed in an atomic way.

Cache Lines

If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 (for older CPUs), 64, or even 128 bytes (depending on the CPU), and they are always aligned on 32-byte, 64-byte, or 128-byte boundaries, respectively. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.

However, cache lines make memory updates more difficult in a multiprocessor environment, as you can see in this example:

  • CPU1 reads a byte, causing this byte and its adjacent bytes to be read into CPU1′s cache line.
  • CPU2 reads the same byte, which causes the same bytes in step 1 to be read into CPU2′s cache line.
  • CPU1 changes the byte in memory, causing the byte to be written to CPU1′s cache line. But the information is not yet written to RAM.
  • CPU2 reads the same byte again. Because this byte was already in CPU2′s cache line, it doesn’t have to access memory. But CPU2 will not see the new value of the byte in memory.

What all this means is that you should group your application’s data together in cache line—size chunks and on cache-line boundaries. The goal is to make sure that different CPUs access different memory addresses separated by at least a cache-line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.

struct CUSTINFO {
   DWORD    dwCustomerID;     // Mostly read-only
   int      nBalanceDue;      // Read-write
   wchar_t  szName[100];      // Mostly read-only
   FILETIME ftLastOrderDate;  // Read-write
};
you can use the C/C++ compiler's __declspec(align(#)) directive to control field alignment. Here is an improved version of this structure:
#define CACHE_ALIGN 64

// Force each structure to be in a different cache line.
struct __declspec(align(CACHE_ALIGN)) CUSTINFO {
   DWORD    dwCustomerID;     // Mostly read-only
   wchar_t  szName[100];      // Mostly read-only

   // Force the following members to be in a different cache line.
   __declspec(align(CACHE_ALIGN))
   int nBalanceDue;           // Read-write
   FILETIME ftLastOrderDate;  // Read-write
};

It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache-line issues entirely.

Critical Sections

A critical section is a small section of code that requires exclusive access to some shared resource before the code can execute. This is a way to have several lines of code "atomically" manipulate a resource. By atomic, I mean that the code knows that no other thread will access the resource. Of course, the system can still preempt your thread and schedule other threads. However, it will not schedule any other threads that want to access the same resource until your thread leaves the critical section.

Here is some problematic code that demonstrates what happens without the use of a critical section:

const int COUNT = 1000;
int g_nSum = 0;

DWORD WINAPI FirstThread(PVOID pvParam) {
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   return(g_nSum);
}


DWORD WINAPI SecondThread(PVOID pvParam) {
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   return(g_nSum);
}

Let’s correct the code using a critical section:

const int COUNT = 10;
int g_nSum = 0;
CRITICAL_SECTION g_cs;

DWORD WINAPI FirstThread(PVOID pvParam) {
   EnterCriticalSection(&g_cs);
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   LeaveCriticalSection(&g_cs);
   return(g_nSum);
}


DWORD WINAPI SecondThread(PVOID pvParam) {
   EnterCriticalSection(&g_cs);
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   LeaveCriticalSection(&g_cs);
   return(g_nSum);
}

The great thing about critical sections is that they are easy to use and they use the interlocked functions internally, so they execute quickly. The major disadvantage of critical sections is that you cannot use them to synchronize threads in multiple processes.

To use critical sections:

  • All threads that want to access the resource must know the address of the CRITICAL_SECTION structure that protects the resource.
  • The members within the CRITICAL_SECTION structure be initialized before any threads attempt to access the protected resource. The structure is initialized via a call to VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);
  • When you know that your process’ threads will no longer attempt to access the shared resource, you should clean up the CRITICAL_SECTION structure by calling this function: VOID DeleteCriticalSection(PCRITICAL_SECTION pcs);
  • When you write code that touches a shared resource, you must prefix that code with a call to: VOID EnterCriticalSection(PCRITICAL_SECTION pcs);
  • At the end of your code that touches the shared resource, you must call this function: VOID LeaveCriticalSection(PCRITICAL_SECTION pcs);

Critical Sections and Spin Locks

When a thread attempts to enter a critical section owned by another thread, the calling thread is placed immediately into a wait state. This means that the thread must transition from user mode to kernel mode (about 1000 CPU cycles). This transition is very expensive. On a multiprocessor machine, the thread that currently owns the resource might execute on a different processor and might relinquish control of the resource shortly. In fact, the thread that owns the resource might release it before the other thread has completed executing its transition into kernel mode. If this happens, a lot of CPU time is wasted.

To improve the performance of critical sections, Microsoft has incorporated spinlocks into them. So when EnterCriticalSection is called, it loops using a spinlock to try to acquire the resource some number of times. Only if all the attempts fail does the thread transition to kernel mode to enter a wait state.

To use a spinlock with a critical section, you should initialize the critical section by calling this function:

BOOL InitializeCriticalSectionAndSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

Slim Reader-Writer Locks

An SRWLock has the same purpose as a simple critical section: to protect a single resource against access made by different threads. However, unlike a critical section, an SRWLock allows you to distinguish between threads that simply want to read the value of the resource (the readers) and other threads that are trying to update this value (the writers). It should be possible for all readers to access the shared resource at the same time because there is no risk of data corruption if you only read the value of a resource. The need for synchronization begins when a writer thread wants to update the resource. In that case, the access should be exclusive: no other thread, neither a reader nor a writer, should be allowed to access the resource. This is exactly what an SRWLock allows you to do in your code and in a very explicit way.

VS SRW Owner  
Request Owner Reader Writer
Reader Allow Block
Writer Block Block

 

As we see from the table, that SRWLocks are very suitable when Readers are more than Writers.

This article is a very good one to understand SRWLocks http://blogs.msdn.com/b/matt_pietrek/archive/2006/10/19/slim-reader-writer-locks.aspx

To use SRWLocks:

  1. First, you allocate an SRWLOCK structure and initialize it with the InitializeSRWLock function: VOID InitializeSRWLock(PSRWLOCK SRWLock);
  2. For readers:
    1. Thread can try to acquire an exclusive access to the resource protected by the SRWLock by calling AcquireSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID AcquireSRWLockExclusive(PSRWLOCK SRWLock);
    2. When the resource has been updated, the lock is released by calling ReleaseSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID ReleaseSRWLockExclusive(PSRWLOCK SRWLock);
  3. For writers:
    1. the same two-step scenario occurs but with the following two new functions: VOID AcquireSRWLockShared(PSRWLOCK SRWLock); VOID ReleaseSRWLockShared(PSRWLOCK SRWLock);

If you want to get the best performance in an application, you should try to use nonshared data first and then use volatile reads, volatile writes, interlocked APIs, SRWLocks, critical sections. And if all of these won’t work for your situation, then and only then, use kernel objects.

Condition Variables

You have seen that an SRWLock is used when you want to allow producer and consumer threads access to the same resource either in exclusive or shared mode. In these kinds of situations, if there is nothing to consume for a reader thread, it should release the lock and wait until there is something new produced by a writer thread. If the data structure used to receive the items produced by a writer thread becomes full, the lock should also be released and the writer thread put to sleep until reader threads have emptied the data structure.

Condition Variables are used in scenarios where a thread has to atomically release a lock on a resource and blocks until a condition is met through the SleepConditionVariableCS or SleepConditionVariableSRW functions.

A thread blocked inside these Sleep* functions is awakened when WakeConditionVariable or WakeAllConditionVariable is called by another thread that detects that the right condition is satisfied, such as the presence of an element to consume for a reader thread or enough room to insert a produced element for a writer thread.

This article solves the well known consumer/producer problem using condition variables with critical sections.

API Table

Function

Description

LONG InterlockedExchangeAdd(
   PLONG volatile plAddend,
   LONG lIncrement);
LONGLONG InterlockedExchangeAdd64(
   PLONGLONG volatile pllAddend,
   LONGLONG llIncrement);

Performs an atomic addition of two 32-bit values.

To operate on 64-bit values, used InterlockedExchangeAdd64

void * _aligned_malloc(size_t size, size_t alignment);

Used to allocate a block of memory that is properly aligned.

The size argument identifies the number of bytes you want to allocate, and the alignment argument indicates the byte boundary that you want the block aligned on. The value you pass for the alignment argument must be an integer power of 2.

LONG InterlockedExchange(
   PLONG volatile plTarget,
   LONG lValue);
LONGLONG InterlockedExchange64(
   PLONGLONG volatile plTarget,
   LONGLONG lValue);
PVOID InterlockedExchangePointer(
   PVOID* volatile ppvTarget,
   PVOID pvValue);

Replace the current value whose address is passed in the first parameter with a value passed in the second parameter.

For a 32-bit application, both functions replace a 32-bit value with another 32-bit value. But for a 64-bit application, InterlockedExchange replaces a 32-bit value while InterlockedExchangePointer replaces a 64-bit value. Both functions return the original value.

PVOID InterlockedCompareExchange(
   PLONG plDestination,
   LONG lExchange,
   LONG lComparand);
PVOID InterlockedCompareExchangePointer(
   PVOID* ppvDestination,
   PVOID pvExchange,
   PVOID pvComparand);

These two functions perform an atomic test and set operation: for a 32-bit application, both functions operate on 32-bit values, but in a 64-bit application, InterlockedCompareExchange operates on 32-bit values while InterlockedCompareExchangePointer operates on 64-bit values. In pseudocode, here is what happens:

LONG InterlockedIncrement(PLONG plAddend);
 
LONG InterlockedDecrement(PLONG plAddend);

These two functions perform atomic increment or decrement

VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);

This function initializes the members of a CRITICAL_SECTION structure (pointed to by pcs).

VOID DeleteCriticalSection(PCRITICAL_SECTION pcs);

Resets the member variables inside the structure. Naturally, you should not delete a critical section if any threads are still using it.

VOID EnterCriticalSection(PCRITICAL_SECTION pcs);

When you write code that touches a shared resource u should prefix this code with this function.

BOOL TryEnterCriticalSection(PCRITICAL_SECTION pcs);

TryEnterCriticalSection never allows the calling thread to enter a wait state. Instead, its return value indicates whether the calling thread was able to gain access to the resource. So if TryEnterCriticalSection sees that the resource is being accessed by another thread, it returns FALSE. In all other cases, it returns TRUE.

VOID LeaveCriticalSection(PCRITICAL_SECTION pcs);

Call this function at the end of your code that touches the shared resource.

BOOL InitializeCriticalSectionAndSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

To use a spinlock with a critical section.

DWORD SetCriticalSectionSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

To change a critical section’s spin count.

BOOL SleepConditionVariableCS(
   PCONDITION_VARIABLE pConditionVariable,
   PCRITICAL_SECTION pCriticalSection,
   DWORD dwMilliseconds);

Sleeps on the specified condition variable and releases the specified critical section as an atomic operation.

BOOL SleepConditionVariableSRW(
   PCONDITION_VARIABLE pConditionVariable,
   PSRWLOCK pSRWLock,
   DWORD dwMilliseconds,
   ULONG Flags);

Sleeps on the specified condition variable and releases the specified SRW lock as an atomic operation.

VOID WakeConditionVariable(
   PCONDITION_VARIABLE ConditionVariable);

Wakes a single thread waiting on the specified condition variable.

VOID WakeAllConditionVariable(
   PCONDITION_VARIABLE ConditionVariable);

Wakes all threads waiting on the specified condition variable.

VOID InitializeSRWLock(PSRWLOCK SRWLock);

Initialize an SRW lock.

VOID AcquireSRWLockExclusive(PSRWLOCK SRWLock);

Acquires an SRW lock in exclusive mode.

VOID ReleaseSRWLockExclusive(PSRWLOCK SRWLock);

Releases an SRW lock that was opened in exclusive mode.

VOID AcquireSRWLockShared(PSRWLOCK SRWLock);

Acquires an SRW lock in shared mode.

VOID ReleaseSRWLockShared(PSRWLOCK SRWLock);

Releases an SRW lock that was opened in shared mode.

 

References

Windows® via C/C++, Fifth Edition

Follow

Get every new post delivered to your Inbox.