Windows Memory Architecture

Content

  1. How a Virtual Address Space Is Partitioned
  2. Regions in an Address Space
  3. Committing Physical Storage Within a Region
  4. Physical Storage and the Paging File
  5. Page Protection Attributes
  6. The Importance of Data Alignment

Introduction

Every process is given its very own virtual address space. For 32-bit processes, this address space is 4 GB because a 32-bit pointer can have any value from 0×00000000 through 0xFFFFFFFF.

Every process has its own private address space. Process A can have a data structure stored in its address space at address 0×12345678, while Process B can have a totally different data structure stored in its address space—at address 0×12345678. When threads running in Process A access memory at address 0×12345678, these threads are accessing Process A’s data structure. When threads running in Process B access memory at address 0×12345678, these threads are accessing Process B’s data structure. Threads running in Process A cannot access the data structure in Process B’s address space, and vice versa.

This address space is simply a range of memory addresses. Physical storage needs to be assigned or mapped to portions of the address space before you can successfully access data without raising access violations.

How a Virtual Address Space Is Partitioned

Each process’ virtual address space is split into partitions. The address space is partitioned based on the underlying implementation of the operating system.

image

The partition of the process’ address space from 0×00000000 to 0x0000FFFF inclusive is set aside to help programmers catch NULL-pointer assignments. If a thread in your process attempts to read from or write to a memory address in this partition, an access violation is raised.

This User-Mode partition is where the process’ address space resides. The usable address range and approximate size of the user-mode partition depends on the CPU architecture.

image

This Kernel-Mode partition is where the operating system’s code resides. The code for thread scheduling, memory management, file systems support, networking support, and all device drivers is loaded in this partition. Everything residing in this partition is shared among all processes. Although this partition is just above the user-mode partition in every process, all code and data in this partition is completely protected. If your application code attempts to read or write to a memory address in this partition, your thread raises an access violation.

Regions in an Address Space

When a process is created and given its address space, the bulk of this usable address space is free, or unallocated. To use portions of this address space, you must allocate regions within it by calling VirtualAlloc. the act of allocating a region is called reserving.

Whenever you reserve a region of address space:

  • The system ensures that the region begins on an allocation granularity boundary. All the CPU platforms use the same allocation granularity of 64 KB—that is, allocation requests are rounded to a 64-KB boundary.
  • The system ensures that the size of the region is a multiple of the system’s page size. A page is a unit of memory that the system uses in managing memory. Like the allocation granularity. The x86 and x64 systems use a 4-KB page size, but the IA-64 uses an 8-KB page size.

If you attempt to reserve a 10-KB region of address space, the system will automatically round up your request and reserve a region whose size is a multiple of the page size. This means that on x86 and x64 systems, the system will reserve a region that is 12 KB.

When your program’s algorithms no longer need to access a reserved region of address space, the region should be freed. This process is called releasing the region of address space and is accomplished by calling the VirtualFree function.

Committing Physical Storage Within a Region

To use a reserved region of address space, you must allocate physical storage and then map this storage to the reserved region. This process is called committing physical storage. Physical storage is always committed in pages. To commit physical storage to a reserved region, you again call the VirtualAlloc function.

When your program’s algorithms no longer need to access committed physical storage in the reserved region, the physical storage should be freed. This process is called decommitting the physical storage and is accomplished by calling the VirtualFree function.

Physical Storage and the Paging File

The file on the disk is typically called a paging file, and it contains the virtual memory that is available to all processes.

when an application commits physical storage to a region of address space by calling the VirtualAlloc function, space is actually allocated from a file on the hard disk. The size of the system’s paging file is the most important factor in determining how much physical storage is available to applications; the amount of RAM you have has very little effect.

Now when a thread in your process attempts to access a block of data in the process’ address space.

physical address in memory, and then the desired access is performed.

In the second possibility, the data that the thread is attempting to access is not in RAM but is contained somewhere in the paging file. In this case, the attempted access is called a page fault, and the CPU notifies the operating system of the attempted access. The operating system then locates a free page of memory in RAM; if a free page cannot be found, the system must free one. If a page has not been modified, the system can simply free the page. But if the system needs to free a page that was modified, it must first copy the page from RAM to the paging file. Next the system goes to the paging file, locates the block of data that needs to be accessed, and loads the data into the free page of memory. The operating system then updates its table indicating that the data’s virtual memory address now maps to the appropriate physical memory address in RAM. The CPU now retries the instruction that generated the initial page fault, but this time the CPU is able to map the virtual memory address to a physical RAM address and access the block of data.

The more often the system needs to copy pages of memory to the paging file and vice versa, the more your hard disk thrashes and the slower the system runs. (Thrashing means that the operating system spends all its time swapping pages in and out of memory instead of running programs.)

When you invoke an application, the system opens the application’s .exe file and determines the size of the application’s code and data. Then the system reserves a region of address space and notes that the physical storage associated with this region is the .exe file itself. That’s right—instead of allocating space from the paging file, the system uses the actual contents, or image, of the .exe file as the program’s reserved region of address space. This, of course, makes loading an application very fast and allows the size of the paging file to remain small.

When a program’s file image (that is, an .exe or a DLL file) on the hard disk is used as the physical storage for a region of address space, it is called a memory-mapped file. When an .exe or a DLL is loaded, the system automatically reserves a region of address space and maps the file’s image to this region.

Page Protection Attributes

Individual pages of physical storage allocated can be assigned different protection attributes.

image

Some malware applications write code into areas of memory intended for data (such as a thread’s stack) and then the application executes the malicious code. Windows’ Data Execution Prevention (DEP) feature provides protection against this type of malware attack. With DEP enabled, the operating system uses the PAGE_EXECUTE_* protections only on regions of memory that are intended to have code execute; other protections (typically PAGE_READWRITE) are used for regions of memory intended to have data in them (such as thread stacks and the application’s heaps).

Windows supports a mechanism that allows two or more processes to share a single block of storage. So if 10 instances of Notepad are running, all instances share the application’s code and data pages.

When an .exe or a .dll module is mapped into an address space, the system calculates how many pages are writable. (Usually, the pages containing code are marked as PAGE_EXECUTE_READ while the pages containing data are marked PAGE_READWRITE.) Then the system allocates storage from the paging file to accommodate these writable pages. This paging file storage is not used unless the module’s writable pages are actually written to.

When a thread in one process attempts to write to a shared block, the system intervenes and performs the following steps:

  1. The system finds a free page of memory in RAM.

  2. The system copies the contents of the page attempting to be modified (in the image) to the free page found in step 1. This free page will be assigned either PAGE_READWRITE or PAGE_EXECUTE_READWRITE protection. The original page’s protection and data does not change at all.

  3. The system then updates the process’ page tables so that the accessed virtual address now translates to the new page of RAM.

After the system has performed these steps, the process can access its own private instance of this page of storage.

A memory block is a set of contiguous pages that all have the same protection attributes and that are all backed by the same type of physical storage.

Protection attributes are given to a region for the sake of efficiency only, and they are always overridden by protection attributes assigned to physical storage.

A block’s protection attributes override the protection attributes of the region that contains the block.

The Importance of Data Alignment

Data alignment is not so much a part of the operating system’s memory architecture as it is a part of the CPU’s architecture.

CPUs operate most efficiently when they access properly aligned data. Data is aligned when the memory address of the data modulo of the data’s size is 0. For example, a WORD value should always start on an address that is evenly divided by 2, a DWORD value should always start on an address that is evenly divided by 4, and so on. When the CPU attempts to read a data value that is not properly aligned, the CPU will do one of two things. It will either raise an exception or the CPU will perform multiple, aligned memory accesses to read the full misaligned data value.

Here is some code that accesses misaligned data:

VOID SomeFunc(PVOID pvDataBuffer) {

   // The first byte in the buffer is some byte of information
   char c = * (PBYTE) pvDataBuffer;

   // Increment past the first byte in the buffer
   pvDataBuffer = (PVOID)((PBYTE) pvDataBuffer + 1);

   // Bytes 2-5 contain a double-word value
   DWORD dw = * (DWORD *) pvDataBuffer;

   // The line above raises a data misalignment exception on some CPUs
...

Obviously, if the CPU performs multiple memory accesses, the performance of your application is hampered. At best, it will take the system twice as long to access a misaligned value as it will to access an aligned value—but the access time could be even worse! To get the best performance for your application, you’ll want to write your code so that the data is properly aligned.

References

Windows® via C/C++, Fifth Edition

I/O Completion Ports

Content

  1. Introduction
  2. Creating an I/O Completion Port
  3. Associating a Device with an I/O Completion Port
  4. How the I/O Completion Port Manages the Thread Pool
  5. Simulating Completed I/O Requests

Introduction

Service application architecture can be one of the following models:

  • Serial model A single thread waits for a client to make a request (usually over the network). When the request comes in, the thread wakes and handles the client’s request.

  • Concurrent model A single thread waits for a client request and then creates a new thread to handle the request. While the new thread is handling the client’s request, the original thread loops back around and waits for another client request. When the thread that is handling the client’s request is completely processed, the thread dies.

The problem with the serial model is that it does not handle multiple, simultaneous requests well. If two clients make requests at the same time, only one can be processed at a time; the second request must wait for the first request to finish processing. A service that is designed using the serial approach cannot take advantage of multiprocessor machines. Obviously, the serial model is good only for the simplest of server applications, in which few client requests are made and requests can be handled very quickly. A Ping server is a good example of a serial server.

Because of the limitations in the serial model, the concurrent model is extremely popular. In the concurrent model, a thread is created to handle each client request. The advantage is that the thread waiting for incoming requests has very little work to do. Most of the time, this thread is sleeping. When a client request comes in, the thread wakes, creates a new thread to handle the request, and then waits for another client request. This means that incoming client requests are handled expediently. Also, because each client request gets its own thread, the server application scales well and can easily take advantage of multiprocessor machines.

Service applications using the concurrent model were implemented using Windows. The Windows team noticed that application performance was not as high as desired. In particular, the team noticed that handling many simultaneous client requests meant that many threads were running in the system concurrently. Because all these threads were runnable (not suspended and waiting for something to happen), Microsoft realized that the Windows kernel spent too much time context switching between the running threads, and the threads were not getting as much CPU time to do their work. To make Windows an awesome server environment, Microsoft needed to address this problem. The result is the I/O completion port kernel object.

Creating an I/O Completion Port

The theory behind I/O Completion Ports states the following:

  1. The number of threads running concurrently must have an upper bound, i.e 500 simultaneous client requests cannot allow 500 runnable threads to exist, it makes sense to set the upper bound equals number of CPUs.
  2. I/O completion ports were designed to work with a pool of threads. A pool of threads is created when the application initializes, and these threads hang around for the duration of the application. What is the number of threads in the pool? As a rule of thumb take the number of CPUs on the host machine and multiply it by 2. So on a dual-processor machine, you should create a pool of four threads.
HANDLE CreateNewCompletionPort(DWORD dwNumberOfConcurrentThreads) {

   return(CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0,
      dwNumberOfConcurrentThreads));
}

Associating a Device with an I/O Completion Port

BOOL AssociateDeviceWithCompletionPort(
   HANDLE hCompletionPort, HANDLE hDevice, DWORD dwCompletionKey) {

   HANDLE h = CreateIoCompletionPort(hDevice, hCompletionPort, dwCompletionKey, 0);
   return(h == hCompletionPort);
}

I/O Completion Ports Internal Data structures

Device List: contains a set of device handles associated with the completion port.

I/O Completion Queue (FIFO): When an asynchronous I/O request for a device completes, the system checks to see whether the device is associated with a completion port and, if it is, the system appends the completed I/O request entry to the end of the completion port’s I/O completion queue.

Waiting Thread Stack (LIFO): As each thread in the thread pool calls GetQueuedCompletionStatus, the ID of the calling thread is placed in this waiting thread queue, enabling the I/O completion port kernel object to always know which threads are currently waiting to handle completed I/O requests. When an entry appears in the port’s I/O completion queue, the completion port wakes one of the threads in the waiting thread queue. This thread gets the pieces of information that make up a completed I/O entry.

Release Thread List and Paused Thread List: When a completion port wakes a thread, the completion port places the thread’s ID in the released thread list. This allows the completion port to remember which threads it awakened and to monitor the execution of these threads. If a released thread calls any function that places the thread in a wait state, the completion port detects this and updates its internal data structures by moving the thread’s ID from the released thread list to the paused thread list

All the threads in the pool should execute the same function. Typically, this thread function performs some sort of initialization and then enters a loop that should terminate when the service process is instructed to stop. Inside the loop, the thread puts itself to sleep waiting for device I/O requests to complete to the completion port.

How the I/O Completion Port Manages the Thread Pool

The goal of the completion port is to keep as many entries in the released thread list as are specified by the concurrent number of threads value used when creating the completion port. If a released thread enters a wait state for any reason, the released thread list shrinks and the completion port releases another waiting thread. If a paused thread wakes, it leaves the paused thread list and reenters the released thread list. This means that the released thread list can now have more entries in it than are allowed by the maximum concurrency value.

Once a thread calls GetQueuedCompletionStatus, the thread is "assigned" to the specified completion port. The system assumes that all assigned threads are doing work on behalf of the completion port. The completion port wakes threads from the pool only if the number of running assigned threads is less than the completion port’s maximum concurrency value.

You can break the thread/completion port assignment in one of three ways:

  1. Have the thread exit.
  2. Have the thread call GetQueuedCompletionStatus, passing the handle of a different I/O completion port.
  3. Destroy the I/O completion port that the thread is currently assigned to.

Let’s tie all of this together now. Say that we are again running on a machine with two CPUs. We create a completion port that allows no more than two threads to wake concurrently, and we create four threads that are waiting for completed I/O requests. If three completed I/O requests get queued to the port, only two threads are awakened to process the requests, reducing the number of runnable threads and saving context-switching time. Now if one of the running threads calls Sleep, WaitForSingleObject, WaitForMultipleObjects, SignalObjectAndWait, a synchronous I/O call, or any function that would cause the thread not to be runnable, the I/O completion port would detect this and wake a third thread immediately. The goal of the completion port is to keep the CPUs saturated with work.

Eventually, the first thread will become runnable again. When this happens, the number of runnable threads will be higher than the number of CPUs in the system. However, the completion port again is aware of this and will not allow any additional threads to wake up until the number of threads drops below the number of CPUs. The I/O completion port architecture presumes that the number of runnable threads will stay above the maximum for only a short time and will die down quickly as the threads loop around and again call GetQueuedCompletionStatus. This explains why the thread pool should contain more threads than the concurrent thread count set in the completion port.

Simulating Completed I/O Requests

I/O completion ports do not have to be used with device I/O at all. it can be used for inter-thread communication.

The PostQueuedCompletionStatus function is incredibly useful—it gives you a way to communicate with all the threads in your pool. For example, when the user terminates a service application, you want all the threads to exit cleanly. But if the threads are waiting on the completion port and no I/O requests are coming in, the threads can’t wake up. By calling PostQueuedCompletionStatus once for each thread in the pool, each thread can wake up, examine the values returned from GetQueuedCompletionStatus, see that the application is terminating, and clean up and exit appropriately.

You must be careful when using a thread termination technique like the one I just described. My example works because the threads in the pool are dying and not calling GetQueuedCompletionStatus again. However, if you want to notify each of the pool’s threads of something and have them loop back around to call GetQueuedCompletionStatus again, you will have a problem because the threads wake up in a LIFO order. So you will have to employ some additional thread synchronization in your application to ensure that each pool thread gets the opportunity to see its simulated I/O entry. Without this additional thread synchronization, one thread might see the same notification several times.

API Table

HANDLE CreateIoCompletionPort(
   HANDLE    hFile,
   HANDLE    hExistingCompletionPort,
   ULONG_PTR CompletionKey,
   DWORD     dwNumberOfConcurrentThreads);

This function performs two different tasks: it creates an I/O completion port, and it associates a device with an I/O completion port.

The dwNumberOfConcurrentThreads parameter tells the I/O completion port the maximum number of threads that should be runnable at the same time. If you pass 0 for the dwNumberOfConcurrentThreads parameter, the completion port defaults to allowing as many concurrent threads as there are CPUs on the host machine.

BOOL GetQueuedCompletionStatus(
   HANDLE       hCompletionPort,
   PDWORD       pdwNumberOfBytesTransferred,
   PULONG_PTR   pCompletionKey,
   OVERLAPPED** ppOverlapped,
   DWORD        dwMilliseconds);

The thread puts itself to sleep waiting for device I/O requests to complete to the completion port

BOOL GetQueuedCompletionStatusEx(
  HANDLE hCompletionPort,
  LPOVERLAPPED_ENTRY pCompletionPortEntries,
  ULONG ulCount,
  PULONG pulNumEntriesRemoved,
  DWORD dwMilliseconds,
  BOOL bAlertable);

In Windows Vista, if you expect a large number of I/O requests to be constantly submitted, instead of multiplying the number of threads to wait on the completion port and incurring the increasing cost of the corresponding context switches, you can retrieve the result of several I/O requests at the same time.

BOOL PostQueuedCompletionStatus(
   HANDLE      hCompletionPort,
   DWORD       dwNumBytes,
   ULONG_PTR   CompletionKey,
   OVERLAPPED* pOverlapped);

This function appends a completed I/O notification to an I/O completion port’s queue. The first parameter, hCompletionPort, identifies the completion port that you want to queue the entry for. The remaining three parameters—dwNumBytes, CompletionKey, and pOverlapped—indicate the values that should be returned by a thread’s call to GetQueuedCompletionStatus. When a thread pulls a simulated entry from the I/O completion queue, GetQueuedCompletionStatus returns TRUE, indicating a successfully executed I/O request.

References

Windows® via C/C++, Fifth Edition

Synchronous and Asynchronous Device I/O

Content

  1. Introduction
  2. Synchronous I/O
  3. Basics of Asynchronous Device I/O
  4. Receiving Completed I/O Request Notifications

Introduction

A scalable application handles a large number of concurrent operations as efficiently as it handles a small number of concurrent operations.

One of the strengths of Windows is the sheer number of devices that it supports. In the context of this discussion, We define a device to be anything that allows communication. The below table lists some devices and their most common uses.

image

To perform any type of I/O, you must first open the desired device and get a handle to it. The way you get the handle to a device depends on the particular device. The below table lists various devices and the functions you should call to open them.

image

Synchronous I/O is what most developers are used to. When you read data from a file, your thread is suspended, waiting for the information to be read. Once the information has been read, the thread regains control and continues executing.

Because device I/O is slow when compared with most other operations, you might want to consider communicating with some devices asynchronously. Here’s how it works: Basically, you call a function to tell the operating system to read or write data, but instead of waiting for the I/O to complete, your call returns immediately, and the operating system completes the I/O on your behalf using its own threads. When the operating system has finished performing your requested I/O, you can be notified. Asynchronous I/O is the key to creating high-performance, scalable, responsive, and robust applications.

Most Windows functions that return a handle return NULL when the function fails. However, CreateFile returns INVALID_HANDLE_VALUE (defined as –1) instead.

HANDLE hFile = CreateFile(…);
if (hFile == NULL) {
   // We’ll never get in here
} else {
   // File might or might not be created OK
}

Here’s the correct way to check for an invalid file handle:

HANDLE hFile = CreateFile(...);
if (hFile == INVALID_HANDLE_VALUE) {
   // File not created
} else {
   // File created OK
}

Synchronous I/O Cancellation

Functions that do synchronous I/O are easy to use, but they block any other operations from occurring on the thread that issued the I/O until the request is completed. A great example of this is a CreateFile operation. When a user performs mouse and keyboard input, window messages are inserted into a queue that is associated with the thread that created the window that the input is destined for. If that thread is stuck inside a call to CreateFile, waiting for CreateFile to return, the window messages are not getting processed and all the windows created by the thread are frozen. The most common reason why applications hang is because their threads are stuck waiting for synchronous I/O operations to complete!

To build a responsive application, you should try to perform asynchronous I/O operations as much as possible. This typically also allows you to use very few threads in your application, thereby saving resources (such as thread kernel objects and stacks). Also, it is usually easy to offer your users the ability to cancel an operation when you initiate it asynchronously

In Windows Vista, the following function allows you to cancel a pending synchronous I/O request for a given thread: BOOL CancelSynchronousIo(HANDLE hThread);

The hThread parameter is a handle of the thread that is suspended waiting for the synchronous I/O request to complete. This handle must have been created with the THREAD_TERMINATE access. If this is not the case, CancelSynchronousIo fails and GetLastError returns ERROR_ACCESS_ DENIED. When you create the thread yourself by using CreateThread or _beginthreadex, the returned handle has THREAD_ALL_ACCESS, which includes THREAD_TERMINATE access.

If the specified thread was suspended waiting for a synchronous I/O operation to complete, CancelSynchronousIo wakes the suspended thread and the operation it was trying to perform returns failure; calling GetLastError returns ERROR_OPERATION_ABORTED. Also, CancelSynchronousIo returns TRUE to its caller.

Note that the thread calling CancelSynchronousIo doesn’t really know where the thread that called the synchronous operation is. The thread could have been pre-empted and it has yet to actually communicate with the device; it could be suspended, waiting for the device to respond; or the device could have just responded, and the thread is in the process of returning from its call. If CancelSynchronousIo is called when the specified thread is not actually suspended waiting for the device to respond, CancelSynchronousIo returns FALSE and GetLastError returns ERROR_NOT_FOUND.

Basics of Asynchronous Device I/O

Compared to most other operations carried out by a computer, device I/O is one of the slowest and most unpredictable. The CPU performs arithmetic operations and even paints the screen much faster than it reads data from or writes data to a file or across a network. However, using asynchronous device I/O enables you to better use resources and thus create more efficient applications.

To access a device asynchronously, you must first open the device by calling CreateFile, specifying the FILE_FLAG_OVERLAPPED flag in the dwFlagsAndAttributes parameter. This flag notifies the system that you intend to access the device asynchronously.

The OVERLAPPED Structure

When performing asynchronous device I/O, you must pass the address to an initialized OVERLAPPED structure via the pOverlapped parameter.

typedef struct _OVERLAPPED {
   DWORD  Internal;     // [out] Error code
   DWORD  InternalHigh; // [out] Number of bytes transferred
   DWORD  Offset;       // [in]  Low 32-bit file offset
   DWORD  OffsetHigh;   // [in]  High 32-bit file offset
   HANDLE hEvent;       // [in]  Event handle or data
} OVERLAPPED
  • Offset and OffsetHigh When a file is being accessed, these members indicate the 64-bit offset in the file where you want the I/O operation to begin. Recall that each file kernel object has a file pointer associated with it. When issuing a synchronous I/O request, the system knows to start accessing the file at the location identified by the file pointer. After the operation is complete, the system updates the file pointer automatically so that the next operation can pick up where the last operation left off.

    When performing asynchronous I/O, this file pointer is ignored by the system. Imagine what would happen if your code placed two asynchronous calls to ReadFile (for the same file kernel object). In this scenario, the system wouldn’t know where to start reading for the second call to ReadFile. You probably wouldn’t want to start reading the file at the same location used by the first call to ReadFile. You might want to start the second read at the byte in the file that followed the last byte that was read by the first call to ReadFile. To avoid the confusion of multiple asynchronous calls to the same object, all asynchronous I/O requests must specify the starting file offset in the OVERLAPPED structure.

    Note that the Offset and OffsetHigh members are not ignored for nonfile devices—you must initialize both members to 0 or the I/O request will fail and GetLastError will return ERROR_INVALID_PARAMETER.

  • hEvent This member is used by one of the four methods available for receiving I/O completion notifications. When using the alertable I/O notification method, this member can be used for your own purposes.

  • Internal This member holds the processed I/O’s error code. As soon as you issue an asynchronous I/O request, the device driver sets Internal to STATUS_PENDING, indicating that no error has occurred because the operation has not started.

  • InternalHigh When an asynchronous I/O request completes, this member holds the number of bytes transferred.

To pass with the OVERLAPPED structure a more useful contextual information, you can extend it.

Asynchronous Device I/O Caveats

  • The device driver doesn’t have to process queued I/O requests in a first-in first-out (FIFO) fashion.
  • When attempting to queue an asynchronous I/O request, the device driver might choose to process the request synchronously. This can occur if you’re reading from a file and the system checks whether the data you want is already in the system’s cache. If the data is available, your I/O request is not queued to the device driver; instead, the system copies the data from the cache to your buffer, and the I/O operation is complete.
  • The data buffer and OVERLAPPED structure used to issue the asynchronous I/O request must not be moved or destroyed until the I/O request has completed.

Canceling Queued Device I/O Requests

  • You can call CancelIo to cancel all I/O requests queued by the calling thread for the specified handle: BOOL CancelIo(HANDLE hFile);
  • You can cancel all queued I/O requests, regardless of which thread queued the request, by closing the handle to a device itself.
  • When a thread dies, the system automatically cancels all I/O requests issued by the thread.
  • If you need to cancel a single, specific I/O request submitted on a given file handle, you can call CancelIoEx: BOOL CancelIoEx(HANDLE hFile, LPOVERLAPPED pOverlapped);. With CancelIoEx, you are able to cancel pending I/O requests emitted by a thread different from the calling thread. This function marks as canceled all I/O requests that are pending on hFile and associated with the given pOverlapped parameter. Because each outstanding I/O request should have its own OVERLAPPED structure, each call to CancelIoEx should cancel just one outstanding request. However, if the pOverlapped parameter is NULL, CancelIoEx cancels all outstanding I/O requests for the specified hFile.

Receiving Completed I/O Request Notifications

  1. Signaling a device kernel object: Not useful for performing multiple simultaneous I/O requests against a single device. Allows one thread to issue an I/O request and another thread to process it.

     

  2. Signaling an event kernel object: Allows multiple simultaneous I/O requests against a single device. Allows one thread to issue an I/O request and another thread to process it.

     

  3. Using alertable I/O: Allows multiple simultaneous I/O requests against a single device. The thread that issued an I/O request must also process it.

     

  4. Using I/O completion ports: Allows multiple simultaneous I/O requests against a single device. Allows one thread to issue an I/O request and another thread to process it. This technique is highly scalable and has the most flexibility.

1. Signaling a Device Kernel Object

A thread can determine whether an asynchronous I/O request has completed by calling either WaitForSingleObject or WaitForMultipleObjects. Here is a simple example:

HANDLE hFile = CreateFile(..., FILE_FLAG_OVERLAPPED, ...);
BYTE bBuffer[100];
OVERLAPPED o = { 0 };
o.Offset = 345;

BOOL bReadDone = ReadFile(hFile, bBuffer, 100, NULL, &o);
DWORD dwError = GetLastError();

if (!bReadDone && (dwError == ERROR_IO_PENDING)) {
   // The I/O is being performed asynchronously; wait for it to complete
   WaitForSingleObject(hFile, INFINITE);
   bReadDone = TRUE;
}

if (bReadDone) {
   // o.Internal contains the I/O error
   // o.InternalHigh contains the number of bytes transferred
   // bBuffer contains the read data
} else {
   // An error occurred; see dwError
}

2. Signaling an Event Kernel Object

The following code demonstrates this approach:

HANDLE hFile = CreateFile(..., FILE_FLAG_OVERLAPPED, ...);

BYTE bReadBuffer[10];
OVERLAPPED oRead = { 0 };
oRead.Offset = 0;
oRead.hEvent = CreateEvent(...);
ReadFile(hFile, bReadBuffer, 10, NULL, &oRead);

BYTE bWriteBuffer[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
OVERLAPPED oWrite = { 0 };
oWrite.Offset = 10;
oWrite.hEvent = CreateEvent(...);
WriteFile(hFile, bWriteBuffer, _countof(bWriteBuffer), NULL, &oWrite);
...

HANDLE h[2];
h[0] = oRead.hEvent;
h[1] = oWrite.hEvent;
DWORD dw = WaitForMultipleObjects(2, h, FALSE, INFINITE);
switch (dw – WAIT_OBJECT_0) {
   case 0:   // Read completed
      break;

   case 1:   // Write completed
      break;
}

3. Alertable I/O

Whenever a thread is created, the system also creates a queue that is associated with the thread. This queue is called the asynchronous procedure call (APC) queue. When issuing an I/O request, you can tell the device driver to append an entry to the calling thread’s APC queue. To have completed I/O notifications queued to your thread’s APC queue, you call the ReadFileEx and WriteFileEx functions.

Second, the *Ex functions require that you pass the address of a callback function, called a completion routine. This routine must have the following prototype: VOID WINAPI CompletionRoutine(DWORD dwError, DWORD dwNumBytes, OVERLAPPED* po);

When you issue an asynchronous I/O request with ReadFileEx and WriteFileEx, the functions pass the address of this function to the device driver. When the device driver has completed the I/O request, it appends an entry in the issuing thread’s APC queue. This entry contains the address of the completion routine function and the address of the OVERLAPPED structure used to initiate the I/O request.

When the thread is in an alertable state (discussed shortly), the system examines its APC queue and, for every entry in the queue, the system calls the completion function, passing it the I/O error code, the number of bytes transferred, and the address of the OVERLAPPED structure.

To process entries in your thread’s APC queue, the thread must put itself in an alertable state. This simply means that your thread has reached a position in its execution where it can handle being interrupted.

The Bad and the Good of Alertable I/O

  • Callback functions Alertable I/O requires that you create callback functions, which makes implementing your code much more difficult. These callback functions typically don’t have enough contextual information about a particular problem to guide you, so you end up placing a lot of information in global variables.

  • Threading issues The real big problem with alertable I/O is this: The thread issuing the I/O request must also handle the completion notification. If a thread issues several requests, that thread must respond to each request’s completion notification, even if other threads are sitting completely idle. Because there is no load balancing, the application doesn’t scale well.

Both of these problems are pretty severe, so it is strongly discouraged to use alertable I/O for device I/O.

API Table

DWORD GetFileType(HANDLE hDevice);

Also, if you have a handle to a device, you can find out what type of device it is by calling GetFileType,

· FILE_TYPE_UNKNOWN: The type of the specified file is unknown.

· FILE_TYPE_DISK: The specified file is a disk file.

· FILE_TYPE_CHAR: The specified file is a character file, typically an LPT device or a console.

· FILE_TYPE_PIPE: The specified file is either a named pipe or an anonymous pipe.

HANDLE CreateFile(
   PCTSTR pszName,
   DWORD dwDesiredAccess,
   DWORD dwShareMode,
   PSECURITY_ATTRIBUTES psa,
   DWORD dwCreationDisposition,
   DWORD dwFlagsAndAttributes,
   HANDLE hFileTemplate);

creates and opens disk files, but don’t let the name fool you— it opens lots of other devices as well.

BOOL WINAPI GetDiskFreeSpace(
  __in   LPCTSTR lpRootPathName,
  __out  LPDWORD lpSectorsPerCluster,
  __out  LPDWORD lpBytesPerSector,
  __out  LPDWORD lpNumberOfFreeClusters,
  __out  LPDWORD lpTotalNumberOfClusters
);

Retrieves information about the specified disk, including amount of bytes per sector.

BOOL GetFileSizeEx(
   HANDLE         hFile,
   PLARGE_INTEGER pliFileSize);

Acquire the file’s size, The first parameter, hFile, is the handle of an opened file, and the pliFileSize parameter is the address of a LARGE_INTEGER union.

BOOL SetFilePointerEx(
   HANDLE         hFile,
   LARGE_INTEGER  liDistanceToMove,
   PLARGE_INTEGER pliNewFilePointer,
   DWORD          dwMoveMethod);

If you need to access a file randomly, you will need to alter the file pointer associated with the file’s kernel object.

The hFile parameter identifies the file kernel object whose file pointer you want to change. The liDistanceToMove parameter tells the system by how many bytes you want to move the pointer. The number you specify is added to the current value of the file’s pointer, so a negative number has the effect of stepping backward in the file. The last parameter of SetFilePointerEx, dwMoveMethod, tells SetFilePointerEx how to interpret the liDistanceToMove parameter.

BOOL SetEndOfFile(HANDLE hFile);
 

This SetEndOfFile function truncates or extends a file’s size to the size indicated by the file object’s file pointer. For example, if you wanted to force a file to be 1024 bytes long, you’d use SetEndOfFile this way:

HANDLE hFile = CreateFile(...);
LARGE_INTEGER liDistanceToMove;
liDistanceToMove.QuadPart = 1024;
SetFilePointerEx(hFile, liDistanceToMove, NULL, FILE_BEGIN);
SetEndOfFile(hFile);
CloseHandle(hFile);
BOOL ReadFile(
   HANDLE      hFile,
   PVOID       pvBuffer,
   DWORD       nNumBytesToRead,
   PDWORD      pdwNumBytes,
   OVERLAPPED* pOverlapped);
 
BOOL WriteFile(
   HANDLE      hFile,
   CONST VOID  *pvBuffer,
   DWORD       nNumBytesToWrite,
   PDWORD      pdwNumBytes,
   OVERLAPPED* pOverlapped);

The hFile parameter identifies the handle of the device you want to access. When the device is opened, you must not specify the FILE_FLAG_OVERLAPPED flag, or the system will think that you want to perform asynchronous I/O with the device. The pvBuffer parameter points to the buffer to which the device’s data should be read or to the buffer containing the data that should be written to the device. The nNumBytesToRead and nNumBytesToWrite parameters tell ReadFile and WriteFile how many bytes to read from the device and how many bytes to write to the device, respectively.

The pdwNumBytes parameters indicate the address of a DWORD that the functions fill with the number of bytes successfully transmitted to and from the device. The last parameter, pOverlapped, should be NULL when performing synchronous I/O. You’ll examine this parameter in more detail shortly when asynchronous I/O is discussed.

Both ReadFile and WriteFile return TRUE if successful. By the way, ReadFile can be called only for devices that were opened with the GENERIC_READ flag. Likewise, WriteFile can be called only when the device is opened with the GENERIC_WRITE flag.

BOOL FlushFileBuffers(HANDLE hFile);
 

If you want to force the system to write cached data to the device.

The FlushFileBuffers function forces all the buffered data associated with a device that is identified by the hFile parameter to be written. For this to work, the device has to be opened with the GENERIC_WRITE flag. If the function is successful, TRUE is returned.

BOOL SetFileCompletionNotificationModes(HANDLE hFile, UCHAR uFlags);

To improve performance slightly, you can tell Windows not to signal the file object when the operation completes.

The hFile parameter identifies a file handle, and the uFlags parameter indicates how Windows should modify its normal behavior with respect to completing an I/O operation. If you pass the FILE_SKIP_SET_EVENT_ON_HANDLE flag, Windows will not signal the file handle when operations on the file complete.

References

Windows® via C/C++, Fifth Edition

Thread Synchronization with Kernel Objects

Content

  1. Introduction
  2. Wait Functions
  3. Event Kernel Objects
  4. Waitable Timer Kernel Objects
  5. Semaphore Kernel Objects
  6. Mutex Kernel Objects
  7. Mutexes vs Critical Sections

Introduction

Although user-mode thread synchronization mechanisms offer great performance, they do have limitations, such as:

  • You can use critical sections to place a thread in a wait state, but you can use them only to synchronize threads contained within a single process
  • You can easily get into deadlock situations with critical sections because you cannot specify a timeout value while waiting to enter the critical section.

The drawback of using Kernel Objects is their performance, the transition from user-mode to kernel-mode is costly: it takes about 200 CPU cycles on the x86 platform for an empty system call—and this, of course, does not include the execution of the kernel-mode code that actually implements the function your thread is calling. But what takes several orders of magnitude more is the overhead of scheduling a new thread with all the cache flushes/misses it entails. Here we’re talking about tens of thousands of cycles.

The following kernel objects can be in a signaled or nonsignaled state:

  • Processes
  • Threads
  • Jobs
  • File and console standard input/output/error streams
  • Events
  • Waitable timers
  • Semaphores
  • Mutexes

Wait Functions

DWORD dw = WaitForSingleObject(hProcess, 5000);
switch (dw) {

   case WAIT_OBJECT_0:
     // The process terminated.
     break;

   case WAIT_TIMEOUT:
      // The process did not terminate within 5000 milliseconds.
      break;

   case WAIT_FAILED:
      // Bad call to function (invalid handle?)
      break;
}

The preceding code tells the system that the calling thread should not be schedulable until either the specified process has terminated or 5000 milliseconds have expired, whichever comes first. So this call returns in less than 5000 milliseconds if the process terminates, and it returns in about 5000 milliseconds if the process hasn’t terminated. Note that you can pass 0 for the dwMilliseconds parameter. If you do this, WaitForSingleObject always returns immediately, even if the wait condition hasn’t been satisfied.

HANDLE3];
h[0] = hProcess1;
h[1] = hProcess2;
h[2] = hProcess3;
DWORD dw = WaitForMultipleObjects(3, h, FALSE, 5000);
switch (dw) {
   case WAIT_FAILED:
      // Bad call to function (invalid handle?)
      break;

   case WAIT_TIMEOUT:
      // None of the objects became signaled within 5000 milliseconds.
      break;

   case WAIT_OBJECT_0 + 0:
      // The process identified by h[0] (hProcess1) terminated.
      break;

   case WAIT_OBJECT_0 + 1:
      // The process identified by h[1] (hProcess2) terminated.
      break;

   case WAIT_OBJECT_0 + 2:
      // The process identified by h[2] (hProcess3) terminated.
      break;
}

Successful Wait Side Effects

For some kernel objects, a successful call to WaitForSingleObject or WaitForMultiple-Objects actually alters the state of the object. A successful call is one in which the function sees that the object was signaled and returns a value relative to WAIT_OBJECT_0. A call is unsuccessful if the function returns WAIT_TIMEOUT or WAIT_FAILED. Objects never have their state altered for unsuccessful calls.

Let’s look at an example. Two threads call WaitForMultipleObjects in exactly the same way:

HANDLE h[2];
h[0] = hAutoResetEvent1;   // Initially nonsignaled
h[1] = hAutoResetEvent2;   // Initially nonsignaled
WaitForMultipleObjects(2, h, TRUE, INFINITE);

When WaitForMultipleObjects is called, both event objects are nonsignaled; this forces both threads to enter a wait state. Then the hAutoResetEvent1 object becomes signaled. Both threads see that the event has become signaled, but neither can wake up because the hAutoResetEvent2 object is still nonsignaled. Because neither thread has successfully waited yet, no side effect happens to the hAutoResetEvent1 object.

Next, the hAutoResetEvent2 object becomes signaled. At this point, one of the two threads detects that both objects it is waiting for have become signaled. The wait is successful, both event objects are set to the nonsignaled state, and the thread is schedulable. But what about the other thread? It continues to wait until it sees that both event objects are signaled. Even though it originally detected that hAutoResetEvent1 was signaled, it now sees this object as nonsignaled.

If multiple threads wait for a single kernel object, which thread does the system decide to wake up when the object becomes signaled?  "The algorithm is fair." which means that if multiple threads are waiting, each should get its own chance to wake up each time the object becomes signaled.

Event Kernel Objects

Events signal that an operation has completed. There are two different types of event objects: manual-reset events and auto-reset events. When a manual-reset event is signaled, all threads waiting on the event become schedulable. When an auto-reset event is signaled, only one of the threads waiting on the event becomes schedulable.

Once an event is created, you control its state directly. When you call SetEvent, you change the event to the signaled state:

BOOL SetEvent(HANDLE hEvent);

When you call ResetEvent, you change the event to the nonsignaled state:

BOOL ResetEvent(HANDLE hEvent);

It’s that easy.

an auto-reset event is automatically reset to the nonsignaled state when a thread successfully waits on the object.

Waitable Timer Kernel Objects

Waitable timers are kernel objects that signal themselves at a certain time or at regular intervals. They are most commonly used to have some operation performed at a certain time.

Waitable timer objects are always created in the nonsignaled state. You must call the SetWaitable-Timer function to tell the timer when you want it to become signaled.

The following code sets a timer to go off for the first time on January 1, 2008, at 1:00 P.M., and then to go off every six hours after that:

// Declare our local variables.
HANDLE hTimer;
SYSTEMTIME st;
FILETIME ftLocal, ftUTC;
LARGE_INTEGER liUTC;

// Create an auto-reset timer.
hTimer = CreateWaitableTimer(NULL, FALSE, NULL);

// First signaling is at January 1, 2008, at 1:00 P.M. (local time).
st.wYear         = 2008; // Year
st.wMonth        = 1;    // January
st.wDayOfWeek    = 0;    // Ignored
st.wDay          = 1;    // The first of the month
st.wHour         = 13;   // 1PM
st.wMinute       = 0;    // 0 minutes into the hour
st.wSecond       = 0;    // 0 seconds into the minute
st.wMilliseconds = 0;    // 0 milliseconds into the second

SystemTimeToFileTime(&st, &ftLocal);

// Convert local time to UTC time.
LocalFileTimeToFileTime(&ftLocal, &ftUTC);
// Convert FILETIME to LARGE_INTEGER because of different alignment.
liUTC.LowPart  = ftUTC.dwLowDateTime;
liUTC.HighPart = ftUTC.dwHighDateTime;

// Set the timer.
SetWaitableTimer(hTimer, &liUTC, 6 * 60 * 60 * 1000,
   NULL, NULL, FALSE); ...

Instead of setting an absolute time that the timer should first go off, you can have the timer go off at a time relative to calling SetWaitableTimer. You simply pass a negative value in the pDueTime parameter. The value you pass must be in 100-nanosecond intervals. Because we don’t normally think in intervals of 100 nanoseconds, you might find this useful: 1 second = 1,000 milliseconds = 1,000,000 microseconds = 10,000,000 100-nanoseconds.

The following code sets a timer to initially go off 5 seconds after the call to SetWaitableTimer:

// Declare our local variables.
HANDLE hTimer;
LARGE_INTEGER li;

// Create an auto-reset timer.
hTimer = CreateWaitableTimer(NULL, FALSE, NULL);

// Set the timer to go off 5 seconds after calling SetWaitableTimer.
// Timer unit is 100 nanoseconds.
const int nTimerUnitsPerSecond = 10000000;

// Negate the time so that SetWaitableTimer knows we
// want relative time instead of absolute time.
li.QuadPart = -(5 * nTimerUnitsPerSecond);

// Set the timer.
SetWaitableTimer(hTimer, &li, 6 * 60 * 60 * 1000,
   NULL, NULL, FALSE); ...

Waitable Timers vs User Timers

The biggest difference is that User timers require a lot of additional user interface infrastructure in your application, which makes them more resource intensive. Also, waitable timers are kernel objects, which means that they can be shared by multiple threads and are securable.

  • User timers generate WM_TIMER messages that come back to the thread that called SetTimer (for callback timers) or the thread that created the window (for window-based timers). So only one thread is notified when a User timer goes off. Multiple threads, on the other hand, can wait on waitable timers, and several threads can be scheduled if the timer is a manual-reset timer.
  • With waitable timers, you’re more likely to be notified when the time actually expires. The WM_TIMER messages are always the lowest-priority messages and are retrieved when no other messages are in a thread’s queue.

Semaphore Kernel Objects

Semaphore kernel objects are used for resource counting. They contain a usage count, as all kernel objects do, but they also contain two additional signed 32-bit values: a maximum resource count and a current resource count. The maximum resource count identifies the maximum number of resources that the semaphore can control; the current resource count indicates the number of these resources that are currently available.

The rules for a semaphore are as follows:

  • If the current resource count is greater than 0, the semaphore is signaled.
  • If the current resource count is 0, the semaphore is nonsignaled.
  • The system never allows the current resource count to be negative.
  • The current resource count can never be greater than the maximum resource count.

A thread gains access to a resource by calling a wait function, passing the handle of the semaphore guarding the resource. Internally, the wait function checks the semaphore’s current resource count and if its value is greater than 0 (the semaphore is signaled), the counter is decremented by 1 and the calling thread remains schedulable.

Unfortunately, there is just no way to get the current resource count of a semaphore without altering it.

Mutex Kernel Objects

Mutex kernel objects ensure that a thread has mutual exclusive access to a single resource. A mutex object contains a usage count, thread ID, and recursion counter. Mutexes behave identically to critical sections. However, mutexes are kernel objects, while critical sections are user-mode synchronization objects.

This means that mutexes are slower than critical sections. But it also means that threads in different processes can access a single mutex, and it means that a thread can specify a timeout value while waiting to gain access to a resource.

The rules for a mutex are as follows:

  • If the thread ID is 0 (an invalid thread ID), the mutex is not owned by any thread and is signaled.
  • If the thread ID is a nonzero value, a thread owns the mutex and the mutex is nonsignaled.

A thread gains access to the shared resource by calling a wait function, passing the handle of the mutex guarding the resource. Internally, the wait function checks the thread ID to see if it is 0 (the mutex is signaled). If the thread ID is 0, the thread ID is set to the calling thread’s ID, the recursion counter is set to 1, and the calling thread remains schedulable.

Every time a thread successfully waits on a mutex, the object’s recursion counter is incremented. The only way the recursion counter can have a value greater than 1 is if the thread waits on the same mutex multiple times.

Abandonment Issues

So if a thread owning a mutex terminates (using ExitThread, TerminateThread, ExitProcess, or TerminateProcess) before releasing the mutex, the system considers the mutex to be abandoned— the thread that owns it can never release it because the thread has died.

Because the system keeps track of all mutex and thread kernel objects, it knows exactly when mutexes become abandoned. When a mutex becomes abandoned, the system automatically resets the mutex object’s thread ID to 0 and its recursion counter to 0. Then the system checks to see whether any threads are currently waiting for the mutex.

This is the same as before except that the wait function does not return the usual WAIT_OBJECT_0 value to the thread. Instead, the wait function returns the special value of WAIT_ABANDONED. This special return value (which applies only to mutex objects) indicates that the mutex the thread was waiting on was owned by another thread that was terminated before it finished using the shared resource. The newly scheduled thread has no idea what state the resource is currently in—the resource might be totally corrupt.

In real life, most applications never check explicitly for the WAIT_ABANDONED return value because a thread is rarely just terminated. (This whole discussion provides another great example of why you should never call the TerminateThread function.)

Mutex vs Critical Section

Characteristic

Mutex

Critical Section

Mode Kernel Mode User Mode
Performance Slow Fast
Can be used across process boundaries? Yes No
Declaration HANDLE hMutex; CRITICAL_SECTION cs;
Initialization hMutext = CreateMutex(NULL, FALSE, NULL); InitializeCriticalSection(&s);
Cleanup CloseHandle(hMutext); DeleteCriticalSection(&cs);
Infinite Wait WaitForSingleObject(hMutex, INFINITE); EnterCriticalSection(&cs);
0 Wait WaitForSingleObject(hMutex, 0); TryEnterCriticalSection(&cs);
Arbitrary Wait WaitForSingleObject(hMutex, dwMilliseconds); N/A
Release ReleaseMutext(hMutext); LeaveCriticalSection(&cs);
Can be waited on with other kernel objects? Yes (e.g WaitForMultipleObjects or similar) No

 

API Table

Function

Description

DWORD WaitForSingleObject(
   HANDLE hObject,
   DWORD dwMilliseconds);

When a thread calls this function, the first parameter, hObject, identifies a kernel object that supports being signaled/nonsignaled. (Any object mentioned in the list on the previous page works just great.) The second parameter, dwMilliseconds, allows the thread to indicate how long it is willing to wait for the object to become signaled.

The following function call tells the system that the calling thread wants to wait until the process identified by the hProcess handle terminates:

WaitForSingleObject(hProcess, INFINITE);

The second parameter tells the system that the calling thread is willing to wait forever (an infinite amount of time) or until this process terminates.

DWORD WaitForMultipleObjects(
   DWORD dwCount,
   CONST HANDLE* phObjects,
   BOOL bWaitAll,
   DWORD dwMilliseconds);

The following function, WaitForMultipleObjects, is similar to WaitForSingleObject except that it allows the calling thread to check the signaled state of several kernel objects simultaneously.

The dwCount parameter indicates the number of kernel objects you want the function to check. This value must be between 1 and MAXIMUM_WAIT_OBJECTS (defined as 64 in the WinNT.h header file). The phObjects parameter is a pointer to an array of kernel object handles.

HANDLE CreateEvent(
   PSECURITY_ATTRIBUTES psa,
   BOOL bManualReset,
   BOOL bInitialState,
   PCTSTR pszName);

creates an event kernel object.

The bManualReset parameter is a Boolean value that tells the system whether to create a manualreset event (TRUE) or an auto-reset event (FALSE). The bInitialState parameter indicates whether the event should be initialized to signaled (TRUE) or nonsignaled (FALSE).

HANDLE CreateEventEx(
   PSECURITY_ATTRIBUTES psa,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

allows you to create or open a potentially existing event requesting reduced access

HANDLE OpenEvent(
   DWORD dwDesiredAccess,
   BOOL bInherit,
   PCTSTR pszName);

Threads in other processes can gain access to the object by calling CreateEvent using the same value passed in the pszName parameter;

BOOL SetEvent(HANDLE hEvent);

change the event to the signaled state

BOOL ResetEvent(HANDLE hEvent);

change the event to the nonsignaled state

BOOL PulseEvent(HANDLE hEvent);

makes an event signaled and then immediately nonsignaled; it’s just like calling Set-Event immediately followed by ResetEvent. If you call PulseEvent on a manual-reset event, any and all threads waiting on the event when it is pulsed are schedulable. If you call PulseEvent on an auto-reset event, only one waiting thread becomes schedulable. If no threads are waiting on the event when it is pulsed, there is no effect.

HANDLE CreateWaitableTimer(
   PSECURITY_ATTRIBUTES psa,
   BOOL bManualReset,
   PCTSTR pszName);

create a waitable timer.

the bManualReset parameter indicates a manual-reset or auto-reset timer. When a manual-reset timer is signaled, all threads waiting on the timer become schedulable. When an auto-reset timer is signaled, only one waiting thread becomes schedulable.

HANDLE OpenWaitableTimer(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

process can obtain its own process-relative handle to an existing waitable timer by calling

BOOL SetWaitableTimer(
   HANDLE hTimer,
   const LARGE_INTEGER *pDueTime,
   LONG lPeriod,
   PTIMERAPCROUTINE pfnCompletionRoutine,
   PVOID pvArgToCompletionRoutine,
   BOOL bResume);

tell the timer when you want it to become signaled.

Obviously, the hTimer parameter indicates the timer that you want to set. The next two parameters, pDueTime and lPeriod, are used together. The pDueTime parameter indicates when the timer should go off for the first time, and the lPeriod parameter indicates how frequently the timer should go off after that

BOOL CancelWaitableTimer(HANDLE hTimer);

This simple function takes the handle of a timer and cancels it so that the timer never goes off unless there is a subsequent call to SetWaitableTimer to reset the timer.

HANDLE CreateSemaphore(
   PSECURITY_ATTRIBUTE psa,
   LONG lInitialCount,
   LONG lMaximumCount,
   PCTSTR pszName);

creates a semaphore kernel object

HANDLE CreateSemaphoreEx(
   PSECURITY_ATTRIBUTES psa,
   LONG lInitialCount,
   LONG lMaximumCount,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

Same as CreateSemaphore, but you can use the following function to directly provide access rights in the dwDesiredAccess parameter. Notice that the dwFlags is reserved and should be set to 0.

HANDLE OpenSemaphore(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

another process can obtain its own process-relative handle to an existing semaphore.

BOOL ReleaseSemaphore(                
   HANDLE hSemaphore,
   LONG lReleaseCount,
   PLONG plPreviousCount);

A thread increments a semaphore’s current resource count.

This function simply adds the value in lReleaseCount to the semaphore’s current resource count.

HANDLE CreateMutex(
   PSECURITY_ATTRIBUTES psa,
   BOOL bInitialOwner,
   PCTSTR pszName);

To use a mutex, one process must first create the mutex.

HANDLE CreateMutexEx(
   PSECURITY_ATTRIBUTES psa,
   PCTSTR pszName,
   DWORD dwFlags,
   DWORD dwDesiredAccess);

You can also use the following function to directly provide access rights in the dwDesiredAccess parameter. The dwFlags parameter replaces the bInitialOwned parameter of CreateMutex: 0 means FALSE, and CREATE_MUTEX_ INITIAL_OWNER is equivalent to TRUE.

HANDLE OpenMutex(
   DWORD dwDesiredAccess,
   BOOL bInheritHandle,
   PCTSTR pszName);

another process can obtain its own process-relative handle to an existing mutex

BOOL ReleaseMutex(HANDLE hMutex);

When the thread that currently has access to the resource no longer needs its access, it must release the mutex

References

Windows® via C/C++, Fifth Edition

Thread Synchronization in User Mode

Threads need to communicate with each other in two basic situations:

  • When you have multiple threads accessing a shared resource in such a way that the resource does not become corrupt.
  • When one thread needs to notify one or more other threads that a specific task has been completed.

Atomic Access: The Interlocked Family of Functions

A big part of thread synchronization has to do with atomic access—a thread’s ability to access a resource with the guarantee that no other thread will access that same resource at the same time.

Consider the following:

// Define a global variable.
long g_x = 0;

DWORD WINAPI ThreadFunc1(PVOID pvParam) {
   g_x++;
   return(0);
}

DWORD WINAPI ThreadFunc2(PVOID pvParam) {
   g_x++;
   return(0);
}

We create two threads: one thread executes ThreadFunc1, and the other thread executes ThreadFunc2.

If one thread executes this code followed by another thread, here is what effectively executes:

MOV EAX, [g_x]       ; Thread 1: Move 0 into a register.
INC EAX              ; Thread 1: Increment the register to 1.
MOV [g_x], EAX       ; Thread 1: Store 1 back in g_x.

MOV EAX, [g_x]       ; Thread 2: Move 1 into a register.
INC EAX              ; Thread 2: Increment the register to 2.
MOV [g_x], EAX       ; Thread 2: Store 2 back in g_x.

Windows is a preemptive, multithreaded environment. So a thread can be switched away from at any time and another thread might continue executing at any time.

MOV EAX, [g_x]       ; Thread 1: Move 0 into a register.
INC EAX              ; Thread 1: Increment the register to 1.

MOV EAX, [g_x]       ; Thread 2: Move 0 into a register.
INC EAX              ; Thread 2: Increment the register to 1.
MOV [g_x], EAX       ; Thread 2: Store 1 back in g_x.

MOV [g_x], EAX       ; Thread 1: Store 1 back in g_x.

To solve the problem just presented we need to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. All the functions manipulate a value atomically. Take a look at InterlockedExchangeAdd and its sibling InterlockedExchangeAdd64 that works on LONGLONG values:

No thread should ever attempt to modify the shared variable by using simple C++ statements:

// The long variable shared by many threads
LONG g_x; ...

// Incorrect way to increment the long
g_x++; ...

// Correct way to increment the long
InterlockedExchangeAdd(&g_x, 1);

You must also ensure that the variable addresses that you pass to these functions are properly aligned or the functions might fail. The C run-time library offers an _aligned_malloc function that you can use to allocate a block of memory that is properly aligned.

InterlockedExchange is extremely useful when you implement a spinlock.

// Global variable indicating whether a shared resource is in use or not
BOOL g_fResourceInUse = FALSE; ...
void Func1() {
   // Wait to access the resource.
   while (InterlockedExchange (&g_fResourceInUse, TRUE) == TRUE)
      Sleep(0);

   // Access the resource.
   ...

   // We no longer need to access the resource.
   InterlockedExchange(&g_fResourceInUse, FALSE);
}
  • This code assumes that all threads using the spinlock run at the same priority level. You might also want to disable thread priority boosting.
  • You should ensure that the lock variable and the data that the lock protects are maintained in different cache lines.
  • You should avoid using spinlocks on single-CPU machines. If a thread is spinning, it’s wasting precious CPU time, which prevents the other thread from changing the value.

You have access to a series of functions that allow you to easily manipulate a stack called an Interlocked Singly Linked List. Each operation, such as pushing or popping an element, is assured to be executed in an atomic way.

Cache Lines

If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 (for older CPUs), 64, or even 128 bytes (depending on the CPU), and they are always aligned on 32-byte, 64-byte, or 128-byte boundaries, respectively. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.

However, cache lines make memory updates more difficult in a multiprocessor environment, as you can see in this example:

  • CPU1 reads a byte, causing this byte and its adjacent bytes to be read into CPU1′s cache line.
  • CPU2 reads the same byte, which causes the same bytes in step 1 to be read into CPU2′s cache line.
  • CPU1 changes the byte in memory, causing the byte to be written to CPU1′s cache line. But the information is not yet written to RAM.
  • CPU2 reads the same byte again. Because this byte was already in CPU2′s cache line, it doesn’t have to access memory. But CPU2 will not see the new value of the byte in memory.

What all this means is that you should group your application’s data together in cache line—size chunks and on cache-line boundaries. The goal is to make sure that different CPUs access different memory addresses separated by at least a cache-line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.

struct CUSTINFO {
   DWORD    dwCustomerID;     // Mostly read-only
   int      nBalanceDue;      // Read-write
   wchar_t  szName[100];      // Mostly read-only
   FILETIME ftLastOrderDate;  // Read-write
};
you can use the C/C++ compiler's __declspec(align(#)) directive to control field alignment. Here is an improved version of this structure:
#define CACHE_ALIGN 64

// Force each structure to be in a different cache line.
struct __declspec(align(CACHE_ALIGN)) CUSTINFO {
   DWORD    dwCustomerID;     // Mostly read-only
   wchar_t  szName[100];      // Mostly read-only

   // Force the following members to be in a different cache line.
   __declspec(align(CACHE_ALIGN))
   int nBalanceDue;           // Read-write
   FILETIME ftLastOrderDate;  // Read-write
};

It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache-line issues entirely.

Critical Sections

A critical section is a small section of code that requires exclusive access to some shared resource before the code can execute. This is a way to have several lines of code "atomically" manipulate a resource. By atomic, I mean that the code knows that no other thread will access the resource. Of course, the system can still preempt your thread and schedule other threads. However, it will not schedule any other threads that want to access the same resource until your thread leaves the critical section.

Here is some problematic code that demonstrates what happens without the use of a critical section:

const int COUNT = 1000;
int g_nSum = 0;

DWORD WINAPI FirstThread(PVOID pvParam) {
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   return(g_nSum);
}


DWORD WINAPI SecondThread(PVOID pvParam) {
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   return(g_nSum);
}

Let’s correct the code using a critical section:

const int COUNT = 10;
int g_nSum = 0;
CRITICAL_SECTION g_cs;

DWORD WINAPI FirstThread(PVOID pvParam) {
   EnterCriticalSection(&g_cs);
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   LeaveCriticalSection(&g_cs);
   return(g_nSum);
}


DWORD WINAPI SecondThread(PVOID pvParam) {
   EnterCriticalSection(&g_cs);
   g_nSum = 0;
   for (int n = 1; n <= COUNT; n++) {
      g_nSum += n;
   }
   LeaveCriticalSection(&g_cs);
   return(g_nSum);
}

The great thing about critical sections is that they are easy to use and they use the interlocked functions internally, so they execute quickly. The major disadvantage of critical sections is that you cannot use them to synchronize threads in multiple processes.

To use critical sections:

  • All threads that want to access the resource must know the address of the CRITICAL_SECTION structure that protects the resource.
  • The members within the CRITICAL_SECTION structure be initialized before any threads attempt to access the protected resource. The structure is initialized via a call to VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);
  • When you know that your process’ threads will no longer attempt to access the shared resource, you should clean up the CRITICAL_SECTION structure by calling this function: VOID DeleteCriticalSection(PCRITICAL_SECTION pcs);
  • When you write code that touches a shared resource, you must prefix that code with a call to: VOID EnterCriticalSection(PCRITICAL_SECTION pcs);
  • At the end of your code that touches the shared resource, you must call this function: VOID LeaveCriticalSection(PCRITICAL_SECTION pcs);

Critical Sections and Spin Locks

When a thread attempts to enter a critical section owned by another thread, the calling thread is placed immediately into a wait state. This means that the thread must transition from user mode to kernel mode (about 1000 CPU cycles). This transition is very expensive. On a multiprocessor machine, the thread that currently owns the resource might execute on a different processor and might relinquish control of the resource shortly. In fact, the thread that owns the resource might release it before the other thread has completed executing its transition into kernel mode. If this happens, a lot of CPU time is wasted.

To improve the performance of critical sections, Microsoft has incorporated spinlocks into them. So when EnterCriticalSection is called, it loops using a spinlock to try to acquire the resource some number of times. Only if all the attempts fail does the thread transition to kernel mode to enter a wait state.

To use a spinlock with a critical section, you should initialize the critical section by calling this function:

BOOL InitializeCriticalSectionAndSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

Slim Reader-Writer Locks

An SRWLock has the same purpose as a simple critical section: to protect a single resource against access made by different threads. However, unlike a critical section, an SRWLock allows you to distinguish between threads that simply want to read the value of the resource (the readers) and other threads that are trying to update this value (the writers). It should be possible for all readers to access the shared resource at the same time because there is no risk of data corruption if you only read the value of a resource. The need for synchronization begins when a writer thread wants to update the resource. In that case, the access should be exclusive: no other thread, neither a reader nor a writer, should be allowed to access the resource. This is exactly what an SRWLock allows you to do in your code and in a very explicit way.

VS SRW Owner  
Request Owner Reader Writer
Reader Allow Block
Writer Block Block

 

As we see from the table, that SRWLocks are very suitable when Readers are more than Writers.

This article is a very good one to understand SRWLocks http://blogs.msdn.com/b/matt_pietrek/archive/2006/10/19/slim-reader-writer-locks.aspx

To use SRWLocks:

  1. First, you allocate an SRWLOCK structure and initialize it with the InitializeSRWLock function: VOID InitializeSRWLock(PSRWLOCK SRWLock);
  2. For readers:
    1. Thread can try to acquire an exclusive access to the resource protected by the SRWLock by calling AcquireSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID AcquireSRWLockExclusive(PSRWLOCK SRWLock);
    2. When the resource has been updated, the lock is released by calling ReleaseSRWLockExclusive with the address of the SRWLOCK object as its parameter: VOID ReleaseSRWLockExclusive(PSRWLOCK SRWLock);
  3. For writers:
    1. the same two-step scenario occurs but with the following two new functions: VOID AcquireSRWLockShared(PSRWLOCK SRWLock); VOID ReleaseSRWLockShared(PSRWLOCK SRWLock);

If you want to get the best performance in an application, you should try to use nonshared data first and then use volatile reads, volatile writes, interlocked APIs, SRWLocks, critical sections. And if all of these won’t work for your situation, then and only then, use kernel objects.

Condition Variables

You have seen that an SRWLock is used when you want to allow producer and consumer threads access to the same resource either in exclusive or shared mode. In these kinds of situations, if there is nothing to consume for a reader thread, it should release the lock and wait until there is something new produced by a writer thread. If the data structure used to receive the items produced by a writer thread becomes full, the lock should also be released and the writer thread put to sleep until reader threads have emptied the data structure.

Condition Variables are used in scenarios where a thread has to atomically release a lock on a resource and blocks until a condition is met through the SleepConditionVariableCS or SleepConditionVariableSRW functions.

A thread blocked inside these Sleep* functions is awakened when WakeConditionVariable or WakeAllConditionVariable is called by another thread that detects that the right condition is satisfied, such as the presence of an element to consume for a reader thread or enough room to insert a produced element for a writer thread.

This article solves the well known consumer/producer problem using condition variables with critical sections.

API Table

Function

Description

LONG InterlockedExchangeAdd(
   PLONG volatile plAddend,
   LONG lIncrement);
LONGLONG InterlockedExchangeAdd64(
   PLONGLONG volatile pllAddend,
   LONGLONG llIncrement);

Performs an atomic addition of two 32-bit values.

To operate on 64-bit values, used InterlockedExchangeAdd64

void * _aligned_malloc(size_t size, size_t alignment);

Used to allocate a block of memory that is properly aligned.

The size argument identifies the number of bytes you want to allocate, and the alignment argument indicates the byte boundary that you want the block aligned on. The value you pass for the alignment argument must be an integer power of 2.

LONG InterlockedExchange(
   PLONG volatile plTarget,
   LONG lValue);
LONGLONG InterlockedExchange64(
   PLONGLONG volatile plTarget,
   LONGLONG lValue);
PVOID InterlockedExchangePointer(
   PVOID* volatile ppvTarget,
   PVOID pvValue);

Replace the current value whose address is passed in the first parameter with a value passed in the second parameter.

For a 32-bit application, both functions replace a 32-bit value with another 32-bit value. But for a 64-bit application, InterlockedExchange replaces a 32-bit value while InterlockedExchangePointer replaces a 64-bit value. Both functions return the original value.

PVOID InterlockedCompareExchange(
   PLONG plDestination,
   LONG lExchange,
   LONG lComparand);
PVOID InterlockedCompareExchangePointer(
   PVOID* ppvDestination,
   PVOID pvExchange,
   PVOID pvComparand);

These two functions perform an atomic test and set operation: for a 32-bit application, both functions operate on 32-bit values, but in a 64-bit application, InterlockedCompareExchange operates on 32-bit values while InterlockedCompareExchangePointer operates on 64-bit values. In pseudocode, here is what happens:

LONG InterlockedIncrement(PLONG plAddend);
 
LONG InterlockedDecrement(PLONG plAddend);

These two functions perform atomic increment or decrement

VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);

This function initializes the members of a CRITICAL_SECTION structure (pointed to by pcs).

VOID DeleteCriticalSection(PCRITICAL_SECTION pcs);

Resets the member variables inside the structure. Naturally, you should not delete a critical section if any threads are still using it.

VOID EnterCriticalSection(PCRITICAL_SECTION pcs);

When you write code that touches a shared resource u should prefix this code with this function.

BOOL TryEnterCriticalSection(PCRITICAL_SECTION pcs);

TryEnterCriticalSection never allows the calling thread to enter a wait state. Instead, its return value indicates whether the calling thread was able to gain access to the resource. So if TryEnterCriticalSection sees that the resource is being accessed by another thread, it returns FALSE. In all other cases, it returns TRUE.

VOID LeaveCriticalSection(PCRITICAL_SECTION pcs);

Call this function at the end of your code that touches the shared resource.

BOOL InitializeCriticalSectionAndSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

To use a spinlock with a critical section.

DWORD SetCriticalSectionSpinCount(
   PCRITICAL_SECTION pcs,
   DWORD dwSpinCount);

To change a critical section’s spin count.

BOOL SleepConditionVariableCS(
   PCONDITION_VARIABLE pConditionVariable,
   PCRITICAL_SECTION pCriticalSection,
   DWORD dwMilliseconds);

Sleeps on the specified condition variable and releases the specified critical section as an atomic operation.

BOOL SleepConditionVariableSRW(
   PCONDITION_VARIABLE pConditionVariable,
   PSRWLOCK pSRWLock,
   DWORD dwMilliseconds,
   ULONG Flags);

Sleeps on the specified condition variable and releases the specified SRW lock as an atomic operation.

VOID WakeConditionVariable(
   PCONDITION_VARIABLE ConditionVariable);

Wakes a single thread waiting on the specified condition variable.

VOID WakeAllConditionVariable(
   PCONDITION_VARIABLE ConditionVariable);

Wakes all threads waiting on the specified condition variable.

VOID InitializeSRWLock(PSRWLOCK SRWLock);

Initialize an SRW lock.

VOID AcquireSRWLockExclusive(PSRWLOCK SRWLock);

Acquires an SRW lock in exclusive mode.

VOID ReleaseSRWLockExclusive(PSRWLOCK SRWLock);

Releases an SRW lock that was opened in exclusive mode.

VOID AcquireSRWLockShared(PSRWLOCK SRWLock);

Acquires an SRW lock in shared mode.

VOID ReleaseSRWLockShared(PSRWLOCK SRWLock);

Releases an SRW lock that was opened in shared mode.

 

References

Windows® via C/C++, Fifth Edition

Thread Scheduling, Priorities and Affinities

Windows is called a preemptive multithreaded operating system because a thread can be stopped at any time and another thread can be scheduled

Suspending and Resuming a Thread

Creating a thread in the suspended state allows you to alter the thread’s environment before the thread has a chance to execute any code. Once you alter the thread’s environment, you must make the thread schedulable. You do this by calling ResumeThread and passing it the thread handle returned by the call to CreateThread (or the thread handle from the structure pointed to by the ppiProcInfo parameter passed to CreateProcess):

DWORD ResumeThread(HANDLE hThread);

A single thread can be suspended several times. If a thread is suspended three times, it must be resumed three times before it is eligible for assignment to a CPU. In addition to using the CREATE_ SUSPENDED flag when you create a thread, you can suspend a thread by calling SuspendThread:

DWORD SuspendThread(HANDLE hThread);

SuspendThread is asynchronous with respect to kernel-mode execution, but user-mode execution does not occur until the thread is resumed.

In real life, an application must be careful when it calls SuspendThread because you have no idea what the thread might be doing when you attempt to suspend it.

SuspendThread is safe only if you know exactly what the target thread is (or might be doing) and you take extreme measures to avoid problems or deadlocks caused by suspending the thread.

Note: The concept of suspending or resuming a process doesn’t exist for Windows because processes are never scheduled CPU time.

Sleeping

A thread can also tell the system that it does not want to be schedulable for a certain amount of time. This is accomplished by calling Sleep:

VOID Sleep(DWORD dwMilliseconds);

There are a few important things to notice about Sleep:

  • Calling Sleep allows the thread to voluntarily give up the remainder of its time slice.
  • The system makes the thread not schedulable for approximately the number of milliseconds specified. That’s right—if you tell the system you want to sleep for 100 milliseconds, you will sleep approximately that long, but possibly several seconds or minutes more.
  • You can call Sleep and pass INFINITE for the dwMilliseconds parameter. This tells the system to never schedule the thread.
  • You can pass 0 to Sleep. This tells the system that the calling thread relinquishes the remainder of its time slice, and it forces the system to schedule another thread.

Windows is not a real-time operating system. Your thread will probably wake up at the right time, but whether it does depends on what else is going on in the system.

Switching to Another Thread

The system offers a function called SwitchToThread that allows another schedulable thread to run if one exists:

BOOL SwitchToThread();

When you call this function, the system checks to see whether there is a thread that is being starved of CPU time. If no thread is starving, SwitchToThread returns immediately. If there is a starving thread, SwitchToThread schedules that thread (which might have a lower priority than the thread calling SwitchToThread). The starving thread is allowed to run for one time quantum and then the system scheduler operates as usual.

This function allows a thread that wants a resource to force a lower-priority thread that might currently own the resource to relinquish the resource. If no other thread can run when SwitchToThread is called, the function returns FALSE; otherwise, it returns a nonzero value.

A Thread’s Execution Times

Sometimes you want to time how long it takes a thread to perform a particular task. What many people do is write code similar to the following, taking advantage of the new GetTickCount64 function:

// Get the current time (start time).
ULONGLONG qwStartTime = GetTickCount64();

// Perform complex algorithm here.

// Subtract start time from current time to get duration.
ULONGLONG qwElapsedTime = GetTickCount64() - qwStartTime;

This code makes a simple assumption: it won’t be interrupted. However, in a preemptive operating system, you never know when your thread will be scheduled CPU time. When CPU time is taken away from your thread, it becomes more difficult to time how long it takes your thread to perform various tasks. What we need is a function that returns the amount of CPU time that the thread has received. Fortunately, prior to Windows Vista, the operating system offers a function called GetThreadTimes that returns this information:

BOOL GetThreadTimes(
   HANDLE hThread,
   PFILETIME pftCreationTime,
   PFILETIME pftExitTime,
   PFILETIME pftKernelTime,
   PFILETIME pftUserTime);

Using this function, you can determine the amount of time needed to execute a complex algorithm by using code such as the following.

__int64 FileTimeToQuadWord (PFILETIME pft) {
   return(Int64ShllMod32(pft->dwHighDateTime, 32) | pft->dwLowDateTime);
}

void PerformLongOperation () {

   FILETIME ftKernelTimeStart, ftKernelTimeEnd;
   FILETIME ftUserTimeStart,   ftUserTimeEnd;
   FILETIME ftDummy;
   __int64 qwKernelTimeElapsed, qwUserTimeElapsed,
      qwTotalTimeElapsed;

   // Get starting times.
   GetThreadTimes(GetCurrentThread(), &ftDummy, &ftDummy,
      &ftKernelTimeStart, &ftUserTimeStart);

   // Perform complex algorithm here.

   // Get ending times.
   GetThreadTimes(GetCurrentThread(), &ftDummy, &ftDummy,
      &ftKernelTimeEnd, &ftUserTimeEnd);

   // Get the elapsed kernel and user times by converting the start
   // and end times from FILETIMEs to quad words, and then subtract
   // the start times from the end times.
   qwKernelTimeElapsed = FileTimeToQuadWord(&ftKernelTimeEnd) -
      FileTimeToQuadWord(&ftKernelTimeStart);

   qwUserTimeElapsed = FileTimeToQuadWord(&ftUserTimeEnd) -
      FileTimeToQuadWord(&ftUserTimeStart);

   // Get total time duration by adding the kernel and user times.
   qwTotalTimeElapsed = qwKernelTimeElapsed + qwUserTimeElapsed;

   // The total elapsed time is in qwTotalTimeElapsed.
}

Thread Context

The CONTEXT structure allows the system to remember a thread’s state so that the thread can pick up where it left off the next time it has a CPU to run on.

Windows actually lets you look inside a thread’s kernel object and grab its current set of CPU registers. To do this, you simply call GetThreadContext:

BOOL GetThreadContext(
   HANDLE hThread,
   PCONTEXT pContext);

You should call SuspendThread before calling GetThreadContext; otherwise, the thread might be scheduled and the thread’s context might be different from what you get back.

It’s amazing how much power Windows offers the developer! But, if you think that’s cool, you’re gonna love this: Windows lets you change the members in the CONTEXT structure and then place the new register values back into the thread’s kernel object by calling SetThreadContext:

BOOL SetThreadContext(
   HANDLE hThread,
   CONST CONTEXT *pContext);

Again, the thread whose context you’re changing should be suspended first or the results will be unpredictable.

Before calling SetThreadContext, you must initialize the ContextFlags member of CONTEXT again, as shown here:

CONTEXT Context;

// Stop the thread from running.
SuspendThread(hThread);

// Get the thread's context registers.
Context.ContextFlags = CONTEXT_CONTROL;
GetThreadContext(hThread, &Context);

// Make the instruction pointer point to the address of your choice.
// Here I've arbitrarily set the address instruction pointer to
// 0x00010000.
Context.Eip = 0x00010000;

// Set the thread's registers to reflect the changed values.
// It's not really necessary to reset the ContextFlags member
// because it was set earlier.
Context.ContextFlags = CONTEXT_CONTROL;
SetThreadContext(hThread, &Context);

// Resuming the thread will cause it to begin execution
// at address 0x00010000.
ResumeThread(hThread);

This will probably cause an access violation in the remote thread; the unhandled exception message box will be presented to the user, and the remote process will be terminated. That’s right—the remote process will be terminated, not your process. You will have successfully crashed another process while yours continues to execute just fine!

Thread Priorities

Every thread is assigned a priority number ranging from 0 (the lowest) to 31 (the highest). When the system decides which thread to assign to a CPU, it examines the priority 31 threads first and schedules them in a round-robin fashion. If a priority 31 thread is schedulable, it is assigned to a CPU. At the end of this thread’s time slice, the system checks to see whether there is another priority 31 thread that can run; if so, it allows that thread to be assigned to a CPU.

Starvation occurs when higher-priority threads use so much CPU time that they prevent lower-priority threads from executing.

Higher-priority threads always preempt lower-priority threads, regardless of what the lower-priority threads are executing.

when the system boots, it creates a special thread called the zero page thread. This thread is assigned priority 0 and is the only thread in the entire system that runs at priority 0. The zero page thread is responsible for zeroing any free pages of RAM in the system when there are no other threads that need to perform work.

Application developers never work with priority levels. Instead, the system maps the process’ priority class and a thread’s relative priority to a priority level. It is precisely this mapping that Microsoft does not want to commit to. In fact, this mapping has changed between versions of the system.

image

image

image

In general, a thread with a high priority level should not be schedulable most of the time. When the thread has something to do, it quickly gets CPU time. At this point, the thread should execute as few CPU instructions as possible and go back to sleep, waiting to be schedulable again. In contrast, a thread with a low priority level can remain schedulable and execute a lot of CPU instructions to do its work. If you follow these rules, the entire operating system will be responsive to its users.

To create a thread with an idle relative thread priority, you execute code similar to the following:

DWORD dwThreadID;
HANDLE hThread = CreateThread(NULL, 0, ThreadFunc, NULL,
   CREATE_SUSPENDED, &dwThreadID);
SetThreadPriority(hThread, THREAD_PRIORITY_IDLE);
ResumeThread(hThread);
CloseHandle(hThread);

Dynamically Boosting Thread Priority Levels

The system determines the thread’s priority level by combining a thread’s relative priority with the priority class of the thread’s process. This is sometimes referred to as the thread’s base priority level. Occasionally, the system boosts the priority level of a thread—usually in response to some I/O event such as a window message or a disk read.

For example, a thread with a normal thread priority in a high-priority class process has a base priority level of 13. If the user presses a key, the system places a WM_KEYDOWN message in the thread’s queue. Because a message has appeared in the thread’s queue, the thread is schedulable. In addition, the keyboard device driver can tell the system to temporarily boost the thread’s level. So the thread might be boosted by 2 and have a current priority level of 15.

The thread is scheduled for one time slice at priority 15. Once that time slice expires, the system drops the thread’s priority by 1 to 14 for the next time slice. The thread’s third time slice is executed with a priority level of 13. Any additional time slices required by the thread are executed at priority level 13, the thread’s base priority level.

Another situation causes the system to dynamically boost a thread’s priority level. Imagine a priority 4 thread that is ready to run but cannot because a priority 8 thread is constantly schedulable. In this scenario, the priority 4 thread is being starved of CPU time. When the system detects that a thread has been starved of CPU time for about three to four seconds, it dynamically boosts the starving thread’s priority to 15 and allows that thread to run for twice its time quantum. When the double time quantum expires, the thread’s priority immediately returns to its base priority.

API Table

Function

Description

DWORD ResumeThread(HANDLE hThread);
Resumes the specified thread (i.e making it schedulable)
DWORD SuspendThread(HANDLE hThread);
Suspends the specified thread.
VOID Sleep(DWORD dwMilliseconds);
thread can also tell the system that it does not want to be schedulable for a certain amount of time
BOOL SwitchToThread(); allows another schedulable thread to run if one exists
BOOL GetThreadTimes(
   HANDLE hThread,
   PFILETIME pftCreationTime,
   PFILETIME pftExitTime,
   PFILETIME pftKernelTime,
   PFILETIME pftUserTime);
returns the amount of CPU time that the thread has received
BOOL GetThreadContext(
   HANDLE hThread,
   PCONTEXT pContext);
lets you look inside a thread’s kernel object and grab its current set of CPU registers
BOOL SetPriorityClass(
   HANDLE hProcess,
   DWORD fdwPriority);
once the child process is running, it can change its own priority class
DWORD GetPriorityClass(HANDLE hProcess);
Query the priority class of a certain process
BOOL SetThreadPriority(
   HANDLE hThread,
   int nPriority);
To set a thread’s relative priority, you must call these functions:
int GetThreadPriority(HANDLE hThread);
To get a thread’s relative priority, you must call these functions:
BOOL SetProcessPriorityBoost(
   HANDLE hProcess,
   BOOL bDisablePriorityBoost);
tells the system to enable or disable priority boosting for all threads within a process
BOOL GetProcessPriorityBoost(
   HANDLE hProcess,
   PBOOL pbDisablePriorityBoost);
determine whether process priority boosting is enabled or disabled:
BOOL SetThreadPriorityBoost(
   HANDLE hThread,
   BOOL bDisablePriorityBoost);
you enable or disable priority boosting for individual threads
BOOL GetThreadPriorityBoost(
   HANDLE hThread,
   PBOOL pbDisablePriorityBoost);
determine whether thread priority boosting is enabled or disabled:
BOOL SetProcessAffinityMask(
   HANDLE hProcess,
   DWORD_PTR dwProcessAffinityMask);
To limit threads in a single process to run on a subset of the available CPUs
BOOL GetProcessAffinityMask(
   HANDLE hProcess,
   PDWORD_PTR pdwProcessAffinityMask,
   PDWORD_PTR pdwSystemAffinityMask);
returns a process’ affinity mask
DWORD_PTR SetThreadAffinityMask(
   HANDLE hThread,
   DWORD_PTR dwThreadAffinityMask);
set affinity masks for individual threads
DWORD SetThreadIdealProcessor(
   HANDLE hThread,
   DWORD dwIdealProcessor);
It would be better if you could tell the system that you want a thread to run on a particular CPU but allow the thread to migrate to another CPU if one is available.Use SetThreadIdealProcessor to set an ideal CPU for a thread us

References

Windows® via C/C++, Fifth Edition

Thread Basics

A thread consists of two components:

  1. A kernel object that the operating system uses to manage the thread. The kernel object is also where the system keeps statistical information about the thread.
  2. A thread stack that maintains all the function parameters and local variables required as the thread executes code.

Threads are always created in the context of some process and live their entire life within that process. What this really means is that the thread executes code and manipulates data within its process’ address space. So if you have two or more threads running in the context of a single process, the threads share a single address space. The threads can execute the same code and manipulate the same data. Threads can also share kernel object handles because the handle table exists for each process, not each thread.

Your First Thread Function

Every thread must have an entry-point function where it begins execution. We already discussed this entry-point function for your primary thread: _tmain or _tWinMain. If you want to create a secondary thread in your process, it must also have an entry-point function, which should look something like this:

DWORD WINAPI ThreadFunc(PVOID pvParam){
   DWORD dwResult = 0;
   ...
   return(dwResult);

}

The CreateThread Function

If you want to create one or more secondary threads, you simply have an already running thread call CreateThread.

HANDLE CreateThread( PSECURITY_ATTRIBUTES psa, DWORD cbStackSize, PTHREAD_START_ROUTINE pfnStartAddr, PVOID pvParam, DWORD dwCreateFlags, PDWORD pdwThreadID);

PDWORD pdwThreadID);

The CreateThread function is the Windows function that creates a thread. However, if you are writing C/C++ code, you should never call CreateThread. Instead, you should use the Microsoft C++ run-time library function _beginthreadex

Thread Stack Size

The cbStackSize parameter specifies how much address space the thread can use for its own stack. Every thread owns its own stack. When CreateProcess starts a process, it internally calls CreateThread to initialize the process’ primary thread. For the cbStackSize parameter, CreateProcess uses a value stored inside the executable file. You can control this value using the linker’s /STACK switch:

/STACK:[reserve][,commit]

The reserve argument sets the amount of address space the system should reserve for the thread’s stack. The default is 1 MB. The commit argument specifies the amount of physical storage that should be initially committed to the stack’s reserved region.

When you call CreateThread, passing a value other than 0 causes the function to reserve and commit all storage for the thread’s stack. The amount of reserved space is either the amount specified by the /STACK linker switch or the value of cbStack, whichever is larger. If you pass 0 to the cbStack parameter, CreateThread reserves a region and commits the amount of storage indicated by the /STACK linker switch information embedded in the .exe file by the linker.

Thread Termination

A thread can be terminated in four ways:

  1. The thread function returns. (This is highly recommended.)
  2. The thread kills itself by calling the ExitThread function. (Avoid this method.)
  3. A thread in the same process or in another one calls the TerminateThread function. (Avoid this method.)
  4. The process containing the thread terminates. (Avoid this method.)

The Thread Function Returns

You should always design your thread functions so that they return when you want the thread to terminate. This is the only way to guarantee that all your thread’s resources are cleaned up properly.

Having your thread function return ensures the following:

  • All C++ objects created in your thread function will be destroyed properly via their destructors.
  • The operating system will properly free the memory used by the thread’s stack.
  • The system will set the thread’s exit code (maintained in the thread’s kernel object) to your thread function’s return value.
  • The system will decrement the usage count of the thread’s kernel object.

When a thread dies by returning or calling ExitThread, the stack for the thread is destroyed. However, if TerminateThread is used, the system does not destroy the thread’s stack until the process that owned the thread terminates.

if several threads run concurrently in your application, you need to explicitly handle how each one stops before the main thread returns. Otherwise, all other running threads will die abruptly and silently.

When a Thread Terminates

The following actions occur when a thread terminates:

  1. All User object handles owned by the thread are freed. In Windows, most objects are owned by the process containing the thread that creates the objects. However, a thread owns two User objects: windows and hooks. When a thread dies, the system automatically destroys any windows and uninstalls any hooks that were created or installed by the thread. Other objects are destroyed only when the owning process terminates.
  2. The thread’s exit code changes from STILL_ACTIVE to the code passed to ExitThread or TerminateThread.
  3. The state of the thread kernel object becomes signaled.
  4. If the thread is the last active thread in the process, the system considers the process terminated as well.
  5. The thread kernel object’s usage count is decremented by 1

Working with C/C++ Run Time Libraries

To create a new thread, you must not call the operating system’s CreateThread function—you must call the C/C++ run-time library function _beginthreadex:

unsigned long _beginthreadex(
   void *security,
   unsigned stack_size,
   unsigned (*start_address)(void *),
   void *arglist,
   unsigned initflag,
   unsigned *thrdaddr);
The _beginthreadex function has the same parameter list as the CreateThread function, but the parameter names and types are not exactly the same.
If you really want to forcibly kill your thread, you can have it call _endthreadex (instead of ExitThread) 

The C/C++ run-time library also places synchronization primitives around certain functions. For example, if two threads simultaneously call malloc, the heap can become corrupted. The C/C++ run-time library prevents two threads from allocating memory from the heap at the same time. It does this by making the second thread wait until the first has returned from malloc. Then the second thread is allowed to enter.Obviously, all this additional work affects the performance of the multithreaded version of the C/C++ run-time library.

Gaining a Sense of One’s Own Identity

Windows offers functions that make it easy for a thread to refer to its process kernel object or to its own thread kernel object:

HANDLE GetCurrentProcess();
HANDLE GetCurrentThread();

The following functions allow a thread to query its process’ unique ID or its own unique ID:

DWORD GetCurrentProcessId();
DWORD GetCurrentThreadId();

Converting a Pseudohandle to a Real Handle

Usually, you use DuplicateHandle function to create a new process-relative handle from a kernel object handle that is relative to another process. However, we can use it in an unusual way convert a Pseudohandle to a Real Handle:

DWORD WINAPI ParentThread(PVOID pvParam) {
   HANDLE hThreadParent;

   DuplicateHandle(
      GetCurrentProcess(),     // Handle of process that thread
                               // pseudohandle is relative to

      GetCurrentThread(),      // Parent thread's pseudohandle
      GetCurrentProcess(),     // Handle of process that the new, real,
                               // thread handle is relative to

      &hThreadParent,          // Will receive the new, real, handle
                               // identifying the parent thread
      0,                       // Ignored due to DUPLICATE_SAME_ACCESS
      FALSE,                   // New thread handle is not inheritable
      DUPLICATE_SAME_ACCESS);  // New thread handle has same
                               // access as pseudohandle

   CreateThread(NULL, 0, ChildThread, (PVOID) hThreadParent, 0, NULL);
   // Function continues...
}
DWORD WINAPI ChildThread(PVOID pvParam) {
   HANDLE hThreadParent = (HANDLE) pvParam;
   FILETIME ftCreationTime, ftExitTime, ftKernelTime, ftUserTime;
   GetThreadTimes(hThreadParent,
      &ftCreationTime, &ftExitTime, &ftKernelTime, &ftUserTime);
   CloseHandle(hThreadParent);
   // Function continues...
}

Now when the parent thread executes, it converts the ambiguous pseudohandle identifying the parent thread to a new, real handle that unambiguously identifies the parent thread, and it passes this real handle to CreateThread. When the child thread starts executing, its pvParam parameter contains the real thread handle. Any calls to functions passing this handle will affect the parent thread, not the child thread.

Because DuplicateHandle increments the usage count of the specified kernel object, it is important to decrement the object’s usage count by passing the target handle to CloseHandle when you finish using the duplicated object handle.

API Table

Function

Description

HANDLE CreateThread(
   PSECURITY_ATTRIBUTES psa,
   DWORD cbStackSize,
   PTHREAD_START_ROUTINE pfnStartAddr,
   PVOID pvParam,
   DWORD dwCreateFlags,
   PDWORD pdwThreadID);

If you want to create one or more secondary threads, you simply have an already running thread call CreateThread:

VOID ExitThread(DWORD dwExitCode);

You can force your thread to terminate

BOOL TerminateThread(
   HANDLE hThread,
   DWORD dwExitCode);

Unlike ExitThread, which always kills the calling thread, TerminateThread can kill any thread

BOOL GetExitCodeThread(
   HANDLE hThread,
   PDWORD pdwExitCode);

Check whether the thread identified by hThread has terminated and, if it has, determine its exit code.

The exit code value is returned in the DWORD pointed to by pdwExitCode. If the thread hasn’t terminated when GetExitCodeThread is called, the function fills the DWORD with the STILL_ACTIVE identifier (defined as 0×103). If the function is successful, TRUE is returned

References

Windows® via C/C++, Fifth Edition

Processes Cont`d–Process Termination

A process can be terminated in four ways:

  1. The primary thread’s entry-point function returns. (This is highly recommended.)
  2. One thread in the process calls the ExitProcess function. (Avoid this method.)
  3. A thread in another process calls the TerminateProcess function. (Avoid this method.)
  4. All the threads in the process just die on their own. (This hardly ever happens.)

This section discusses all four methods and describes what actually happens when a process ends.

The Primary Thread’s Entry-Point Function Returns

You should always design an application so that its process terminates only when your primary thread’s entry-point function returns. This is the only way to guarantee that all your primary thread’s resources are cleaned up properly.

Having your primary thread’s entry-point function return ensures the following:

  • Any C++ objects created by this thread will be destroyed properly using their destructors.
  • The operating system will properly free the memory used by the thread’s stack.
  • The system will set the process’ exit code (maintained in the process kernel object) to your entry-point function’s return value.
  • The system will decrement the process kernel object’s usage count.

The ExitProcess Function

A process terminates when one of the threads in the process calls ExitProcess:

VOID ExitProcess(UINT fuExitCode);

This function terminates the process and sets the exit code of the process to fuExitCode. ExitProcess doesn’t return a value because the process has terminated. If you include any code following the call to ExitProcess, that code will never execute.

When your primary thread’s entry-point function (WinMain, wWinMain, main, or wmain) returns, it returns to the C/C++ run-time startup code, which properly cleans up all the C run-time resources used by the process. After the C run-time resources have been freed, the C run-time startup code explicitly calls ExitProcess, passing it the value returned from your entry-point function.

By simply allowing the primary thread’s entry-point function to return, the C/C++ run time can perform its cleanup and properly destruct all C++ objects. By the way, this discussion does not apply only to C++ objects. The C/C++ run time does many things on behalf of your process; it is best to allow the run time to clean it up properly.

Making explicit calls to ExitProcess and ExitThread is a common problem that causes an application to not clean itself up properly. In the case of ExitThread, the process continues to run but can leak memory or other resources.

The TerminateProcess Function

A call to TerminateProcess also ends a process:

BOOL TerminateProcess(
   HANDLE hProcess,
   UINT fuExitCode);

This function is different from ExitProcess in one major way: any thread can call TerminateProcess to terminate another process or its own process. The hProcess parameter identifies the handle of the process to be terminated. When the process terminates, its exit code becomes the value you passed as the fuExitCode parameter.

You should use TerminateProcess only if you can’t force a process to exit by using another method. The process being terminated is given absolutely no notification that it is dying—the application cannot clean up properly and cannot prevent itself from being killed (except by normal security mechanisms). For example, the process cannot flush any information it might have in memory out to disk.

A process will leak absolutely nothing once it has terminated. I hope that this is clear.

The TerminateProcess function is asynchronous. So you might want to call WaitForSingleObject.

When All the Threads in the Process Die

If all the threads in a process die (either because they’ve all called ExitThread or because they’ve been terminated with TerminateThread), the operating system assumes that there is no reason to keep the process’ address space around. This is a fair assumption because there are no more threads executing any code in the address space. When the system detects that no threads are running any more, it terminates the process. When this happens, the process’ exit code is set to the same exit code as the last thread that died.

When a Process Terminates

When a process terminates, the following actions are set in motion:

  1. Any remaining threads in the process are terminated.
  2. All the User and GDI objects allocated by the process are freed, and all the kernel objects are closed. (These kernel objects are destroyed if no other process has open handles to them. However, the kernel objects are not destroyed if other processes do have open handles to them.)
  3. The process’ exit code changes from STILL_ACTIVE to the code passed to ExitProcess or TerminateProcess.
  4. The process kernel object’s status becomes signaled. This is why other threads in the system can suspend themselves until the process is terminated.
  5. The process kernel object’s usage count is decremented by 1.

Note that a process’ kernel object always lives at least as long as the process itself. However, the process kernel object might live well beyond its process. When a process terminates, the system automatically decrements the usage count of its kernel object. If the count goes to 0, no other process has an open handle to the object and the object is destroyed when the process is destroyed.

Once again, let me remind you that you should tell the system when you are no longer interested in a process’ statistical data by calling CloseHandle. If the process has already terminated, CloseHandle will decrement the count on the kernel object and free it.

References

Windows® via C/C++, Fifth Edition

Processes Cont`d–Beauty of CreateProcess Function

The CreateProcess Function

You create a process with the CreateProcess function:

BOOL CreateProcess(
   PCTSTR pszApplicationName,
   PTSTR pszCommandLine,
   PSECURITY_ATTRIBUTES psaProcess,
   PSECURITY_ATTRIBUTES psaThread,
   BOOL bInheritHandles,
   DWORD fdwCreate,
   PVOID pvEnvironment,
   PCTSTR pszCurDir,
   PSTARTUPINFO psiStartInfo,
   PPROCESS_INFORMATION ppiProcInfo);

When a thread calls CreateProcess:

  1. The system creates a process kernel object with an initial usage count of 1. This process kernel object is not the process itself but a small data structure that the operating system uses to manage the process.
  2. The system then creates a virtual address space for the new process and loads the code and data for the executable file and any required DLLs into the process’ address space.
  3. The system then creates a thread kernel object (with a usage count of 1) for the new process’ primary thread. Like the process kernel object, the thread kernel object is a small data structure that the operating system uses to manage the thread.
  4. This primary thread begins by executing the application entry point set by the linker as the C/C++ run-time startup code, which eventually calls your WinMain, wWinMain, main, or wmain function.
  5. If the system successfully creates the new process and primary thread, CreateProcess returns TRUE.

pszApplicationName and pszCommandLine

The pszApplicationName and pszCommandLine parameters specify the name of the executable file the new process will use and the command-line string that will be passed to the new process, respectively.

Notice that the pszCommandLine parameter is prototyped as a PTSTR. This means that CreateProcess expects that you are passing the address of a non-constant string. Internally, CreateProcess actually does modify the command-line string that you pass to it. But before CreateProcess returns, it restores the string to its original form.

CreateProcess also searches for the executable in the following order:

  1. The directory containing the .exe file of the calling process
  2. The current directory of the calling process
  3. The Windows system directory—that is, the System32 subfolder as returned by GetSystemDirectory
  4. The Windows directory
  5. The directories listed in the PATH environment variable

All of this happens as long as the pszApplicationName parameter is NULL (which should be the case more than 99 percent of the time). Instead of passing NULL, you can pass the address to a string containing the name of the executable file you want to run in the pszApplicationName parameter. Note that you must specify the file’s extension; the system will not automatically assume that the filename has an .exe extension. CreateProcess assumes that the file is in the current directory unless a path precedes the filename. If the file can’t be found in the current directory, CreateProcess doesn’t look for the file in any other directory—it simply fails.

psaProcess, psaThread, and bInheritHandles

To create a new process, the system must create a process kernel object and a thread kernel object (for the process’ primary thread). Because these are kernel objects, the parent process gets the opportunity to associate security attributes with these two objects. You use the psaProcess and psaThread parameters to specify the desired security for the process object and the thread object, respectively. You can pass NULL for these parameters, in which case the system gives these objects default security descriptors. Or you can allocate and initialize two SECURITY_ATTRIBUTES structures to create and assign your own security privileges to the process and thread objects.

Another reason to use SECURITY_ATTRIBUTES structures for the psaProcess and psaThread parameters is if you want either of these two object handles to be inheritable by any child processes spawned in the future by this parent process.

fdwCreate

The fdwCreate parameter identifies flags that affect how the new process is created. You can specify multiple flags if you combine them with the bitwise OR operator.

The fdwCreate parameter also allows you to specify a priority class. However, you don’t have to do this, and for most applications you shouldn’t—the system will assign a default priority class to the new process.

pvEnvironment

The pvEnvironment parameter points to a block of memory that contains environment strings that the new process will use. Most of the time, NULL is passed for this parameter, causing the child process to inherit the set of environment strings that its parent is using.

pszCurDir

The pszCurDir parameter allows the parent process to set the child process’ current drive and directory. If this parameter is NULL, the new process’ working directory will be the same as that of the application spawning the new process. If this parameter is not NULL, pszCurDir must point to a zero-terminated string containing the desired working drive and directory. Notice that you must specify a drive letter in the path.

psiStartInfo

The psiStartInfo parameter points either to a STARTUPINFO or STARTUPINFOEX structure:

typedef struct _STARTUPINFO {
   DWORD cb;
   PSTR lpReserved;
   PSTR lpDesktop;
   PSTR lpTitle;
   DWORD dwX;
   DWORD dwY;
   DWORD dwXSize;
   DWORD dwYSize;
   DWORD dwXCountChars;
   DWORD dwYCountChars;
   DWORD dwFillAttribute;
   DWORD dwFlags;
   WORD wShowWindow;
   WORD cbReserved2;
   PBYTE lpReserved2;
   HANDLE hStdInput;
   HANDLE hStdOutput;
   HANDLE hStdError;
} STARTUPINFO, *LPSTARTUPINFO;

typedef struct _STARTUPINFOEX {
    STARTUPINFO StartupInfo;
    struct _PROC_THREAD_ATTRIBUTE_LIST *lpAttributeList;
} STARTUPINFOEX, *LPSTARTUPINFOEX;

Windows uses the members of this structure when it creates the new process. Most applications will want the spawned application simply to use default values. At a minimum, you should initialize all the members in this structure to zero and then set the cb member to the size of the structure:

STARTUPINFO si = { sizeof(si) };
CreateProcess(..., &si, ...);

If you fail to zero the contents of the structure, the members will contain whatever garbage is on the calling thread’s stack.

Now, if you want to initialize some of the members of the structure, you simply do so before the call to CreateProcess.

This dwFlags member contains a set of flags that modify how the child process is to be created. Most of the flags simply tell CreateProcess whether other members of the STARTUPINFO structure contain useful information or whether some of the members should be ignored

ppiProcInfo

The ppiProcInfo parameter points to a PROCESS_INFORMATION structure that you must allocate; CreateProcess initializes the members of this structure before it returns. The structure appears as follows:

typedef struct _PROCESS_INFORMATION {
   HANDLE hProcess;
   HANDLE hThread;
   DWORD dwProcessId;
   DWORD dwThreadId;
} PROCESS_INFORMATION;

As already mentioned, creating a new process causes the system to create a process kernel object and a thread kernel object. At creation time, the system gives each object an initial usage count of 1. Then, just before CreateProcess returns, the function opens with full access to the process object and the thread object, and it places the process-relative handles for each in the hProcess and hThread members of the PROCESS_INFORMATION structure. When CreateProcess opens these objects internally, the usage count for each becomes 2.

This means that before the system can free the process object, the process must terminate (decrementing the usage count by 1) and the parent process must call CloseHandle (decrementing the usage count again by 1, making it 0). Similarly, to free the thread object, the thread must terminate and the parent process must close the handle to the thread object.

When a process kernel object is created, the system assigns the object a unique identifier; no other process kernel object in the system will have the same ID number. The same is true for thread kernel objects. When a thread kernel object is created, the object is assigned a unique, systemwide ID number. Process IDs and thread IDs share the same number pool. This means that it is impossible for a process and a thread to have the same ID. In addition, an object is never assigned an ID of 0. Notice that Windows Task Manager associates a process ID of 0 to the "System Idle Process" as shown next. However, there is really no such thing as the "System Idle Process." Task Manager creates this fictitious process as a placeholder for the Idle thread that runs when nothing else is running. The number of threads in the System Idle Process is always equal to the number of CPUs in the machine. As such, it always represents the percentage of CPU usage that is not being used by real processes.

You can discover the ID of the current process by using GetCurrentProcessId and the ID of the running thread by calling GetCurrentThreadId. You can also get the ID of a process given its handle by using GetProcessId and the ID of a thread given its handle by using GetThreadId. Last but not least, from a thread handle, you can determine the ID of its owning process by calling GetProcessIdOfThread.

References

Windows® via C/C++, Fifth Edition

Processes

What is a process?

A process is usually defined as an instance of a running program and consists of two components:

  • A kernel object that the operating system uses to manage the process. The kernel object is also where the system keeps statistical information about the process.
  • An address space that contains all the executable or dynamic-link library (DLL) module’s code and data. It also contains dynamic memory allocations such as thread stacks and heap allocations.

For a process to accomplish anything, it must have a thread that runs in its context; this thread is responsible for executing the code contained in the process’ address space. In fact, a single process might contain several threads, all of them executing code "simultaneously" in the process’ address space. To do this, each thread has its own set of CPU registers and its own stack. Each process has at least one thread that executes code in the process’ address space. When a process is created, the system automatically creates its first thread, called the primary thread. This thread can then create additional threads, and these can in turn create even more threads. If there were no threads executing code in the process’ address space, there would be no reason for the process to continue to exist, and the system would automatically destroy the process and its address space.

For all of these threads to run, the operating system schedules some CPU time for each thread. It creates the illusion that all the threads run concurrently by offering time slices (called quantums) to the threads in a round-robin fashion.

Your First Windows Application

Windows supports two types of applications: those based on a graphical user interface (GUI) and those based on a console user interface (CUI). A GUI-based application has a graphical front end.

When you use Microsoft Visual Studio to create an application project, the integrated environment sets up various linker switches so that the linker embeds the proper type of subsystem in the resulting executable. This linker switch is /SUBSYSTEM:CONSOLE for CUI applications and /SUB-SYSTEM:WINDOWS for GUI applications. When the user runs an application, the operating system’s loader looks inside the executable image’s header and grabs this subsystem value

Your Windows application must have an entry-point function that is called when the application starts running. As a C/C++ developer, there are two possible entry-point functions you can use:

int WINAPI _tWinMain(
   HINSTANCE hInstanceExe,
   HINSTANCE,
   PTSTR pszCmdLine,
   int nCmdShow);

int _tmain(
   int argc,
   TCHAR *argv[],
   TCHAR *envp[]);

     

Notice that the exact symbol depends on whether you are using Unicode strings or not. The table below tells you which entry point to implement in your source code and when.

image

The linker is responsible for choosing the proper C/C++ run-time startup function when it links your executable.

All of the C/C++ run-time startup functions do basically the same thing. The difference is in whether they process ANSI or Unicode strings and which entry-point function they call after they initialize the C run-time library. Here’s a summary of what the startup functions found in the crtexe.c file do:

  1. Retrieve a pointer to the new process’ full command line.
  2. Retrieve a pointer to the new process’ environment variables.
  3. Initialize the C/C++ run time’s global variables. Your code can access these variables if you include StdLib.h.
  4. Initialize the heap used by the C run-time memory allocation functions (malloc and calloc) and other low-level input/output routines.
  5. Call constructors for all global and static C++ class objects.

After all of this initialization, the C/C++ startup function calls your application’s entry-point function. If you wrote a _tWinMain function with _UNICODE defined, it is called as follows:

GetStartupInfo(&StartupInfo);
int nMainRetVal = wWinMain((HINSTANCE)&__ImageBase, NULL, pszCommandLineUnicode,
   (StartupInfo.dwFlags & STARTF_USESHOWWINDOW)
      ? StartupInfo.wShowWindow : SW_SHOWDEFAULT);

And it is called as follows without _UNICODE defined:

GetStartupInfo(&StartupInfo);
int nMainRetVal = WinMain((HINSTANCE)&__ImageBase, NULL, pszCommandLineAnsi,
   (StartupInfo.dwFlags & STARTF_USESHOWWINDOW)
      ? StartupInfo.wShowWindow : SW_SHOWDEFAULT);

Notice that _ImageBase is a linker defined pseudo-variable that shows where the executable file is mapped into the application memory.

When your entry-point function returns, the startup function calls the C run-time exit function, passing it your return value (nMainRetVal). The exit function does the following:

  1. It calls any functions registered by calls to the _onexit function.
  2. It calls destructors for all global and static C++ class objects.
  3. In DEBUG builds, leaks in the C/C++ run-time memory management are listed by a call to the _CrtDumpMemoryLeaks function if the _CRTDBG_LEAK_CHECK_DF flag has been set.
  4. It calls the operating system’s ExitProcess function, passing it nMainRetVal. This causes the operating system to kill your process and set its exit code.

If you wrote a _tmain function, it is called as follows when _UNICODE is defined:

int nMainRetVal = wmain(argc, argv, envp);

And it is called as follows when _UNICODE is not defined:

int nMainRetVal = main(argc, argv, envp);

Process Entry-point Parameters

Process Handle

Every executable or DLL file loaded into a process’ address space is assigned a unique instance handle. Your executable file’s instance is passed as (w)WinMain‘s first parameter, hInstanceExe. The handle’s value is typically needed for calls that load resources.

The actual value of (w)WinMain‘s hInstanceExe parameter is the base memory address where the system loaded the executable file’s image into the process’ address space. For example, if the system opens the executable file and loads its contents at address 0×00400000, (w)WinMain‘s hInstanceExe parameter has a value of 0×00400000.

The GetModuleHandle function, shown next, returns the handle/base address where an executable or DLL file is loaded in the process’ address space:

HMODULE GetModuleHandle(PCTSTR pszModule);

A Process’ Previous Instance Handle

As noted earlier, the C/C++ run-time startup code always passes NULL to (w)WinMain‘s hPrevInstance parameter. This parameter was used in 16-bit Windows and remains a parameter to (w)WinMain solely to ease porting of 16-bit Windows applications. You should never reference this parameter inside your code.

A Process’ Command Line

When a new process is created, it is passed a command line. The command line is almost never blank; at the very least, the name of the executable file used to create the new process is the first token on the command line.

Following the example of the C run time, you can also obtain a pointer to your process’ complete command line by calling the GetCommandLine function:

PTSTR GetCommandLine();

This function returns a pointer to a buffer containing the full command line, including the full pathname of the executed file. Be aware that GetCommandLine always returns the address of the same buffer. This is another reason why you should not write into pszCmdLine: it points to the same buffer, and after you modify it, there is no way for you to know what the original command line was.

The following function declared in ShellAPI.h and exported by Shell32.dll, CommandLineToArgvW, separates any Unicode string into its separate tokens:

PWSTR* CommandLineToArgvW(
   PWSTR pszCmdLine,
   int* pNumArgs);

As the W at the end of the function name implies, this function exists in a Unicode version only

A Process’ Environment Variables

Every process has an environment block associated with it. An environment block is a block of memory allocated within the process’ address space that contains a set of strings with the following appearance, where ‘’ means null character:

=::=::\ ...
VarName1=VarValue1
VarName2=VarValue2
VarName3=VarValue3 ...
VarNameX=VarValueX

The first part of each string is the name of an environment variable. This is followed by an equal sign, which is followed by the value you want to assign to the variable.

When a user logs on to Windows, the system creates the shell process and associates a set of environment strings with it. The system obtains the initial set of environment strings by examining two keys in the registry.

The first key contains the list of all environment variables that apply to the system:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\
   Session Manager\Environment

The second key contains the list of all environment variables that apply to the user currently logged on:

HKEY_CURRENT_USER\Environment

Normally, a child process inherits a set of environment variables that are the same as those of its parent process. However, the parent process can control what environment variables a child inherits, as you’ll see later when we discuss the CreateProcess function. By inherit, I mean that the child process gets its own copy of the parent’s environment block; the child and parent do not share the same block. This means that a child process can add, delete, or modify a variable in its block and the change will not be reflected in the parent’s block.

If you still want to use environment variables, there are a few functions that your applications can call. The GetEnvironmentVariable function allows you to determine the existence and value of an environment variable:

DWORD GetEnvironmentVariable(
   PCTSTR pszName,
   PTSTR pszValue,
   DWORD cchValue);

Many strings contain replaceable strings within them. For example, you can this string somewhere in the registry:

%USERPROFILE%\Documents

The portion that appears in between percent signs (%) indicates a replaceable string. In this case, the value of the environment variable, USERPROFILE, should be placed in the string. On my machine, the value of my USERPROFILE environment variable is

C:\Users\jrichter

So, after performing the string replacement, the resulting string becomes

C:\Users\jrichter\Documents

Because this type of string replacement is common, Windows offers the ExpandEnvironmentStrings function:

DWORD ExpandEnvironmentStrings(
   PTCSTR pszSrc,
   PTSTR pszDst,
   DWORD chSize);

Finally, you can use the SetEnvironmentVariable function to add a variable, delete a variable, or modify a variable’s value:

BOOL SetEnvironmentVariable(
   PCTSTR pszName,
   PCTSTR pszValue);

You should always use these functions for manipulating your process’ environment block.

Process Related

A Process’ Error Mode

Associated with each process is a set of flags that tells the system how the process should respond to serious errors, which include disk media failures, unhandled exceptions, file-find failures, and data misalignment. A process can tell the system how to handle each of these errors by calling the SetErrorMode function:

UINT SetErrorMode(UINT fuErrorMode);

The fuErrorMode parameter is a combination of any of the flags shown in the below table bitwise ORed together.

image

A Process’ Current Drive and Directory

When full pathnames are not supplied, the various Windows functions look for files and directories in the current directory of the current drive. For example, if a thread in a process calls CreateFile to open a file (without specifying a full pathname), the system looks for the file in the current drive and directory.

The system keeps track of a process’ current drive and directory internally. Because this information is maintained on a per-process basis, a thread in the process that changes the current drive or directory changes this information for all the threads in the process.

A thread can obtain and set its process’ current drive and directory by calling the following two functions:

DWORD GetCurrentDirectory(
   DWORD cchCurDir,
   PTSTR pszCurDir);
BOOL SetCurrentDirectory(PCTSTR pszCurDir);

If the buffer you provide is not large enough, GetCurrentDirectory returns the number of characters required to store this folder, including the final ” character, and copies nothing into the provided buffer, which can be set to NULL in that case. When the call is successful, the length of the string in characters is returned, without counting the terminating ” character.

A Process’ Current Directories

The system keeps track of the process’ current drive and directory, but it does not keep track of the current directory for every drive. However, there is some operating system support for handling current directories for multiple drives. This support is offered via the process’ environment strings. For example, a process can have two environment variables, as shown here:

=C:=C:\Utility\Bin
=D:=D:\Program Files

These variables indicate that the process’ current directory for drive C is \Utility\Bin and that its current directory for drive D is \Program Files.

If you call a function, passing a drive-qualified name indicating a drive that is not the current drive, the system looks in the process’ environment block for the variable associated with the specified drive letter. If the variable for the drive exists, the system uses the variable’s value as the current directory. If the variable does not exist, the system assumes that the current directory for the specified drive is its root directory.

For example, if your process’ current directory is C:\Utility\Bin and you call CreateFile to open D:ReadMe.Txt, the system looks up the environment variable =D:. Because the =D: variable exists, the system attempts to open the ReadMe.Txt file from the D:\Program Files directory. If the =D: variable did not exist, the system would attempt to open the ReadMe.Txt file from the root directory of drive D. The Windows file functions never add or change a drive-letter environment variable—they only read the variables.

If a parent process creates an environment block that it wants to pass to a child process, the child’s environment block does not automatically inherit the parent process’ current directories. Instead, the child process’ current directories default to the root directory of every drive. If you want the child process to inherit the parent’s current directories, the parent process must create these drive-letter environment variables and add them to the environment block before spawning the child process. The parent process can obtain its current directories by calling GetFullPathName:

DWORD GetFullPathName(
   PCTSTR pszFile,
   DWORD cchPath,
   PTSTR pszPath,
   PTSTR *ppszFilePart);

For example, to get the current directory for drive C, you call GetFullPathName as follows:

TCHAR szCurDir[MAX_PATH];
DWORD cchLength = GetFullPathName(TEXT("C:"), MAX_PATH, szCurDir, NULL);

As a result, the drive-letter environment variables usually must be placed at the beginning of the environment block.

The System Version

Frequently, an application needs to determine which version of Windows the user is running.

BOOL GetVersionEx(POSVERSIONINFOEX pVersionInformation);

This function requires you to allocate an OSVERSIONINFOEX structure in your application and pass the structure’s address to GetVersionEx. The OSVERSIONINFOEX structure is shown here:

typedef struct {
   DWORD dwOSVersionInfoSize;
   DWORD dwMajorVersion;
   DWORD dwMinorVersion;
   DWORD dwBuildNumber;
   DWORD dwPlatformId;
   TCHAR szCSDVersion[128];
   WORD  wServicePackMajor;
   WORD  wServicePackMinor;
   WORD  wSuiteMask;
   BYTE  wProductType;
   BYTE  wReserved;
} OSVERSIONINFOEX, *POSVERSIONINFOEX;

The OSVERSIONINFOEX structure has been available since Windows 2000. Other versions of Windows use the older OSVERSIONINFO structure, which does not have the service pack, suite mask, product type, and reserved members.

References

Windows® via C/C++, Fifth Edition

Follow

Get every new post delivered to your Inbox.