Skip to content

lhmouse/mcfgthread

Repository files navigation

The MCF Gthread Library

MCF Gthread is a threading support library for Windows 7 and above that implements the gthread interface set, which is used internally both by GCC to provide synchronization of initialization of local static objects, and by libstdc++ to provide C++11 threading facilities.

Warning

This project uses some undocumented NT system calls and is not guaranteed to work on some Windows versions. The author gives no warranty for this project. Use it at your own risk.

Benchmark Results

This is the result of a benchmark program on Windows 11 Insider Preview (dev channel, build 26300.7760) on an Intel i9-14900K desktop processor (8 P-cores + 16 E-cores, 32 threads):

benchmark_result_win11_26300_i9_14900k

This is the result on Windows 11 Insider Preview (beta channel, build 26220.7755) on a Qualcomm Snapdragon 8cx Gen 3 mobile processor processor (8 cores, no hyper-threading):

benchmark_result_win11_26220_snapdragon_8cxg3

This is the result on Windows 7 SP1 on an Intel i7-7700 desktop processor (4 cores, 8 threads):

benchmark_result_win7_sp1_i7_7700

This is the result on Wine 9.0 on Linux Mint 22 (kernel 6.11 low-latency) on an Intel i7-1165G7 mobile processor (4 cores, 8 threads):

benchmark_result_wine90_linux_i7_1165g7

How to Build

Compiling natively can be done in MSYS2. We take the UCRT64 shell as an example. Others are similar. Clang shells are also supported.

pacman -S --noconfirm mingw-w64-ucrt-x86_64-{{headers,crt,tools}-git,gcc,binutils,meson}
meson setup build_debug
cd build_debug
ninja test

Cross-compiling from Debian, Ubuntu or Linux Mint is supported. In order to run tests, Wine is required.

sudo apt-get install -y --no-install-recommends mingw-w64-{x86-64-dev,tools}  \
        {gcc,g++,binutils}-mingw-w64-x86-64 meson wine wine-binfmt
meson setup --cross-file cross/gcc.x86_64-w64-mingw32 build_debug
cd build_debug
ninja test

Tip

In order for __cxa_atexit() (and the non-standard __cxa_at_quick_exit()) to conform to the Itanium C++ ABI, it is required 1) for a process to call __cxa_finalize(NULL) when exiting, and 2) for a DLL to call __cxa_finalize(&__dso_handle) when it is unloaded dynamically. This requires hacking the CRT. If you don't have the modified CRT, you may still get standard compliance by 1) calling __MCF_exit() instead of exit() from your program, and 2) calling __cxa_finalize(&__dso_handle) followed by fflush(NULL) upon receipt of DLL_PROCESS_DETACH in your DllMain().

Implementation Details

The condition variable

A condition variable is implemented as an atomic counter of threads that are currently waiting on it. Initially the counter is zero, which means no thread is waiting.

When a thread is about to start waiting on a condition variable, it increments the counter and suspends itself using the global keyed event, passing the address of the condition variable as the key. Another thread may read the counter to tell how many threads that it will have to wake up (note this has to be atomic), and release them from the global keyed event, also passing the address of the condition variable as the key.

The primitive mutex

A primitive mutex is just a condition variable with a boolean bit, which designates whether the mutex is LOCKED. A mutex is initialized to all-bit zeroes which means it is unlocked and no thread is waiting.

When a thread wishes to lock a mutex, it checks whether the LOCKED bit is clear. If so, it sets the LOCKED bit and returns, having taken ownership of the mutex. If the LOCKED bit has been set by another thread, it goes to wait on the condition variable. If the thread wishes to unlock this mutex, it clears the LOCKED bit and wakes up at most one waiting thread on the condition variable, if any.

The 'real' mutex

In reality, critical sections are fairly small. If a thread fails to lock a mutex, it might be able to do so soon, and we don't want it to give up its time slice as a syscall is an overkill. Therefore, it is reasonable for a thread to perform some spinning (busy waiting), before it actually decides to sleep.

This could however lead to severe problems in case of heavy contention. When there are hundreds of thread attempting to lock the same mutex, the system scheduler has no idea whether they are spinning or not. As it is likely that a lot of threads will eventually give up spinning and make a syscall to sleep, we are wasting a lot of CPU time and aggravating the situation.

This issue is ultimately solved by mcfgthread by encoding a spin failure counter in each mutex. If a thread gives up spinning because it couldn't lock the mutex within a given number of iterations, the spin failure counter is incremented. If a thread locks a mutex successfully while it is spinning, the spin failure counter is decremented. This counter provides a heuristic way to determine how heavily a mutex is seized. If there have been many spin failures, newcomers will not attempt to spin, but will make a syscall to sleep on the mutex directly.

The once-initialization flag

A once-initialization flag contains a READY byte (this is the first one according to Itanium ABI) which indicates whether initialization has completed. The other bytes are used as a primitive mutex.

A thread that sees the READY byte set to non-zero knows initialization has been done, so it will return immediately. A thread that sees the READY byte set to zero will lock the bundled primitive mutex, and shall perform initialization thereafter. If initialization fails, it unlocks the primitive mutex without setting the READY byte, so the next thread that locks the primitive mutex will perform initialization. If initialization is successful, it sets the READY byte and unlocks the primitive mutex, releasing all threads that are waiting on it. (Do you remember that a primitive mutex actually contains a condition variable?)

List of Imported Functions

Function DLL Category
BaseGetNamedObjectDirectory KERNEL32 Undocumented
CreateThread KERNEL32 Windows API
DecodePointer KERNEL32, NTDLL Windows API
EncodePointer KERNEL32, NTDLL Windows API
ExitThread KERNEL32 Windows API
FormatMessageW KERNEL32 Windows API
GetCurrentProcessId KERNEL32 Windows API
GetLastError KERNEL32 Windows API
GetModuleFileNameW KERNEL32 Windows API
GetModuleHandleExW KERNEL32 Windows API
GetProcAddress KERNEL32 Windows API
GetProcessHeap KERNEL32 Windows API
GetSystemInfo KERNEL32 Windows API
GetSystemTimeAsFileTime KERNEL32 Windows API
GetThreadPriority KERNEL32 Windows API
GetTickCount64 KERNEL32 Windows API
HeapAlloc KERNEL32 Windows API
HeapFree KERNEL32 Windows API
HeapReAlloc KERNEL32 Windows API
HeapSetInformation KERNEL32 Windows API
HeapSize KERNEL32 Windows API
NtClose NTDLL Windows Driver API
NtCreateSection NTDLL Windows Driver API
NtDelayExecution NTDLL Undocumented
NtDuplicateObject NTDLL Windows Driver API
NtMapViewOfSection NTDLL Windows Driver API
NtRaiseHardError NTDLL Undocumented
NtReleaseKeyedEvent NTDLL Undocumented
NtUnmapViewOfSection NTDLL Windows Driver API
NtWaitForKeyedEvent NTDLL Undocumented
NtWaitForSingleObject NTDLL Windows Driver API
QueryPerformanceCounter KERNEL32 Windows API
QueryPerformanceFrequency KERNEL32 Windows API
QueryUnbiasedInterruptTime KERNEL32 Windows API
RaiseFailFastException KERNEL32 Windows API
RtlDllShutdownInProgress NTDLL Windows API
RtlFillMemory NTDLL Windows Driver API
RtlMoveMemory NTDLL Windows Driver API
RtlNtStatusToDosError NTDLL Windows Driver API
RtlZeroMemory NTDLL Windows Driver API
SetConsoleCtrlHandler KERNEL32 Windows API
SetLastError KERNEL32 Windows API
SetThreadPriority KERNEL32 Windows API
SwitchToThread KERNEL32 Windows API
TerminateProcess KERNEL32 Windows API
TlsAlloc KERNEL32 Windows API
TlsGetValue KERNEL32 Windows API
TlsGetValue2 KERNEL32 Windows API
TlsSetValue KERNEL32 Windows API
VirtualProtect KERNEL32 Windows API

About

Cornerstone of the MOST efficient std::thread on Windows for mingw-w64

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •