An implementation of the C11 `<stdatomic.h>` interface

1. Implemented library features
- 1.1. Type, constants and function interfaces
- 1.2. Type generic functions
2. The <stdatomic.h> header file
3. The implementation
4. Performance considerations
- 4.1. Benchmarks
- 4.2. Code inspection
  - 4.2.1. Lower range of thread numbers
5. Installation
- 5.1. Musl
  - 5.1.1. direct integration
  - 5.1.2. building externally
- 5.2. Other C libraries
6. Terms
- 6.1. Copyright and License
- 6.2. Distribution

Jens Gustedt

The implementation of the C11 atomic interface typically sits between the implementation of the core language by the C compiler and the implementation of the C library. It needs compiler support for the individual atomic operations and library supports for the cases where no low-level atomic instruction is available and a lock must be taken.

This implementation builds entirely on the two gcc ABIs for atomics. It doesn't even attempt to go down to assembly level by itself.
We provide all function interfaces that the two gcc ABIs and the C standard need.
For compilers that don't offer the direct language support for atomics this provides a syntactically reduced but fully functional approach to atomic operations.

The sources of this library can be found at https://gforge.inria.fr/projects/stdatomic

A short description of the new futex-lock algorithm that is implemented for the Linux operating system has been accepted for SAC'16. A long version of that article can be found here: https://hal.inria.fr/hal-01236734

1 Implemented library features

We distinguish between the implementation of library functions and the <stdatomic.h> header file.

The latter already contains a lot of intelligence, because it has to provide type generic interfaces. This is more involved than usual C library header files.

The header file is optional, the created function interface should be compatible with the header files that gcc and clang may provide.

But you should be careful with gcc's header file. Up to now (Dec 2015) that header file has an unfixed bug: addition and subtraction of atomic types only does the increments in bytes, and not in elements of the base type of the pointer. Avoid their header, as long as this bug isn't fixed. If you are interested you can follow their bug at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64843

1.1 Type, constants and function interfaces

These are the types and proper functions that are foreseen by the standard:

atomic_flag and its four functions
the memory_order enumeration type
fences
object-like macros to test for lock-freeness and similar things
typedef for atomic integer and pointer types.

All of these are provided in forms that are compatible with gcc and clang.

1.2 Type generic functions

These are all implemented as macros, and should in many cases result in optimized inlined assembler instructions, and not in library calls. Library calls are only needed as fall back, when there is no reasonable instruction set available.

This implementation uses predefined macros of the form

__GCC_HAVE_SYNC_COMPARE_AND_SWAP_X

where X can be one of 1, 2, 4, 8 or 16. All versions of gcc and clang since at least ten years implement these macros and the underlying operations consistently.

If that macro exists, we suppose that the compiler is able to synthesize all corresponding memory functions for types of size X and all necessary arithmetic function for integer types of that size, as lock-free (= stateless) instructions.

This doesn't mean that there are direct assembler instruction for all these operations. They can well be implemented as an unbounded loop that uses a compare_and_swap (CAS) primitive for atomic exchange. Gcc typically does this for the less common atomic arithmetic instructions such as atomic_fetch_and, for example. Lock-free doesn't mean a bounded number of instructions.

Relevant for C11 are 9 operations

fetch_add for integer and pointer addition, returning the previous value
fetch_sub for integer and pointer subtraction, returning the previous value
fetch_or for bitwise or, returning the previous value
fetch_and for bitwise and, returning the previous value
fetch_xor for bitwise xor, returning the previous value
load for an atomic load operation
store for an atomic store operation
exchange for an atomic exchange operation, equivalent to a store that returns the previous value
compare_exchange for an atomic compare and exchange operation, equivalent to a conditional store that also saves the previous value, and returns false or true according to the success of the condition and the possible store operation. Two semantically different interfaces are provided. A strong version that returns false and true as described, and a weak version that may fail eventually, even if the condition of equality was fulfilled.

According to recent precision given by the C standards committee the store, exchange and compare_exchange functions have memory semantics, that is they are done as-if using memcpy and memcmp not as-if using equality and assignment operations. This distinction is important for two cases

If the corresponding type has padding bits or bytes, compare_exchange can fail even if the abstract value of two objects would compare equal. This is because the padding, if different for the objects, enters into the comparison.
Floating point NaN seen as values always compares false, even if two NaN are compared. When compared as bit pattern through memcmp they could compare equal.

These operations in C11 generally have two variants, one that uses sequential consistency and another one where a more detailed consistency can be requested through additional parameters. So in total we have to provide 20 different type generic functional interfaces.

For the operations that cannot be mapped to built-in compiler support the compilers inserts calls to external functions. The names for these functions are typically composed of the operation and prefixed either by __sync_ (the older gcc ABI) or __atomic_ (the newer gcc ABI). The names of these calls can be suffixed by _X for X as above if this concerns an operation on a type of the corresponding width.

All external functions that the gcc ABI's require are provided.

1.2.1 The `__atomic_` ABI

is already close to the C11 call interface as given above.

In addition to the more or less obvious operands, the built-in functions take one or two additional parameters that reflect an eventual requirement for the memory_order of the operation. So the functions represent the C11 "explicit" features such as atomic_fetch_add_explicit. The non-explicit versions have to be mapped to the explicit version by providing memory_order_seq_cst as parameter(s).

Observe that the built-in functions only foresee one interface compare_exchange.

The distinction between weak and strong versions of these built-in functions are ruled through an additional parameter, not through a different function interface.
The function symbol fall-back __atomic_compare_exchange confusingly has a different semantic and prototype than the built-in function. It misses the parameter to chose between the "weak" and the "strong" version, and solely corresponds to the C11 operation

atomic_compare_exchange_strong_explicit

As said, load, store and compare_exchange operations have memory semantics. The implementation may use = or == operators in some places for optimization, but it then does so with objects of uintXX_t, so every bit is accounted for.

Function call interfaces for the arithmetic operations are only generated if we can suppose that an integer type for the corresponding size exists. We can reasonably assume that there are always types uint8_t, uint16_t, uint32_t and uint64_t, so the variants for 1, 2, 4 and 8 can always be generated.

For a 128 bit type these are only generated if __SIZEOF_INT128__ or __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 exist. If so, we assume that __uint128_t is such an integer type and known to the compiler.

Arithmetic operations can safely use these uintXX_t types internally, since the standard imposes two's complement representation for signed atomic types and also enforces that atomic operations may not produce traps on overflow.

Additionally to the operations that have generic function interfaces in the C11 standard, gcc additionally implements six other built-ins, namely

__atomic_add_fetch for integer or pointer addition, returning the updated value
__atomic_sub_fetch for integer or pointer subtraction, returning the updated value
__atomic_or_fetch for bitwise or, returning the updated value
__atomic_and_fetch for bitwise and, returning the updated value
__atomic_xor_fetch for bitwise xor, returning the updated value
__atomic_fetch_nand for bitwise nand (x = ~(x & v)), returning the previous value
__atomic_nand_fetch for bitwise nand (x = ~(x & v)), returning the updated value

For the completeness of the library interface we supply analogous functions with the _X suffix for these. They might be called by the compiler if the user code uses assign and add or similar operators on atomic integers. The __atomic_add_fetch and __atomic_sub_fetch functions may also eventually be used by the compiler to implement an atomic prefix increment or decrement operation (++x and --x). This would e.g happen if x is an object of type __int128_t and the platform doesn't implement lock-free atomics for types of size 16.

1.2.2 Clang's `__c11_atomic` built-ins

Clang has gone a different path for the built-ins that implement C11 atomics, prefixed with __c11_atomic. These are a directly feature equivalent to the C11 generic functions that have memory_order arguments (_explicit suffix).

For the cases that no atomic instructions can be synthesized, clang falls back to the same external calls as described for gcc's __atomic ABI.

1.2.3 The `__sync` ABI

It dates back long before the C11 atomic interface had been designed and thus cannot be directly conforming to it. It has basically the same built-ins for arithmetic types as above, only that

The functions are named a bit differently.
They only implement sequential consistency.
There are no load, store or exchange features.
The nand operations changed their meaning from version 4.4 onward. Therefore this operation cannot be used portably in an environment that might use different versions of compilers. So we don't implement these function interfaces and we deprecate the use of this built-in.

Additionally this interface also implements a test_and_set functionality that is used to implement the atomic_flag functions. This built-in is documented to have acquire-release consistency. If used with sequential consistency, an additional fence is inserted to ensure that.

These features are sufficient to provide a decent implementation of C11 atomics.

1.2.4 The lock-full fallback functions

In absence of proper architecture support, all fallbacks (for the three built-in families) with _X suffix use the ones without suffix underneath. These external interfaces receive the size of the data type as an additional, leading parameter:

__atomic_load
__atomic_store
__atomic_exchange
__atomic_compare_exchange

They have pure memory semantics and their basic operations are memcpy and memcmp for load, store and comparison.

These functions cannot be called directly from within your code, because the compiler cannot distinguish them from the gcc built-ins, and they have different prototypes than these.

We implement these functions as critical sections that are protected with a lock, similar to a mutex. This implementations uses a table of locks and a hash function to choose one of the entries that only depends on the address of the atomic object.

At the moment, this implementation has several address-hash functions that can be chosen a library-compile time. Any function that mixes the bits of the address should perform reasonably well.

More important for performance is the choice of the lock. Such a lock can be relatively simple, since C11 atomics that are not lock-free don't have to be asynchronous signal safe.

There are several possibilities, in order of preference:

An OS specific light-weighted lock with non-active waits. The integration into musl uses Linux' futex underneath to do an efficient wait. If by coincidence these are called in an un-threaded process, they are close to non-ops.
C11's mtx_t type has an shallow interface that should allow it to be implemented a bit simpler and efficient than OS specific mutexes that implement a lot of functionality. This solution should be portable to all platforms that implement this part of C11. In a relatively near future these could be all POSIX and Windows platforms. This approach has the disadvantage that a table of mtx_t must be initialized at process startup because mtx_t doesn't guarantee static initialization.
POSIX' pthread_mutex_t is a little less portable, but allows for static initialization.
A spinlock similar to atomic_flag. Such an approach is portable to all platforms that implement atomics and allows for static initialization. This is the only choice when compiled without OS or library support.

The wait functionality is an active wait, that burns CPU cycles and memory bandwidth. In many circumstances this should do well, the critical sections that are protected by this are nice and small.

2 The `<stdatomic.h>` header file

2.1 Full C11 support

Versions of gcc and clang that fully implement the C11 atomics interface will not need a special header file but can use their own that is shipped with the compiler:

gcc starting with version 4.9
clang starting with version 3.6

This full support of atomics allows to use atomic objects just as other objects it whatever operations the base type supports.

These default operations on atomics use sequential consistency. That is, each such an operation will enforce a full memory transfer and the perceived effect is as if all these operations, even if issued in different threads, have been done one after another. Thus, thread parallelism can only play between such operations:

atomics operations are expensive

The functional interfaces with different memory_order arguments (_explicit suffix to the name) that we described above may be used to milder the memory effect that atomic operations have. The possible gain of such different memory consistency models are very architecture dependent. E.g on the x86 platforms they offer almost no advantage, whereas on ARM platforms acquire/release semantics may bring some noticeable gain.

But beware that this gain is bought with a sensible complexification of the code. Only use this if the atomic operations are a measurable performance bottleneck and you already have reduced the number of these operations to a minimum.

2.2 Partial C11 atomics support

A series of compiler versions offers partial atomics support that already implements most of the C11 semantic:

gcc versions 4.7 and 4.8
clang versions 3.2 to 3.5

These versions provide the built-in functions as described above but lack full compiler support for atomic types and operations.

With the <stdatomic.h> header that we supply for these compilers, application code can use the functional interfaces. A macro _Atomic(T) is provided that can be used to issue emulated declarations of atomic types that should be forward compatible to platforms with complete C11 atomics support. Example:

// global variables
_Atomic(size_t) thread_inside_count = ATOMIC_VAR_INIT(0);
_Atomic(size_t) thread_total_count = ATOMIC_VAR_INIT(1);

int my_thread_function(void* arg) {
   atomic_fetch_add(&thread_inside_count, 1);
   atomic_fetch_add(&thread_total_count, 1);

   // do something complicated here

   // at the end
   atomic_fetch_sub(&thread_inside_count, 1);
}

Underneath such emulated atomic objects are implemented as arrays of volatile base type of size 1. This has the following sought effects:

They can't be assigned to.
They evaluate to a pointer in almost any context.
Operations with them cannot be reordered by the compiler.

So you should be relatively safe from programming errors that would access such objects without passing through the type generic atomic functions. The compiler will error out on improper usage of such atomic objects, but the diagnostics may be a bit crude.

2.2.1 Issues

Since this approach may reinterpret data through pointer casts, it could potentially be dangerous. So let us discuss the possible issues.

The generic fallbacks for memory access only use memcpy and memcmp to access the data itself. So the access of the data is within the constraints of the standard.
The generic fallbacks for memory access ensure that their arguments have compatible base types (if a pointer is passed in) or are assignment compatible with the base type of the atomic (if a value is passed in). So data that is copied across can never be misinterpreted as being of a wrong type because the two target types are compatible.
The specialized functions with _X suffix may reinterpret their data as the corresponding uintXX_t for the size. Copying or comparing such data is always guaranteed to use all bits, so in that sense it is equivalent to memcpy and memcmp.
The arithmetic operations that are executed then are operations on an unsigned integer type that has no padding bits. This arithmetic is compatible for all integer types that have no padding bits and, for the signed types, are represented with two's complement.
An emulated atomic with this approach is implemented as an array to the base type, and so in the user code the base type of the object remains visible to the compiler. As a consequence this approach has no effect on the aliasing rules, the compiler always has complete information about the type of each object.

The only potential problem for our approach that remains is alignment. Since the stub functions that are provided may use casts to uintXX_t of "atomic" objects you have to ensure that these objects are at least aligned as these types would be. This should not be a problem, if the base type is an integer type, too. Integer types with same size should have the same alignment.

If you encounter problems with a user defined type that has a size that is a small power of two you could force alignment

_Alignas(sizeof(toto)) _Atomic(toto) toto1;
__attribute__((__aligned__(sizeof(toto)))) _Atomic(toto) toto2;

with whatever of the two constructs works for you.

I am currently struggling to provide a version of the _Atomic(T) macro that ensures that automatically. It seems to be possible but produces a lot of noise for function parameters that are pointers to atomics.

2.3 Basic atomics support

Even older versions of gcc and clang implement the __sync built-in functions and can thereby made to accept the same <stdatomic.h> header as discussed above. Since, as their names indicate, these built-ins only have fully synchronizing versions, they will not be able to take advantage of the different consistency models. But implementing atomics with stronger consistency than required, here sequential consistency, only, is conforming to the C standard.

3 The implementation

3.1 Requirements

3.1.1 Compilers

You should be able to compile this implementation with any version of modern gcc and clang. (Versions are hard to tell, gcc should work for 4.1) The quality of the resulting binary will depend on the implementation of atomic support by the compiler.

There are three different implementations, for modern clang and gcc, and one for those compilers that only support the __sync_ built-ins. They are only tested with clang and gcc, but might work with other compilers that implement one of the sets of built-ins and is otherwise compatible to some gcc extensions:

compound expressions with ({ })
__typeof__
__attribute__((__unused__))
__builtin_choose_expr for the __sync version as a precursor of C11's _Generic
#pragma redefine_extname to rename the external symbols that are produced

If aligment happens to be an issue you might also need

__attribute__((__aligned__(something)))
__alignof__

or the equivalent C11 features _Alignas and _Alignof.

There are some heuristics in place to decide at compile time which case applies, namely __clang__ to detect clang, __ATOMIC_... macros to detect the C11 versions of the built-ins.

3.1.2 OS or C library support

The library may work with different lock constructs, as described above, that is a futex based support for Linux, C11's mtx_t, POSIX' pthread_mutex_t, and active spinning as a last resort. You may find a description of the algorithms and some performance figures in the article https://hal.inria.fr/hal-01236734

If you would like to see support for other OS or runtime environments, don't hesitate to contact me if you'd like to work on implementing and integrating this.

3.2 Caveats

3.2.1 Symbol renaming

There is one important difficulty when compiling this. The original __atomic library interface was developed with C++ in mind and not C. Therefore it freely uses function overloading for the built-ins versus the library interface. Since we also use the library functions as fallbacks in the implementation of some of the _X variants this naming scheme is not supportable with a C compiler.

We get away with it by using internal names, prefixed with __impl_ for all functions. Then a gcc extension is used to map that internal name to an external name, e.g

#pragma redefine_extname __impl_load __atomic_load

If your compiler doesn't support this feature, you'd have to use an external tool such as objcopy to achieve the same.

3.2.2 Support of 16 byte atomic instructions

The main difference for modern processors that is relevant here is if it supports 16 byte atomic instructions or not. There is no difficulty to detect this at compile time, but if the library is used with code that is compiled with a different compiler or just different compiler options, incompatible binary code may be produced.

My plan is to freeze that feature at compile time of the library and reflect the capacity in the <stdatomic.h> that is provided. This then may result in code that is a bit less optimized than it could, but that is compatible.

If the library is not compiled with direct 16 byte support the application may not use it, and thus use a memory implementation for such operations.
If the library is compiled with direct 16 byte support but the application compiler doesn't support it, the user code should fallback to library calls, but which in turn use the atomic instructions. So such a variant would have a call overhead and would not be able to inline the atomics in the user binary.

I already have a working implementation of such a safety feature in some other code, so this is feasible. But for the moment this is not yet done, here. Be careful when using this preliminary version.

3.3 Leftovers

There are some leftovers that will hopefully disappear.

There are several hash functions and a instrumentation infrastructure for the hashes. I didn't have enough test cases yet to see what would be best, here.

3.4 Instrumentation and testing

3.4.1 Instrumentation

There is optional instrumentation for the lock functions. Switching it on changes overall performance substantially, and thus I'd expect a noticeable effect by the observation principle. These counters can give qualitative information about what happens, but you shouldn't take the figures verbally. Also these counters are only protected if you test the library with only one lock, using atomics for these counters themselves would have a strong performance impact and the resulting statistics would basically be worthless.

You can switch the instrumentation of the code on by defining the symbol BENCH at compile time. A function atomic_summary can be used at the end of all operations to print the collected data to stderr.

3.4.2 Code injection

To test the behavior of the locking algorithm you may inject a function call just after the acquisition of the lock. Thereby you can e.g force the thread that obtains the lock to be descheduled, and test the worst-case behavior of the locking algorithm.

This feature is switched on by defining the macro ATOMIC_INJECT at compile time. By that you have a thread local variable atomic_faulty and a function interface atomic_inject at your disposal, namely atomic_inject is called iff atomic_faulty is true for the calling thread.

There is a "weak" version of atomic_inject that does nothing. It can be overwritten by a specific version that you provide yourself. E.g for the benchmarks using Modular C in the article that we mentionned above, slow path of the algorithm is stressed by simply calling thrd_yield.

The variable atomic_faulty can be used to switch the code injection on and off, such that you may experiment with different probabilities of failure.

4 Performance considerations

4.1 Benchmarks

I have run a long series of benchmarks to validate the approach. The code for the benchmark is at the moment integrated in p11 with comes with Modular C, see Cmod. To compile it you'd need

a C11 compliant library that has C11 threads, I only know of musl, or an implementation of POSIX' threads that can be used to emulate C11 threads.
a C11 compiler that also has gcc extension. I tested with gcc and clang.
Cmod
P99, my old macro library. This one could probably avoided, it is just needed for some parts of p11.

The test in p11 is called p11#test#lifo. It is based on a stack implementation (Last In First Out) that uses an atomic pair of items for the head to avoid the ABA problem.

Please refer to the article for some results of the benchmarks.

4.2 Code inspection

4.2.1 Lower range of thread numbers

For this application the performance in the lower range of is largely dominated by the fast path, that is by a very small number of assembler instructions that constitute the good case, when a thread doesn't encounter congestion. On a x86_64 machine, our implementation of the four different categories result in the following memory instructions:

	lock	unlock
spinlock	`cmpxchgl`	`movl`
futex	`cmpxchgl`	`lock addl`
mutex	`cmpxchgl`	`movl`, `xchg`
musl	`xchg`	`movl`, `mov`, `lock orl`, `mov`

The spinlock and futex implementation here have very similar performance, because they have a minimal number memory instructions.

Musl's internal lock implementation actually looses for the unlock. It has four different memory instructions. Two of them originate from the internal macro a_store, which needs a synchronization of the mov instruction to avoid reordering on the processor. It results in two instructions:

mov eax, (%rdi)
lock orl (%rsp)

We observed an improvement whe a_store is implemented directly with on atomic instruction, e.g.

xchg %eax, (%rdi)

Such a change could perhaps be integrated into musl at a later stage.

The mutex implementations have two memory instructions for the unlock functions. One movl from memory to CPU for a waiters counter, and one xchg to manipulate the lock itself.

Our implementation attempts to combine the two instructions for unlock into one: on the fast path we only need one atomic addition. By that we are better than the mutex, we save one movl instruction for the waiters counter. We may be a bit worse than the spinlock, because that only has a write to memory to perform, and doesn't need information from memory to be returned to the CPU.

5 Installation

As said above it will be important that your compiled library and your user code agree on the model of the atomics that they have for 16-byte data types. Be careful, compiler options that change the processor model can change this characteristic, e.g if you compile with gcc and the option -march=native.

If your compiler (gcc or clang) already has a working stdatomic.h file in place, you have nothing to do from that part. If not, you should install all files of the form

atomic_.....h

in an include directory where your compiler can find it when you compile application code.

5.1 Musl

5.1.1 direct integration

If you have the sources of musl the easiest is to integrate the library directly into the libc. To achieve that just do

make MUSL=your/path/to/musl musl

This copies all necessary code to a subdirectory of your musl path. Then just compile and install musl as you would do usually.

By default this chooses the futex implementation of the generic lock function.

5.1.2 building externally

If you want to compile a standalone libstdatomic.a library archive file, you first need to compile one object file that encapsulates all system calls to futex. This can be done with the same command as above

make MUSL=your/path/to/musl musl

and then

make MUSL=your/path/to/musl libstdatomic.a

This will compile the one file inside the musl src hierarchy and then assemble all to one "standalone" library.

5.2 Other C libraries

Up to now I only tested with glibc as other C library. If you have experience with other libraries please let me know.

For the moment the futex generic lock only works with musl, so you have to chose another one when musl is not available. You choose the version through a define of ATOMIC_GENERIC_LOCK:

ATOMIC_GENERIC_LOCK_PTHREAD: uses a pthread_mutex_t for the lock. It will be the choice for most, unless you have a C library that already implements C11 threads, then you'd use
ATOMIC_GENERIC_LOCK_MTX: and a mtx_t.
ATOMIC_GENERIC_LOCK_CMPXCHG: is the last resort if you have neither of the two above. This implements just a spinlock. It is only suited for applications that don't have a strong congestion on any atomic operation. If there is a lot of congestion, the application will suffer dramatically.

With this choice, you may compile the library with a simple make

make CONFIG='-DATOMIC_GENERIC_LOCK=ATOMIC_GENERIC_LOCK_PTHREAD' libatomic.a

6 Terms

6.1 Copyright and License

The stdatomic library as a whole is licensed under the following standard MIT license:

----------------------------------------------------------------------
Copyright © 2015-2017 Jens Gustedt

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------------------------------

6.2 Distribution

This work is distributed at http://stdatomic.gforge.inria.fr/.

An implementation of the C11 <stdatomic.h> interface

Table of Contents