Disclaimer: This article is a personal opinion and does not represent a change of position by Arm on this matter.

Introduction

Whilst discussing a compiler bug I reported with Qualcomm and Arm engineers, the question arose of what the semantics of relaxed atomics should be.

Architectures like Arm AArch64 have atomic instructions with fully defined behaviours. C/C++ atomic operations have a memory order parameter that can be specified as relaxed. A relaxed operation is still atomic but imposes no constraint on memory ordering.

Unfortunately, relaxed memory order does not fully define the semantics of these operations as they are implemented in compilers. This has been discussed in academia and several attempts have been made to define memory models for C/C++. Compilers, which map C/C++ semantics to instructions, have to decide on which C/C++ semantics to implement. Our testing, using the Téléchat compiler testing tool, shows that current compiler implementations are wrong, compared to recent C/C++ models (such as that of Lahav et. al).

This post summarizes my findings on the topic, condensed for the wider LLVM/GCC community with the intention of fixing the implementation of relaxed atomics as it is used in practice.

I propose that we:

  1. Implement Hans Boehm’s branch-after-load proposal, guarded behind a flag such as -morder-relax-atomics
  2. Once completed, I will test the implementation using the Téléchat compiler testing tool.

After a decade of academic work, the compiler community has all the ingredients needed to justify this proposal and test compiler fixes for relaxed atomics in practice.

It is broken down as follows:

  1. Background - Skip this if you are familiar with relaxed atomics in C/C++.
  2. Current Implementation - as it is implemented in GCC and LLVM today.
  3. Relaxed Atomics Bugs in Practice - I found thousands of bug instances in code generated by today’s compilers.
  4. Proposed Fixes - in the literature and discussions with engineers.
  5. The Branch-After-Load Proposal - and why we should pick this one.
  6. Questions - from engineers who have seen this proposal.
  7. Acknowledgements - those who kindly provided feedback on this proposal

Note: This is a live document, if you are expecting changes that you do not see, please clear your browser cache. Last updated: 3rd November 2023.

UPDATE 27/10/23: LLVM discourse discussion here

UPDATE 30/10/23: LLVM post mentioned in LLVM Weekly

Background

C/C++ provides a set of atomic types and operations for use in the execution of concurrent programs. Concurrent programs have multiple threads of execution that communicate via shared memory locations.

In a concurrent program, atomics constrain the order that each thread reads and writes to shared memory, and consequently the executions of other threads that read from memory. This is necessary, since modern multicore processors can, and will, execute instructions out of order. This highly dynamic behaviour is known as relaxed memory concurrency and is exhibited by processors implementing the Arm, RISC-V, Intel x86 (and x64), IBM PowerPC, RISC-V, and MIPS architectures (and many others…).

Not wanting to outlaw perfectly valid concurrent program behaviour, the ISO C/C++ standards adopted a relaxed memory consistency model. Memory consistency models specify behaviours allowed by concurrent programs in terms of reads and write operations to shared memory.

Operations on C/C++ atomics have a memory order parameter. Memory order is defined as either sequentially consistent, acquire-release, acquire, release, consume, or relaxed. For instance a C/C++ atomic_load_explicit(x, memory_order_acquire) is compiled to a LDAR (load acquire) instruction when targeting the Armv8-a AArch64 architecture. I focus on operations with memory_ordered_relaxed semantics.

The semantics of relaxed atomic operations has generated a wealth of questions, most notably whether they permit thin-air reads. Thin-air reads are characterized by behaviours that should be forbidden by the source model whilst allowing perfectly valid compiler optimisations in LLVM and GCC. Finding all the programs that should, or should not, exhibit such behaviour has been of central focus to academics. Mark Batty and Peter Sewell describe this problem well, but we emphasize that thin-air reads is a modelling problem and that we focus on what relaxed atomics do in practice. In academia, several C/C++ models have been proposed over the last decade. This resulted in many papers, PhDs, and proposed solutions to the problem. Several proposals have arisen in C/C++ standards discussions, but little progress has been made.

Recent ISO C standards require very little of relaxed atomic operations. Namely that “there are relaxed atomic operations, which are not synchronization operations”, “although, like synchronization operations, they cannot contribute to data races.”. This means that relaxed operations are atomic, but there is no guarantees of memory ordering. This does not prevent compilers from implementing stronger (or synchronizing) semantics (as the “shall” keyword is not used), but the standard also guarantees nothing about what happens when it is compiled and run on a machine that implements a relaxed model.

We learned that developers use relaxed atomics exactly as specified. Following discussions with maintainers of concurrency libraries (who use relaxed operations), we understand that the standard describes their practice exactly - that relaxed atomics operations do not provide synchronization guarantees and are used for atomicity and UB-freedom guarantees. UB-freedom and in particular data-race freedom is guaranteed by using relaxed atomics (in place of non-atomics), which in theory guarantees that any data-race free program has sequentially consistent semantics. Unfortunately, we observe, and the standard permits, non-SC behaviour despite the use of relaxed atomics under current implementations.

Current Implementation

I describe the implementation of relaxed atomics by example. Consider the following litmus test:

where P0:r0=1 means the local variable r0 in P0 ends up with the value 1.

Litmus tests consist of a fixed initial state, a concurrent program, and a predicate over the final state. Executing the concurrent program, starting from the fixed initial state, ends up in one of several outcomes. The predicate over the final state is used to explore outcomes of execution that are unexpected under a given model.

When executing the above program under the C/C++ model of Lahav et. al (that captures the most widely accepted behaviour of the standard so far), the program finishes with one of the following outcomes:

{ P0:r0=0; P1:r0=0; }
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }

The outcome { P0:r0=0; P1:r0=1; } is observed when the program executes all of P0 followed by all of P1. In detail P0 loads the value 0 from the initial state of location x, followed by the store of value 1 to y, then P1 loads 1 from y followed by storing 1 to x.

Observe that the outcome { P0:r0 = 1; P1:r0 = 1 } is not present. Following Hans Boehm’s proposal, many C/C++ models forbid the executions that lead to this outcome. This outcome is allowed by ISO C23, but forbidden under the proposal in question.

You may rightfully ask what happens when we compile this program. Compiling this program to target the Armv8.0-a architecture using clang -O3 produces the following litmus test:

Where X%P0_r0 is a symbolic register for thread P0 that contains the location r0 (we preserve locals for exposition). At runtime we allocate a concrete register.

Running this under the official Arm AArch64 memory model, we get the following outcomes:

{ P0:r0=0; P1:r0=0; }
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; } <-- forbidden by the proposed model, allowed by C23!	

So what has happened? the Arm AArch64 model permits the re-ordering of the effects of running the LDR instructions on each thread after the effects of the STR instructions. Precisely the store to y on P0 finishes before the load of x (and vice versa for P1). This behaviour is known as load buffering and is allowed by models of Armv7, Armv8, RISC-V, and IBM PowerPC to name a few.

Further, this behaviour is observable on many machines. The load and store of each thread may be issued in a pipeline, and the store on each thread finishes whilst the load fetches data from memory. We have observed this by executing the above AArch64 litmus test on an Apple M1 machine. The details of how it happens are irrelevant - what matters is that it does happen in practice and we observe it under architecture models.

Now this behaviour isn’t new. Many academics have mentioned it; for example, Batty and Sewell’s C/C++ mappings, or the modelling papers that propose idealised compilation schemes from C/C++ atomics to a handful of instructions (See Lahav et. al). The problem here is that these works do not reflect the code generated by compilers in practice, where new architectures and compiler revisions bring new instructions and atomics implementations.

Relaxed Atomics Bugs in Practice

Fortunately, we can now test atomics implementations in practice. In recent work I and my co-author presented the Téléchat tool that compares the behaviour of source and compiled litmus tests under source and architecture models. Téléchat takes a C/C++ litmus test, compiles it to get an assembly test, and executes both under their respective models, finding bugs when there are outcomes of executing the compiled program that are not also outcomes of executing the source program. Whenever any model or compiler is updated we can re-run our tests to check for new bugs.

We have reported many concurrency bugs using Téléchat, deployed Téléchat in industry automated testing for Arm Compiler, and assisted Arm’s engineers with queries from Arm’s industry partners. Now we turn to the semantics of relaxed atomics once more.

We found thousands of bug instances related to relaxed atomics accesses. We generated over 9 million litmus tests that mix atomics, non-atomics, control flow, fences, and straight line code. We fed these tests into clang and gcc for optimisation levels -O1/2/3/fast/g for the architectures mentioned below. We compared the number of outcomes of the compiled tests against those of the source program (under each model respectively).

The below table summarizes our findings:

Arch + Compiler -O1/O2/O3/Ofast/Og Model
Armv8-a + clang 2352/2352/2352/2352/2352 AArch64 (official)
Armv8-a + gcc 2352/2352/2352/2352/2352 AArch64 (official)
Armv7-a + clang 2352/2352/2352/2352/2352 ARM
Armv7-a + gcc 2352/2352/2352/2352/2352 ARM
RISC-V (64-bit) + clang 2352/2352/2352/2352/2352 RISC-V (official)
RISC-V (64-bit) + gcc 2352/2352/2352/2352/2352 RISC-V (official)
IBM PowerPC (64-bit) + clang 2352/2352/2352/2352/2352 PowerPC
IBM PowerPC (64-bit) + gcc 2352/2352/2352/2352/2352 PowerPC
Intel x86-64 + clang 0/0/0/0/0 x86-TSO
Intel x86-64 + gcc 0/0/0/0/0 x86-TSO
MIPS (64-bit) + clang 0/0/0/0/0 MIPS
MIPS (64-bit) + gcc 0/0/0/0/0 MIPS


The 2352 tests flag compiled program outcomes under their respective architecture models, that are not permitted outcomes of the source program under the C/C++ model of Lahav et. al. These outcomes are common to all architectures tested with the exception of Intel x86-64 and MIPS (64-bit) that have stronger models. They are drawn from 2352 programs that exhibit bad code generation, due to 294 concurrency patterns that are instances of the load buffering bug we mention in the previous section. See our paper for more details.

So what can we do about it?

Proposed Fixes

There are a number of proposals to fix for relaxed atomics bugs. These mutually exclusive proposals fix the bug in the code generation of the above tests. I outline the proposals here before elaborating on my preferred fix in the next section. The proposals I am aware of are:

  1. Branch-After-Load (BAL): Introduce a dependency-preserving branch after all relaxed atomic loads (preferred - see below)
  2. Fence-Before-Store (FBS): Introduce a fence before all relaxed atomic stores (more expensive than BAL on IBM machines - see questions).
  3. Strong-Release-Acquire (SRA): Make all atomic loads have acquire semantics (vice versa stores have release). Expensive (see questions below) and not implementable on Armv6-m hardware that does not have acquire/release instructions. Also mixing implementations of relaxed atomics will still exhibit bugs.
  4. Prevent Optimisations from applying to atomics: Current implementation in GCC following my bug report, but does not prevent re-ordering at runtime (see questions).

The Branch-After-Load Proposal

Hans Boehm’s proposal fixes these bugs by suffixing load instructions with conditional branches to prevent the reordering of later stores. For Armv8-a a C/C++ relaxed atomic_load is compiled to ldr;CBZ label2;label2:... rather than the current ldr implementation.

We present BAL by example, adapting the Armv8 AArch64 test from the current implementation consider the following Arm AArch64 litmus test.

When run under the official Arm AArch64 memory model, we get the following outcomes:

{ P0:r0=0; P1:r0=0; }
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }

This matches the expected outcomes of the C/C++ litmus test under the C/C++ model (of Lahav et. al). Implementing BAL removes load buffering bugs. This is the same for RISC-V, Armv7, and PowerPC tests. It does not make a difference for MIPS or x86-64.

Why does this fix stand out? Well it has been proven to be sound with respect to academic compilation schemes, has a good performance trade-off for correctness guarantees, was proposed by standards committee members, and can be implemented on old hardware that may not support atomic instructions. It would solve real issues in the Linux Kernel without costly fences (the kernel does not use relaxed atomics or the ISO C/C++ model - the load buffering issue affects the ISO C and linux memory models) and the problem of thin-air reads evaporates into thin-air! Furthermore it would restore the ability to reason about the compilation of relaxed atomics, removing the need for complicated models. Best of all, I can test this fix in practice using Téléchat.

We propose to guard this implementation behind a command line flag such as -morder-relax-atomics to support current practice. After discussions with maintainers of concurrency libraries we realise we cannot outlaw the current implementation entirely. Rather we give users two options:

  1. Use LLVM/GCC as normal, has the default C23 behaviour - with the performance of plain loads/stores.
  2. Use a command-line flag such as -morder-relax-atomics (naming TBC) under which the BAL proposal is implemented - with the correctness guarantees outlined above.

This way current practice and the stronger implementation are supported but not enforced. Users can choose between performance and the ability to reason about concurrent code. This means Lahav et. al’s model covers the BAL behaviour, which is a subset of the behaviours allowed by the current C23 standards. No major changes to the standards would be necessary.

I end by elaborating on these points in the form of an FAQ.

Questions

Here are answers to some common questions I have received:

Has anything similar been implemented before?

Yes, Paul McKenney and Raul Silvera proposed the C/C++ sequentially consistent load for IBM PowerPC be implemented using a lwz; b 0x1;0x1: isync sequence that has been in use for over a decade now.


Relaxed atomics have worked fine for a decade, why change now?

It has taken a decade for academics to understand the costs of relaxed atomics. I try to summarise what I know of here. Importantly, the current implementation makes it hard to reason about relaxed atomics. We can enumerate expected behaviours of small tests in practice using Téléchat, but checking all concurrent behaviour of a large program is computationally complex and generally undecidable. By adopting BAL, we can restore some ability to reason about relaxed atomics with a relatively small perf cost.


What is the performance cost of the branch-after-load in practice?

Not bad - Dolan et. al tested the above in the context of a multicore ocaml memory model and found favourable performance results on a Cavium ThunderX and IBM pSeries:

“On average, BAL, FBS and SRA are 2.5%, 0.6% and 85.3% slower than the baseline on AArch64 and 2.9%, 26.0% and 40.8% slower on PowerPC.”

Which suggests BAL has on average a 2.x% slow down for the machines tested in comparison to the (unacceptable) SRA and FBS numbers. A compiler engineer notes this hardware is rather old now, and so your results may vary.


Isn’t BAL the same as memory_order_consume?

Consume ordering is stronger than BAL. Consume must ensure no accesses after a consume load happen before the load is committed to memory - it’s a weaker form of acquire. From a standards perspective release stores and consume loads are synchronized in a dependency-ordered before relationship, where all stores before the release store become visible side effects to the consume load. For BAL, there is no such constraint, and adding a dummy control dependency enforces no such costly operation in the hardware. Also, consume is not currently supported by C/C++ models as it is hard to implement. BAL is not hard to implement.


What work needs to be done to implement this?

For LLVM changes need to be made for Armv7, Armv8 AArch64/AArch32, IBM PowerPC, and RISC-V backends. Including changes to outline atomics, libgcc, and inline atomics. Further checks will be needed for any newer architecture versions (e.g. Armv8.2-a or above).


Why not just use non-atomic accesses if both are compiled to plain-old loads/stores for relaxed architectures?

We want to preserve atomicity and plain accesses do not guarantee that for older architectures. Consider a version of the load buffering test above, but with 128-bit integer types. Arm AArch64 has 64-bit registers and so we must use load/store pair instructions to implement the expected 128-bit behaviour. These LDP/STP instructions are made exclusive to ensure that no other thread accesses a given location whilst the LDP/STP are in flight. With newer architectures we gain single-copy atomicity guarantees for LDP/STP, but with older Armv8.0 cores (Raspberry Pi) we need exclusive access.


Atomics are treated as volatile inside GCC and LLVM, surely that prevents nasty optimisations?

That only prevents the compiler from applying certain optimisations, the generated code may still exhibit re-ordering when run on hardware or under an architecture model.


Your load buffering example has a data race, all bets are off right?

The C/C++ example is well-defined, but we observe re-ordering when compiled and run on Arm-based hardware. We want to prevent this re-ordering in practice.


Does this only happen for this specific sequence of load/store? (the load buffering example above) and not for any other relaxed atomics.

We wish to prevent re-ordering of load buffering tests. The issue of re-ordering applies to any relaxed C/C++ atomic operation that is compiled to a non-atomic instruction (or sequence thereof) in a load buffering pattern. Load Buffering implemented with a store-release instruction forbids re-ordering, but store-buffering with store-release instruction allows it. Thus we want to prevent load buffering wherever a load-store:load-store pattern occurs. For instance LLVM translates the relaxed compare exchange into a non-atomic SWP instruction when targeting Armv8.2-a. If a compare exchange is followed by a store, we may observe re-ordering.


Can we make all stores have release semantics too? So either make all loads acquire or all stores release? Failing that use barriers or branches as a weaker form?

Like the SRA proposal this should work, as release stores synchronize with non-atomic loads (on Arm AArch64), and we do not observe re-ordering with non-atomic loads. The branch-after-load proposal uses instructions (loads and branches) available with almost all, if not all, architectures supported by LLVM and GCC. Further a branch after a load offers a form of ‘synchronization’ through a dummy control-dependency without the hardware cost of release-acquire semantics or fences.


why change the compilation of relaxed loads instead of the relaxed fence?

I propose to change loads and not fences because we want to prevent reordering wherever a load is present, which is not necessarily where there is a fence.

What’s the semantics of atomic_thread_fence(relaxed)? When would I want to use it?

On Armv7 and Armv8 relaxed fences are NOPs. Since relaxed fences are NOPs it is not clear there is a use case for it so far.


So the branch fix is just for loads, not for atomic RMW operations

RMW operations (such as CAS and SWP) are also effected.


One of the issues is is that the models appear to be much weaker than current CPUs - this means we cannot be sure whether the model is too weak or the CPUs are too strong

Yes that is true, fortunately we have observed load buffering behaviour on hardware.


If the relaxed atomic operations cannot be used for synchronisation (per the standard that you quote), why is the litmus test using them for synchronisation (between threads)?

The problem is that C/C++ atomics don’t fully capture the semantics of relaxed accesses as they are implemented in compilers and run on Arm machines. I propose we just fix it in practice by strengthening the implementation to enforce this degree of synchronisation. The standards can then ratify the newly accepted practice.


What if this specific litmus test is invalid?

The test is well-defined.


Are there any cases where we can elide the branch? I mean, obviously we can if there’s an existing branch, but can we take advantage of other memory barriers? ldar? stlr?

A valid case, as long as that branch isn’t then later removed. This means we would need to emit the branch late in the pipeline to prevent optimisations.

One of the other proposals is ‘Strong-release-acquire’ that either makes all stores release, or all loads acquire. It has the same effect as the branch-after-load proposal, but performs slightly worse on PPC machines. Also we can’t use load-acquire on machines that don’t have those instructions (v6-m), whereas almost every machine supported by llvm has branch instructions. See the questions section of the paper.


if we lower an atomic operation to a loop (cas loop, or ldxr/stxr), does that naturally protect us from issues like this?

Not by default, consider the following litmus test that has a CAS loop on P0:

When run under the AArch64 model, we get the outcomes:

{ P0:X1=0; P0:X4=0; P1:X1=0; }
{ P0:X1=0; P0:X4=0; P1:X1=1; }
{ P0:X1=1; P0:X4=0; P1:X1=0; }
{ P0:X1=1; P0:X4=0; P1:X1=1; }

To enforce order, we need the branch to depend on the value load and not the exclusivity check. I acknowledge this behaviour is unexpected but as of the date of writing this is permitted by the Arm AArch64 model.

Acknowledgements

Wilco Dijkstra (Arm), Eli Friedman (Qualcomm), Lee Smith (Formerly Arm), James Brotherston (UCL), Tomas Matheson (Arm), Szabolcs Nagy (Arm), Ola Liljedahl (Arm), Wonderful Void (on GitHub, Arm), and members of the LLVM and GCC compiler teams at Arm who participated in discussions.

This article is a personal opinion and does not represent a change of position by Arm on this matter.