Lock-Free Programming

The goal of lock-free programming is not to make synchronization “free”. It is to avoid as much of the extra overhead of traditional locks as possible in highly concurrent scenarios.

In a multithreaded environment, the most common and simplest synchronization tool is still a lock. For example, many implementations of std::mutex try to stay in user space when there is no contention. But once contention becomes heavy, threads may block, wake up, and context switch, and performance can drop sharply. Even when the kernel is not involved, locks can still introduce cache synchronization, pipeline stalls, and scheduling overhead.

So in scenarios that care deeply about throughput and latency, we often want to replace mutexes with lighter synchronization primitives. The usual building blocks are atomic operations, CAS, and more precise control over memory ordering.

Why focus on queues here? Because the producer-consumer model is extremely common in multithreaded programs. In a game engine, for example, the logic thread may submit render commands into a queue, while the render thread consumes and executes them asynchronously. Logging systems, task schedulers, network receive threads, and business logic threads also often pass data around through queues.

From an implementation point of view, lock-free queues usually follow one of two routes:

linked lists
circular arrays

The linked-list version is usually easier to explain, especially as an introduction. But if the goal is high throughput, mainstream implementations tend to prefer circular arrays. This post first clarifies the basic concepts, then gives both a single-producer single-consumer linked queue and a circular-array queue, and finally summarizes common optimization directions in real systems.

Open-source implementations worth looking at include:

Atomic Operations and CAS

In multithreaded code, if multiple threads access the same memory and at least one of them writes to it, there must be some form of synchronization. Otherwise it is a data race, and the behavior of the program is undefined.

The most basic synchronization primitive is an atomic variable:

std::atomic<int> counter { 0 };
counter.fetch_add(1, std::memory_order_relaxed);

Atomic variables solve problems such as “this read or write cannot be torn” and “operations on the same atomic object are atomic”. But they do not automatically make the whole algorithm thread-safe. The truly difficult parts are usually:

ordering between several shared variables
when data written by one thread becomes visible to another thread
how to keep compound operations consistent

This brings us to CAS.

CAS stands for compare-and-swap. In C++, it corresponds to compare_exchange_weak and compare_exchange_strong. Its behavior can be summarized in one sentence:

If the current value is still equal to the old value I just observed, replace it with the new value. Otherwise, some other thread changed it first, so I fail and retry.

In pseudocode:

if (*addr == expected) {
    *addr = desired;
    return true;
}
return false;

For example, if several threads all want to change a linked-list tail pointer from A to B, only the first thread whose CAS succeeds can complete the update. The others will find that “the value is no longer A”, so they reread the latest state and decide what to do next.

A typical pattern looks like this:

std::atomic<int> value { 0 };

void AddOne()
{
    int expected = value.load(std::memory_order_relaxed);
    while (!value.compare_exchange_weak(
        expected,
        expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed))
    {
        // On failure, expected is updated with the current value, so retry.
    }
}

A few common conclusions:

weak is allowed to fail spuriously, so it is usually a better fit inside loops.
strong does not fail spuriously, but that does not mean it is always faster.
CAS only guarantees a conditional update on one atomic object. It does not automatically solve consistency for an entire data structure.

The ABA Problem

CAS has a classic problem called ABA.

Suppose thread 1 reads a pointer value A and prepares to CAS it. Thread 2 changes it to B, then changes it back to A. When thread 1 performs its CAS, it may wrongly conclude that “the value never changed”, even though the intermediate state was completely different.

In lock-free linked lists and stacks, A -> B -> A is often more than “the value went around in a circle”. It may mean:

the original node has been unlinked
the node has even been freed
the same address has been allocated to another object

So when thread 1 sees “still A”, what it may really be seeing is only “the address looks the same”, not “this is still the original object”.

ABA is common in linked stacks, free lists, and MPMC queues. Typical solutions include:

pointer plus version number
hazard pointers
epoch-based reclamation
memory-pool strategies that avoid immediate address reuse

SPSC queues usually do not hit the most complicated ABA cases, because there is only one producer and one consumer, and the state space is much smaller. Once you move to MPSC or MPMC, however, memory reclamation and ABA are almost impossible to avoid.

Memory Ordering

When people first encounter atomic operations, they often ask:

If the variable is already atomic, why do I still need to care about memory ordering?

The answer is: atomicity only means this variable itself will not be read or written in a broken way. It does not describe the ordering relationship between this operation and other writes.

Modern CPUs execute out of order for performance, and compilers also reorder instructions. Cache-coherence protocols can guarantee that “the same atomic variable eventually becomes consistent”, but they do not automatically guarantee that another thread observes “I wrote the data first, then published the pointer” in the same order.

If you want to write correct lock-free code, it is best to separate several key ideas clearly:

atomicity
visibility
ordering
happens-before

These words look similar, but they are not the same thing.

Separate Three Things First: Atomicity, Visibility, and Ordering

1. Atomicity

Atomicity answers this question:

Can another thread observe only half of this read or write on a shared object?

For example, if an atomic pointer changes from A to B, another thread sees either A or B; it does not see “half of A and half of B”. This solves torn reads and writes.

2. Visibility

Visibility answers this question:

When can a write done by one thread be observed by another thread?

Notice that this is not just about one atomic variable itself. It is about the whole group of state written by one thread. For example:

node->value = 123;
node->next = nullptr;
published.store(node, ???);

The consumer really cares about:

when it can see published == node
whether it can also see node->value = 123 when it sees published == node

3. Ordering

Ordering answers this question:

If an operation is written earlier in a thread, will another thread also observe it earlier?

The classic pitfall is:

payload = 42;
flag = true;

In the source code, you wrote “payload first, flag second”. But without enough ordering constraints, another thread may see flag == true before it sees payload = 42.

So a rough way to remember it is:

atomicity: whether a single operation can be torn
visibility: when another thread can see it
ordering: whether another thread sees things in the order you expect

Who Is Reordering Things?

At first, many people blame “reordering” entirely on the CPU. In reality there are at least two layers:

compiler reordering
hardware reordering

Why the Compiler Reorders

From a single-threaded point of view, as long as the final result is unchanged, the compiler has a lot of freedom to swap, combine, move earlier, or delay instructions.

For example:

x = 1;
y = 2;

If the compiler can prove that the single-threaded result is unaffected, it has every incentive to adjust the order of generated code.

You can picture the process like this:

flowchart LR
    A["Source order<br/>x = 1;<br/>y = 2;"] --> B["Compiler optimization phase"]
    B --> C["Possible machine instruction order<br/>write y first<br/>then write x"]
    C --> D["Single-threaded result is still correct"]

The point is not that the compiler is “trying to fight you”. It is following a lower-level goal:

As long as the observable single-threaded result does not change, it can freely adjust implementation details.

For a more realistic-looking example:

int a = 0;
int b = 0;

void Foo()
{
    a = 1;
    b = 2;
}

If there are no atomic semantics, no volatile, and no other synchronization constraints, the compiler may generate code that writes b before a. The assembly might look like this:

; The source code is:
; a = 1;
; b = 2;

mov dword ptr [b], 2
mov dword ptr [a], 1
ret

This assembly is not saying “the compiler will definitely generate this”. It is saying:

source order is not the same thing as final instruction order
without synchronization semantics, the compiler is not obligated to preserve the positions you wrote

Now consider an example closer to lock-free code:

payload = 42;
flag = 1;

Without atomics and ordering constraints, the compiler may still see these as two ordinary writes. So the illustrative assembly may again look like:

mov dword ptr [flag], 1
mov dword ptr [payload], 42

If another thread is using flag to decide whether payload is ready, this kind of reordering, perfectly legal for a single thread, can directly break the multithreaded logic.

So from the compiler layer, one meaning of memory ordering is:

Tell the compiler explicitly that this is not ordinary optimization space. There is cross-thread meaning here, so the order cannot be changed casually.

Why the CPU Also Reorders

Modern CPUs try to keep their pipelines full. They use:

out-of-order execution
store buffers
speculative execution
cache hierarchies and invalidation propagation

So even if the compiler leaves the instructions alone, the hardware layer does not guarantee that “the order in my source code” is the same as “the order another core observes”.

The most useful piece to look at separately is the store buffer, because many cases of “I clearly already executed the write, so why has another core not seen it?” come from it.

A minimal hardware view looks like this:

flowchart LR
    subgraph Core0["Core 0"]
        I0["execute store payload = 42"]
        I1["execute store flag = 1"]
        SB["store buffer"]
    end

    subgraph Coherence["cache coherence / cache hierarchy"]
        C["cache and coherence propagation"]
    end

    subgraph Core1["Core 1"]
        L0["load flag"]
        L1["load payload"]
    end

    I0 --> SB
    I1 --> SB
    SB --> C
    C --> L0
    C --> L1

The key point is: after core 0 finishes executing a store, that does not mean the write has immediately become observable by other cores in the order you expect. It often still has to pass through:

this core’s store buffer
the cache hierarchy
coherence propagation

Here is a timing diagram closer to what another core may observe:

sequenceDiagram
    participant C0 as "Core 0"
    participant SB as "Core0 store buffer"
    participant C1 as "Core 1"

    C0->>SB: store payload = 42
    C0->>SB: store flag = 1
    SB-->>C1: flag becomes visible first
    C1->>C1: load(flag) == 1
    C1->>C1: load(payload) may still see old value
    SB-->>C1: payload becomes visible later

This diagram does not mean hardware always makes flag arrive first. It means:

Even if core 0 internally executed the two stores in order, core 1 does not necessarily observe them in the same order.

This is exactly why execution order and visibility order are not the same thing.

Correct Assembly Order Still Does Not Guarantee Cross-Core Observation Order

When people first learn this, they often react with:

Then I can just inspect the assembly and know the order, right?

That only solves half of the problem.

Suppose the compiler really does generate the assembly you expected:

mov dword ptr [payload], 42
mov dword ptr [flag], 1

This only means:

for the current core, the instruction stream says “write payload first, then write flag”

It does not by itself guarantee:

another core must observe payload first, then flag

Between issuing a store on this core and other cores actually seeing its effect, there are still microarchitectural details and coherence propagation.

Why Acquire / Release Help

At the CPU layer, the meaning of acquire/release can be understood intuitively:

release prevents “the data I prepared before this” from becoming visible after the “publish” action
acquire prevents “the data I need to read after observing the publication” from being pulled before that observation

In other words, they are not about making source code look nice. They constrain:

how the compiler orders things
how the CPU makes results visible to other cores

This is why memory ordering must be considered from both the compiler perspective and the hardware perspective.

Why Cache Coherence Is Not Enough

Many people say:

Do we not already have cache-coherence protocols? Why can things still go wrong?

Because cache coherence mainly guarantees:

For the same memory location, everyone will eventually converge to a consistent value.

It does not automatically guarantee:

For writes to several different memory locations, other threads will see them in the order you wanted.

This is crucial. For example:

payload = 42;
flag.store(true, std::memory_order_relaxed);

Here flag itself is still atomic and still participates in cache coherence. But whether seeing flag == true also means seeing payload = 42 is not something cache coherence gives you automatically.

The Most Important Line in the C++ Memory Model: Happens-Before

If you remember only one term, remember happens-before.

A rough explanation is:

If operation A happens-before operation B, then B must be able to see A and all visible effects before A.

Within a single thread, program order already forms part of happens-before.

In multiple threads, the key is: you must use atomic operations and memory ordering to establish this line between threads as well.

The classic way is:

one thread performs store(..., release)
another thread performs load(..., acquire) on the same atomic object and reads that value

This forms:

synchronizes-with
and therefore establishes a cross-thread happens-before

You do not have to memorize the terminology, but you should remember the result:

Data written before a release publication must be visible after another thread observes that publication through acquire.

What Each memory_order Guarantees

Here are the most common memory orders in C++.

memory_order_relaxed Only guarantees that this atomic operation itself is atomic. It does not establish additional cross-thread ordering.
memory_order_acquire Often used to receive a result published by another thread. In the current thread, reads and writes after it cannot be reordered before it.
memory_order_release Often used to publish data that has already been written. In the current thread, reads and writes before it cannot be reordered after it.
memory_order_acq_rel Has both acquire and release semantics. It is common on read-modify-write operations such as fetch_add and CAS.
memory_order_seq_cst The strongest and easiest to reason about. In addition to acquire/release effects, it also requires all seq_cst atomic operations to appear as if they are placed in one global total order.

The standard also has memory_order_consume, but it has long been “theoretically present, practically rarely used”. Many implementations treat it as acquire. When writing business code or lock-free data structures, you can usually ignore it at first.

`memory_order_relaxed`

relaxed is the easiest one to misunderstand.

It is not “unsafe”, and it is not “do whatever you want”. It still guarantees:

this atomic read or write is not torn
modification order is consistent for the same atomic object

What it does not guarantee is:

ordering with other ordinary memory reads and writes
the ability to publish extra state across threads

So relaxed is a good fit for:

statistical counters
some paths in reference counting
obtaining a unique ticket when it is not used to publish other data

For example:

globalId.fetch_add(1, std::memory_order_relaxed);

Here we only want to generate a unique number, so we usually do not need semantics saying “other writes must become visible too”.

But if you write:

payload = 42;
flag.store(true, std::memory_order_relaxed);

then relaxed is not enough to express “if you see flag, you should also see payload”.

`memory_order_release`

The core meaning of release is:

In the current thread, reads and writes before this release operation cannot be moved after it.

So it is very suitable for a “publish” action, for example:

node->value = 123;
node->next = nullptr;
tail->next.store(node, std::memory_order_release);

Here the point of release is not to make writing tail->next “more atomic”. It means:

initialize the new node first
then publish it as a complete object

`memory_order_acquire`

The core meaning of acquire is:

In the current thread, reads and writes after the acquire operation cannot be moved before it.

So it is very suitable for “receiving a published result”, for example:

Node* next = head->next.load(std::memory_order_acquire);

As long as this next came from some release publication, the current thread can safely inspect the contents that were initialized before that release.

You can remember release/acquire as a pair:

release: I have finished writing it, and now I officially publish it
acquire: since I have seen the publication, my later reads must see the published data

`memory_order_acq_rel`

acq_rel usually appears on read-modify-write operations, such as:

fetch_add
exchange
compare_exchange_*

These operations often play two roles at once:

read the current state
write back a new state

For example, a CAS may need to:

acquire data that was already published behind the old state
release the new state it just constructed

That is when acq_rel is natural.

`memory_order_seq_cst`

seq_cst can be understood as “the strongest acquire/release plus one global order everyone agrees on”.

Its biggest advantages are:

easier reasoning
less chance of getting the order wrong

Its biggest disadvantages are:

it is usually heavier than more precise acquire/release
it may be unnecessary in hot paths

A common strategy in lock-free code is:

write it correctly with seq_cst first
after the logic is confirmed, tighten it to acquire/release/acq_rel/relaxed where needed

But the second step is not a mechanical replacement. You really have to revalidate the conditions that make the algorithm correct.

The most common publish-subscribe pattern is:

data = 42; // ordinary write
flag.store(true, std::memory_order_release);

In another thread:

if (flag.load(std::memory_order_acquire)) {
    // Here it can see the earlier data = 42.
}

If both release/acquire are changed to relaxed, then even though flag itself is still atomic, the consumer is not guaranteed to see the producer’s earlier data when it sees flag == true.

This model is almost the basic building block of lock-free queues, linked lists, and stacks:

write the object contents first
then publish a pointer, index, or state bit
the other side first observes the published result
then reads the published object contents

Once you fully understand this block, a lot of queue code becomes much smoother to read.

What Memory Orders Are Commonly Used in Queues?

With the concepts above in place, the memory ordering in queue implementations becomes easier to understand.

1. Accessing an Index Owned by This Thread Can Often Be relaxed

For example, in an SPSC queue, the producer reads its own tail:

const size_t tail = tail_.load(std::memory_order_relaxed);

Only the producer writes tail_, so it does not need to use this read to receive data published by another thread.

2. Observing a Boundary Published by the Other Side Usually Needs acquire

For example, the consumer reads tail_ to decide whether there is new data:

const size_t tail = tail_.load(std::memory_order_acquire);

The producer publishes “this slot has been written” through tail_.store(..., release), so the consumer needs acquire to truly receive that publication.

3. Publishing a New Node, Element, or State Usually Needs release

For example, after the producer writes an element into a ring-buffer slot, it updates tail_:

buffer_[tail & mask_] = std::move(v);
tail_.store(tail + 1, std::memory_order_release);

Without release, even if the consumer sees the new tail, it may not reliably see the value written into the slot.

4. Competing for a Shared Position While Both Reading and Writing Often Uses acq_rel

In an MPMC queue, multiple threads often advance head_ or tail_ through CAS, and acq_rel is common.

That kind of operation is neither a pure publication nor a pure receive. It:

makes a decision based on current shared state
changes that shared state to a new value

An Easy-to-Miss CAS Detail: Success and Failure Orders

C++ compare_exchange can specify separate memory orders for success and failure:

head_.compare_exchange_weak(
    expected,
    desired,
    std::memory_order_acq_rel,
    std::memory_order_acquire);

This means:

if CAS succeeds, the current thread both reads old state and publishes new state, so it uses acq_rel
if CAS fails, the current thread did not write successfully, but it read the latest value, so it usually needs at least acquire for the observed state

Part of why lock-free code is hard to read is that it contains both:

ordering requirements on the success path
ordering requirements on the failure-and-retry path

A Practical Rule of Thumb

When you are writing lock-free code and are unsure about the memory order, ask yourself:

Is this atomic operation publishing an object or state that has already been written?
Is this atomic operation receiving a result published by another thread?
Is this atomic operation only for counting, reserving a slot, or getting a ticket, or does it also carry data-visibility meaning?

The usual mapping is:

pure atomic counter or ticket: consider relaxed first
publish: consider release first
receive a published result: consider acquire first
both receive and publish: consider acq_rel first
not sure, and you want simple reasoning first: use seq_cst to validate the logic

This rule of thumb is not a formal proof, but it is very useful in engineering.

Looking at release/acquire on a Timeline

Definitions can still feel abstract, so put them on a timeline:

sequenceDiagram
    participant P as "Producer thread"
    participant S as "shared atomic flag"
    participant C as "Consumer thread"

    Note over P: write payload / next / value
    P->>S: store(true, release)
    C->>S: load(acquire) == true
    Note over C: now it can safely read the previously published data

The important thing is not the boolean flag itself. It is:

the producer first completes the payload write in its own thread
then uses release semantics to publish a “data is ready” signal
once the consumer reads that signal with acquire
it can observe the producer’s writes that happened before the release

A practical way to remember it:

release pushes previous writes out, and acquire pulls subsequent reads in.

If you only have atomicity but not this ordering guarantee, you may see the following counterintuitive result:

Producer's view: write payload first, then flag = true
Consumer's view: see flag = true first, but payload still looks unwritten

This is why “using atomic variables” does not automatically make memory ordering reliable.

The Difference Between x86 and ARM

If you go one layer deeper, you will run into a very common engineering phenomenon:

The same code feels “fine” on x86, then suddenly breaks on ARM.

This is not an illusion. The two architectures have different default memory-model strengths.

Why x86 Can Create a False Sense of Safety

x86 usually follows a relatively strong TSO, or Total Store Order, style. Roughly speaking:

store-store ordering is relatively strong
load-load ordering is also relatively strong
many acquire/release operations look like ordinary loads and stores at the assembly level

So some code that lacks explicit synchronization may “happen to run for a long time without problems” on x86.

But this does not mean x86 is a sequential-consistency paradise. It still has:

store buffers
delays at the observable level
store-load issues in some scenarios

It is just less likely than ARM to expose these problems directly.

Why ARM Exposes Bugs More Easily

ARM’s default memory model is weaker. Roughly speaking:

unexpected visibility and ordering issues are more likely
both the compiler and the hardware rely more on explicit synchronization semantics
acquire/release often requires more explicit hardware constraints

So lock-free code that “seems fine on x86” may quickly expose problems on ARM:

reading data that is not fully initialized
seeing a state bit first, then old payload
CAS logic failing because it relied on implicit ordering

The Most Important Engineering Conclusion

When writing cross-platform lock-free code, the worst thing to do is:

Run it on x86 first, then treat “no bug appeared” as proof that the memory ordering is correct.

The reliable approach is:

express your synchronization intent strictly through the C++ memory model
use acquire/release/acq_rel where they are actually needed
do not depend on luck from one particular architecture being “strong by default”

For lock-free programming, this advice is close to survival gear.

A Linked Lock-Free Queue

Start with the easiest version for learning: a single-producer single-consumer linked queue.

SPSC has a natural separation of responsibilities:

only the producer modifies the tail pointer
only the consumer modifies the head pointer
the truly shared question is when a node becomes linked into the list

This makes the implementation much simpler than a multi-producer multi-consumer queue.

A Simple SPSC Linked Queue

Here is an implementation that shows the core idea. To make the example clearer, it uses a dummy node as a sentinel:

template<typename T>
class SpscLinkedQueue
{
private:
    struct Node
    {
        std::atomic<Node*> next { nullptr };
        std::optional<T> value;

        Node() = default;
        explicit Node(T v) : value(std::move(v)) {}
    };

public:
    SpscLinkedQueue()
    {
        Node* dummy = new Node();
        head_ = dummy;
        tail_ = dummy;
    }

    ~SpscLinkedQueue()
    {
        Node* cur = head_;
        while (cur != nullptr) {
            Node* next = cur->next.load(std::memory_order_relaxed);
            delete cur;
            cur = next;
        }
    }

    void Enqueue(T v)
    {
        Node* node = new Node(std::move(v));

        // Write data into the new node first, then link it to the tail.
        tail_->next.store(node, std::memory_order_release);
        tail_ = node;
    }

    bool Dequeue(T& out)
    {
        Node* next = head_->next.load(std::memory_order_acquire);
        if (next == nullptr) {
            return false;
        }

        out = std::move(*next->value);

        delete head_;
        head_ = next;
        return true;
    }

private:
    Node* head_ { nullptr }; // modified only by the consumer thread
    Node* tail_ { nullptr }; // modified only by the producer thread
};

The key points are:

The producer constructs node first and finishes writing value.
It then publishes the new node through tail_->next.store(..., release).
The consumer observes the new node through head_->next.load(acquire).
Once load(acquire) sees a non-null node, it can see the producer’s initialization of that node before the release.

This is the classic “write data first, then publish the pointer” pattern.

Why This Version Is Relatively Easy to Get Right

It avoids most of the truly difficult problems:

no multiple producers competing for the tail pointer
no multiple consumers competing for the head pointer
no CAS competition over the same head or tail
nodes are reclaimed by only one consumer

If you only want to explain acquire/release and the idea of “publishing a node”, an SPSC linked queue is a very good starting point.

Problems With the Linked Version

Although the linked implementation is easy to understand, it is often not the highest-throughput choice in real systems.

1. Each enqueue allocates a node

new/delete hurts throughput badly. Even if you build your own object pool, allocation and reclamation are still extra costs.

2. Pointer chasing has poor cache locality

Linked-list nodes are usually scattered across memory. The consumer has to jump through pointers every time, which makes prefetching hard and increases cache misses.

3. Scaling to MPMC is hard

Once you need to support multiple producers or multiple consumers, advancing head and tail can no longer rely on the natural SPSC division of work. You usually need CAS, and immediately run into:

ABA
node reclamation timing
whether a dequeued node might still be read by another thread

This is why many truly high-performance lock-free queues try to reduce dynamic allocation at the node level.

A Circular-Array Lock-Free Queue

If queue capacity can be known in advance, a circular array is usually a better choice.

Its advantages are direct:

no per-element node allocation
contiguous memory, so it is cache-friendly
advancing indexes is easier to optimize than chasing pointers

The simplest SPSC ring queue needs only two monotonically increasing counters:

head points to the next position to read
tail points to the next position to write

The array index is mapped to the real slot through index & mask, assuming the capacity is a power of two.

A Simple SPSC Circular-Array Queue

template<typename T>
class SpscRingQueue
{
public:
    explicit SpscRingQueue(size_t capacityPow2)
        : capacity_(capacityPow2)
        , mask_(capacityPow2 - 1)
        , buffer_(capacityPow2)
    {
    }

    bool Enqueue(T v)
    {
        const size_t tail = tail_.load(std::memory_order_relaxed);
        const size_t head = head_.load(std::memory_order_acquire);

        if (tail - head == capacity_) {
            return false; // full
        }

        buffer_[tail & mask_] = std::move(v);
        tail_.store(tail + 1, std::memory_order_release);
        return true;
    }

    bool Dequeue(T& out)
    {
        const size_t head = head_.load(std::memory_order_relaxed);
        const size_t tail = tail_.load(std::memory_order_acquire);

        if (head == tail) {
            return false; // empty
        }

        out = std::move(*buffer_[head & mask_]);
        buffer_[head & mask_].reset();
        head_.store(head + 1, std::memory_order_release);
        return true;
    }

private:
    const size_t capacity_;
    const size_t mask_;
    std::vector<std::optional<T>> buffer_;

    alignas(64) std::atomic<size_t> head_ { 0 };
    alignas(64) std::atomic<size_t> tail_ { 0 };
};

The idea is essentially the same as in the linked version:

the producer writes the element into the slot first
then publishes “a new element is readable” with tail.store(..., release)
the consumer uses tail.load(acquire) to decide whether there is new data

Why is this faster than the linked version?

element slots are contiguous in memory
there is no new/delete on every operation
synchronization happens mainly between two increasing counters

This is why circular arrays are so common in high-throughput queues.

This Version Is Still Only SPSC

It is worth stressing that the ring queue above is still single-producer single-consumer only.

If multiple producers write at the same time, they will compete for the same tail. Then you must:

use CAS or fetch_add to reserve slots
handle the intermediate state where a thread has reserved a slot but has not finished writing data
solve how the consumer decides whether a slot is truly ready

At that point, implementation complexity rises quickly. You usually need to maintain an additional state bit or sequence number for each slot.

From SPSC to MPSC / MPMC

The SPSC ring queue above is simple because:

tail is advanced by only one producer
head is advanced by only one consumer

Once you move to MPSC or MPMC, at least one of these conditions breaks. Several threads begin competing for the same index.

Why You Cannot Just Change tail to fetch_add

The most obvious change is usually:

size_t ticket = tail_.fetch_add(1, std::memory_order_acq_rel);
buffer_[ticket & mask_] = value;

This only solves “multiple producers will not get the same slot”. It is far from enough, because the consumer immediately faces two problems:

has this slot actually been written?
is the content in this slot from the current round, or leftover data from the previous round?

In other words, tail only solves position reservation. It does not solve commit completion or slot generation.

This is one of the easiest things to miss when writing an MPMC queue for the first time.

A Common Engineering Approach: One sequence per Slot

Mainstream bounded MPMC ring queues commonly attach a sequence number to each slot. This sequence is not decoration. It solves three things at once:

whether the current slot is empty
whether the current slot has been written and can be read
which round this slot belongs to after the array wraps around

A typical slot looks like this:

template<typename T>
struct Slot
{
    std::atomic<size_t> sequence;
    T value;
};

During initialization, assume the capacity is N; the sequence of slot i is initialized to i.

Then there is a memorable convention:

for producers, if slot.sequence == ticket, the slot is writable in this round
for consumers, if slot.sequence == head + 1, the slot is readable in this round

A Simplified MPMC Ring Queue

The following is not a complete industrial implementation. It extracts the core mechanism so it is easier to understand what sequence is doing.

template<typename T>
class BoundedMpmcQueue
{
private:
    struct Slot
    {
        std::atomic<size_t> sequence;
        T value;
    };

public:
    explicit BoundedMpmcQueue(size_t capacityPow2)
        : capacity_(capacityPow2)
        , mask_(capacityPow2 - 1)
        , buffer_(capacityPow2)
    {
        for (size_t i = 0; i < capacity_; ++i) {
            buffer_[i].sequence.store(i, std::memory_order_relaxed);
        }
    }

    bool Enqueue(T v)
    {
        size_t pos = tail_.load(std::memory_order_relaxed);

        for (;;) {
            Slot& slot = buffer_[pos & mask_];
            size_t seq = slot.sequence.load(std::memory_order_acquire);
            intptr_t diff = static_cast<intptr_t>(seq) - static_cast<intptr_t>(pos);

            if (diff == 0) {
                if (tail_.compare_exchange_weak(
                    pos,
                    pos + 1,
                    std::memory_order_relaxed)) {
                    slot.value = std::move(v);
                    slot.sequence.store(pos + 1, std::memory_order_release);
                    return true;
                }
            } else if (diff < 0) {
                return false; // full
            } else {
                pos = tail_.load(std::memory_order_relaxed);
            }
        }
    }

    bool Dequeue(T& out)
    {
        size_t pos = head_.load(std::memory_order_relaxed);

        for (;;) {
            Slot& slot = buffer_[pos & mask_];
            size_t seq = slot.sequence.load(std::memory_order_acquire);
            intptr_t diff = static_cast<intptr_t>(seq) - static_cast<intptr_t>(pos + 1);

            if (diff == 0) {
                if (head_.compare_exchange_weak(
                    pos,
                    pos + 1,
                    std::memory_order_relaxed)) {
                    out = std::move(slot.value);
                    slot.sequence.store(pos + capacity_, std::memory_order_release);
                    return true;
                }
            } else if (diff < 0) {
                return false; // empty
            } else {
                pos = head_.load(std::memory_order_relaxed);
            }
        }
    }

private:
    const size_t capacity_;
    const size_t mask_;
    std::vector<Slot> buffer_;

    alignas(64) std::atomic<size_t> head_ { 0 };
    alignas(64) std::atomic<size_t> tail_ { 0 };
};

This looks twisty at first. Split it into two layers and it becomes much clearer.

Enqueue, Line by Line

First look only at the producer path:

size_t pos = tail_.load(std::memory_order_relaxed);

This reads a current guess for the write position. It is only a guess, not ownership, because other producers may be competing for the same position.

Slot& slot = buffer_[pos & mask_];
size_t seq = slot.sequence.load(std::memory_order_acquire);
intptr_t diff = static_cast<intptr_t>(seq) - static_cast<intptr_t>(pos);

These three lines ask the slot a question:

From the point of view of logical position pos, is it my turn to write this physical slot?

So the meaning of diff is crucial:

diff == 0 The slot’s sequence is exactly equal to pos, so it is writable in this round.
diff < 0 The slot’s sequence is behind pos, which means the consumer has not returned this slot yet, and the queue is tending toward full.
diff > 0 The slot state is already ahead of what this thread expected, which means some other thread advanced the state and this thread’s view of pos is stale.

Only when diff == 0 may the producer continue competing for write ownership:

if (tail_.compare_exchange_weak(
    pos,
    pos + 1,
    std::memory_order_relaxed)) {
    slot.value = std::move(v);
    slot.sequence.store(pos + 1, std::memory_order_release);
    return true;
}

Two different state transitions happen here:

tail_ is CASed from pos to pos + 1 This means “I reserved logical position pos; nobody else should touch it.”
slot.sequence is changed from pos to pos + 1 This means “the data in this slot has really been written, so it can now be read.”

These two actions must not be conflated.

tail_ solves “who owns this logical position”
sequence solves “whether the data in this slot is ready”

If you only have the former without the latter, the consumer may read the slot too early, right after the producer reserves the position but before the value is written.

Dequeue, Line by Line

The consumer path is symmetric. The condition changes to “is this slot readable?”

size_t pos = head_.load(std::memory_order_relaxed);
Slot& slot = buffer_[pos & mask_];
size_t seq = slot.sequence.load(std::memory_order_acquire);
intptr_t diff = static_cast<intptr_t>(seq) - static_cast<intptr_t>(pos + 1);

The pos + 1 here is critical. For a consumer, the readable state of the slot is not pos; it is pos + 1, after the producer has published it.

So the consumer interprets diff like this:

diff == 0 The slot has been written and can be read.
diff < 0 The slot has not entered the readable state for this round, so the queue is tending toward empty.
diff > 0 The observation is stale again; another thread already advanced first.

The order during the actual read also matters:

if (head_.compare_exchange_weak(
    pos,
    pos + 1,
    std::memory_order_relaxed)) {
    out = std::move(slot.value);
    slot.sequence.store(pos + capacity_, std::memory_order_release);
    return true;
}

Again there are two state transitions:

head_ changes from pos to pos + 1 This means “I got the right to consume this logical position.”
slot.sequence changes from pos + 1 to pos + capacity_ This means “this physical slot has been cleared and can be written again when the next round wraps around.”

Why not write pos back? Because a circular array wraps around. The same physical slot is reused for many rounds. Adding capacity_ encodes “the next round” and prevents mixing up a new round with an old one.

Layer One: Reserve a Number First, Then Touch the Slot

Multiple producers cannot write into the same slot directly, so they first compete around tail_. Whoever succeeds in CAS owns write permission for position pos. Consumers do the same around head_. Whoever succeeds in CAS owns read permission for position pos.

This step guarantees:

multiple producers do not write the same logical position
multiple consumers do not read the same logical position

Layer Two: sequence Distinguishes Empty, Full, and Generation

The hard part is that the array wraps around. The same physical slot repeatedly carries different data from different rounds. Without sequence, it is hard to determine from the slot alone:

whether the slot is empty
whether it has been written by a new-round producer
whether it still contains leftover data from a previous round

sequence encodes this generation information.

For producers:

seq == pos means this slot is exactly ready to be written by the current producer
seq < pos means the consumer has not released this slot yet, and the queue is tending toward full
seq > pos means another thread already advanced the state, so the producer should observe again

For consumers:

seq == pos + 1 means this slot has been written and can be read
seq < pos + 1 means the producer has not finished publishing, and the queue is tending toward empty
seq > pos + 1 means another thread already handled it first, so the consumer should observe again

Why This Avoids Many Linked-List Problems

Compared with the linked version, a bounded ring queue has several obvious advantages:

slots are preallocated, so there is no new/delete per element
there is no linked-list node reclamation, which avoids a large part of the memory-reclamation problem
each slot’s state is localized, which is much more cache-friendly

But it also has costs:

capacity is usually fixed, or resizing is very hard
the implementation details are much harder than SPSC
correctness depends heavily on memory ordering and state transitions

A Slot-Rotation Diagram Makes It Clearer

Assume capacity = 4, and look only at physical slot[0]. This slot is reused by logical positions 0, 4, 8, 12, and so on.

Its sequence follows this cycle:

flowchart LR
    A["slot[0]<br/>sequence = 0<br/>matches pos = 0<br/>state: writable in round 0"]
    B["slot[0]<br/>sequence = 1<br/>matches consumer pos + 1 = 1<br/>state: readable in round 0"]
    C["slot[0]<br/>sequence = 4<br/>matches next-round pos = 4<br/>state: writable in round 1"]
    D["slot[0]<br/>sequence = 5<br/>matches next-round consumer pos + 1 = 5<br/>state: readable in round 1"]

    A -->|"after producer writes value<br/>store(pos + 1)"| B
    B -->|"after consumer reads value<br/>store(pos + capacity)"| C
    C -->|"next-round producer writes again"| D

Once this diagram makes sense, pos, pos + 1, and pos + capacity_ stop looking like magic constants:

pos The slot is writable by the producer in the current round.
pos + 1 The slot has been written in the current round and can be read by the consumer.
pos + capacity_ The slot has completed a full round and enters the next-round writable state.

Or, as a forced mnemonic:

The producer advances the slot from “writable in this round” to “readable in this round”; the consumer advances it again to “writable in the next round”.

Understanding “Reservation” and “Publication” on a Timeline

If we extract one producer and one consumer, the core MPMC flow can be simplified like this:

sequenceDiagram
    participant P as "one producer"
    participant T as "tail / head atomic index"
    participant Slot as "one slot sequence + value"
    participant C as "one consumer"

    P->>T: CAS / fetch_add, get pos
    P->>Slot: write value
    P->>Slot: sequence.store(pos + 1, release)
    C->>T: CAS / fetch_add, get pos
    C->>Slot: sequence.load(acquire) == pos + 1
    C->>Slot: read value
    C->>Slot: sequence.store(pos + capacity, release)

The key actions are just three steps:

reserve a logical position
read or write data in the slot
use sequence to switch the slot from “writable” to “readable”, then back to “writable in the next round”

Once you understand these three steps, mainstream bounded MPMC queue implementations become much easier to read.

Memory Reclamation, Hazard Pointers, and EBR

When writing lock-free data structures, one problem is very easy to underestimate: an element has been dequeued does not mean the node can be immediately deleted.

For locked structures, this is usually intuitive. While holding the lock, nobody else can enter, so you know no other thread is touching the node and deletion is relatively simple.

In lock-free structures, another thread may already have:

read the address of this node
not yet completed its CAS
or be using this address to read the next pointer

If you free the node directly, the risks appear:

at best, a dangling pointer read
at worst, the memory is reused and triggers a very subtle ABA

So the truly hard part of many lock-free algorithms is not “how to unlink a node”, but:

when you can know that no thread will access it again
how to hold it safely before that point

This is why memory-reclamation strategies exist.

Why ABA and Memory Reclamation Keep Appearing Together

Conceptually, ABA looks like a “CAS comparison was fooled” problem. Engineering-wise, memory reclamation looks like a “do not delete too early” problem. In practice, they often share the same root.

Consider a typical case:

Thread A reads the head pointer head = X.
Thread B pops X and frees it.
The allocator reuses the same address for a new node Y.
The new node is linked back at the head, and the address is still X.
Thread A performs CAS and wrongly thinks “the head is still the original node”.

From the CAS point of view, this is ABA.

From the memory-lifetime point of view:

a node was freed too early
an old address was reused too early

So in many cases, solving memory reclamation already greatly reduces ABA risk.

The Most Naive Method: Delayed Free

The simplest idea is: do not delete nodes immediately. Put them into a “retired list” first, and free them in batches later.

This idea is correct in spirit, but two questions matter:

how much later is “later”
how do you know no thread still references the node during that time

hazard pointer and epoch based reclamation are two mainstream answers to these questions.

What Is a Hazard Pointer?

A hazard pointer can be understood as:

Before a thread actually accesses a shared node, it publicly announces: “I am looking at this node, do not delete it yet.”

This “announcement” is usually writing the node address into the current thread’s own hazard slot. When another thread wants to reclaim nodes, it scans all hazard pointers published by all threads:

if a retired node still appears in any thread’s hazard pointer, it cannot be deleted yet
only when no hazard pointer references it can it really be freed

A Minimal Flow

When a thread prepares to read the head node of a linked list, the flow roughly looks like this:

Atomically read the current head pointer p.
Publish p into this thread’s hazard pointer.
Confirm again that the head pointer is still p.
If it is still p, then it is safe to use p.
Clear this thread’s hazard pointer after use.

Why “publish, then confirm again”?

Because after the thread reads p but before it has time to announce “I am looking at it”, another thread may already have unlinked and retired it. So implementations usually check the shared pointer again to ensure that the protected pointer is really the node they are about to access.

A Simplified Sketch

Thread A:
1. p = head.load()
2. hazard[A] = p
3. if (head.load() != p) retry
4. safely access *p
5. hazard[A] = nullptr

When thread B wants to reclaim a node:

1. retired_list.push(p)
2. scan hazard pointers from all threads
3. if no hazard pointer points to p
4. delete p

Advantages of Hazard Pointers

very precise, protecting specific nodes
no need to pause all threads globally
relatively friendly to a slow thread, because it only blocks the nodes it is currently protecting

Costs of Hazard Pointers

paths that access shared pointers become more verbose
each thread needs hazard slots
reclamation needs to scan hazard pointers from all threads, which can be costly with many threads
implementation details are demanding, especially the “publish protection then confirm again” step

When They Fit Well

In general, hazard pointer fits:

pointer-chasing data structures such as linked lists, stacks, and hash buckets
scenarios that need precise protection of a small number of nodes
scenarios where you do not want one slow thread to hold back a large batch of retired nodes

What Is Epoch-Based Reclamation?

epoch based reclamation, or EBR, is a more batch-oriented approach:

Instead of protecting every node precisely, decide by time batch whether a batch of retired nodes is old enough to delete.

It usually maintains a global epoch. For example:

the current global epoch is 10
when a thread enters a critical section, it records that it is in epoch 10
when a node is unlinked, it is placed into the “retired in epoch 10” list

Only when the system can confirm:

all active threads have left epoch 10
or all active threads have advanced to a later epoch

can nodes retired in epoch 10 be truly freed.

Why It Works

EBR relies on an assumption:

Threads only read shared nodes after entering a critical section.

If a node was retired in epoch 10, and we can confirm that no thread is still in epoch 10 or an earlier read path, then no one can legally access that node anymore.

A Minimal Flow

Thread enters a read/write critical section:
1. active = true
2. local_epoch = global_epoch

Thread leaves the critical section:
3. active = false

Node retirement:
4. retired[global_epoch].push(node)

Try reclamation:
5. if all active threads have entered a later epoch
6. free retired nodes from older epochs

Many implementations use a “three-epoch ring” or similar structure. The core idea is the same: put newly retired nodes into the current epoch, keep a buffer, and after all threads have crossed it, reclaim the older batch.

Advantages of EBR

the read path is usually lighter than hazard pointers
batch reclamation often has good throughput
the idea is common in queues, linked lists, and skip lists

Costs of EBR

reclamation precision is coarse, by batch rather than by node
if one thread stays in a critical section for a long time, it can hold back an entire batch of nodes
threads must explicitly register entry into and exit from protected regions

When It Fits Well

In general, EBR fits:

read-heavy scenarios where the read path should be as light as possible
systems with controlled thread models, where threads do not get stuck indefinitely in critical sections
systems that care more about overall throughput than precise immediate reclamation of each node

Choosing Between Hazard Pointers and EBR

A very engineering-oriented way to distinguish them is:

hazard pointer: “I am currently looking at these specific nodes; do not delete them.”
EBR: “Retired nodes older than a certain era can now be deleted in batches.”

Neither is absolutely more advanced. They make different tradeoffs.

If You Care More About Precision

hazard pointer is often a better fit, because it is precise down to the node. A slow thread only holds back the specific nodes it has declared protected; it does not block all retired garbage in an entire epoch.

If You Care More About a Lightweight Read Path

EBR is often more attractive. The read path usually only needs to record “I entered a protected period”; it does not need to publish a hazard pointer for every node.

A Simple Comparison

Dimension	Hazard Pointer	EBR
Protection granularity	Per node	Per batch
Read-path cost	Higher	Lower
Reclamation method	Scan hazards, then free precisely	Batch free after epochs advance
Slow-thread impact	Holds back a small number of protected nodes	May hold back an entire batch of retired nodes
Implementation difficulty	Publishing protection and confirming again	Epoch advancement and thread liveness management

Their Relationship With ABA

To be clear: hazard pointer and EBR are not algorithms invented specifically for ABA. They are fundamentally safe memory reclamation schemes.

Why do they always appear together with ABA?

Because the most dangerous ABA cases often come from:

a node is unlinked
the node is freed
the address is reused
CAS mistakes “a new object at the same address” for “the same old object”

If you can guarantee that a node is not freed and reused while another thread may still be looking at it, you have already suppressed many ABA risks.

Of course, some algorithms may still need:

version numbers
tagged pointers
sequence numbers

Some ABA problems are not caused by free-and-reuse. The algorithm’s state may genuinely go around and return to the same value.

Why Ring Buffers Often Do Not Need Them, While Linked Structures Often Do

The bounded ring queue discussed earlier usually does not need hazard pointer or EBR. The fundamental reasons are:

slots are preallocated
nodes are not frequently newed and deleted at runtime
the focus is slot state and generation, not node lifetime

Linked lists, stacks, and lock-free hash tables are different:

nodes are often dynamically allocated
nodes are unlinked
whether a node can still be seen by another thread is not always obvious

Without an extra reclamation strategy, the code easily becomes “logically unlinked, but lifetime never explained”.

A Practical Test

When you look at a lock-free data structure, if it satisfies all of these:

dynamically allocated nodes
nodes may be removed concurrently
other threads may still hold old pointers

then immediately ask:

When are these old nodes freed, and what guarantees no one can access them at that moment?

If this question has no clear answer, the algorithm is probably not complete.

Putting Them Into Michael-Scott Queue

After all these concepts, use a classic linked lock-free queue to connect them. The closest example to this article’s theme is the Michael-Scott queue, often abbreviated as MS queue.

Its basic structure is simple:

maintain an atomic head
maintain an atomic tail
always keep a dummy node in the linked list
enqueue links a new node at the tail
dequeue actually takes head->next

Very roughly:

head --> [dummy] --> [A] --> [B] --> null
tail ----------------------^

When a consumer takes an element, it does not directly delete A. Instead:

read the current head
read head->next
CAS head forward to head->next
the old dummy node becomes a “retired node”

This is exactly where memory reclamation is easiest to get wrong.

Why MS Queue Cannot Directly delete oldHead in Dequeue

Suppose thread A and thread B both call Dequeue.

Thread A may have:

read oldHead
not yet read oldHead->next

Thread B then succeeds first, advances head to the next node, and immediately does:

delete oldHead;

Logically, this may look fine because head no longer points to the old node. But thread A may still:

hold the address oldHead
be accessing oldHead->next
even be preparing to compare based on it in the next loop

So the risk here is not abstract “thread unsafety”. It is two very concrete risks:

use-after-free
ABA after the node address is reused

In a correct MS queue implementation, oldHead usually cannot be freed immediately after a successful CAS. It must be retired first and handed to a reclamation strategy.

A Timing Diagram

Without a reclamation strategy, the dangerous timing roughly looks like this:

sequenceDiagram
    participant A as "Thread A"
    participant Q as "MS queue"
    participant B as "Thread B"
    participant M as "Allocator"

    A->>Q: oldHead = head.load()
    Note over A: has not safely dereferenced oldHead->next yet
    B->>Q: oldHead = head.load()
    B->>Q: next = oldHead->next
    B->>Q: CAS(head, oldHead, next) succeeds
    B->>M: delete oldHead
    M-->>Q: this address may be reused later
    A->>Q: continue accessing oldHead->next
    Note over A,Q: this may now be use-after-free

The key point is not that “threads A and B both read head”. It is:

B has logically unlinked the old dummy node
but A still physically holds that node address
logical removal and physical destruction are not the same thing

Many lock-free bugs hide exactly in this gap.

A Dequeue Skeleton With Only the Core Logic

The following code intentionally omits many engineering details. It keeps only the main thread: why a reclamation strategy is needed.

bool Dequeue(T& out)
{
    for (;;) {
        Node* oldHead = head_.load(std::memory_order_acquire);
        Node* next = oldHead->next.load(std::memory_order_acquire);

        if (next == nullptr) {
            return false;
        }

        if (head_.compare_exchange_weak(
            oldHead,
            next,
            std::memory_order_acq_rel,
            std::memory_order_acquire)) {
            out = std::move(next->value);
            Retire(oldHead);
            return true;
        }
    }
}

The most important line here is not the CAS. It is:

Retire(oldHead);

The hard question is not “can I advance the head to the next node?” It is “after advancing it, when can I really delete the old node?”

How Hazard Pointers Usually Fit

In an MS queue, the consumer’s most dangerous operation is walking through shared pointers:

read head
then read head->next

A typical approach is to protect the node being dereferenced with a hazard pointer.

Typical Pattern When Reading head

1. oldHead = head.load()
2. hp[me] = oldHead
3. if (head.load() != oldHead) retry
4. next = oldHead->next.load()

This means:

first declare oldHead as “currently in use”
then confirm that the shared head really is still it
only after confirmation dereference oldHead->next

Then, even if another thread has already unlinked oldHead from the list, it cannot actually free it as long as it sees some thread’s hazard pointer still pointing at oldHead.

How Timing Changes With Hazard Pointers

In the same two-thread scenario, with hazard pointers, thread A first publicly declares “I am looking at oldHead”:

sequenceDiagram
    participant A as "Thread A"
    participant HP as "hazard slot[A]"
    participant Q as "MS queue"
    participant B as "Thread B"

    A->>Q: oldHead = head.load()
    A->>HP: publish(oldHead)
    A->>Q: confirm head is still oldHead
    B->>Q: next = oldHead->next
    B->>Q: CAS(head, oldHead, next) succeeds
    B->>Q: retire(oldHead)
    B->>HP: scan hazard pointers
    Note over B,HP: finds that A still protects oldHead
    B-->>Q: cannot delete oldHead yet
    A->>Q: safely read oldHead->next
    A->>HP: clear()
    B->>HP: next scan
    B-->>Q: now oldHead can really be freed

This diagram shows the core value of hazard pointers:

they do not remove nodes from the queue
they prevent nodes from being destroyed while another thread may still dereference them

This is why reclamation is not an optional optimization. It is part of correctness.

After CAS Succeeds

Once the current thread successfully advances head to next, the old oldHead enters a retired list:

retired_list.push(oldHead)
scan_all_hazard_pointers()
if (oldHead not protected) {
    delete oldHead
}

Note that the reclaimed node is not next; it is the old dummy node oldHead. This is one of the easy-to-miss details in MS queue.

Why This Is a Good Hazard Pointer Example

In this example, the thread only needs to precisely protect a small number of nodes, usually:

the current head
and in some implementations, next

This matches the hazard pointer style very well: precisely protect a few nodes currently being accessed.

How EBR Usually Fits

With EBR, the idea is no longer “announce protection for each node”. Instead:

before entering Dequeue, the thread declares that it has entered a protected epoch
inside this epoch, it reads head, reads next, and performs CAS
if CAS succeeds, it puts oldHead into the retired list for the current epoch
when the operation finishes, it leaves the protected region

The pseudoflow is:

EnterEpoch()

oldHead = head.load()
next = oldHead->next.load()
if (CAS(head, oldHead, next)) {
    retire_in_current_epoch(oldHead)
}

LeaveEpoch()

Now oldHead is not actually deleted by this Dequeue. It is decided by epoch advancement across the whole system:

if some thread is still in an old epoch
then nodes retired in that old epoch cannot be deleted
only after all active threads pass that epoch
can the batch of retired nodes be freed together

Why It Is Also Common in Queues

Enqueue and dequeue operations are usually very frequent. If every pointer traversal has to publish a separate hazard pointer, the hot path can become heavy. EBR is attractive because:

register once when the operation starts
leave once when the operation ends
hand retired nodes to batch reclamation

If the thread model is regular, this style usually works smoothly.

How to Think About EBR Here

EBR does not publish protection for each node, but it solves the same time-window problem at a coarser granularity:

sequenceDiagram
    participant A as "Thread A"
    participant E as "epoch system"
    participant Q as "MS queue"
    participant B as "Thread B"

    A->>E: EnterEpoch()
    A->>Q: read oldHead / next
    B->>E: EnterEpoch()
    B->>Q: CAS(head, oldHead, next) succeeds
    B->>E: retire oldHead into current epoch
    B->>E: LeaveEpoch()
    Note over E: because A is still in the old epoch, oldHead cannot be freed yet
    A->>Q: finish this Dequeue path
    A->>E: LeaveEpoch()
    Note over E: after no active thread remains in the old epoch, oldHead can be batch reclaimed

Intuitively:

hazard pointers protect “this specific node”
EBR protects “all old nodes in this time period”

The Difference in the Same Dequeue

Looking at the same Dequeue, the difference is clear:

hazard pointer cares about “can this specific node in my hand be deleted?”
EBR cares about “am I still inside a protected era?”

That is:

hazard pointer is node-level protection
EBR is critical-section-level protection

Both can work for an MS queue, but the code style is quite different.

Another Classic ABA Scene: Treiber Stack

If MS queue is better for explaining why safe reclamation is needed, Treiber stack is better for making ABA itself obvious.

Its core structure is very simple:

an atomic top
each node has a next
both push and pop CAS around top

A minimal pop skeleton looks like this:

Node* Pop()
{
    for (;;) {
        Node* oldTop = top_.load(std::memory_order_acquire);
        if (oldTop == nullptr) {
            return nullptr;
        }

        Node* next = oldTop->next;
        if (top_.compare_exchange_weak(
            oldTop,
            next,
            std::memory_order_acq_rel,
            std::memory_order_acquire)) {
            return oldTop;
        }
    }
}

At first glance it looks textbook-clean. But because it relies so heavily on CAS over top, ABA becomes especially obvious.

A Classic ABA Timeline

Assume the initial stack is:

top --> A --> B --> C

Thread 1 starts pop:

reads oldTop = A
reads next = B
has not yet performed CAS

Thread 2 changes the stack around:

pops A
pops B
pushes A back

Now the stack may become:

top --> A --> C

For thread 1, the dangerous part is that when it looks at top again, the address is still A. So its CAS may succeed and change top from A to its cached old B.

The problem is that B may now:

no longer be in the stack
have been taken by another thread
even have been freed or reused

A Timeline Diagram

sequenceDiagram
    participant T1 as "Thread 1"
    participant S as "Treiber stack"
    participant T2 as "Thread 2"

    T1->>S: read top = A
    T1->>S: read next = B
    Note over T1: paused, CAS not performed yet

    T2->>S: pop A, top becomes B
    T2->>S: pop B, top becomes C
    T2->>S: push A, top becomes A again

    T1->>S: CAS(top, A, B) succeeds
    Note over T1,S: Thread 1 thought it only removed A\nbut actually wrote a stale B back into top

This is the classic smell of ABA:

from the perspective of “is the value equal to A”, the CAS condition is satisfied
but from the perspective of “is the stack still the state I observed”, the world has already changed

What Is Actually Wrong?

Thread 1’s implicit assumption is:

As long as top is still A, A->next is still the B I saw earlier.

ABA proves this assumption false.

CAS verifies only “the current value of the head pointer”. It does not verify whether other state changes happened in between.

This is why Treiber stack is such a good example of one fact:

compare_exchange verifies a point, not a history.

Why Version Numbers / Tagged Pointers Are Common Here

A common idea is to stop comparing only raw pointers. Compare:

pointer plus version number
pointer plus tag

That changes top from:

to:

(A, version = 17)

Then even if the address returns to A, if a pop or push happened in between, the version number is no longer the original value. Thread 1’s CAS fails, forcing it to reread the latest state.

Intuitively:

a raw pointer answers “is this the current address?”
a tagged pointer answers “is this the current address, and has no one touched it in between?”

But Version Numbers Are Not the Whole Answer

One important note: even with a version number, the memory-reclamation problem does not disappear automatically.

The reason is simple:

the version number solves “did the state go around and come back?”
safe reclamation solves “can this node really be destroyed now?”

So in real Treiber stack implementations, you often see two kinds of things together:

tagged pointer / pointer plus counter
hazard pointers, EBR, or another reclamation scheme

The former mainly solves ABA observation distortion; the latter mainly solves node lifetime.

The Contrast With MS Queue Builds Intuition

Putting the two classic structures together is useful:

MS queue makes it easier to see that a node cannot be deleted immediately after unlinking
Treiber stack makes it easier to see that “the address came back” does not mean “the state did not change”

One is more about lifetime; the other is more about state wrapping around. Together they cover the two most troublesome classes of lock-free problems.

Why This Example Connects the Previous Concepts

MS queue ties together several hard parts from the whole article:

CAS advances head and tail
memory_order guarantees the order between node contents and pointer publication
ABA if old nodes are freed and reused too early, CAS observations may be distorted
hazard pointer / EBR answers “when can an old node be safely deleted?”

If someone can follow this line and understand the Dequeue of an MS queue, they have probably moved past the stage of merely memorizing lock-free terminology.

The One Sentence Worth Remembering

For linked lock-free structures, CAS often only solves “how to unlink a node”. Hazard pointers, EBR, or another reclamation strategy solve “when the unlinked node can actually be destroyed”.

Typical Optimization Directions

After understanding the two SPSC versions above, mainstream open-source implementations become much less unfamiliar. Most of them optimize further in the following directions.

If head and tail fall on the same cache line, the producer and consumer will constantly invalidate each other’s cache line. Common approaches are:

use alignas(64) on hot variables
separate producer-side and consumer-side state in memory layout

2. Batch commits to reduce atomic operations

The producer can write several elements locally, then update the visible tail once. The consumer can also consume in batches. This can significantly reduce the number of atomic synchronizations.

3. Cache the other side’s index locally

The producer does not necessarily need to atomically reread head on every enqueue. It can cache a local headCache and synchronize only when it suspects the queue is full. The consumer can similarly cache tail. This is common in high-frequency paths.

4. Use a power-of-two capacity

Then slot indexing can use:

index & (capacity - 1)

instead of %, which is usually cheaper.

Many high-performance MPMC queues do not make every thread fight over one global head/tail. Instead:

each producer maintains its own subqueue
consumers poll several subqueues
or the structure is sharded by thread, NUMA node, or CPU core

This essentially reduces hot-spot contention.

6. Handle memory reclamation, not just enqueue and dequeue

When moving from SPSC to MPMC, the hardest part is often not “putting elements in and taking them out”, but:

who reclaims nodes
when reclamation is safe
whether address reuse can trigger ABA

This is also one of the hardest parts of many lock-free algorithm papers.

What Mainstream Implementations Are Roughly Doing

With the background above, the two libraries mentioned at the beginning become easier to understand.

ConcurrentQueue

The core idea of ConcurrentQueue is not to make all threads fight over one shared linked list. Instead, it:

uses block storage to reduce per-element allocation
localizes producer state
splits contention across multiple substructures

It is not the easiest implementation to explain in a textbook, but it fits engineering reality very well: if you want throughput, reduce global hot spots as much as possible.

AtomicQueue

AtomicQueue leans more toward the circular-array approach, and its code structure makes it easier to connect directly to “preallocated slots plus atomic index advancement”.

If you only want to build intuition for high-performance queues, starting from the ring-buffer version is often easier.

When Not to Use Lock-Free

Lock-free does not necessarily mean faster, and it is not always a better fit for business code.

Locks may be more appropriate in these cases:

the critical section is large and not suitable for high-frequency synchronization anyway
concurrency is low and lock contention is almost nonexistent
correctness matters more than maximum throughput
the team is not familiar with lock-free programming, so maintenance cost is too high

A very practical rule is: start with the simple correct solution, use profiling to prove that locks are the bottleneck, and only then consider switching to lock-free.

Summary

The core of lock-free queues is not “knowing how to write CAS”. It is understanding the following:

atomicity and visibility are not the same thing
what release/acquire actually guarantees when publishing data
why SPSC is a good starting point because its state space is small
why linked lists are easy to explain but not always fast, while circular arrays are usually better for high throughput
once you enter MPMC, slot state, generation checks, and memory reclamation must be taken seriously
whenever a data structure dynamically removes nodes, you must ask how ABA and safe reclamation are handled
hazard pointer and EBR are not “extra optimizations”; often they are part of correctness

If this article is expanded later, natural directions include:

a separate post on the C++ memory model, going deeper into fences, release sequences, and hardware models
a fully compilable bounded MPMC queue with tests
a separate post connecting classic linked lock-free structures such as Michael-Scott queue and Treiber stack with reclamation strategies

That would take us from “able to read lock-free queue code” to “able to implement and verify one ourselves”.

Lock-Free Programming#

Atomic Operations and CAS#

The ABA Problem#

Memory Ordering#

Separate Three Things First: Atomicity, Visibility, and Ordering#

1. Atomicity#

2. Visibility#

3. Ordering#

Who Is Reordering Things?#

Why the Compiler Reorders#

Why the CPU Also Reorders#

Correct Assembly Order Still Does Not Guarantee Cross-Core Observation Order#

Why Acquire / Release Help#

Why Cache Coherence Is Not Enough#

The Most Important Line in the C++ Memory Model: Happens-Before#

What Each memory_order Guarantees#

memory_order_relaxed#

memory_order_release#

memory_order_acquire#

memory_order_acq_rel#

memory_order_seq_cst#

Why the Publish-Subscribe Model Matters So Much#

What Memory Orders Are Commonly Used in Queues?#

1. Accessing an Index Owned by This Thread Can Often Be relaxed#

2. Observing a Boundary Published by the Other Side Usually Needs acquire#

3. Publishing a New Node, Element, or State Usually Needs release#

4. Competing for a Shared Position While Both Reading and Writing Often Uses acq_rel#

An Easy-to-Miss CAS Detail: Success and Failure Orders#

A Practical Rule of Thumb#

Looking at release/acquire on a Timeline#

The Difference Between x86 and ARM#

Why x86 Can Create a False Sense of Safety#

Why ARM Exposes Bugs More Easily#

The Most Important Engineering Conclusion#

A Linked Lock-Free Queue#

A Simple SPSC Linked Queue#

Why This Version Is Relatively Easy to Get Right#

Problems With the Linked Version#

1. Each enqueue allocates a node#

2. Pointer chasing has poor cache locality#

3. Scaling to MPMC is hard#

A Circular-Array Lock-Free Queue#

A Simple SPSC Circular-Array Queue#

This Version Is Still Only SPSC#

From SPSC to MPSC / MPMC#

Why You Cannot Just Change tail to fetch_add#

A Common Engineering Approach: One sequence per Slot#

A Simplified MPMC Ring Queue#

Enqueue, Line by Line#

Dequeue, Line by Line#

Layer One: Reserve a Number First, Then Touch the Slot#

Layer Two: sequence Distinguishes Empty, Full, and Generation#

Why This Avoids Many Linked-List Problems#

A Slot-Rotation Diagram Makes It Clearer#

Understanding “Reservation” and “Publication” on a Timeline#

Memory Reclamation, Hazard Pointers, and EBR#

Why ABA and Memory Reclamation Keep Appearing Together#

The Most Naive Method: Delayed Free#

What Is a Hazard Pointer?#

A Minimal Flow#

A Simplified Sketch#

Advantages of Hazard Pointers#

Costs of Hazard Pointers#

When They Fit Well#

What Is Epoch-Based Reclamation?#

Why It Works#

A Minimal Flow#

Advantages of EBR#

Costs of EBR#

When It Fits Well#

Choosing Between Hazard Pointers and EBR#

If You Care More About Precision#

If You Care More About a Lightweight Read Path#

A Simple Comparison#

Their Relationship With ABA#

Why Ring Buffers Often Do Not Need Them, While Linked Structures Often Do#

A Practical Test#

Putting Them Into Michael-Scott Queue#

Why MS Queue Cannot Directly delete oldHead in Dequeue#

A Timing Diagram#

Lock-Free Programming

Atomic Operations and CAS

The ABA Problem

Memory Ordering

Separate Three Things First: Atomicity, Visibility, and Ordering

1. Atomicity

2. Visibility

3. Ordering

Who Is Reordering Things?

Why the Compiler Reorders

Why the CPU Also Reorders

Correct Assembly Order Still Does Not Guarantee Cross-Core Observation Order

Why Acquire / Release Help

Why Cache Coherence Is Not Enough

The Most Important Line in the C++ Memory Model: Happens-Before

What Each memory_order Guarantees

`memory_order_relaxed`

`memory_order_release`

`memory_order_acquire`

`memory_order_acq_rel`

`memory_order_seq_cst`

Why the Publish-Subscribe Model Matters So Much

What Memory Orders Are Commonly Used in Queues?

1. Accessing an Index Owned by This Thread Can Often Be relaxed

2. Observing a Boundary Published by the Other Side Usually Needs acquire

3. Publishing a New Node, Element, or State Usually Needs release

4. Competing for a Shared Position While Both Reading and Writing Often Uses acq_rel

An Easy-to-Miss CAS Detail: Success and Failure Orders

A Practical Rule of Thumb

Looking at release/acquire on a Timeline

The Difference Between x86 and ARM

Why x86 Can Create a False Sense of Safety

Why ARM Exposes Bugs More Easily

The Most Important Engineering Conclusion

A Linked Lock-Free Queue

A Simple SPSC Linked Queue

Why This Version Is Relatively Easy to Get Right

Problems With the Linked Version

1. Each enqueue allocates a node

2. Pointer chasing has poor cache locality

3. Scaling to MPMC is hard

A Circular-Array Lock-Free Queue

A Simple SPSC Circular-Array Queue

This Version Is Still Only SPSC

From SPSC to MPSC / MPMC

Why You Cannot Just Change tail to fetch_add

A Common Engineering Approach: One sequence per Slot

A Simplified MPMC Ring Queue

Enqueue, Line by Line

Dequeue, Line by Line

Layer One: Reserve a Number First, Then Touch the Slot

Layer Two: sequence Distinguishes Empty, Full, and Generation

Why This Avoids Many Linked-List Problems

A Slot-Rotation Diagram Makes It Clearer

Understanding “Reservation” and “Publication” on a Timeline

Memory Reclamation, Hazard Pointers, and EBR

Why ABA and Memory Reclamation Keep Appearing Together

The Most Naive Method: Delayed Free

What Is a Hazard Pointer?

A Minimal Flow

A Simplified Sketch

Advantages of Hazard Pointers

Costs of Hazard Pointers

When They Fit Well

What Is Epoch-Based Reclamation?

Why It Works

A Minimal Flow

Advantages of EBR

Costs of EBR

When It Fits Well

Choosing Between Hazard Pointers and EBR

If You Care More About Precision

If You Care More About a Lightweight Read Path

A Simple Comparison

Their Relationship With ABA

Why Ring Buffers Often Do Not Need Them, While Linked Structures Often Do

A Practical Test

Putting Them Into Michael-Scott Queue

Why MS Queue Cannot Directly delete oldHead in Dequeue

A Timing Diagram