Basics of Futexes: Fast Userspace Mutexes in Linux - Deep Dive into Low-Level Synchronization

Table of Contents

After years of optimizing concurrent systems at Semrush, I’ve come to appreciate the elegance of Linux’s futex mechanism. While most developers never directly interact with futexes (and that’s perfectly fine), understanding them gives you deeper insight into how higher-level synchronization primitives actually work.

The futex (short for “Fast userspace mutex”) was proposed by IBM contributors in 2002 and integrated into the Linux kernel in late 2003. It’s the secret sauce that makes modern Linux threading efficient, powering everything from pthread mutexes to Go’s sync.Mutex.

Important disclaimer: Futexes are extremely low-level. Unless you’re writing a runtime, standard library, or specialized system software, you’ll never need to use them directly. But knowing how they work? That’s valuable knowledge.

The Motivation Behind Futexes #

Before futexes, every lock operation required a system call (like semop). System calls are expensive - they require context switching from userspace to kernel space. As applications became more concurrent, locking started showing up as a significant bottleneck in performance profiles. Frustrating, considering locks don’t do any actual “work” - they’re just overhead for safety.

The key insight behind futexes is brilliant: most locks are actually uncontended. When a thread encounters a free lock, it’s unlikely another thread is trying to grab it at that exact moment. So why pay for a system call every time?

The solution: try cheap atomic operations first. Only if there’s actual contention do we involve the kernel. This gives us:

Fast path: Uncontended case uses only atomic operations (nanoseconds)
Slow path: Contended case uses kernel assistance to sleep/wake threads (microseconds)

In my experience profiling production systems, the fast path handles 95%+ of lock acquisitions. That’s a massive win.

How Futexes Work: Wait and Wake #

The futex(2) system call multiplexes many operations, but we’ll focus on the essentials: FUTEX_WAIT and FUTEX_WAKE.

From the man page:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space.

Simply put: futexes let userspace code ask the kernel to suspend execution until signaled. The waiting threads don’t burn CPU cycles - they’re properly suspended by the scheduler.

Let’s see a practical example - a handshake between two processes using shared memory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


int main(int argc, char** argv) {
  int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
  if (shm_id < 0) {
    perror("shmget");
    exit(1);
  }
  int* shared_data = shmat(shm_id, NULL, 0);
  *shared_data = 0;

  int forkstatus = fork();
  if (forkstatus < 0) {
    perror("fork");
    exit(1);
  }

  if (forkstatus == 0) {
    // Child process
    printf("child waiting for A\n");
    wait_on_futex_value(shared_data, 0xA);

    printf("child writing B\n");
    // Write 0xB to the shared data and wake up parent.
    *shared_data = 0xB;
    wake_futex_blocking(shared_data);
  } else {
    // Parent process
    printf("parent writing A\n");
    // Write 0xA to the shared data and wake up child.
    *shared_data = 0xA;
    wake_futex_blocking(shared_data);

    printf("parent waiting for B\n");
    wait_on_futex_value(shared_data, 0xB);

    // Wait for the child to terminate
    wait(NULL);
    shmdt(shared_data);
  }

  return 0;
}

The helper functions handle the futex syscall details:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


void wait_on_futex_value(int* futex_addr, int val) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAIT, val, NULL, NULL, 0);
    if (futex_rc == -1) {
      if (errno != EAGAIN) {
        perror("futex");
        exit(1);
      }
    } else if (futex_rc == 0) {
      if (*futex_addr == val) {
        // This is a real wakeup
        return;
      }
    } else {
      abort();
    }
  }
}

void wake_futex_blocking(int* futex_addr) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAKE, 1, NULL, NULL, 0);
    if (futex_rc == -1) {
      perror("futex wake");
      exit(1);
    } else if (futex_rc > 0) {
      return;
    }
  }
}

Key gotcha: FUTEX_WAIT returns immediately if the value at futex_addr doesn’t match val. This prevents race conditions but makes the semantics tricky. I’ve debugged many issues where developers didn’t account for this behavior.

Futexes as Kernel-Managed Queues #

At its core, a futex is a queue the kernel manages for userspace. Instead of busy-waiting (burning CPU), threads can sleep efficiently until woken.

Here’s the architecture from LWN’s “A futex overview and update”:

The kernel maintains a hash table keyed by memory addresses, mapping to wait queues. When a thread calls FUTEX_WAIT, it’s added to the appropriate queue and suspended. FUTEX_WAKE removes threads from the queue and makes them runnable again.

Timed Waiting with Futexes #

Production systems often need timeouts. The futex syscall supports this via the timeout parameter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


printf("child waiting for A\n");
struct timespec timeout = {.tv_sec = 0, .tv_nsec = 480000000};  // 480ms
while (1) {
  unsigned long long t1 = time_ns();
  int futex_rc = futex(shared_data, FUTEX_WAIT, 0xA, &timeout, NULL, 0);
  printf("child woken up rc=%d errno=%s, elapsed=%llu\n", futex_rc,
         futex_rc ? strerror(errno) : "", time_ns() - t1);
  if (futex_rc == 0 && *shared_data == 0xA) {
    break;
  }
}

If the wait exceeds 480ms, the process wakes up with ETIMEDOUT and can decide whether to retry or handle the timeout condition.

Implementing a Mutex with Futexes #

Now for the interesting part - building a real mutex using futexes and atomics. This implementation is based on Ulrich Drepper’s excellent “Futexes are Tricky” paper:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


class Mutex {
public:
  Mutex() : atom_(0) {}

  void lock() {
    int c = cmpxchg(&atom_, 0, 1);
    // If the lock was previously unlocked, there's nothing else for us to do.
    // Otherwise, we'll probably have to wait.
    if (c != 0) {
      do {
        // If the mutex is locked, we signal that we're waiting by setting the
        // atom to 2. A shortcut checks if it's 2 already and avoids the atomic
        // operation in this case.
        if (c == 2 || cmpxchg(&atom_, 1, 2) != 0) {
          // Here we have to actually sleep, because the mutex is actually
          // locked. Note that it's not necessary to loop around this syscall;
          // a spurious wakeup will do no harm since we only exit the do...while
          // loop when atom_ is indeed 0.
          syscall(SYS_futex, (int*)&atom_, FUTEX_WAIT, 2, 0, 0, 0);
        }
        // We're here when either:
        // (a) the mutex was in fact unlocked (by an intervening thread).
        // (b) we slept waiting for the atom and were awoken.
        //
        // So we try to lock the atom again. We set the state to 2 because we
        // can't be certain there's no other thread at this exact point. So we
        // prefer to err on the safe side.
      } while ((c = cmpxchg(&atom_, 0, 2)) != 0);
    }
  }

  void unlock() {
    if (atom_.fetch_sub(1) != 1) {
      atom_.store(0);
      syscall(SYS_futex, (int*)&atom_, FUTEX_WAKE, 1, 0, 0, 0);
    }
  }

private:
  // 0 means unlocked
  // 1 means locked, no waiters
  // 2 means locked, there are waiters in lock()
  std::atomic<int> atom_;
};

// Helper for atomic compare-exchange
int cmpxchg(std::atomic<int>* atom, int expected, int desired) {
  int* ep = &expected;
  std::atomic_compare_exchange_strong(atom, ep, desired);
  return *ep;
}

The three-state design (0/1/2) is clever:

0: Unlocked - fast path via atomic CAS
1: Locked, no waiters - still fast unlock
2: Locked with waiters - need futex wake on unlock

This is essentially how glibc implements pthread mutexes, though with more complexity for different mutex types.

Real-World Implementations #

glibc’s pthread_mutex #

The glibc implementation lives in nptl/pthread_mutex_lock.c. Despite supporting various mutex types (recursive, error-checking, etc.), the core pattern is recognizable. Look for the lll_ (low-level lock) prefixes throughout the codebase.

From sysdeps/nptl/lowlevellock.h:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.

   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads...
 */

Go Runtime’s sync.Mutex #

Go doesn’t use libc, so it implements its own futex-based locks. The implementation in src/runtime/lock_futex.go defines familiar constants:

1
2
3
4
5
6
7
8
9


const (
  mutex_unlocked = 0
  mutex_locked   = 1
  mutex_sleeping = 2
  ...
)

// Possible lock states are mutex_unlocked, mutex_locked and mutex_sleeping.
// mutex_sleeping means that there is presumably at least one sleeping thread.

The runtime calls the futex syscall directly via futexsleep in src/runtime/os_linux.go, using FUTEX_WAIT_PRIVATE (sufficient for single-process Go programs).

Practical Lessons #

After working with futex-based systems for years, here are my key takeaways:

Futexes are everywhere - Every pthread mutex, Go channel, and Rust lock uses them under the hood on Linux
The fast path is critical - Optimize for uncontended cases; they dominate in real workloads
Debugging is hard - Futex bugs often manifest as hangs or rare race conditions. Always use higher-level primitives unless absolutely necessary
Performance wins are real - The userspace fast path makes modern Linux synchronization incredibly efficient

At Semrush, we’ve built systems handling millions of requests per second, and the efficiency of futex-based synchronization is a key enabler. But we never use futexes directly - pthread mutexes and Go’s sync package provide all we need with better safety guarantees.

Conclusion #

Futexes represent one of those beautiful systems programming ideas: solve the common case (uncontended locking) in userspace, involve the kernel only when necessary (contention). This hybrid approach gives us the best of both worlds - efficiency and correctness.

While you’ll likely never write futex() directly, understanding this mechanism deepens your knowledge of how modern concurrent systems actually work. Next time you use a mutex or channel, you’ll know there’s a carefully orchestrated dance between userspace atomics and kernel queues making it all possible.

Have you encountered futex-related issues in production? Or implemented custom synchronization primitives? I’d love to hear about your experiences with low-level Linux concurrency.