← All posts
C++SystemsPerformanceConcurrency

Multithreading in C++: std::thread, Mutexes, and the Cost of Synchronization

April 18, 2026·11 min read

Why Multithreading?

Modern hardware has stopped getting faster in single-threaded terms. Clock speeds plateaued around 2005. What we got instead was more cores. Writing software that only uses one is leaving performance on the table — sometimes by an order of magnitude.

C++11 brought a portable threading model to the standard library. Before then, you were writing pthreads on POSIX or Windows threads directly, and portability was a mess. Now there's std::thread, std::mutex, std::condition_variable, and std::atomic — and they compose well.

std::thread Basics

Spawning a thread is straightforward:

#include <thread>
#include <iostream>

void worker(int id) { std::cout << "Thread " << id << " running\n"; }

int main() { std::thread t1(worker, 1); std::thread t2(worker, 2);

t1.join(); t2.join(); return 0; }

join() blocks until the thread finishes. detach() lets it run independently — but then you lose the ability to join it, so you must ensure the thread's lifetime doesn't outlast its data.

Data Races and Mutexes

The moment two threads touch the same memory and at least one is writing, you have a data race. The behavior is undefined in the C++ memory model — the compiler is free to reorder your code in ways that assume no concurrent access.

#include <mutex>
#include <thread>

int counter = 0; std::mutex mtx;

void increment() { for (int i = 0; i < 100000; i++) { std::lock_guard<std::mutex> lock(mtx); counter++; } }

int main() { std::thread t1(increment); std::thread t2(increment); t1.join(); t2.join(); // counter == 200000, always return 0; }

std::lock_guard uses RAII to lock on construction and unlock on destruction. No manual unlock, no exception leaks.

Condition Variables

Mutexes protect shared state. Condition variables let threads wait for state to change:

#include <condition_variable>
#include <mutex>
#include <queue>

std::queue<int> work_queue; std::mutex mu; std::condition_variable cv;

void producer() { for (int i = 0; i < 10; i++) { { std::lock_guard<std::mutex> lock(mu); work_queue.push(i); } cv.notify_one(); } }

void consumer() { while (true) { std::unique_lock<std::mutex> lock(mu); cv.wait(lock, [] { return !work_queue.empty(); }); int item = work_queue.front(); work_queue.pop(); lock.unlock(); // process item } }

The lambda in cv.wait() guards against spurious wakeups — condition variables can wake for no reason, so always re-check the condition.

The Cost of Synchronization

Locking is not free. A mutex acquisition involves an atomic compare-and-swap, a potential OS context switch if the lock is contended, and memory fence instructions that prevent hardware reordering. On modern x86, an uncontended std::mutex lock/unlock costs around 20–50 nanoseconds. Contended locks can cost microseconds.

The practical implication: coarse-grained locking is often faster than fine-grained. A single mutex protecting a large block runs fewer lock operations than many mutexes each protecting tiny sections. Profile before optimizing.

When Not to Use Threads

More threads doesn't mean more speed:

- I/O-bound work: async I/O (epoll, io_uring) often outperforms thread-per-connection models - Fine-grained tasks: thread creation overhead (~100µs) dominates if tasks are sub-millisecond - Shared, frequently-mutated state: lock contention can make threaded code slower than single-threaded

The classic mistake is adding threads to make code "faster" without understanding where time is actually spent.

Conclusion

C++ threading is powerful but demands discipline. Data races are silent and non-deterministic — they may not manifest under light load or on your machine, only surfacing under production conditions. Use tools: ThreadSanitizer (-fsanitize=thread) catches data races at runtime with minimal code changes. Write threaded code assuming the worst about execution order, and use std::atomic or mutexes everywhere shared state is touched.