Skip to main content
  1. Posts/

New Perf latency profiling

·5 mins

Picture this scenario: you’re writing Python code that calls a library for ML inference in the middle. You notice the program runs slowly. You collect a profile using standard tools like perf, and end up seeing this picture:

90% Matmul
10% some Python code

You look at this and get discouraged - matrix multiplication is probably so optimized that there’s no point even trying to improve it.

Fortunately, it’s not that simple. Perf collects events from all threads and CPUs, which get summed together. So if matrix multiplication happened to run on many cores, say 8, you’ll see matmul’s weight in the profile as 8 times larger than it actually is relative to wall-clock time.

Recently, perf added --latency, a tool that divides all events by the number of active CPUs. After this, you might see a profile like:

40% matmul
60% some Python code

Now optimizing the Python code makes sense for latency, because the profile shows that out of real wall-clock time, you spent 40% in matrix multiplication (possibly on many cores) and 60% in Python (likely on a single core). This is often useful for servers, build systems, and command-line tools where latency matters more than throughput.

The Wall-Clock vs CPU Time Dilemma #

There are two fundamental notions of time in computing: wall-clock time and CPU time. For a single-threaded program, or a program running on a single-core machine, these notions are identical. However, for multi-threaded/multi-process programs running on multi-core machines, these notions diverge significantly. Each second of wall-clock time gives us number-of-cores seconds of CPU time.

Traditional profilers, including perf until now, only allowed profiling CPU time. This creates a fundamental mismatch when optimizing for latency rather than throughput:

  • CPU profiling helps improve throughput - how much work gets done across all cores
  • Latency profiling helps improve wall-clock time - how long users actually wait

Consider these use cases where latency profiling is essential:

  • Optimizing build system latency
  • Reducing server request latency
  • Speeding up ML training/inference pipelines
  • Improving command-line program response times

CPU profiles are useless at best for these scenarios, or misleading at worst if users don’t understand the distinction.

How Latency Profiling Works #

The implementation, merged in Linux 6.15, is elegantly simple:

  1. Context switch collection: During perf record, the tool now tracks context switches
  2. Parallelism calculation: During perf report, it calculates the number of threads running on CPUs at each moment
  3. Weight adjustment: Each sample’s weight is divided by the parallelism level

This effectively models taking 1 sample per unit of wall-clock time, giving us a true picture of where our program spends its time from the user’s perspective.

Using the New Feature #

The new latency profiling is available with a simple flag:

1
2
3
4
5
6
7
# Traditional CPU profiling (default)
perf record ./my_program
perf report

# New latency profiling
perf record --latency ./my_program
perf report

The --latency option enables tracking of scheduler information and adjusts sample weights based on parallelism. This means code running in parallel gets its contribution reduced proportionally, while serial execution bottlenecks become more prominent.

Real-World Example: Python ML Inference #

Let’s revisit our opening example with actual commands:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Traditional profiling might show:
$ perf record python inference.py
$ perf report
# 90.2% libopenblas.so  [.] dgemm_kernel
#  9.8% python          [.] PyEval_EvalFrameEx

# With latency profiling:
$ perf record --latency python inference.py  
$ perf report
# 41.3% libopenblas.so  [.] dgemm_kernel
# 58.7% python          [.] PyEval_EvalFrameEx

The latency profile reveals that while BLAS operations use many cores efficiently, the Python interpreter becomes the actual bottleneck for wall-clock time. This insight completely changes optimization priorities.

Limitations and Caveats #

Not without limitations. If you have a multi-threaded server with a thread pool for specific work, the latency math becomes more complex. The profiling works best when:

  • Testing in isolation without system load
  • Profiling specific processes rather than system-wide
  • Understanding your application’s threading model

Currently, latency profiling is limited to process-level profiling - system-wide profiling is not yet supported. This makes sense given the complexity of attributing wall-clock time across multiple independent processes.

Implementation Details #

For the curious, here’s a simplified view of how perf calculates the adjusted weights:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// Conceptual implementation
struct sample {
    uint64_t timestamp;
    uint64_t period;  // Original sample weight
    uint32_t cpu;
};

// During report phase
uint32_t parallelism = count_active_cpus_at(sample->timestamp);
uint64_t adjusted_weight = sample->period / parallelism;

The actual implementation tracks context switches and maintains a timeline of CPU activity, allowing it to accurately determine parallelism levels at any point during execution.

Practical Applications #

This feature particularly shines for:

  1. Build Systems: Understanding whether compilation, linking, or code generation is the bottleneck
  2. Web Servers: Identifying if request handling or backend calls dominate latency
  3. Data Processing: Distinguishing between I/O wait and computation time
  4. ML Workloads: Balancing preprocessing, inference, and postprocessing

The Future of Performance Analysis #

Dmitry Vyukov from Google, who spearheaded this feature, notes that this fills a critical gap in the performance tooling ecosystem. No other mainstream profiler previously offered true wall-clock time profiling for multi-threaded applications.

The feature also introduces a new “parallelism” key in perf reports, making it easier to understand how effectively your application uses available cores. This metric alone can guide architectural decisions about threading models and work distribution.

Conclusion #

The new --latency flag in perf fundamentally changes how we approach performance optimization. By exposing true wall-clock time bottlenecks, it helps developers focus on what actually makes users wait. Whether you’re optimizing build times, server latency, or ML pipelines, this tool provides insights that were previously invisible or misleading.

The days of staring at CPU profiles wondering why optimization efforts don’t improve user-perceived performance are over. With latency profiling, we can finally see our code the way users experience it - in real time.

Further Reading #

Remember: CPU profiles show how busy your cores are. Latency profiles show how long your users wait. Choose wisely.