New Perf latency profiling
Table of Contents
Picture this scenario: you’re writing Python code that calls a library for ML inference in the middle. You notice the program runs slowly. You collect a profile using standard tools like perf, and end up seeing this picture:
90% Matmul
10% some Python code
You look at this and get discouraged - matrix multiplication is probably so optimized that there’s no point even trying to improve it.
Fortunately, it’s not that simple. Perf collects events from all threads and CPUs, which get summed together. So if matrix multiplication happened to run on many cores, say 8, you’ll see matmul’s weight in the profile as 8 times larger than it actually is relative to wall-clock time.
Recently, perf added --latency
, a tool that divides all events by the number of active CPUs. After this, you might see a profile like:
40% matmul
60% some Python code
Now optimizing the Python code makes sense for latency, because the profile shows that out of real wall-clock time, you spent 40% in matrix multiplication (possibly on many cores) and 60% in Python (likely on a single core). This is often useful for servers, build systems, and command-line tools where latency matters more than throughput.
The Wall-Clock vs CPU Time Dilemma #
There are two fundamental notions of time in computing: wall-clock time and CPU time. For a single-threaded program, or a program running on a single-core machine, these notions are identical. However, for multi-threaded/multi-process programs running on multi-core machines, these notions diverge significantly. Each second of wall-clock time gives us number-of-cores seconds of CPU time.
Traditional profilers, including perf until now, only allowed profiling CPU time. This creates a fundamental mismatch when optimizing for latency rather than throughput:
- CPU profiling helps improve throughput - how much work gets done across all cores
- Latency profiling helps improve wall-clock time - how long users actually wait
Consider these use cases where latency profiling is essential:
- Optimizing build system latency
- Reducing server request latency
- Speeding up ML training/inference pipelines
- Improving command-line program response times
CPU profiles are useless at best for these scenarios, or misleading at worst if users don’t understand the distinction.
How Latency Profiling Works #
The implementation, merged in Linux 6.15, is elegantly simple:
- Context switch collection: During
perf record
, the tool now tracks context switches - Parallelism calculation: During
perf report
, it calculates the number of threads running on CPUs at each moment - Weight adjustment: Each sample’s weight is divided by the parallelism level
This effectively models taking 1 sample per unit of wall-clock time, giving us a true picture of where our program spends its time from the user’s perspective.
Using the New Feature #
The new latency profiling is available with a simple flag:
|
|
The --latency
option enables tracking of scheduler information and adjusts sample weights based on parallelism. This means code running in parallel gets its contribution reduced proportionally, while serial execution bottlenecks become more prominent.
Real-World Example: Python ML Inference #
Let’s revisit our opening example with actual commands:
|
|
The latency profile reveals that while BLAS operations use many cores efficiently, the Python interpreter becomes the actual bottleneck for wall-clock time. This insight completely changes optimization priorities.
Limitations and Caveats #
Not without limitations. If you have a multi-threaded server with a thread pool for specific work, the latency math becomes more complex. The profiling works best when:
- Testing in isolation without system load
- Profiling specific processes rather than system-wide
- Understanding your application’s threading model
Currently, latency profiling is limited to process-level profiling - system-wide profiling is not yet supported. This makes sense given the complexity of attributing wall-clock time across multiple independent processes.
Implementation Details #
For the curious, here’s a simplified view of how perf calculates the adjusted weights:
|
|
The actual implementation tracks context switches and maintains a timeline of CPU activity, allowing it to accurately determine parallelism levels at any point during execution.
Practical Applications #
This feature particularly shines for:
- Build Systems: Understanding whether compilation, linking, or code generation is the bottleneck
- Web Servers: Identifying if request handling or backend calls dominate latency
- Data Processing: Distinguishing between I/O wait and computation time
- ML Workloads: Balancing preprocessing, inference, and postprocessing
The Future of Performance Analysis #
Dmitry Vyukov from Google, who spearheaded this feature, notes that this fills a critical gap in the performance tooling ecosystem. No other mainstream profiler previously offered true wall-clock time profiling for multi-threaded applications.
The feature also introduces a new “parallelism” key in perf reports, making it easier to understand how effectively your application uses available cores. This metric alone can guide architectural decisions about threading models and work distribution.
Conclusion #
The new --latency
flag in perf fundamentally changes how we approach performance optimization. By exposing true wall-clock time bottlenecks, it helps developers focus on what actually makes users wait. Whether you’re optimizing build times, server latency, or ML pipelines, this tool provides insights that were previously invisible or misleading.
The days of staring at CPU profiles wondering why optimization efforts don’t improve user-perceived performance are over. With latency profiling, we can finally see our code the way users experience it - in real time.
Further Reading #
- Official perf documentation on CPU and latency overheads
- Original patch series discussion
- Linux 6.15 perf tools pull request
Remember: CPU profiles show how busy your cores are. Latency profiles show how long your users wait. Choose wisely.