NYTProf v5 – Flaming Precision

As soon as I saw a Flame Graph visualization I knew it would make a great addition to NYTProf. So I’m delighted that the new Devel::NYTProf version 5.00, just released, has a Flame Graph as the main feature of the index page.


In this post I’ll explain the Flame Graph visualization, the new ‘subroutine calls event stream’ that makes the Flame Graph possible, and other recent changes, including improved precision in the subroutine profiler.


Let’s start with the improved precision. That work was actually released a few months ago in Devel::NYTProf 4.23 but not announced.

Devel::NYTProf started life as a line/statement profiler, writing a stream of events, one per statement. It’s important for speed that the stream is space efficient, so statement times were expressed as integer microseconds (a ‘tick’) and written in a compressed form. Values less than 128µs use a single byte. This worked very well for v1. Back in early 2008 minimum statement times were typically just a few microseconds.

When I added the subroutine profiler I chose to use double precision floating point values to hold the subroutine call times with seconds as the units. I presume that seemed reasonable at the time as microseconds (multiples of 1e-6) can be accurately stored double precision floating point values and are significantly above the typical machine epsilon of 2.220446e-16.

I’d assumed the values weren’t at risk from the pernicious effect of cumulative round-off errors. The situation got worse with NYTProf v2 because that switched the clock ‘tick’ from 1µs to 100ns on some systems (those with POSIX realtime clock API and OS X). And then worse again when profiling of ‘slowops’ was added in NYTProf v3 since slowops are often far from slow.

$ perl -we '$n=10_000_000; $t=0.0; $i=3/$n; $t+=$i while $n--; print "$t\n";'

The way the subroutine profiler works, calculating inclusive and exclusive times as it goes, makes it sensitive to these accumulated errors. (Sometimes a subroutine that did nothing but call a very fast subroutine many times could be reported as having taken less time than the sum of the times in the subroutine it called.)

The subroutine profiler still uses double precision floating point values to accumulate the times, but now accumulates integer ticks instead of fractional seconds.

$ perl -we '$n=10_000_000; $t=0.0; $i=3.0; $t+=$i while $n--; $t/=10_000_000; print "$t\n";'

(The $t=0.0 and $i=3.0 ensure perl is using floating point values in that example. I checked it with Devel::Peek.)

Subroutine Call Events

There’s one thing the old and deeply flawed Devel::DProf profiler can do that NYTProf hasn’t been able to: the DProf dprofpp utility can generate a subroutine call tree.

NYTProf hasn’t been able to do that because its subroutine profiler worked entirely in memory, accumulating aggregate data about each call arc, but not outputting anything until the end of the profile. So all the calls on any given arc are merged together.

NYTProf v5 adds a new calls option that enables streaming of subroutine call events as they happen. With calls=2 subroutine call and return events are generated. With calls=1 (the default) only subroutine return events are generated. (A curious side effect of perl internals and the way NYTProf works means it can’t reliably know the name of the subroutine at call entry time. So the call entry event isn’t very useful at the moment.)

The call return events are sufficient to recreate a call tree, albeit with some expensive massaging of the data. NYTProf does this with the new nytprofcalls utility which reads and processes the stream of call return events. At the moment it’s undocumented, rather hackish, and only generates the call data in a collapsed form suitable for generating a flamegraph (more below). It could be extended to produce a call tree without too much work. Then, finally, the ghost of Devel::DProf can be laid to rest.

Flame Graph

Brendan Gregg developed the Flame Graph as a way to visualize very large volumes of stack traces sampled by DTrace.

It’s a wonderfully compact and information-rich way to visualize the where a program is spending its time. It’s also unusual and potentially confusing, so a little explanation is required. Keep in mind that it’s a visualization of distinct call stacks and that the colors are not meaningful.

The y-axis represents stack depth. Each box represents the spent time in a particular subroutine when called by the subroutine below it. So a particular subroutine will appear in multiple places if called via different call stacks.

The x-axis spans the time the profiler was running. It does not show the passing of time from left to right, as most graphs do. The left to right ordering has no meaning (it’s sorted alphabetically).

The width of the box shows the inclusive time the subroutine was running, or part of the ancestry of subroutines that were running (the boxes above it). Wider box functions may be slower than narrow box functions, or they may simply be called more often. The call count is not shown.

Brendan’s original flamegraph script generated an SVG that wasn’t well suited to embedding in an application like NYTProf. He’s kindly accepted a series of pull requests to add the key features I was looking for. The most important being the ability to make the boxes clickable: click on a box and you’ll be taken to the report for that subroutine!

Let’s take a closer look at a simple example using a recursive Fibonacci function:

sub fib {
    my $n = shift;
    return $n if $n < 2;
    fib($n-1) + fib($n-2);
sub foo { fib(8) }
sub bar { fib(8) }

That gives us a Flame Graph like this:


The line at the bottom that spans the full width represents the entire profile run. In this case it was 778µs. (Hover over any block to see the time – you can see one in the image, along with the bold and bordered box it relates to).

The first line above that shows the calls to foo and bar. The line for those is shorter than the total line because the total includes the time perl spent compiling the script. It shows up clearly here because this script is so fast.

Then, above the blocks for both foo and bar, you can see the recursive calls to fib rising like flames (okay, with a little imagination). Two things to note here. Firstly bar is shown to the left of foo simply because the names at each level are in lexicographic order. There’s no deeper meaning in the ordering.

Secondly, you can easily see that bar was faster (narrower) than foo, even though they contain the same code. Why’s that? When foo ran first it would have paid the price for growing the stacks and warming the memory pages. Then when bar was called it gained from foo‘s work.

Flame Graph Generator

Behind the scenes nytprofhtml runs nytprofcalls to generate a file in the report directory called all_stacks_by_time.calls. It then calls flamegraph.pl to read that file and generate the all_stacks_by_time.svg that’s shown in the report.

The all_stacks_by_time.calls has a very simple format. One line per distinct call stack, with subroutine names separated by semicolons, followed by a number (which is either in 1µs or 100ns units depending on the platform). Here’s an example running the code above but calling fib(2) instead of fib(8) to keep it small:

main::bar 37
main::bar;main::fib 45
main::bar;main::fib;main::fib 19
main::foo 416
main::foo;main::fib 222
main::foo;main::fib;main::fib 61

This simple format is perfect for grep’ing! You can effectively zoom-in on any subset of the call stacks by generating a flamegraph of just the stacks that contain the functions you’re interested in. For example, running this command on the profile of perlcritic shown at the top:

grep -w Perl::Critic::Policy::new nytprof/all_stacks_by_time.calls | flamegraph.pl > tmp.svg && open tmp.svg

gives you this Flame Graph:


You can see that a lot of time is being spent gathering stack traces for exceptions (this is with perlcritic 1.118 on perl v5.14.2).

It would be nice to have a Flame Graph generated for each of the top-N files/modules, showing just the subset of call stacks that involve any of the subroutines defined in that file. I didn’t get around to that for v5.00. Feel free to fork the code, add that in, and send me a pull request!

Minor Changes

The very old and very limited nytprofcsv utility has been deprecated. Let me know if you use it, otherwise it would be around much longer.

The blocks option is no longer on by default – it seems that few people used the ability to view statement times rolled up at the block level. You can always enable it with blocks=1 in the options.

What Next?

For NYTProf? I don’t know.

Next up on my to-do list is giving Devel::SizeMe the love it needs. There’s some deep work I’d really like to get done before YAPC::NA in June.

Maybe I’ll see you there.

7 thoughts on “NYTProf v5 – Flaming Precision

    • There is lots to explore with color. The warm colors suits the activity (profiling “hot” CPU usage), and the randomness helps differentiate frames (vertical) and columns (horizontal). I’ve experimented using cool colors to include off-CPU stacks – those blocked on I/O.

      I have wanted to know the function call counts (provided it is available) as well as the sample counts. Are cool colors intuitive for this? A large warm frame would mean many samples and many calls (eg, called by parent function in a hot loop), whereas a large cool frame would mean many samples and few calls (heavy compute). I don’t have a better suggestion: is there something intuitive that would convey compute vs frequency? I would find this information useful, of course. :-)

  1. Pingback: Brendan's blog » Flame Graphs

  2. Pingback: perlbuzz.com » Perlbuzz news roundup for 2013-04-15

Comments are closed.