NYTProf v2 – the background story

This is the back story to the development of NYTProf v2.

Earlier this year (2008) I needed to do some performance profiling of the source code of a large application. Like many perl developers, my first instinct was to try using Devel::DProf (actually the Apache::DProf wrapper as it was a mod_perl application). It was not a great experience.

DProf

DProf seems easily confused by unusual control flow, spewing “… has unstacked calls” warnings. Also, the subroutines it said were taking the most time didn’t make sense to me. Eventually I worked out that Devel::DProf is effectively broken.

The application I was trying to profile has quite a few large subroutines, so knowing just the time spent in the subroutine as a whole didn’t help much for those. I wanted to know where in the subroutine the time was being spent.

FastProf

That led me to look at line-based profilers. At the time there as only one worth looking at, Devel::FastProf by Salvador Fandiño García (which was based on Devel::SmallProf by Ted Ashton).

Line profilers spit out a stream of “file id, line number, time spent” records to a file. The time between starting one statement and starting the next is measured and associated with the line number of the statement.

Devel::FastProf was great. Fast and effective. It gave me far more accurate timings, and when I made changes in the code it highlighted I could see an immediate effect on performance.

Devel::FastProf was great, but I wanted more. The lack of subroutine level timing was frustrating. When you have a ~100,000 lines of code, knowing the time spent on each, and how many times it was executed, is less useful than you may think – there’s just too much detail. Especially when looking for structural problems in the code, or for good places to add caching, or pass extra information down a call chain to save expensive calls deeper in the code. There’s a need for both subroutine and line level timings when profiling.

Idea

I’d had an idea: The line number output in the FastProf profile need not be the line number of the statement. It could output the line number of the subroutine containing the statement. Then you’d automatically get subroutine level timings! Simple.
Then I wondered if it was possible to find the line number of the block the statement was in. That would give block level timing! A first for any perl profiler.

My perl internals knowledge was getting rusty as it was a few years since I’d been Pumpkin for the 5.4.x release, so I asked the wizards on perl5-porters. They gave me hope and enough clues to get going.

Salvador kindly moved the Devel::FastProf code to a public svn, so I could contribute more easily, and I started hacking. I added code to find the nearest enclosing block or sub and it proved very useful.

When I’m optimizing I start by identifying “locally inefficient code”. That is, code you can optimize without significant structural changes. Without making changes outside the subroutine. Moving code outside loops is a common example. Subroutine timings identify the hot subs, then block and line timings pinpoint the hot spots in the code.

That’s the low hanging fruit. Easy pickings, and often very effective. But there’s a limit to how far that’ll take you.

Callers and Callees

There are two ways to optimize a hot piece of code: make it faster, or execute it less often. The former tends to get much more attention than the latter. It’s important to remember to keep stepping back. To keep looking for the bigger picture.

When I’m optimizing I often use a well defined chunk of work, like 10 requests to the same URL, so I can see of the number of times a subroutine is called “feels right”. That often shows subs being called “too often”. But what then? You need to know why the sub is being called too often, so you need to know where it’s being called from.

In Devel::FastProf I added counting of subroutine callers two-levels up the call stack. So I could see that foo() was called 10 times by bar() and that of those 10 calls from bar() 7 has come from baz() and 3 from boo(). It was very simplistic, slow (implemented in perl) and only had counts, not timings, but proved very useful.

Using these additions to Devel::FastProf I reduced the CPU usage of the application by over 40%. Not bad. (I could see another 10% or so to be gained fairly easily but had to draw the line somewhere.)

NYTProf

Meanwhile Adam Kaplan had forked Devel::FastProf, added a test harness and tests, made some internal improvements, grafted on an html report derived from Devel::Cover, and released it as Devel::NYTProf.

When I saw NYTProf I switched to working on that. Again, Adam was kind enough to move the code to a public svn repository. I was attracted not so much by the html report as by the test harness. A lesson to anyone wanting to attract developers to an open source project!

Testing profilers is hard and Adam had come up with a good basic testing framework which was easy to extend. NYTProf now has a strong test suite that profiles 19 different perl scripts with four different combinations of profiler options. The test suite has proven invaluable in identifying regressions during development, and for identifying portability issues between perl versions.

I re-implemented the block/sub level profiling and the subroutine caller tracking from FastProf in NYTProf, but with more care, more attention to performance, and now tests.

I was particularly pleased with the subroutine caller tracking. It intercepts the entersub opcode and uses the save stack for storage and to trigger a ‘destructor’ call to end the timing when the subroutine is exited by any means. The end result is an extremely fast and robust subroutine call profiler. I plan to add an option to disable the other profiler so you can just get subroutine profile details when you don’t need statement level details. It currently lacks the ability to give exclusive times but I think I’ve an efficient solution for that. (Update: Implemented in r340 and r343 so will be in the 2.02 release.)

Accuracy

Another key innovation was to fix a fundamental problem inherent in all statement profilers. Consider a statement that calls a subroutine and then performs some other work that doesn’t execute new statements, for example:

    foo(...) || mkdir(...);

In all other statement profilers the time spent in remainder of the expression (mkdir in the example) will be recorded as having been spent on the last statement executed in foo()! Here’s another example:

    while (<>) {
        ...
        1;
    }

After the first time around the loop, any further time spent evaluating the condition (waiting for input in this example) would be be recorded as having been spent on the last statement executed in the loop!

I fixed this in NYTProf by intercepting all the opcodes which indicate that control is returning into some previous statement and adjusting the profile accordingly.

Reporting

As much effort, if not more, went into the reporting side of the code. And there’s a lot more to be done there. My goal is to keep growing the data model classes to the point where any reporting code can get the information it needs easily enough that there’s no longer a need for the rather limiting Reporter class.

I’d like to see a single ‘nytprof’ command line tool that loads a class to generate the report. That would replace nytprofhtml and nytprofcsv. That would make it easy for other developers to release ‘nytprof reporting modules’ to CPAN.

For example, one very useful report that FastProf has but NTProf currently lacks is a list of most expensive lines (or blocks, or subs) output in the format used by compiler error messages. The format is important because most editors have a special mode for reading such files that means you can hop from one ‘most expensive line’ to the next with a single key stroke. (For vim that’s called quickfix mode.) That’s a wonderful way to browse the hotspots and make edits on the spot.

Future

There are many, many, ways NYTProf can be enhanced further. As I’ve worked on it I’ve dumped ideas, issues and random notes into the HACKING file.

Thanks

I’d like to end by expressing my thanks to Salvador Fandiño García and especially Adam Kaplan for allowing me to contribute to the modules they created and tolerating my strong ideas with understanding. Thank you both. It’s been quite a ride.

NYTProf v2 – A major advance in perl profilers

After much hacking, and just in time for OSCON, I’m delighted to announce the release of Devel::NYTProf version 2. A powerful, efficient, feature-rich perl source code profiler.

“If NYTProf 1 is a Toyota, then 2.0 is a Cadillac”
— Adam Kaplan, author of NYTProf 1.

The Short Story

(I’ve written up the long story, for the record, in another post.)

Adam forked Devel::FastProf (the best statement-level profiler at the time), added a test harness and tests, made some internal improvements, and grafted on an html report derived from Devel::Cover. The resulting Devel::NYTProf v1 was a big hit.

Meanwhile I’d been working on Devel::FastProf, contributing some interesting new profiling features, but it had no test harness and no html reports. When Adam released NYTProf I switched. Attracted not so much by the html report as by the test harness. (A lesson to anyone wanting to attract developers to an open source project.)

I started out by adding in the same new features I’d been adding to FastProf, albeit with more polish and tests. And then I got carried away…

“Holy shit! That is amazing.”
— Andy Lester, after using a recent development version.

Example Screen Shots

As an example I’ve used NYTProf to profile perlcritic 1.088 running on its own source code.

$ cd Perl-Critic-1.088
$ perl -d:NYTProf -S perlcritic .
$ nytprofhtml
$ open nytprof/index.html

The first image is the main index page, showing the top few subroutines and the start of the list of all source files.

NYTProf perlcritic index.png

Each source file has links to separate line-level, block-level, and sub-level reports (though I hope to merge them in future). Clicking on a subroutine name takes you to the line-level report for the file it’s defined in and positions you at the subroutine definition.

(The color coding is based on Median Absolute Deviation of all the values in the column, same as in NYProf v1.)

Here’s the all_perl_files() subroutine, for example:

NYTProf perlcritic all_perl_files.png

The colored numbers show the number of statements executed, the total time taken, and the average. The statement times are always exclusive times. Time actually spent on that statement, the expressions and any built-in functions it uses. It doesn’t include any time spent executing statements elsewhere in subroutines called by it. In NYTProf subroutine timings are inclusive and statement timings are exclusive.

Where did you come from and where are you going?

Notice the grey text.

On lines that define a subroutine NYTProf now adds ‘comments’ giving the total number of times the sub was called, the inclusive time spent in that sub, and the average. Then it adds a break-down of the same details for every location that called the subroutine. Here’s a better example of that:

NYTProf sub-callers.png

On lines that call subroutines NYTProf now adds ‘comments’ giving the name of the actual subroutines called (resolving code refs to subroutine names, including the class that handled a method call). It also tells you how many calls were made and how much time was spent in that subroutine for calls made from that line. Here’s an example:

NYTProf subs-called.png

When you mouse-over the grey text it turns black and you can click on embedded links to take you to the callers or callees. So with a few clicks you can run up and down the call stack exploring where the time is being spent and where the hot spots are being called from. The ability to explore the code so easily, guided by these performance signposts is incredibly useful.

Rolling up for a higher level view

Sometimes per-statement timing can overwhelming. In large subroutines it becomes “hard to see the wood for the trees”. So, for the first time in any Perl profiler, NYTProf now provides a block-level view of the timing data:

NYTProf perlcritic all_perl_files block level.png

What’s happening here is that NYTProf is taking the same time measurements per statement, but instead of accumulating the time against the line the statement is on, it accumulates it against the line of the first statement in the enclosing block. (The actual line it accumulates it against isn’t ideal in some cases. I’m hoping to improve that soon.)

This report is a little more tricky to read but can be very useful, especially in large subroutines. (I hope to improve the readability with some css :hover magic in future.)

The subroutine-level report is very similar except that all the timings are accumulated against line of the first statement in the subroutine.

Have a Look

Back in June I gave a talk at the Irish Open Source Technology conference where I showed the first version of the annotated html report (which I’d been hacking on till 3am the night before while struggling with a cold – presentations are great motivators). You can see the 15 minute video here or here).

Explore for yourself

I’ve uploaded the full set of reports for you to explore here. Take a look. Let me know what you think.

Devel::DProf – broken by the passage of time

Measure twice, cut once.

To measure the performance of your Perl code many developers instinctively reach for Devel::DProf. The venerable perl profiler dates back to 1995.

When profiling you have two broad choices: cpu-time vs wallclock time, and subroutine-level profiling vs line-level profiling. DProf is a subroutine-level profiler. (I’ll talk more about subroutine-level profiling vs line-level profiling in an upcoming post about the soon-to-be-released NYTProf 2.0.)

Measuring cpu-time should, in theory, give you a clear picture of the time spent by perl executing your code, independent of the load on the machine or delays waiting for i/o. Sadly life isn’t so simple.

The big problem is that on most systems cpu-time measurement has a resolution of 0.01 seconds (a HZ value of 100). That may not sound like much time, but you can execute a lot of perl code, and a lot of subroutine calls, in that time.

So any profiler using cpu-time with such a course granularity is going to give ‘noisy’ results. Some subroutines may appear to be slow because the cpu-time ‘tick’ happened to occur more while those subs were running. Other subroutines may appear to be fast because the cpu-time ‘tick’ doesn’t happen much while they’re running.

Try playing around with the venerable Devel::DProf like this:

$ perl -we 'print "sub s$_ { sqrt(42) for 1..100 }; s$_({});\n" for 1..1000' > x.pl
$ perl -d:DProf x.pl
$ dprofpp
Total Elapsed Time = 0.047999 Seconds
  User+System Time = 0.047999 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 20.8   0.010  0.010      1   0.0100 0.0100  main::s986
 20.8   0.010  0.010      1   0.0100 0.0100  main::s29
 20.8   0.010  0.010      1   0.0100 0.0100  main::s321
 0.00       - -0.000      1        -      -  main::s322
 0.00       - -0.000      1        -      -  main::s323
 0.00       - -0.000      1        -      -  main::s329
 0.00       - -0.000      1        -      -  main::s330

Clearly nonsense. Out of 1000 subroutines that would all take about the same length of time, a cpu-time ‘tick’ happened to occur in those three. The rest of the subroutines appear to be ‘free’. (If you play with this yourself you’ll probably need to adjust the 1..100 to see the effect on your system.)

Using dprofpp -r (to “display elapsed real times” instead of cpu-times) gives a very similar granular result. That’s because Devel::DProf artificially reduces the resolution of real time to match the granular cpu-time! It’s also odd that dprofpp doesn’t provide an option to sort by elapsed time, but there’d be little value anyway given the low resolution.

Many people have reported strange results from Devel::DProf over the years. Design decisions that made sense back in 1995 are now causing problems as CPUs have got so much faster. Several have implemented alternative profilers.

Using cpu-time is effectively a kind of statistical sampling. That’s fine as long as you remember it and interpret the results accordingly. The work-around for the course granularity is to execute the code for long enough for the effects to be averaged out. But that can be inconvenient, and how do you know how much is enough?

Personally I’d recommend using wallclock time for profiling. I want to know how much time was spent waiting for network, waiting for disk, waiting for cpu. Those are often significant factors in performance issues. Too many network round trips, or too many individual disk requests, for example, can easily kill performance. If you only profile cpu time you may not spot these kinds of problems. You may just be chasing the random ghosts of occasional cpu clock ticks.

 

So which profiler would I recommend? Stay tuned for an announcement in a day or so. Easy! NYTProf 2.0

Code duplication, cheap but not free

I’m working on a large old codebase at the moment where Repeat Yourself seems to have been standard practice. Here’s a typical example:

    if (exists(Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{ $graph_id })
        && Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{ $graph_id }->{ as_of_rev } <= $current_revision) {
        $self->{cid} = Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{ $graph_id }->{ new_node_id };
    }
    elsif (exists(Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{''})
        && Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{''}->{ as_of_rev } <= $current_revision) {
        $self->{cid} = Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{ '' }->{ new_node_id };
    }

The codebase has many, many, examples of this style written by a variety of developers.

What puzzles me is why this kind of code didn’t raise red flags for the developers at the time. It’s harder to read, harder to maintain, and slower than a simpler approach. Perhaps the elsif part was a copy-n-paste job, but that still doesn’t explain the three instances of Foo::StaticData::NodeRemap->get_data()->{ $args{cid} }->{ $graph_id } in the first part. Perhaps even those were copy-n-pasted.

I’d always have an urge to factor out the common expression into a temporary variable for efficiency and clarity. I wonder if my sensitivity to code duplication is partly due to needing to pay close attention to efficiency for most of my career. (I started Perl programming about 15 years ago, when cpu performance was measured in MHz.)

I also wonder how soon tools like Perl::Critic can help detect duplicate code fragments. Common sub-expressions that may be candidates for elimination.

I’m working on optimizing the codebase at the moment. That chunk showed up as a performance issue so I rewrote it as

    my $NodeRemap = Foo::StaticData::NodeRemap->get_data();
    for my $id ( $graph_id, '' ) {
        my $x = $NodeRemap->{ $args{cid} }->{ $id };
        next unless $x and $x->{as_of_rev} > $current_revision;
        $self->{cid} = $x->{new_node_id};
        last;
    }

Spot my mistake?

Sidebar: This post is also an experiment in posting code to my blog. I’m trying out MarsEdit. It’s good but I’d like to see a “Paste Preformatted” mechanism that would also html escape the contents of the paste buffer. It’s scriptable so I guess I could implement it myself in my copious spare time…