NYTProf screencast from the 2008 London Perl Workshop

I’ve uploaded the screencast of my NYTProf talk at the London Perl Workshop in November 2008.

It’s based on a not-quite-2.08 version and includes some coverage of an early draft of the ‘timings per rolled-up package name’ feature I discussed previously.

It also shows how and why anonymous subroutines, defined at the same line of different executions of the same string eval, get ‘merged’.

The demos use perlcritic and Moose code. It also includes a nice demonstration showing NYTProf highlighting a performance problem with File::Spec::Unix when called using Path::Class::Dir objects.

It’s 36 minutes long, including a good Q&A session at the end (wherein a market rate for performance improvements is established). Enjoy.

NYTProf 2.08 – better, faster, more cuddly

I’ve just released NYTProf 2.08 to CPAN, some three and a half months after 2.07.

If you’ve been profiling large applications then the first difference you’ll notice is that the new version generates reports much faster.

NYTProf 2.08 timings.pngThe next thing you may notice is that statement timings are now nicely formatted with units. Gisle Aas contributed the formatting code for 2.07 but I had to do some refactoring to get it working for the statement timings.

Another nice refinement is that hovering over a time will show a tool-tip with the time expressed as a percentage of the overall runtime.

Almost all the tables are now sortable. I used jQuery and the tablesorter plugin for that. I’ve not added any fancy buttons, just click on a table heading to sort by that column. You’ll see a little black arrow to show the column is sorted. (You can hold the shift key down to add second and third columns to the sort order.)

A profiler isn’t much use if it’s not accurate. NYTProf now has tests for correct handling of times for string evals within string evals. In fact the handling of string evals got a big overhaul for this version as part of ongoing improvements in the underlying data model. I’m working towards being able to show annotated performance reports for the contents of string evals. It’s not there yet, but definitely getting closer.

A related feature is the new savesrc=1 option. When enabled, with a recent version of perl, the source code for each source file is written into the profile data file. That makes the profile self-contained and, significantly, means that accurate reports can be generated even after the original source files have been modified.

Another new option is optimize=0. You can use it to disable the perl optimizer. That can be worth doing if the statement timings, or counts, for some chunk of code seem odd and you suspect that the perl optimizer has rewritten it.

The final new feature noted in the NYTProf 2.08 Changes file is that it’s now possible to generate multiple profile data files from a single application. Since v2.0 you could call DB::disable_profile() and DB::enable_profile() to control profiling at runtime. Now you can pass an optional filename to enable_profile to make it close the previous profile and open a new one. I imagine this would be most useful in long running applications where you’d leave profiling disabled (using the start=none option) and then call enable_profile and disable_profile around some specific code in specific situations – like certain requests to a mod_perl app.

There’s one more new feature that I’ve just realised I’d forgotten to add to the Changes file before the release: Timings per rolled-up package name. What’s that? Well, it’s probably easiest to show you…

These images are taken from a profile of perlcritic. Each shows the time spent exclusively in subroutines belonging to a certain package and any packages below it. Hovering over a time gives the percentage, so I can see that the 57.3s spent in the 36 PPI packages accounted for 42% of the runtime.

NYTProf 2.08 pkg1.png

This gives you a quick overview for large (wide) codebases that would be hard to get in any other way.

Tables are generated for upto five levels of package name hierarchy, so you can drill-down to finer levels of detail.

NYTProf 2.08 pkg2.png

 

NYTProf 2.08 pkg3.png

I can visualize a much better UI for this data than the series of tables nytprofhtml currently produces, but my limited free time and jQuery skills prevent me doing more. Patches welcome, naturally.

Enjoy!

p.s. I’ve a screencast from my NYTProf talk at the London Perl Workshop in November I hope to (finally) upload soon. It includes a demo of the package roll-up timing tables.

Can you reproduce this NYTProf failure?

I’ve a new release of NYTProf ready to upload but I’m stuck.

The CPAN Testers service is reporting a failure on a number of systems but I can’t reproduce it locally or work out the cause.

Can you reproduce the failure with Devel::NYTProf 2.07_94? If so, could you give me remote access via ssh? (Or spare some time to investigate yourself – I’ll happily clue you in if you can reproduce the problem.)

Update: No one could reproduce it. It seems that the failures was not what it appeared to be. A clue was that only one tester was affected. Devel-NYTProf-2.07_94.tar.gz unpacked itself into a directory called Devel-NYTProf-2.07. It seems that when using CPANPLUS if the user had already got an old Devel-NYTProf-2.07 directory its contents got merged and tests would fail. I’m not convinced that’s the whole story, but Devel-NYTProf-2.07_95.tar.gz unpacked into a Devel-NYTProf-2.07_95 directory and didn’t run into the problem.

Update: More usefully, Andreas made my wish come true by pointing out the –solve parameter to the ctgetreports utility in his CPAN::Testers::ParseReports distribution. It “tries to identify the best contenders for a blame using Statistics::Regression. [...] The function prints the [...] top 3 candidates according to R^2 with their regression analysis.” Cool.

NYTProf 2.07 – Even better, and now for Windows too

NYTProf development rolls on, with 2.07 being another significant release.

Here’s a summary of highlights from the many improvements.

Windows, at last

Windows wizard Jan Dubois has contributed patches that enable NYTProf to run on Windows. Many thanks Jan!

Corrected Subroutine Total Time

A significant error in time reported as spent in a subroutine has been fixed. It was showing the sum of the inclusive and exclusive time instead of just the inclusive time.

String Evals

One of NYTProf’s innovative features is tracking of subroutine calls by calling location. In earlier versions you couldn’t see calls made from within string evals. Now you can. Even for evals within evals:

NYTProf string eval.png

That’s a big step forward for profiling code that makes heavy use of evals.

(The use of the µ micro symbol is also new – part of some helpful interface polishing contributed by Gisle Aas. They’re not used on the statement timings yet, but will be at some point.)

Recursion

Previously the ‘inclusive time’ reported for a subroutine wasn’t useful for subroutines called recursively. The overall inclusive time included the inclusive time of nested calls, typically inflating it beyond usefulness.

Now the inclusive time only measures the inclusive time of the outermost calls. NYTProf now also tracks and reports the maximum call depth for each calling location.

Disabling the Statement Profiler

NYTProf started life as a statement profiler. I added a subroutine profiler later and was delighted with how well that worked out. The subroutine profiler is novel in a few ways and also extremely robust and efficient. Part of the efficiency comes from not streaming data to disk while the program is running. It accumulates the counts and timings in memory.

When profiling large and/or performance sensitive applications the costs of NYTProf’s statement profiler, in performance impact and disk space, can be prohibitive. So now I’ve added an option to disable the statement profiler. Setting the stmts=0 option turns NYTProf into a very lightweight subroutine profiler.

Caller Information for XSubs

NYTProf reports time spent in subroutines broken down by calling location. It’s one of its key features. The caller details are embedded as comment annotation into the report for each perl source code file.

That has worked very well for perl subs, but there’s no natural home for XS subroutines because there was no source code file available to annotate.

There’s a partial solution included in NYTProf 2.07: XS subs that are in the same package as profiled perl subs, get annotated subroutine ‘stubs’ appended to the report for the corresponding perl source file.

It works quite well, in particular the common UNIVERSAL methods get annotated:

NYTProf xsub callers.png

Also, now these subs have a place in the reports, any references to them are hyper links, naturally.

Sadly it doesn’t work so well for packages that don’t contain any perl subs, like DBI::st. I plan to add per-package report pages for them in a future release.

Clock Docs

I’ve greatly expanded the description of the various clocks available for profiling: POSIX real-time clocks, gettimeofday(), and Time::HiRes.

I’ve also added a caveats section that discusses issues with SMP systems, processor affinity, and virtual machines.

Feedback Wanted

How has NYTProf helped you optimize your applications? Let me know.

Perhaps as blog posts of your own, linking back here, or comments here, or CPAN ratings, or email to the mailing list or just to me. I’d love to hear from you.

Thanks!

NYTProf 2.05 – assorted fixes

Hot on the heels of NYTProf 2.04, I’ve just released NYTProf 2.05. It has a few small but significant bug fixes. From the Changes file:

  Fixed alteration of $! during profiling, thanks to Slaven Rezic.
  Fixed clock_gettime() clock selection to fallback to CLOCK_REALTIME
    if CLOCK_MONOTONIC is not available, e.g, on linux 2.4.
  Fixed error when application ends after DB::disable_profile().
  Added some docs to Devel::NYTProf::Apache
  Added clock=N option to enable user to select the clock.

All the bugs were long-standing ones that happened to come to light and get fixes after 2.04 was released.

The first was the cause of a couple of odd bugs were applications would behave differently when profiled. The problem was that in certain situations the profiler would alter the value of $! between statements, so code that tested for $! would think something had gone wrong.

The second caused weird timing values on some platforms. The statement times would be zero and the subroutine times would be garbage. The problem was that some systems, such as Linux 2.4, define the CLOCK_MONOTONIC macro but don’t implement that clock in the kernel. NYTProf wasn’t testing that the clock worked. Now it does. If the clock_gettime() call fails with the selected clock it falls back to CLOCK_REALTIME (and if that fails NYTProf aborts).

As a bonus I’ve added the ability to select what clock you want to use by setting clock=N in the NYTPROF environment variable, where N is the integer id of the clock you want to use (usually you can find the CLOCK_* macros in the /usr/include/time.h file). This only works if your system supports clock_gettime(), which most modern unix-like systems do, with the notable exception of OS X.

Why would you want to use a different clock?

Well the default, CLOCK_MONOTONIC, is relatively expensive to call, and NYTProf calls it twice per statement. So it’s the dominant cost of the profiler.

Some systems, like Linux 2.6, offer other clocks, like CLOCK_MONOTONIC_FAST that returns the most recently cached value of the clock. That’s typically just 1/10th the cost of CLOCK_MONOTONIC, so your code runs much faster under the profiler.

Naturally there’s a trade-off: the accuracy of the timings drops significantly. Still far better than Devel::DProf, but far worse than normal. You’re dependent on how often the cached time value gets updated by your kernel. I saw a figure of “600 times per second” quoted for one system, but I can’t find the link now. My advice would be to steer clear of this feature unless you feel comfortable with the issues like these.

Enjoy!

NYTProf 2.04 gives you 90% smaller data files

At OSCON this year I gave a talk on my new pet project Devel::NYTProf v2 to a packed room. Turned out to be a lot of fun.

“The first thing I need to do is talk about Devel::DProf because it needs to be taken out and shot.”

I made a screencast of the 40 minute talk which you can watch on blip.tv here. Worth watching for the background on profilers, the demo of NYTProf, and the questions, not to mention the teasing I get along the way.

One of the final questions was about the size of the profile data file that NYTProf produces. One of the major drawbacks of statement-level profiling is the volume of data it generates while profiling your code. For every statement executed the profiler streams out the file id, the line number, and the time spent. For every statement! When trying to profile a full application doing real work the volume of data generated quickly becomes impractical to deal with. Multi-gigabyte files are common.

This was the major problem with Devel::SmallProf, which generated text files while profiling. Salvador Fandiño García addressed that in Devel::FastProf by writing the data in a compact binary form. A vast improvement that contributed to Devel::FastProf (on which Devel::NYTProf is based) being the first statement-level profiler worth using on large applications. Even so, the volume of data generated was still a problem when profiling all but short running applications.

NYTProf 2.03 was producing profile data at the rate of about 13MB per million statements executed. That might not sound too bad until you realise that on modern systems with cpu intensive code, perl can execute millions of statements every few seconds.

I could see a way to approximately halve the data volume by changing the format to optimize of the common case of consecutive statements being in the same file, but that wasn’t going to be enough. The best way forward would be to add zip compression. It would be easy enough to pipe the output stream through a separate zip process, but that approach has a problem: the zip process will be soaking up cpu time asynchronously from the app being profiled. That would affect the realtime measurements in an unpredictable way.

I realized back in June that a better approach would be to embed zip compression into NYTProf itself. Around the end of July Nicholas Clark, current Perl Pumpkin, got involved and was motivated to implement the internal zipping because he was “generating over 4Gb of profile data trying to profile the harness in the Perl 5 core running tests in parallel“.

He did a great job. The zlib library is automatically detected at build time and, if available, the code to dynamically route i/o through the zip library gets compiled in. The output stream starts in normal mode, so you can easily see and read the plain text headers in the data file, then switches to zip compression for the profile data. How well did it work out? This graph tells the story:

NYTProf 2.04 compression.png

(The data relates to profiling perlcritic running on a portion of its own source code on my MacBook Pre 2GHz laptop. I only took one sample at each compression level so there may be some noise in the results.)

The data file size (red) plummets even at the lowest compression level. Also note the corresponding drop in system time (yellow) due to the reduction in context switches and file i/o.

I’ve set the default compression level to 6. I doubt you’ll want to change it, but you can by adding compression=N to the NYTPROF environment variable.

Here are the change notes for the 2.04 release:

  Fixed rare divide-by-zero error in reporting code.
  Fixed rare core dump in reporting code.
  Fixed detection of #line directives to be more picky.
  Fixed some compiler warnings thanks to Richard Foley.
  Added on-the-fly ~90% zip compression thanks to Nicholas Clark.
    Reduces data file size per million statements executed
    from approx ~13MB to ~1MB (depends on code being profiled).
  Added extra table of all subs sorted by inclusive time.
  No longer warns about '/loader/0x800d8c/...' synthetic file
    names perl assigns reading code from a CODE ref in @INC

Enjoy!

The Italian Perl Workshop

Pisa Cathedral Wall.jpgI spent a very pleasant few days in Pisa, Italy, last week. I’d been invited to speak at the Fourth Italian Perl Workshop. The workshop was a great success. In fact calling it a “workshop” is selling it short. It’s more of a mini-conference:

“2 days of conference, 2 simultaneous tracks, more than 30 talks, 120 attendees, 20 sponsors and many international guests”

The whole event ran very smoothly thanks to a great team lead by Gianni Ceccarelli, Francesco Nitido, and Enrico Sorcinelli. I’ll echo the compliments of one attendee “Complimenti sinceri agli organizzatori! Bravissimi! Tutto perfetto!

I gave short talk on Gofer on the Thursday, and then two 40 minutes talks on Friday: Perl Myths, and Devel::NYTProf. I hope to upload screencasts and slides next week. The talks were all recorded on video so I imagine they’ll be uploaded at some point. I’ll add links here to them when they are.

The majority of the sessions were in Italian so, since my Italian is practically non-existant, I had plenty of time to work.

Or at least try to. The one disappointment of the trip for me was the apparent poor quality of the Italian internet. Using mtr I’d regularly see over 20% packet loss within telecomitalia.it and interbusiness.it from my hotel room. Occasionally over 50%. It got much better at night, so I’d do more work then. At the conference venue the Italian academic network (garr.net) also regularly had over 20% packet loss at its link to the internet. All this was, of course, outside the control of the organisers.

The “corridoor track” at perl conferences is always good. I had a chance to talk to Rafel Garcia-Suarez (and meet his lovely wife and new baby son), Matt S Trout, Hakim Cassimally, Michel Rodriguez, Marcus Ramberg, and many others.

I had opted to take a very early fight so I’d have a day being a tourist in Pisa before the conference started. The weather was beautiful and I had a lovely time strolling though the streets of this ancient city.

Pisa Knights Square.jpg

I didn’t take my camera with me, but I did take my iPhone so I was able to capture a few snaps as I strolled around and climbed the tower. (Yes, it really does lean in a disconcerting “it must be about to fall down” way. All the more dramatic when you’re up close and can appreciate the massive scale of the tower.)

Pisa View over Cathedral.jpg

Perl Myths – OSCON 2008

I gave a updated version of my earlier Perl Myths talk at OSCON this year. It includes updated numbers, updated job trend graphs (showing good growth in perl jobs) and slides for the perl6 portion that were missing from the upload of the previous talk.

Two versions of the slides are available: one with just the slides on a landscape page, and another with slides and notes on a portrait page.

I also have a screencast of the presentation which I hope to edit and upload before long. (I’ll update this page and post a new note when I do.)

NYTProf v2 – the background story

This is the back story to the development of NYTProf v2.

Earlier this year (2008) I needed to do some performance profiling of the source code of a large application. Like many perl developers, my first instinct was to try using Devel::DProf (actually the Apache::DProf wrapper as it was a mod_perl application). It was not a great experience.

DProf

DProf seems easily confused by unusual control flow, spewing “… has unstacked calls” warnings. Also, the subroutines it said were taking the most time didn’t make sense to me. Eventually I worked out that Devel::DProf is effectively broken.

The application I was trying to profile has quite a few large subroutines, so knowing just the time spent in the subroutine as a whole didn’t help much for those. I wanted to know where in the subroutine the time was being spent.

FastProf

That led me to look at line-based profilers. At the time there as only one worth looking at, Devel::FastProf by Salvador Fandiño García (which was based on Devel::SmallProf by Ted Ashton).

Line profilers spit out a stream of “file id, line number, time spent” records to a file. The time between starting one statement and starting the next is measured and associated with the line number of the statement.

Devel::FastProf was great. Fast and effective. It gave me far more accurate timings, and when I made changes in the code it highlighted I could see an immediate effect on performance.

Devel::FastProf was great, but I wanted more. The lack of subroutine level timing was frustrating. When you have a ~100,000 lines of code, knowing the time spent on each, and how many times it was executed, is less useful than you may think – there’s just too much detail. Especially when looking for structural problems in the code, or for good places to add caching, or pass extra information down a call chain to save expensive calls deeper in the code. There’s a need for both subroutine and line level timings when profiling.

Idea

I’d had an idea: The line number output in the FastProf profile need not be the line number of the statement. It could output the line number of the subroutine containing the statement. Then you’d automatically get subroutine level timings! Simple.
Then I wondered if it was possible to find the line number of the block the statement was in. That would give block level timing! A first for any perl profiler.

My perl internals knowledge was getting rusty as it was a few years since I’d been Pumpkin for the 5.4.x release, so I asked the wizards on perl5-porters. They gave me hope and enough clues to get going.

Salvador kindly moved the Devel::FastProf code to a public svn, so I could contribute more easily, and I started hacking. I added code to find the nearest enclosing block or sub and it proved very useful.

When I’m optimizing I start by identifying “locally inefficient code”. That is, code you can optimize without significant structural changes. Without making changes outside the subroutine. Moving code outside loops is a common example. Subroutine timings identify the hot subs, then block and line timings pinpoint the hot spots in the code.

That’s the low hanging fruit. Easy pickings, and often very effective. But there’s a limit to how far that’ll take you.

Callers and Callees

There are two ways to optimize a hot piece of code: make it faster, or execute it less often. The former tends to get much more attention than the latter. It’s important to remember to keep stepping back. To keep looking for the bigger picture.

When I’m optimizing I often use a well defined chunk of work, like 10 requests to the same URL, so I can see of the number of times a subroutine is called “feels right”. That often shows subs being called “too often”. But what then? You need to know why the sub is being called too often, so you need to know where it’s being called from.

In Devel::FastProf I added counting of subroutine callers two-levels up the call stack. So I could see that foo() was called 10 times by bar() and that of those 10 calls from bar() 7 has come from baz() and 3 from boo(). It was very simplistic, slow (implemented in perl) and only had counts, not timings, but proved very useful.

Using these additions to Devel::FastProf I reduced the CPU usage of the application by over 40%. Not bad. (I could see another 10% or so to be gained fairly easily but had to draw the line somewhere.)

NYTProf

Meanwhile Adam Kaplan had forked Devel::FastProf, added a test harness and tests, made some internal improvements, grafted on an html report derived from Devel::Cover, and released it as Devel::NYTProf.

When I saw NYTProf I switched to working on that. Again, Adam was kind enough to move the code to a public svn repository. I was attracted not so much by the html report as by the test harness. A lesson to anyone wanting to attract developers to an open source project!

Testing profilers is hard and Adam had come up with a good basic testing framework which was easy to extend. NYTProf now has a strong test suite that profiles 19 different perl scripts with four different combinations of profiler options. The test suite has proven invaluable in identifying regressions during development, and for identifying portability issues between perl versions.

I re-implemented the block/sub level profiling and the subroutine caller tracking from FastProf in NYTProf, but with more care, more attention to performance, and now tests.

I was particularly pleased with the subroutine caller tracking. It intercepts the entersub opcode and uses the save stack for storage and to trigger a ‘destructor’ call to end the timing when the subroutine is exited by any means. The end result is an extremely fast and robust subroutine call profiler. I plan to add an option to disable the other profiler so you can just get subroutine profile details when you don’t need statement level details. It currently lacks the ability to give exclusive times but I think I’ve an efficient solution for that. (Update: Implemented in r340 and r343 so will be in the 2.02 release.)

Accuracy

Another key innovation was to fix a fundamental problem inherent in all statement profilers. Consider a statement that calls a subroutine and then performs some other work that doesn’t execute new statements, for example:

    foo(...) || mkdir(...);

In all other statement profilers the time spent in remainder of the expression (mkdir in the example) will be recorded as having been spent on the last statement executed in foo()! Here’s another example:

    while (<>) {
        ...
        1;
    }

After the first time around the loop, any further time spent evaluating the condition (waiting for input in this example) would be be recorded as having been spent on the last statement executed in the loop!

I fixed this in NYTProf by intercepting all the opcodes which indicate that control is returning into some previous statement and adjusting the profile accordingly.

Reporting

As much effort, if not more, went into the reporting side of the code. And there’s a lot more to be done there. My goal is to keep growing the data model classes to the point where any reporting code can get the information it needs easily enough that there’s no longer a need for the rather limiting Reporter class.

I’d like to see a single ‘nytprof’ command line tool that loads a class to generate the report. That would replace nytprofhtml and nytprofcsv. That would make it easy for other developers to release ‘nytprof reporting modules’ to CPAN.

For example, one very useful report that FastProf has but NTProf currently lacks is a list of most expensive lines (or blocks, or subs) output in the format used by compiler error messages. The format is important because most editors have a special mode for reading such files that means you can hop from one ‘most expensive line’ to the next with a single key stroke. (For vim that’s called quickfix mode.) That’s a wonderful way to browse the hotspots and make edits on the spot.

Future

There are many, many, ways NYTProf can be enhanced further. As I’ve worked on it I’ve dumped ideas, issues and random notes into the HACKING file.

Thanks

I’d like to end by expressing my thanks to Salvador Fandiño García and especially Adam Kaplan for allowing me to contribute to the modules they created and tolerating my strong ideas with understanding. Thank you both. It’s been quite a ride.