What’s actually installed in that perl library?

A key part of my plan for Upgrading from Perl 5.8 is the ability to take a perl library installed for one version of perl, and reinstall it for a different version of perl.

To do that you have to know exactly what distributions were installed in the original library. And not just which distributions, but which versions of those distributions.

I’ve a solution for that now. It turned out to be rather harder to solve than I’d thought… As I mentioned previously, I had developed a “distinctly hackish solution” that seemed to be working well. Sadly it didn’t withstand battle testing.

We have a library with almost 5000 modules installed from CPAN over many years. I ran that hackish script and it duly listed the distributions it thought were installed. Using that list I reinstalled them into a new library and ran diff -r to compare the two. That found a bunch of differences that led me into a vortex of hacking and rerunning. Eventually I had to admit that the whole approach wasn’t robust enough and I started to explore other ideas.

Some searching turned up BackPAN::Version::Discover which is meant to “Figure out exactly which dist versions you have installed”. Perfect. Sadly it simply didn’t work well for me. Probably because it’s using a similarly flawed approach to my own.

I knew brian d foy’s MyCPAN project was working towards a similar goal. His approach required us to either run a large BackPAN indexing process or paying to license the data to offset his costs for doing so. That didn’t seem attractive.

I wondered about using GitPAN and the github API to match git blob hashes of local modules with files in the gitpan repos. Sadly GitPAN has fallen out of date and isn’t being maintained at the moment. With hindsight I’m thankful of that because it lead me to a better solution.

MetaCPAN

MetaCPAN is full of awesome. On the surface it looks like another kind of search.cpan.org site. Don’t be fooled. Underneath is a vast repository of CPAN metadata powered by an ElasticSearch distributed database (based on Lucene). How vast? Every file in every distribution on CPAN (and, critically for me, the BackPAN archive) has been indexed in great detail. Including details like the file size and which spans of lines are code and which are pod.

The cherry on the cake is the RESTful API that provides full access to ElasticSearch query expressions.

The key “lightbulb over head” moment came when I realized I could ask MetaCPAN to “find all releases that contain a particular version of a module“. Bingo!

The Method

The next step was how to work out which of those candidates was the one actually installed. The key realization here was that I could use MetaCPAN to get version and file size info for all the modules in each candidate release and see how well they matched what was currently installed.

The whole process falls into several distinct phases…

The first phase finds the name, version, and file size of all the modules in the library being surveyed. (Taking care to handle an archlib nested within the main lib.)

Then, for every module it asks MetaCPAN for all the distribution releases that included that that module version. For rarely changed modules in frequently released distributions there might be many candidates, so it tries to limit the number of candidates by also matching the file size. This is especially helpful for modules that don’t have a version number.

Then, for every candidate distribution release, MetaCPAN is queried to get the modules in the release, along with their version numbers and file sizes. These are compared to the data it gathered about the locally installed modules to yield a “fraction installed” figure between 0 and 1. The candidates that share the highest fraction installed are returned.

Typically there’s just one candidate that has fraction installed of 1. A perfect match. Sometimes the fraction is less than 1 for various obscure but valid reasons. Sometimes life isn’t so simple. There may be multiple candidates that have the same highest fraction installed value. So the next phase attempts to narrow the choice from among the “best candidates” for each module. The results are gathered into a two level hash of distributions and candidate releases.

The final phase is the first to work in terms of distributions instead of modules. For each distribution it tries to choose among the candidate releases.

The Results

The method seems to work well. It identifies files with local changes. It deals gracefully with ‘remnant’ modules that were included in an old release but not in later ones. And it copes with distributions that have been split into separate distributions.

It reports progress and anything unusual to stderr and writes the list of distributions to stdout. You should investigate anything that’s reported to ensure that the chosen distribution is the right one.

I checked the results by creating a new library (see below) and running diff -r old_lib new_lib. I didn’t see any differences that I couldn’t account for.

The survey process is not fast. It can take a couple of hours on the first run for a large library. Most of that time is spent making MetaCPAN calls (lots and lots of MetaCPAN calls) so you’re dependent on network and MetaCPAN performance. Most of the calls are cached in an external file so later runs are much faster.

Using The Results

Using a list of distributions to recreate a library isn’t as straight-forward as it might seem. You can’t just give the list to cpanm because it would try to install the latest version of any prerequisites. I looked at using –scandeps or topological sorting to reorder the list to put the prerequisites first. It didn’t work out. I also looked at using CPAN::Mini::Inject (and OrePAN and Pinto) to create a local MiniCPAN for cpanm to fetch from. They didn’t work out either, for various reasons.

In the end I added a --makecpan dir option so that the surveyor script itself would fetch the distributions and create a MiniCPAN for cpanm to use.

So now a typical initial run looks like this:

dist_surveyor --makecpan my_cpan /some/perl/lib/dir > installed_dists.txt

followed by building a new library from the results:

cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < installed_dists.txt

If you need to rebuild the library, perhaps due to test failures, then it’s much faster to use a list of modules to drive cpanm. Fortunately dist_surveyor writes one for you:

cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < my_cpan/dist_surveyor/token_packages.txt

Testing Bonus

Speaking of test failures, I was surprised to see how often tests failed due to problems with prerequisites even though the distribution and its prerequisites had passed their tests when originally installed. For example, imagine distribution A v1, and its prerequisite B v1 are installed. Later, distribution B gets upgraded to v2 but the tests for distribution A don’t get rerun.

Reinstalling all the distributions forces all distributions to be tested with the prerequisites that are actually being used.

Presentation Slides

I gave a lightning talk on Dist::Surveyor at the 2011 London Perl Workshop (always a great event) and uploaded the slides.

Source Code

The repository is on github and I’ve made a release to CPAN.

7 thoughts on “What’s actually installed in that perl library?

  1. It’s not that MyCPAN “isn’t usable”. I have the generated data and the scripts to use that data to do the task. You declined to use my data. There’s no “similarly flawed approach”.

    The DPAN you mention in your slides isn’t the part of this process that does the part you are trying to do. You’ve never seen the stuff I told you that I had because I haven’t made it public. You declined to license it.

    I’m perfectly fine with you not using it, but don’t misrepresent things.

    • The “similarly flawed approach” comment was in relation to BackPAN::Version::Discover.

      Re MyCPAN, I had honestly completely forgotten about the data license fees. I had flagged it in my mind as “not an option for us” and when I wrote this post I only remembered your previous blog post where it seemed clear there was still work to be done before it was complete.

      I’m sorry for misrepresenting the state of MyCPAN. It certainly wasn’t intentional. I’ve updated the post. Let me know if it’s not clear.

  2. Thanks for this. My company will likely also do a big upgrade from 5.8, and your tool may prove very useful in this endeavor.

  3. We did our 5.12 -> 5.14 the hard way – trial and error, but on purpose: We wanted to install what was needed, not what we had previously installed. Still, I would have liked to have this available. :-)

    • We wanted as few variables as possible for the 5.8 -> 5.10 jump (which for us will also include a CentOS5 to CentOS6 migration). We might take the trial-an-error (“take no baggage”) approach for the 5.10 -> 5.12 or 5.12 -> 5.14 upgrades.

Comments are closed.