A key part of my plan for Upgrading from Perl 5.8 is the ability to take a perl library installed for one version of perl, and reinstall it for a different version of perl.
To do that you have to know exactly what distributions were installed in the original library. And not just which distributions, but which versions of those distributions.
I’ve a solution for that now. It turned out to be rather harder to solve than I’d thought… As I mentioned previously, I had developed a “distinctly hackish solution” that seemed to be working well. Sadly it didn’t withstand battle testing.
We have a library with almost 5000 modules installed from CPAN over many years. I ran that hackish script and it duly listed the distributions it thought were installed. Using that list I reinstalled them into a new library and ran
diff -r to compare the two. That found a bunch of differences that led me into a vortex of hacking and rerunning. Eventually I had to admit that the whole approach wasn’t robust enough and I started to explore other ideas.
Some searching turned up BackPAN::Version::Discover which is meant to “Figure out exactly which dist versions you have installed”. Perfect. Sadly it simply didn’t work well for me. Probably because it’s using a similarly flawed approach to my own.
I knew brian d foy’s MyCPAN project was working towards a similar goal. His approach required us to either run a large BackPAN indexing process or paying to license the data to offset his costs for doing so. That didn’t seem attractive.
I wondered about using GitPAN and the github API to match git blob hashes of local modules with files in the gitpan repos. Sadly GitPAN has fallen out of date and isn’t being maintained at the moment. With hindsight I’m thankful of that because it lead me to a better solution.
MetaCPAN is full of awesome. On the surface it looks like another kind of search.cpan.org site. Don’t be fooled. Underneath is a vast repository of CPAN metadata powered by an ElasticSearch distributed database (based on Lucene). How vast? Every file in every distribution on CPAN (and, critically for me, the BackPAN archive) has been indexed in great detail. Including details like the file size and which spans of lines are code and which are pod.
The key “lightbulb over head” moment came when I realized I could ask MetaCPAN to “find all releases that contain a particular version of a module“. Bingo!
The next step was how to work out which of those candidates was the one actually installed. The key realization here was that I could use MetaCPAN to get version and file size info for all the modules in each candidate release and see how well they matched what was currently installed.
The whole process falls into several distinct phases…
The first phase finds the name, version, and file size of all the modules in the library being surveyed. (Taking care to handle an archlib nested within the main lib.)
Then, for every module it asks MetaCPAN for all the distribution releases that included that that module version. For rarely changed modules in frequently released distributions there might be many candidates, so it tries to limit the number of candidates by also matching the file size. This is especially helpful for modules that don’t have a version number.
Then, for every candidate distribution release, MetaCPAN is queried to get the modules in the release, along with their version numbers and file sizes. These are compared to the data it gathered about the locally installed modules to yield a “fraction installed” figure between 0 and 1. The candidates that share the highest fraction installed are returned.
Typically there’s just one candidate that has fraction installed of 1. A perfect match. Sometimes the fraction is less than 1 for various obscure but valid reasons. Sometimes life isn’t so simple. There may be multiple candidates that have the same highest fraction installed value. So the next phase attempts to narrow the choice from among the “best candidates” for each module. The results are gathered into a two level hash of distributions and candidate releases.
The final phase is the first to work in terms of distributions instead of modules. For each distribution it tries to choose among the candidate releases.
The method seems to work well. It identifies files with local changes. It deals gracefully with ‘remnant’ modules that were included in an old release but not in later ones. And it copes with distributions that have been split into separate distributions.
It reports progress and anything unusual to stderr and writes the list of distributions to stdout. You should investigate anything that’s reported to ensure that the chosen distribution is the right one.
I checked the results by creating a new library (see below) and running
diff -r old_lib new_lib. I didn’t see any differences that I couldn’t account for.
The survey process is not fast. It can take a couple of hours on the first run for a large library. Most of that time is spent making MetaCPAN calls (lots and lots of MetaCPAN calls) so you’re dependent on network and MetaCPAN performance. Most of the calls are cached in an external file so later runs are much faster.
Using The Results
Using a list of distributions to recreate a library isn’t as straight-forward as it might seem. You can’t just give the list to cpanm because it would try to install the latest version of any prerequisites. I looked at using –scandeps or topological sorting to reorder the list to put the prerequisites first. It didn’t work out. I also looked at using CPAN::Mini::Inject (and OrePAN and Pinto) to create a local MiniCPAN for cpanm to fetch from. They didn’t work out either, for various reasons.
In the end I added a
--makecpan dir option so that the surveyor script itself would fetch the distributions and create a MiniCPAN for cpanm to use.
So now a typical initial run looks like this:
dist_surveyor --makecpan my_cpan /some/perl/lib/dir > installed_dists.txt
followed by building a new library from the results:
cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < installed_dists.txt
If you need to rebuild the library, perhaps due to test failures, then it’s much faster to use a list of modules to drive cpanm. Fortunately dist_surveyor writes one for you:
cpanm --mirror file:$PWD/my_cpan --mirror-only -l new_lib < my_cpan/dist_surveyor/token_packages.txt
Speaking of test failures, I was surprised to see how often tests failed due to problems with prerequisites even though the distribution and its prerequisites had passed their tests when originally installed. For example, imagine distribution A v1, and its prerequisite B v1 are installed. Later, distribution B gets upgraded to v2 but the tests for distribution A don’t get rerun.
Reinstalling all the distributions forces all distributions to be tested with the prerequisites that are actually being used.