Fixing the POD synopsis in OSX – take 2 (perldoc, nroff and UTF-8)

Ever copied and pasted a chunk from perldoc output and found you were getting mysterious errors from perl? I have.

I’ve learnt to rewrite the ‘-‘ characters because although they look like ‘-‘ characters they’re really a unicode HYPHEN: U+2010. Some other chars get mangled too, but that’s the most frequent problem for me.

So I was delighted to see a blog post by marcus ramberg called Fixing the POD synopsis in OSX wherein he fingers nroff as being the problem and gives a simple solution:

alias perldoc='perldoc -t'

Trouble is using perldoc -t means you loose the nice bold text that nroff gives you. So I went digging…

It seems the problem only affects people using UTF-8 and that nroff has a -T option that lets you specify an output encoding to use. So if perldoc ran ‘nroff -Tascii’ instead of plain ‘nroff’ that would avoid the hypen problem and let us keep the bold text.

It turns out that perldoc has an option to specify the nroff command to use, so the solution is simple:

alias perldoc="perldoc -n 'nroff -Tascii'"

This is, of course, still a hack. My main worry is that pod docs using non-ascii characters may get mangled. A much better fix would be to arrange for the ascii characters to not get mapped to unicode at all. So I went digging again…

nroff calls groff -m tty-char ... and tty-char refers to a tty-char.tmac file that defines the character mappings. The groff man pages point me to groff_tmac man pages which tell me I can get groff to look for .tmac files elsewhere by passing a -Mdir option or setting the GROFF_TMAC_PATH environment variable.

I looked at the default file, /usr/share/groff/1.19.2/tmac/tty-char.tmac on my Mac, and… decided it was time to go to sleep! The formatting is probably simple enough but I’m out of tuits.

So, what’s needed is for someone to determine what change is needed to the tty-char.tmac file, or the files it refers to, to avoid unwanted conversions to unicode. Then put a modified file into a directory and either add a -Mdir option to the nroff alias above, or set the GROFF_TMAC_PATH environment variable. Setting the env var has the benefit of ‘fixing’ all the man pages.

So, anyone want to dig deeper? (For all I know the solution can already be found on google…)

TIOBE Index is being gamed

It is sad, but inevitable, that the TIOBE index of programming language “popularity” (sic) would be gamed.

Once you start measuring something, and advertising the results, people with an interest in particular outcomes naturally start to look for ways to influence those results. (It’s the Observer Effect writ large.)

The fact that TIOBE’s methodology, which I’ve discussed previously here and here, is simplistic makes it particularly open to gaming. Anyone, or any community, with access to many web pages can simply add the magic phrase “foo programming”, where foo is their language of choice, to get counted.

And it seems that’s exactly what the Delphi community did at the end of 20081. They made a concerted effort and it seems to have paid off. (I’d be very interested in hearing about similar behaviour in other language communities.)

Is that behaviour gaming? The author of the post who exhorted is readers to “Update your Delphi related blog or site to say Delphi programming on every page in visible text (update the template). Stand up and be counted. You can make a difference!” doesn’t seem to think so, as he also said “I am not suggesting we game the system, just that we help TCPI get an accurate count.

An accurate count of what, exactly? That’s always been the fundamental question with TIOBE. It should be obvious that most web pages that talk about “delphi programming” wouldn’t actually contain the phrase “delphi programming”. The same applies to every other language. That’s the paradox at the heart of the TIOBE Index. And yet, somehow, TIOBE seem to think that counting pages containing the phrase “delphi programming” lets them claim that:

The ratings are based on the number of skilled engineers world-wide, courses and third party vendors.

Eh? How can they possibly defend that claim? Certainly their documented definition doesn’t support it, or even mention it.

I presume they’re thinking that CV’s, job postings, and adverts are most likely to contain the magic phrase. It should be obvious, again, that the number of CV’s, job postings, and adverts referring to a given programming language would naturally only be a small fraction of the total web pages referring to the language. (And only distantly related to the “popularity” of a language.) Yet that “small fraction” is what TIOBE measure and make bold claims about.

The fact that TIOBE is making a comparison based on a small fraction makes it even more troubling that TIOBE CEO Paul Jansen appears to support language communities changing their pages to include the magic “foo programming” phrase. In an email quoted on delphi.org he says:

For your information, I think your action has already some effect. Tonight’s run shows that Delphi is #8 at this moment. There is a realistic chance that Delphi will become “TIOBE’s Language of the Year 2008″

He’s endorsing the artificial insertion of the magic phrase. Clearly this distorts the TIOBE index in favour of language communities that infect as many pages as possible with the magic phrase.

That sure seems like an invitation to game the system! It’s likely to lead to other language communities doing the same, and so to further devaluation of the TIOBE Index.

(For alternatives to TIOBE you could look at sites like http://www.langpop.com/, James Robson’s Language Usage Indicators, or my popular comparison of job trends blog post with ‘live’ graphs.)

I have, on a couple of occasions, used the phrase “perl programming” in blog posts for my own amusement, and linked it to my original TIOBE or not TIOBE – “Lies, damned lies, and statistics” post. I haven’t suggested that others do the same. TIOBE’s endorsement of artificial insertion changes that. Now it seems like we’re going to get a dumb “race to the bottom” to see which language community controls the most web pages.

If, as a result, the TIOBE Index is affected significantly, then I simply hope they’ll drop their pretentious claims and state clearly exactly what they’re counting, how they’re doing it, and what it means: not much.


1. Many thanks to Barry Walsh for his blog post that alerted me to this.

Thanks, Iron Man, for the good excuse to perl blog

I’ve been thinking that I haven’t blogged much lately. Assorted half-baked ideas would cross my mind and then evaporate before I’d find the time, or motivation, to actually start writing.

The folks at the Enlightened Perl Organisation have solved the motivation problem by announcing the Iron Man Blogging Challenge: in short, “maintain a rolling frequency of 4 posts every 32 days, with no more than 10 days between posts“.

So about one post a week. I can aim for that!

Can you? “The rules are very simple: you blog, about Perl. Any aspect of Perl you like. Long, short, funny, serious, advanced, basic: it doesn’t matter. It doesn’t have to be in English, either, if that’s not your native language.” Why not try? Help yourself and help perl at the same time.

I’ll try to capture the half-baked ideas for perl blog posts as they cross my mind, then build on them as time and mood allow. Hopefully about one a week will mature into an actual blog post.

Meanwhile…

I remembered that almost exactly a year ago schwern blogged that he was “horrified at the junk which shows up when you search for perl blog on Google“. It seems the situation hasn’t improved much since. The planet.perl.org site is top, but the rest are a bit of a mishmash.

The problem is partly that “perl blog” isn’t a great search term. Google naturally gives preference to words that appear in urls and titles (all else being equal), but blogs rarely explicitly call themselves blogs on their pages or urls. I suspect many on the first page of results are there because ‘blog’ appears in the url. (To help out I’ve included “perl blog” in the title of this post :)

Searching for “perl blogs” (plural) works better because it finds pages talking about perl blogs, which is useful when searching for perl blogs.

One entry in the “perl blogs” results was the Perl Foundation’s wiki page listing perl blogs. That was new to me. This blog wasn’t on it so I’ve added it. Got a perl blog, or know someone who has, that’s not listed on that page? Go and add it, now! It’ll only take a moment.

Along similar lines, I’ve added the phrases “perl blog” and “perl programming” to the sidebar of my blog pages. The first is to help people searching for “perl blog”. The second is mostly for my own amusement.