Interesting Items OSCON 2008 – Dealing with Streaming Data

This is a collection of links to things discussed, or just mentioned, at OSCON that I found interesting enough to note. Hopefully one of a series for OSCON 2008, as time allows.

These items are from a great talk on “A Streaming Database” by Rafael J. Fernández-Moctezuma at PDXPUG day.

Hancock is a C-based domain-specific language designed to make it easy to read, write, and maintain programs that manipulate large amounts of relatively uniform data. In addition to C constructs, Hancock provides domain-specific forms to facilitate large-scale data processing

The CQL continuous query language (google)

Borealis is a distributed stream processing engine. Borealis builds on previous efforts in the area of stream processing: Aurora and Medusa.

CEDR is the Complex Event Detection and Response project from Microsoft Research.

Google Protocol Buffers “allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice”.
Which seems like Thrift which is “a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between langauges”.

Lies, damn lies, and search engine rankings

I started a related recent post with a quote that seems just as apt here:

“Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.
– Mark Twain

If you regularly use just one search engine, as I tend to do, it’s very easy to be lulled into a false sense of security about the quality and relevance of the results.

I was recently reminded of the significant differences that can occur in the results of different search engines. That, in turn, reminded me of tools I’d come across previously to highlight those differences. In particular one that gives a very clear picture of the differences in ranking. After a little digging I found it at langreiter.com (via list of tools at http://www.seocompany.ca).

As a demonstration, here’s a comparison of the top results for +”perl programming” at Google (top) and Yahoo (bottom):

google vs yahoo rankings for perl programming via langreiter.png

and here’s the same for +”python programming”:

google vs yahoo rankings for python programming via langreiter.png

Each dot represents a result url, with the top ranked results on the left. Where a url appears in the top 100 results on both Google and Yahoo then a line is drawn between them to highlight the different rankings. On the site you can hover over the dots to see the corresponding url.

I remember being very surprised when I first saw these kinds of results a few years ago. I’m no less surprised now. If fact more so, as I’d had (naïvely) expected Yahoo and Google to have converged somewhat in their concept of relevancy. At least for top results.

The particular queries I used above are not exceptional. I couldn’t find any query that didn’t have significant differences in rankings. Don’t believe me? Go try it yourself at http://www.langreiter.com/exec/yahoo-vs-google.html.

That so many of the top 20 from one search engine don’t even appear in the top 100 of the other is… is… well, I’m not quite sure what to make of it. At first sight it seems like a bad thing, but I also have to admit that it’s a good thing. At least in some ways. Diversity is important in any ecosystem.

If you only use one major search engine then you have to accept that you’re getting just one view of the internet. Most of the time you may be happy with that. It’s worth keeping it in mind, though, for those times when you’re struggling to find good results.

One way to avoid the issue is to use a meta search engine that’ll query multiple search engines for you and merge the results. There are lots of them.

TIOBE or not TIOBE – “Lies, damned lies, and statistics”

[I couldn't resist the title, sorry.]

“Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.
– Mark Twain

I’ve been meaning to write a post about the suspect methodology of the TIOBE Index but Andrew Sterling Hanenkamp beat me to it (via Perl Buzz).

I do want to add a few thoughts though…

The TIOBE Programming Community Index is built on two assumptions:

  • that the number of search engine hits for the phrase “foo programming” is proportional to the “popularity” of that language.
  • that the proportionality is the same for different languages.
  • It’s not hard to pick holes in both of those assumptions.

    They also claim that “The ratings are based on the number of skilled engineers world-wide, courses and third party vendors” but I can’t see anything in their methodology that supports that claim.
    I presume they’re just pointing out the kinds of sites that are more likely to contain the “foo programming” phrase.

    Even if you can accept their assumptions as valid, can you trust their maths? Back in Jan 2008 when I was researching views of perl TIOBE was mentioned. So I took a look at it.

    At the time Python had just risen above Perl, prompting TIOBE to declare Python the “programming language of the year”. When I did a manual search, using the method they described, the results didn’t fit.

    I wrote an e-mail to Paul Jansen, the Managing Director and author of the TIOBE Index. Here’s most of it:

    Take perl and python, for example:

    I get 923,000 hits from google for +”python programming” and 3,030,000 for +”perl programming”. (The hits for Jython, IronPython, and pypy programming are tiny.) As reported by the “X-Y of approx Z results” at the top of the search results page.

    Using google blog search I get 139,887 for +”python programming” and 491,267 for +”perl programming”. (The hits for Jython, IronPython, and pypy programming are tiny.)

    So roughly 3-to-1 in perl’s favor from those two sources. It’s hard to imagine that “MSN, Yahoo!, and YouTube” would yield very different ratios.

    So 3:1 for perl, yet python ranks higher than perl. Certainly seems odd.

    Am I misunderstanding something?

    I didn’t get a reply.

    I did note that many languages had dipped sharply around that time and have risen sharply since. Is that level of month-to-month volatility realistic?

    Meanwhile, James Robson has implemented an alternative, and open source, set of Language Usage Indicators. I’m hoping he’ll add trend graphs soon.

    Update: the story continues.


    Loaded Perl: A history in 530,000 emails

    MarkMail is a free service for searching mailing list archives. They’ve just loaded 530,000 emails from 75 perl-related mailing lists into their index.

    They’ve got a home page for searching these lists at http://perl.markmail.org/.

    Of course the first thing people often do with new search engines is search for themselves. I’m no exception. Where MarkMail shines is the ability to drill-down into the results in many ways with a single click (bugs, announcements, attachments etc). Worth a look.

    The graph of messages per month is not just cute, you can click and drag over a range of bars to narrow the search to a specific period. It clearly shows my activity rising sharply in 2001 and then dropping to a lower level after 2004.

    I particularly pleased that they’ve indexed dbi-users, dbi-dev, and dbi-announce lists.