Coding

Statistically improbable phrases on Google AppEngine

I read about Google AppEngine early this morning, and applied for an invite. Google’s issuing beta invites to the first 10,000 users. I was pretty convinced I wasn’t among those, but turns out I was lucky.

AppEngine lets you write web apps that Google hosts. People have been highlighting that it give you access to the Google File System and BigTable for the first time. But to me, that isn’t a big deal. (I’m not too worried about reliability, and MySQL / flat files work perfectly well for me as a data store.)

What’s more interesting unlike Amazon’s EC2 and S3, this is free up to a certain quota. And you get a fair bit of processing power and bandwidth for free. One of the reasons I’ve held back on creating some apps was simply because it would take away too much bandwidth / CPU cycles from my site. (I’ve had this problem before.) Google quota is 10 GB of bandwidth per day (which is about 30 times what my site uses). And this is on Google’s incredibly fast servers It also offers 200 million megacycles a day. That’s like a dedicated 2.3 GHz processor (200 million megacycles = 200,000 GHz x 1 second ~ 2.3 GHz x 86,400 seconds/day) — better, because this is the average capacity, not peak capacity. The only restriction that really worries me is that only 3 apps are allowed per developer.

So I decided to give a shot at publishing some code I’d kept in reserve for a long time. You may remember my statistical analysis of Calvin & Hobbes. For this, I’d created a script in Perl that could generate Statistically Improbable Phrases (SIPs) for any text. This is based on (a somewhat limited) 23MB corpus of ebooks that I had. I’d wanted to put that up on my website, but …

AppEngine only uses Python. So the first task was to get Python, and then to learn Python. The only saving grace was that I was just cutting-and-pasting most of the time. Google wasn’t helping:

Google AppEngine Over Quota Error

Anyway, the site is up. You can view it at sip.s-anand.net for now. Just type a URL, and it’ll tell you the improbable words in that site.

Visit sip.s-anand.net

Technical notes

I realise that these are statistically improbable words, not phrases. I’ll get to the phrases in a while.

The logic is simple:

  • Get the frequency of words in a corpus. I pre-generated this file. It has over 100,000 words.
  • Get the URL as text. Rather than muck around with Python, I decided to use the W3 html2txt service.
  • Convert the text to words. Splitting text into words is tricky. For now, I’m simply assuming that any group of letters is a word, and anything that’s not a letter is a word delimiter.
  • Find the relative frequency (improbability) of words. This is the frequency in the URL divided by the frequency in the corpus, normalised (i.e. scale it so that the maximum value is 1.0).
  • Create a tag cloud. I use the word frequency as the size and the improbability as the colour. You need a bit of mathematical jugglery to get the pattern right. Right now, I’m taking the 6th root of the improbability and the logarithm of the frequency to get a reasonably smooth tag cloud.

The source code is at statistically-improbable-phrases.googlecode.com.

Update: 12-Apr-2008. I’ve added some interactivity. You can play with the contrast and font size, the filter out common or infrequent words.

Update: 22-Apr-2008. Added concordance. You can click on a word and see the context in which it appears.

Chaining functions in Javascript

One of the coolest features of jQuery is the ability to chain functions. The output of a function is the calling object. So instead of writing:

var a = $("<div></div>");
a.appendTo($("#id"));
a.hide();

… I can instead write:

$("<div></div>").appendTo($("#id")).hide();

A reasonable number of predefined Javascript functions can be used this way. I make extensive use of it with the String.replace function.

But where this feature is not available, you an create it in a fairly unobstrusive way. Just add this code to your script:

Function.prototype.chain = function() {
var that = this;
return function() {
    // New function runs the old function
    var retVal = that.apply(this, arguments);
    // Returns "this" if old function returned nothing
    if (typeof retVal == "undefined") { return this; }
                // else returns old value
    else { return retVal; }
}
};
var chain = function(obj) {
        for (var fn in obj) {
                if (typeof obj[fn] == "function") {
                    obj[fn] = obj[fn].chain();
                }
    }
        return obj;
}

Now, chain(object) returns the same object, with all its functions replaced with chainable versions.

What’s the use? Well, take the Google AJAX search API. Normally, to search for the top 8 “Harry Potter” PDFs on esnips.com, I’d have to do:

    var searcher = new google.search.WebSearch();
    searcher.setQueryAddition("filetype:PDF");
    searcher.setResultSetSize(google.search.Search.LARGE_RESULTSET);
    searcher.setSiteRestriction("esnips.com");
    searcher.setSearchCompleteCallback(onSearch);
    searcher.execute("Harry Potter");

Instead, I can now do this:

chain(new google.search.WebSearch())
.setQueryAddition("filetype:PDF")
.setResultSetSize(google.search.Search.LARGE_RESULTSET)
.setSiteRestriction("esnips.com")
.setSearchCompleteCallback(onSearch)
.execute("Harry Potter");

(On the whole, it’s probably not worth the effort. Somehow, I just like code that looks like this.)

Javascript error logging

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

This is the last of 3 articles on these topics.

I am a bad programmer

I am not a professional developer. In fact, I’m not a developer at all. I’m a management consultant. (Usually, it’s myself I’m trying to convince.)

Since no one pays me for what little code I write, no one shouts at me for getting it wrong. So I have a happy and sloppy coding style. I write what I feel like, and publish it. I don’t test it. Worse, sometimes, I don’t even run it once. I’ve sent little scripts off to people which wouldn’t even compile. I make changes to this site at midnight, upload it, and go off to sleep without checking if the change has crashed the site or not.

But no one tells me so

At work, that’s usually OK. On the few occasions where I’ve written Perl scripts or VB Macros that don’t work, people call me back within a few hours, very worried that THEY’d done something wrong. (Sometimes, I don’t contradict them.) It can be quite a stressful experience but good thing you can learn more here on how to cope up with it.

On my site, I don’t always get that kind of feedback. People just click the back button and go elsewhere.

Recently, I’ve been doing more Javascript work on my site than writing stuff. Usually, the code works for me. (I write it for myself in the first place.) But I end up optimising for Firefox rather than IE, and for the plugins I have, etc. When I try the same app a few months later on my media PC, it doesn’t work, and shockingly enough, no one’s bothered telling me about it all these months. They’d just click, nothing happens, they’d vanish.

But their browsers can tell me

The good part about writing code in Javascript is that I can catch exceptions. Any Javascript error can be trapped. So since the end of last year, I’ve started wrapping almost every Javascript function I write in a try {} catch() {} block. In the catch block, I send a log message reporting the error.

The code looks something like this:

function log(e, msg) {
    for (var i in e) { msg += i + "=" + e[i] + "\n"; }
    (new Image()).src="log.pl?m=" + encodeURIComponent(msg);
}

function abc() {
    try {
    // ... function code
    } catch(e) { log(e, "abc"); }
}

Any time there’s an error in function abc, the log function is called. It sends the function name ("abc") and the error details (the contents of the error event) to log.pl, which stores the error, along with details like the URL, browser, time and IP address. This way, I know exactly where what error occurs.

This is a fantastic for a three reasons.

  • It tells me when I’ve goofed up. This is instantaneous feedback. I don’t have to wait for a human. If you run my program on your machine, and it fails, I get to know immediately. (Well, as soon as I read the error log, at least.)
  • It tells me where I’ve goofed up. The URL and the function name clearly indicate the point of failure.
  • It tells me why I’ve goofed up. Almost. Using the browser name and the error message, I can invariably pinpoint the reason for the error. Then it’s just a matter of taking the time to fix it.

I’d think this sort of error reporting should be the norm for any software. At least for a web app, given how easy it is to implement.

Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a Matches all <a> links
//p/b Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr Pick all rows in a row in a row
title->./td[3]//a For each row, set the title to the link text in the 3rd column
link->./td[3]//a … and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

Website load distribution using Javascript

My music search engine shows a list of songs as you type — sort of like Google’s autosuggest feature. I load my entire list of songs upfront for this to work. Though it’s compressed to load fast, each time you load the page, it downloads about 500KB worth of song titles.

My allotted bandwidth on my hosting service is 3GB per month. To ensure I don’t exceed it, I uploaded the songs list to an alternate free server: Freehostia. This keeps my load down. If I exceed Freehostia’s limit, my main site won’t be affected — just the songs. I also uploaded half of them to Google Pages, to be safe.

This all worked fine… until recently. Google Pages has a relatively low bandwidth restriction. (Not sure what, and they won’t reveal it, but my site is affected.) Freehostia is doing some maintenance, and their site goes down relatively often. So my song search goes down when any of these go down.

Now, these are rarely down simultaneously. Just one or the other. But whenever Freehostia is down, I can’t listen to one bunch of songs. When Google Pages is down, I can’t listen to another.

What I needed was a load distribution set-up. So I’ve made one in JavaScript.

Normally, I load the song list using an external javascript. I have a line that says:

<script src="http://sanand.freehostia.com/songs/..."></script>

… and the song’s loaded from Freehostia.

What I’d like to do is:

loadscripts(
    "list.hasLoaded()", [
    "http://sanand.freehostia.com/songs/...",
    "http://root.node.googlepages.com/...",
    "http://www.s-anand.net/songs/..."
])

If the function can’t load it from the first link, it loads it from the second, and so on, until list.hasLoaded() returns true.

Here’s how I’ve implemented the function:

function loadscripts(check, libs) {
    document.write('<script src="' + libs.shift() +
        '"><' + '/script>');

    if (libs.length>0) document.write('<script>if (!(' + check + ')) {'
        + 'loadscripts("' + check + '",["' + libs.join('","')+'"])'
        + '}</' + 'script>')
}

The first document.write loads the first script in the list. The if condition checks if there’s more scripts to load. If yes, the second document.write writes a script that

  • checks whether the script is already loaded, and
  • if not, loads the rest of the scripts using the same function.

I’ve expanded the sites that have these free songs as well. So now, as long as my main site works, and at least one of the other sites work, the search engine will work.

PS: You can easily expand this to do random load distribution as well.