How I do things

Handling missing pages

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

This article covers the second topic.

One thing I’m curious about is hits to non-existent pages (404s) on my site. I usually get 404s because:

  • I renamed the page
  • Someone typed a wrong URL
  • Someone followed a wrong link

Find the 404

The first problem is to know when someone gets a 404. I’ve seen sites that tell you to contact the administrator in case of a 404. That’s crazy. The administrator should automatically detect of 404s! Almost every web server provides this facility.

The real issue is attention. I receive 700 404s a day. That’s too much to manually inspect. And most of these are not for proper web pages, but for images (for example, almost all my 404s used to be for browsers requesting favicon.ico) or weird MS Office files.

I’m interested in a small subset of 404 errors. Those that hit a web page, not support files. And those requested by a human, not a search engine or a program.

A decent way of filtering these is to use Javascript in your 404 page. Javascript is typically executed only by browsers (i.e. humans, not search engines), and only in a web page (not images, etc.) So if you serve Javascript in your 404 page, and it gets executed, it’s likely to be a human requesting a web page.

I have a piece of Javascript in my custom 404 page that looks something like this:

<script>(new Image()).src = "/log.pl";</script>

Every time this code runs, it loads a new image. The source of the image is a Perl script, log.pl. Every time log.pl is accessed, it logs the URL from which it was called. I’m reasonably guaranteed that these are web pages a human tried to access.

The reduction in volume is tremendous. On a typical month, I get ~20,000 404 errors. With the Javascript logging, it’s down to around 200 a month, and most of them quite meaningful.

Point to the right page

Sometimes, the change happens because I changed the URLs. I keep fiddling with the site structure. Someone would have links to an old page that I’ve renamed. I may not even know that. Even if I did, they can’t be bothered with requests to change the link. So I’ve got to handle it.

The quickest way, I find, is to use Apache’s mod_rewrite. You can simply redirect the old URL to the new URL. For example, I used to have a link to /calvin.html which I now point to /calvinandhobbes.html. That becomes a simple line on my .htaccess file:

RewriteRule calvin.html  calvinandhobbes.html

I don’t do this for every site restructuring, though. I just restructure, wait for someone to request a wrong page, and when my 404 error log warns me, I create a line in the .htaccess. It keeps the redirections down to a minimum, and only for those links that are actually visited.

Be flexible with the URL structure

Sometimes people type in a wrong link. Often, these are unintentional. Here are some common misspellings for my Hindi songs search.

s-anand.net/hindi/
s-anand.net/Hindi
s-anand.net/hiundi

Occasionally, people are exploring the structure of my site:

s-anand.net/excel
s-anand.net/music
s-anand.net/hits

I need to decide what to do with both cases. For the former, sometimes my URL structure is too restrictive. I mean, why should someone have to remember to type /hindi instead of /Hindi or /hindi/? Who cares about case? Who cares about a trailing slash?

In such cases, I map all the variants to the right URL using mod_rewrite. For example, typing s-anand.net/HiNDi (with or without caps, with or without a slash at the end) will still take you to the right page.

As I keep discovering new mis-spellings, I take a call on whether to add it or not. The decision is usually based on volume. If two people make the same spelling mistake in a day, I almost certainly add the variant. Most of the time, it’s just typing errors like /hiundi which isn’t repeated oftener than once a month.

Provide search

To handle the exploratory URLs, and people following wrong links, I’ve turned my custom 404 page into a search engine.

For example, when someone types s-anand.net/excel, I know they’re searching for Excel. So I just do a Google Custom Search within my site for “excel” — that is, anything following the URL.

It’s a bit more complex than that, actually. I do a bit of tweaking to the URL, like convert punctuations (underscore, hyphen, full-stop, etc.) to spaces, remove common suffixes (.html, .htm) and ignore numbers. Quite often, it matches something on my site that they’re looking for. If not, ideally, I ought to try for various alternatives and subsets of the original search string to figure out a good match. But given that the number of mismatches is down to about one a day, I’m fairly comfortable right now.

What this means, incidentally, is that my site is, by default, a search engine for itself. To search for movie-related stuff on my site, just type s-anand.net/movie and you get a search of the word “movie” on my site. (Sort of like on a9.com, where searching for a9.com/keyword does a search on the keyword.)

Monitoring site downtime

If something goes wrong with my site, I like to know of it. My top three problems are:

  1. The site is down
  2. A page is missing
  3. Javascript isn’t working

I’ll talk about how I manage these over 3 articles.

My site used to go down a lot. Initially that was because I kept playing around with mod_rewrite and other Apache modules without quite understanding them. I’d make a change and upload it without testing. (I’m like that.) And then I’d go to sleep.

Next morning, the site’s down, and has been down all night.

This is a bit annoying. Partly because I couldn’t use my site, but mostly because of the Oh yeah, sorry — I goofed up last night replies that I have to send out the next day.

So I started using Site24x7 to track if my website was down. It’s a convenient (and free) service. It pings my site every hour. If it’s down, I get an SMS. If it’s back up, I get an SMS. It also keeps a history of how often the site is down.

Site24x7

Over time, I stopped making mistakes. But my site still kept going down, thanks to my hosting service (100WebSpace). When I goof up, it’s just an annoyance, and I can fix it. But when my hosting service goes down, it’s more than that. My site is where I listen to music, read comics, read RSS feeds, use custom search engines, watch movies, browse for books, etc. Not being able to do these things — nor fix the site — is suffocating.

Worse, I couldn’t sleep. I use my mobile as my alarm. It’s annoying to hear an SMS from under your pillow at 3am every day — especially if it says your site is down.

So I switched to HostGator a few months ago. Nowadays, the site is down a lot less. (In times of trouble, it becomes sluggish, but doesn’t actually go down.)

That came at a cost, though. I was paying 100 WebSpace about $25 per annum. I’m paying Hostgator about $75 per annum. Being the kind that analyses purchases to death, the big question for me was, is this worth it. There is where my other problem with the site being down kicks in. I get a bit of ad revenue from my site, and I lose that when the site’s down. (Not that it’s much. Still…)

According to Site24x7, my site was up ~98% of the time. So I’m losing about 2% of my potential ad revenue. For the extra $50 to be worth it, my ad revenue needs to be more than $50 / 2% = $2,500 per annum. I’m nowhere near it. So the switch isn’t actually a good idea economically, but it does make life convenient (which is pretty important) and I sleep better (much more so).

The important thing, I’ve realised, is not just to track this stuff. That’s useful, sure. But what really made Site24x7 useful to me is that it would alert me when there was a big problem.

There are many kinds of alerting.

There’s a report you can view whenever you remember to view it. (It could be an RSS feed, so at least you won’t have to remember the site. But you still need to read your feeds.)

Then there’s the more pushy alerting: sending you an e-mail. That may catch you instantly for the half of the day that you’re online. Or, if you’re like me, it may completely escape your attention. (I don’t read e-mail.)

And then there’s the equivalent of shaking you by the shoulder — getting an SMS. (At least, that’s how it is for me. Incidentally, I don’t reply to SMS either. Calling me gets a reply. Nothing else.)

The type of alerting is clearly a function of the severity of the problem. Wake me up when my site goes down. Don’t wake me up if a link is broken.

Site24x7 sends me an SMS when my site is down. Fits the bill perfectly.

Managing feed overload

I have only two problems with Google Reader.

The first is that it doesn’t support authenticated feeds. Ideally, I’d have liked to have a single reading list that combines my e-mail with newsfeeds. GMail offers RSS feeds of your e-mail. But the feeds require authentication (obviously) and Google Reader doesn’t support that right now. (So I usually don’t read e-mail 🙂

The second is that it’s tough to manage large feeds. It’s a personal quirk, really. I like to read all entries. If there are 100, I read all 100. If there are 1000, I struggle but read all 1000. I’m too scared to “Mark all read” because there are some sources that I don’t want to miss.

The 80-20 rule is at work here. There are some prolific writers (like Scoble) who write many entries a day. There are some prolific sources (del.icio.us or digg). I can’t keep up with such writers / sources. I don’t particularly want to. If I missed one day of del.icio.us popular items, I’ll just read the next day’s.

With Google Reader, that makes me uneasy. I don’t like having 200 unread items. I don’t like to mark them all read.

In such cases, popurls‘ approach is useful. It shows the top 15-30 entries of the popular sites as a snapshot. Any time you’re short of things to read, visit this. If you’re busy, don’t.

Using Google’s AJAX Feed API, it’s quite trivial to build your own feed reader. So I cloned popurls‘ layout into my bookmarks page, and put in feeds that I like.

You can customise my bookmarks page to use your own feeds. Save the page, open it in Notepad, and look for existing feeds. They’ll look like this:

"hacker news" : {
    entries:15,
    url:"http://news.ycombinator.com/rss"
},

The first line (“hacker news”) is the title of the feed. You can call it what you want. Set entries to the number of feed entries you want to show. Set url to the RSS feed URL. Save it, and you have your own feed reader. (If you want to put it up on your site, you may want to change the Google API key.)

Try it! Just save this page and edit the feeds.


Here, I must point out three things about Google’s AJAX Feed API that make it extremely powerful.

The obvious is that is allows Javascript access to RSS in a very easy way. That makes it very easy to integrate with any web page.

The second is subtler — it includes historical entries. So even if an RSS feed had only 10 entries, I could pick up the last 100 or 1,000, as long as Google has known about the feed for long enough. This is what makes Google Reader more of a platform rather than a simple feed reader application. Google Reader is a feed archiver — not just a feed reader.

The third (I’m a bit crazy here) is that you can use it to schedule tasks. Google’s FeedFetcher refreshes feeds every 3 hours or so. If you want to do something every three hours (or some multiple thereof — say every 24 hours), you can write a program that does what you want, and subscribe to it’s output as a feed.

You may notice that I have a Referrers to s-anand.net on my bookmarks page. These are the sites that someone clicked on to visit my site. I have a PHP application that searches my access log for external referrers. Rather than visit that page every time, I just made an RSS feed out of it and subscribed to it. Every three hours or so, Google accesses the URL. I search my access.log and archives the latest results. So, even after my access.log is trimmed by the server, I have it all on Google Reader to catch up with later.

Since Google doesn’t forget to ping, I can schedule some fairly time-critical processes this way. For instance, if I wanted to download each Dilbert strip, every day as it arrives, I can create an application that downloads the file and returns a feed entry. Now, I don’t need to remember to run it every day! I just subscribe to the application on Google Reader, and Google will remind the application to run every 3 hours. (I know — I could use a crontab, but somehow, I like this.) Plus I would get the Dilbert strip on my feed reader as a bonus.


Update: Google has just released PartnerBar, which is a more flexible way of viewing a snapshot of feeds.

Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a Matches all <a> links
//p/b Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr Pick all rows in a row in a row
title->./td[3]//a For each row, set the title to the link text in the 3rd column
link->./td[3]//a … and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

Advanced Google Reader

I’ve stopped visiting websites.

No, really. There’s only one website I visit these days. Google Reader.

Google Reader screenshot

Google Reader is a feed reader. If you want to just catch up on the new stuff on a site, you can add the site to Google Reader. Anything new that is published on the site appears in Google Reader. Right now, I’ve subscribed to over 50 feeds. There’s no way I can remember to visit 50 sites — so I’m actually able to read more and miss less.

In a sense, that’s the essence of using feeds to read stuff: remember less, read more and miss less. If I ever find an interesting site, I just add it to Google Reader and forget about it.

But it goes beyond that. You can subscribe to search results.

Videos, for examples. Subscribe to a search for google engedu on Google Video. Any new TechTalk at Google, you get it in your Reader. (The talks are fabulous by the way.) Or search for hindi movies.

Photos? Track new photos of the Indian cricket team on Flickr.

Movies? Get an update of the latest movie torrents from mininova or torrentspy. Or just get updates on American Gangster DVDRip.

Presentations? Track anything on Javascript or consulting from SlideShare.

Documents? Find new eBooks on Scribd.

Not all pages offer a feed. But Page2Rss offers a reasonable solution. It converts pretty much any page into a feed.

It’s gotten to the point where anything I know I want to read, I put it on Google Reader. The rest of the Internet is when I don’t know what I want to read.

Default camera ISO setting

In those early days, when all I had was an analog SLR, I had to make choices up-front. Do I buy an ISO 100 film for daytime shooting? (It’s cheaper, besides.) Do I go in for the expensive ISO 1600 film for my fancy night shots? Do I lug around the tripod? Do I use the flash? Do I even bother taking indoor shots? etc.

With my new digital camera, at least the ISO choice vanishes. The ISO range varies from 64 to 1600. And so, I don’t need flash or a tripod most of the time.

But once in a while, I get into a tricky situation.

Having a digital camera lets me take pictures a lot faster. Suppose I spot a boat speeding by when strolling along the Thames. The time it takes from the instant I see it to the instant I click the shutter is about 5 seconds. 2 seconds to pull out the camera, 1.5 seconds for the camera to start up, and about 1.5 seconds for me to aim and shoot.

I love being able to do this kind of a thing.

Except, it’s still a bit tricky managing the ISO. It takes me about 10 seconds to change the ISO settings. No, not because the menus are complex… that accounts for only about 3 seconds or so. The bulk of it is because I have to think about what ISO setting to use — especially given that I like to overexpose digital camera images a bit.

So, when I’m going indoors, I have to remember to set the ISO to something like 400 or 800, and back again when I get out. It may sound like I’m going a too far, but the thing is, since I don’t keep my tripod always attached, and don’t ever turn on the flash, I’ve spoiled a fair number of impulsive indoor and night shots because I’ve had the wrong ISO setting at the wrong time.

Being digital images, many of these problems can be fixed.

If I use a high ISO setting (say ISO 800), I get a fair bit of digital noise. But NeatImage does a decent job of reducing noise (thanks, Dhar!), so the result is not too bad.

If I use a low ISO setting (say ISO 100), I get clean images in bright light, but blurred images in low light (no tripod, no flash, you see). I haven’t found anything that can recover from a blurred image.

I decided, on the balance, to have a slightly higher ISO setting by default. I get slightly noisier images, but that’s less of a worry.

So I leave the camera in ISO 400. I can quickly shoot indoors. If I have the time and need, I shift to ISO 100, or use a tripod if required. Then I set it back to ISO 400 when done.

Making my music search engine faster

My music search engine takes quite a while to load (typically 40 seconds). That’s an unusually long time for a page, given that most of the people that access it are on broadband connections, and are listening to music online.

The reason is, firstly, that I’m loading a lot of data. Literally every single song on that you can search comes through as Javascript. All the downloadable Hindi songs, for instance, occupy 1.3 megabytes before compression. On average, this takes about 20 seconds to load.

The second reason is, I’m doing a lot of processing with the data. Two things take the bulk of the time: uncompressing the data, and massaging it to allow for different spellings. On average, that takes about 20 seconds.

40 seconds is too long to wait. Actually, the perceived waiting time is 8 seconds, because I start showing the search and the results after 1/5th of the data is downloaded. But 8 seconds is still too long.

I managed to cut this time by half with two things I just learned.

Firstly, Javascript loads and executes sequentially. As soon as a Javascript block is loaded, it is executed. While it is being executed, the rest of the HTML file is not parsed. So any subsequent Javascript blocks (or images, CSS, any other external links) are not loaded. Bandwidth is just sitting idle while the browser is busy at work on the Javascript.

Since I’ve split all my songs into five equal-sized Javascript files, my search engine loads the Javascript files one after another! The problem is even worse — the calculations I do in Javascript take up as much time as the load time. If the loading went on in parallel, by the time the first calculation is done, the second script would have loaded.

This problem can be solved for Internet Explorer and Opera. The “defer” attribute loads the scripts in the background, and defers their execution. This reduces the loading time to nearly zero for all the Javascript files except for the first one or two, because by the time my calculations are done, the next script is already loaded.

Javascript loading sequence before and after 'defer'

These Javascript files contain a list of songs as a long string, which I then split into individual songs. Then I modify each song using regular expressions so that approximate matches will still work. (For e.g., typing “aa” is the same as typing “a” on this search engine.) The performance of regular expressions is critical to me.

Originally, I did this:

  1. Split the string into songs
  2. Modify each song using regular expressions

Now, I changed the sequence to this:

  1. Modify the string using regular expressions
  2. Split the string into songs

When I timed the speed of this change, I realised browsers differ a lot in the speed of processing regular expressions. Here is the time (in seconds) for each browser to process the Javascript before and after the change.

Browser Before After
Internet Explorer 6.3 5.0
Firefox 17.7 4.7
Opera 93.8 19.9

Internet Explorer wasn’t affected much by this change. But Firefox and Opera were heavily impacted. I’m guessing this is because Firefox and Opera have a high setup time for matching regular expressions. Before, I was matching thousands of strings. After, I was matching only one (long) string. IE didn’t care much. Firefox and Opera speeded up dramatically. (I suspect my Opera data is skewed by a few people using a rather slow machine… but then, that’s probably why they should use Opera in the first place — it works rather well old machines.)

With these changes, my total load time fell from about 40 seconds to about 20 seconds. Pretty decent improvement.

There are two further steps I could take from here on.

Compress the songs files further

Currently, I’m doing two things to compress the songs.

  1. Instead of sending the list as a (movie-song),(movie-song),... combination for every song, I send it as a (movie-song,song,song,...)(movie-song,song,song,...)... combination. So I don’t repeat the movie names.
  2. Secondly, I compress them via HTTP/1.1. (My server doesn’t let me do that with Javascript, actually, because Netscape 4.x doesn’t accept compressed Javascript. But since it’s such an old browser and none of my viewers use it, I trick the server by renaming my Javascript files as .html, and it passes them on compressed.

What I could do additionally is:

  1. Remove duplicate songs. If songs are spelt differently, I include them both. But I can knock off the duplicate ones.
  2. Compress the text. Though I’m gzipping the text before sending it, I suspect there may be ways of storing the data in a more compressed form. For example, many songs are repeated with the (male) and (female) versions. These could be clubbed (somehow…)

Speed up the Javascript calculations further

There are two steps I’m doing right now:

  1. Modify the string using regular expressions
  2. Split the string into songs

Here’s how much time each step takes (in seconds) across browsers. (Note: these were based on one day’s data, sampling about 2,000 page views)

Browser Regular expression Split
Internet Explorer 0.88 3.04
Firefox 3.76 1.08
Safari 0.47 0.46
Opera 4.96 29.78

Firefox takes a lot of time with regular expressions. IE and Opera take a lot longer to split (which involves creating many objects). I may have to think up different optimisations for the two.

The code is available in the HTML of the song search engine. It’s readable. The stuff I’m talking about is in the function sl(p,s). Any thoughts are welcome!

Reducing the server load

I’m been using a shared hosting service with 100 WebSpace over the last 7 years. It’s an ad-free account that offers 100MB of space and 3GB of bandwidth per month. Things were fine until two months ago, which was when my song search engines started attracting an audience. I had anticipated that I might run out of bandwidth, so I used a different server (that has 5GB of bandwidth per month quota) for loading the songs. But what I didn’t anticipate whas that my server load would run over the allotted CPU limit.

You’d think this is unusual, given how cheap computing power is, and that I’d run out of bandwidth quicker. But no. My allotted limit was 1.3% of CPU usage (whatever that meant), and about 2 months ago, I hit 1.5% a few times. I upgraded my account to one which had a 2.5% limit immediately, but the question was: why did this happen?

This blog uses a lot of Perl scripts. I store all articles on a MySQL database. Every time a link is requested, I dynamically generate the HTML by pulling up the article from the MySQL database and formatting the text based on a template.

Schematic of how my website displays pages

I also use MySQL to store users’ comments. Every time I display each page, I also pull out the comments related to that page.

I can’t store the files directly as HTML because I keep changing the template. Every time I change the template, I have to regenerate all the files. If I do that on my laptop and upload it, I consume a lot of bandwidth. If I do that on the server, I consume a lot of server CPU time.

Anyway, since I’d upgraded my account, I thought things would be fine. Two weeks ago, I hit the 2.5% limit as well. No choice. Had to do something.

If you read the O’Reilly Radar Database War Stories, you’ll gather that databases are great for queries, joins and the like, while flat files are better to process large volume data as a batch. Since page requests come one by one, and I don’t need to do much batch processing, I’d gone in for a MySQL design. But there’s a fair bit of overhead to each databasse query, and that’s the key problem. Perl takes a while to load (and I suspect my server is not using mod_perl). The DBI module takes a while to load. Connecting to MySQL takes a while. (The query itself, though, is quite fast.)

So I moved to flat files instead. Instead of looking up from a database, I just look up a test file using grep. (I don’t use Perl’s regular expression matching because regular expression matching in UNIX is faster than in Perl.) I have a 1.6MB text file that contains all my blog entries.

But looking up a 1.6MB text file takes a while. So I split the file based on the first letter of the title. So this post (Reducing the server load) would go under a file x.r.txt (for ‘R’) while my last post (Calvin and Hobbes animated) would go under a file x.c.txt (for ‘C’). This speeds up the grep by a factor of 5-10.

On average, using MySQL query used to take 0.9 seconds per query. Now, using grep, it’s down to about 0.5 seconds per query. Flat files reduced the CPU load by about half. (And as a bonus, my site has no SQL code. I never did like SQL that much.)

So that’s why you haven’t seen any posts from me the last couple of weeks. Partly because I didn’t have anything to say. Partly because I was forced to revamp this site.

Tamil spelling corrector

The Internet has a lot of tamil song lyrics in English. Finding them is not easy, though. Two problems. The lyrics are fragmented: there’s no one site to search them. And Google doesn’t help. It doesn’t know that alaipaayudhe, alaipaayuthe and alaipayuthey are the same word.

This is similar to the problem I faced with tamil audio. The solution, as before, is to make an index, and provide a search interface that is tolerant of English spellings of Tamil words. But I want to go a step further. Is it possible to display these lyrics in Tamil?

My Tamil Transliterator does a lousy job of this. Though it’s tolerant of mistakes, it’s ignorant of spelling and grammer. So,

kanda nal muthalai kathal peruguthadi

… becomes…

kanda nal muthalai kathal peruguthadi

… when in fact we want…

kanda naaL muthalaay kaathal peruguthadi

(If you’re viewing this on an RSS reader, check my post to see what I mean.)

I need an automated Tamil spelling corrector. Reading Peter Norvig’s “How to Write a Spelling Corrector” and actually having understood it, I gave spelling correction in Tamil a shot.

Norvig’s approach, in simple terms, is this:

  1. Get a dictionary
  2. Tweak the word you want to check (add a letter, delete one, swap 2 letters, etc.)
  3. Pick all tweaks that get you to a valid word on the dictionary
  4. Choose the most likely correction (most common correct word, adjusted for the probability of that mistake happening)

Making a dictionary is easy. I just need lots of Tamil literature, and then I pick out words from it. For now, I’m just using the texts in Project Madurai.

Tweaking the word to check is easy. Norvig’s article has a working code example.

Picking valid tweaks is easy. Just check against the dictionary.

The tough part is choosing the likely correction. For each valid word, I need the probability of having made this particular error.

Let’s take an example. I’ve spelt kathal. A list of valid tweaks to this word include: kal, kol, kadal, kanal, and kaadhal. For each of these, I need to figure out how often the valid tweaks occur, and the probability that I typed kathal when I really meant one of these tweaks. This is what such a calculation would look like:

Tweak Frequency Probability of typing kathal Product
kal 1 0.04 0.04
kol 4 0.02 0.08
kadal 10 0.1 1.0
kanal 1 0.01 0.01
kaadhal 6 0.25 1.50

Once we have this, we can see that kaadhal is the right one — it has the maximum value (1.50) in the last column, where we multiply the frequency and the probability.

(You probably realise how small my dictionary is, looking at the frequencies. Well, that’s how big Project Madurai is. But increasing the size of a dictionary is a trivial problem.)

Anyway, getting the frequency is easy. How do I get the probabilities, though? That’s what I’m working on right now.

Splitting a sentence into words

I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think.

Think about it. What is a word?

Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece:

"I'd look at McDonald's," he said.
"They sell over 3,000,000 burgers a day -- at $1.50 each."
High-fat foods were the rage. For e.g., margins in fries
were over 50%... and (except for R&M & Dyana [sic]) everyone
was at ~30% net margin; growing at 25% too!

Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:

"I'd
look
at
McDonald's,"
...

Now, some of these like “I’d” are words. But “McDonald’s” isn’t. I mean, there’s a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let’s take a closer look.

The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:

McDonald's,
e.g.,
High-fat
R&M
...

The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:

Comma
Not inside numbers: 3,000,000.
Hyphen
Not for hyphenated words: High-fat.
Single-quote
Not for possessives: McDonald’s. Not for joint words: I’d.
Ampersand
Not for abbreviations: R&M
Period
Not for abbreviations: O.K. Not for URLs: www.s-anand.net
Slash
Not for fractions: 3/4. Not for URLs: google.com/search

Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn’t make sense to separate them.

So here are my current rules for splitting a sentence into words. (It’s a Perl regular expression. Don’t worry. Cooper’s Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)

# Split by clear word separators
/       [\s \! \? \;\(\)\[\]\{\}\<\> " ]
 
# ... by COMMA, unless it has numbers on both sides: 3,000,000
|       (?<=\D) ,
|       , (?=\D)
 
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
|       (?<=\W) [\.\-\&]
|       [\.\-\&] (?=\W)
 
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
|       (?<=\W) [']
 
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
|       \s \/
|       \/ \s
 
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
|       \:(?!\/\/|\d)
/x;

This doesn’t even scratch the surface of the issue, though. Here are some issues:

  • Lots of files split words into two at the end of a line. How do we handle that?
  • How do we handle incorrect punctuation? For instance, if someone types “done.Yet,” without leaving a space after the full-stop, I’ll think it’s an abbreviation.
  • What about other separators? Like the ± symbol or the £ symbol for instance.
  • What about other languages?!

And you thought it was easy!