How I do things

Advanced Google Reader

I’ve stopped visiting websites.

No, really. There’s only one website I visit these days. Google Reader.

Google Reader screenshot

Google Reader is a feed reader. If you want to just catch up on the new stuff on a site, you can add the site to Google Reader. Anything new that is published on the site appears in Google Reader. Right now, I’ve subscribed to over 50 feeds. There’s no way I can remember to visit 50 sites — so I’m actually able to read more and miss less.

In a sense, that’s the essence of using feeds to read stuff: remember less, read more and miss less. If I ever find an interesting site, I just add it to Google Reader and forget about it.

But it goes beyond that. You can subscribe to search results.

Videos, for examples. Subscribe to a search for google engedu on Google Video. Any new TechTalk at Google, you get it in your Reader. (The talks are fabulous by the way.) Or search for hindi movies.

Photos? Track new photos of the Indian cricket team on Flickr.

Movies? Get an update of the latest movie torrents from mininova or torrentspy. Or just get updates on American Gangster DVDRip.

Presentations? Track anything on Javascript or consulting from SlideShare.

Documents? Find new eBooks on Scribd.

Not all pages offer a feed. But Page2Rss offers a reasonable solution. It converts pretty much any page into a feed.

It’s gotten to the point where anything I know I want to read, I put it on Google Reader. The rest of the Internet is when I don’t know what I want to read.

Default camera ISO setting

In those early days, when all I had was an analog SLR, I had to make choices up-front. Do I buy an ISO 100 film for daytime shooting? (It’s cheaper, besides.) Do I go in for the expensive ISO 1600 film for my fancy night shots? Do I lug around the tripod? Do I use the flash? Do I even bother taking indoor shots? etc.

With my new digital camera, at least the ISO choice vanishes. The ISO range varies from 64 to 1600. And so, I don’t need flash or a tripod most of the time.

But once in a while, I get into a tricky situation.

Having a digital camera lets me take pictures a lot faster. Suppose I spot a boat speeding by when strolling along the Thames. The time it takes from the instant I see it to the instant I click the shutter is about 5 seconds. 2 seconds to pull out the camera, 1.5 seconds for the camera to start up, and about 1.5 seconds for me to aim and shoot.

I love being able to do this kind of a thing.

Except, it’s still a bit tricky managing the ISO. It takes me about 10 seconds to change the ISO settings. No, not because the menus are complex… that accounts for only about 3 seconds or so. The bulk of it is because I have to think about what ISO setting to use — especially given that I like to overexpose digital camera images a bit.

So, when I’m going indoors, I have to remember to set the ISO to something like 400 or 800, and back again when I get out. It may sound like I’m going a too far, but the thing is, since I don’t keep my tripod always attached, and don’t ever turn on the flash, I’ve spoiled a fair number of impulsive indoor and night shots because I’ve had the wrong ISO setting at the wrong time.

Being digital images, many of these problems can be fixed.

If I use a high ISO setting (say ISO 800), I get a fair bit of digital noise. But NeatImage does a decent job of reducing noise (thanks, Dhar!), so the result is not too bad.

If I use a low ISO setting (say ISO 100), I get clean images in bright light, but blurred images in low light (no tripod, no flash, you see). I haven’t found anything that can recover from a blurred image.

I decided, on the balance, to have a slightly higher ISO setting by default. I get slightly noisier images, but that’s less of a worry.

So I leave the camera in ISO 400. I can quickly shoot indoors. If I have the time and need, I shift to ISO 100, or use a tripod if required. Then I set it back to ISO 400 when done.

Making my music search engine faster

My music search engine takes quite a while to load (typically 40 seconds). That’s an unusually long time for a page, given that most of the people that access it are on broadband connections, and are listening to music online.

The reason is, firstly, that I’m loading a lot of data. Literally every single song on that you can search comes through as Javascript. All the downloadable Hindi songs, for instance, occupy 1.3 megabytes before compression. On average, this takes about 20 seconds to load.

The second reason is, I’m doing a lot of processing with the data. Two things take the bulk of the time: uncompressing the data, and massaging it to allow for different spellings. On average, that takes about 20 seconds.

40 seconds is too long to wait. Actually, the perceived waiting time is 8 seconds, because I start showing the search and the results after 1/5th of the data is downloaded. But 8 seconds is still too long.

I managed to cut this time by half with two things I just learned.

Firstly, Javascript loads and executes sequentially. As soon as a Javascript block is loaded, it is executed. While it is being executed, the rest of the HTML file is not parsed. So any subsequent Javascript blocks (or images, CSS, any other external links) are not loaded. Bandwidth is just sitting idle while the browser is busy at work on the Javascript.

Since I’ve split all my songs into five equal-sized Javascript files, my search engine loads the Javascript files one after another! The problem is even worse — the calculations I do in Javascript take up as much time as the load time. If the loading went on in parallel, by the time the first calculation is done, the second script would have loaded.

This problem can be solved for Internet Explorer and Opera. The “defer” attribute loads the scripts in the background, and defers their execution. This reduces the loading time to nearly zero for all the Javascript files except for the first one or two, because by the time my calculations are done, the next script is already loaded.

Javascript loading sequence before and after 'defer'

These Javascript files contain a list of songs as a long string, which I then split into individual songs. Then I modify each song using regular expressions so that approximate matches will still work. (For e.g., typing “aa” is the same as typing “a” on this search engine.) The performance of regular expressions is critical to me.

Originally, I did this:

  1. Split the string into songs
  2. Modify each song using regular expressions

Now, I changed the sequence to this:

  1. Modify the string using regular expressions
  2. Split the string into songs

When I timed the speed of this change, I realised browsers differ a lot in the speed of processing regular expressions. Here is the time (in seconds) for each browser to process the Javascript before and after the change.

Browser Before After
Internet Explorer 6.3 5.0
Firefox 17.7 4.7
Opera 93.8 19.9

Internet Explorer wasn’t affected much by this change. But Firefox and Opera were heavily impacted. I’m guessing this is because Firefox and Opera have a high setup time for matching regular expressions. Before, I was matching thousands of strings. After, I was matching only one (long) string. IE didn’t care much. Firefox and Opera speeded up dramatically. (I suspect my Opera data is skewed by a few people using a rather slow machine… but then, that’s probably why they should use Opera in the first place — it works rather well old machines.)

With these changes, my total load time fell from about 40 seconds to about 20 seconds. Pretty decent improvement.

There are two further steps I could take from here on.

Compress the songs files further

Currently, I’m doing two things to compress the songs.

  1. Instead of sending the list as a (movie-song),(movie-song),... combination for every song, I send it as a (movie-song,song,song,...)(movie-song,song,song,...)... combination. So I don’t repeat the movie names.
  2. Secondly, I compress them via HTTP/1.1. (My server doesn’t let me do that with Javascript, actually, because Netscape 4.x doesn’t accept compressed Javascript. But since it’s such an old browser and none of my viewers use it, I trick the server by renaming my Javascript files as .html, and it passes them on compressed.

What I could do additionally is:

  1. Remove duplicate songs. If songs are spelt differently, I include them both. But I can knock off the duplicate ones.
  2. Compress the text. Though I’m gzipping the text before sending it, I suspect there may be ways of storing the data in a more compressed form. For example, many songs are repeated with the (male) and (female) versions. These could be clubbed (somehow…)

Speed up the Javascript calculations further

There are two steps I’m doing right now:

  1. Modify the string using regular expressions
  2. Split the string into songs

Here’s how much time each step takes (in seconds) across browsers. (Note: these were based on one day’s data, sampling about 2,000 page views)

Browser Regular expression Split
Internet Explorer 0.88 3.04
Firefox 3.76 1.08
Safari 0.47 0.46
Opera 4.96 29.78

Firefox takes a lot of time with regular expressions. IE and Opera take a lot longer to split (which involves creating many objects). I may have to think up different optimisations for the two.

The code is available in the HTML of the song search engine. It’s readable. The stuff I’m talking about is in the function sl(p,s). Any thoughts are welcome!

Reducing the server load

I’m been using a shared hosting service with 100 WebSpace over the last 7 years. It’s an ad-free account that offers 100MB of space and 3GB of bandwidth per month. Things were fine until two months ago, which was when my song search engines started attracting an audience. I had anticipated that I might run out of bandwidth, so I used a different server (that has 5GB of bandwidth per month quota) for loading the songs. But what I didn’t anticipate whas that my server load would run over the allotted CPU limit.

You’d think this is unusual, given how cheap computing power is, and that I’d run out of bandwidth quicker. But no. My allotted limit was 1.3% of CPU usage (whatever that meant), and about 2 months ago, I hit 1.5% a few times. I upgraded my account to one which had a 2.5% limit immediately, but the question was: why did this happen?

This blog uses a lot of Perl scripts. I store all articles on a MySQL database. Every time a link is requested, I dynamically generate the HTML by pulling up the article from the MySQL database and formatting the text based on a template.

Schematic of how my website displays pages

I also use MySQL to store users’ comments. Every time I display each page, I also pull out the comments related to that page.

I can’t store the files directly as HTML because I keep changing the template. Every time I change the template, I have to regenerate all the files. If I do that on my laptop and upload it, I consume a lot of bandwidth. If I do that on the server, I consume a lot of server CPU time.

Anyway, since I’d upgraded my account, I thought things would be fine. Two weeks ago, I hit the 2.5% limit as well. No choice. Had to do something.

If you read the O’Reilly Radar Database War Stories, you’ll gather that databases are great for queries, joins and the like, while flat files are better to process large volume data as a batch. Since page requests come one by one, and I don’t need to do much batch processing, I’d gone in for a MySQL design. But there’s a fair bit of overhead to each databasse query, and that’s the key problem. Perl takes a while to load (and I suspect my server is not using mod_perl). The DBI module takes a while to load. Connecting to MySQL takes a while. (The query itself, though, is quite fast.)

So I moved to flat files instead. Instead of looking up from a database, I just look up a test file using grep. (I don’t use Perl’s regular expression matching because regular expression matching in UNIX is faster than in Perl.) I have a 1.6MB text file that contains all my blog entries.

But looking up a 1.6MB text file takes a while. So I split the file based on the first letter of the title. So this post (Reducing the server load) would go under a file x.r.txt (for ‘R’) while my last post (Calvin and Hobbes animated) would go under a file x.c.txt (for ‘C’). This speeds up the grep by a factor of 5-10.

On average, using MySQL query used to take 0.9 seconds per query. Now, using grep, it’s down to about 0.5 seconds per query. Flat files reduced the CPU load by about half. (And as a bonus, my site has no SQL code. I never did like SQL that much.)

So that’s why you haven’t seen any posts from me the last couple of weeks. Partly because I didn’t have anything to say. Partly because I was forced to revamp this site.

Tamil spelling corrector

The Internet has a lot of tamil song lyrics in English. Finding them is not easy, though. Two problems. The lyrics are fragmented: there’s no one site to search them. And Google doesn’t help. It doesn’t know that alaipaayudhe, alaipaayuthe and alaipayuthey are the same word.

This is similar to the problem I faced with tamil audio. The solution, as before, is to make an index, and provide a search interface that is tolerant of English spellings of Tamil words. But I want to go a step further. Is it possible to display these lyrics in Tamil?

My Tamil Transliterator does a lousy job of this. Though it’s tolerant of mistakes, it’s ignorant of spelling and grammer. So,

kanda nal muthalai kathal peruguthadi

… becomes…

kanda nal muthalai kathal peruguthadi

… when in fact we want…

kanda naaL muthalaay kaathal peruguthadi

(If you’re viewing this on an RSS reader, check my post to see what I mean.)

I need an automated Tamil spelling corrector. Reading Peter Norvig’s “How to Write a Spelling Corrector” and actually having understood it, I gave spelling correction in Tamil a shot.

Norvig’s approach, in simple terms, is this:

  1. Get a dictionary
  2. Tweak the word you want to check (add a letter, delete one, swap 2 letters, etc.)
  3. Pick all tweaks that get you to a valid word on the dictionary
  4. Choose the most likely correction (most common correct word, adjusted for the probability of that mistake happening)

Making a dictionary is easy. I just need lots of Tamil literature, and then I pick out words from it. For now, I’m just using the texts in Project Madurai.

Tweaking the word to check is easy. Norvig’s article has a working code example.

Picking valid tweaks is easy. Just check against the dictionary.

The tough part is choosing the likely correction. For each valid word, I need the probability of having made this particular error.

Let’s take an example. I’ve spelt kathal. A list of valid tweaks to this word include: kal, kol, kadal, kanal, and kaadhal. For each of these, I need to figure out how often the valid tweaks occur, and the probability that I typed kathal when I really meant one of these tweaks. This is what such a calculation would look like:

Tweak Frequency Probability of typing kathal Product
kal 1 0.04 0.04
kol 4 0.02 0.08
kadal 10 0.1 1.0
kanal 1 0.01 0.01
kaadhal 6 0.25 1.50

Once we have this, we can see that kaadhal is the right one — it has the maximum value (1.50) in the last column, where we multiply the frequency and the probability.

(You probably realise how small my dictionary is, looking at the frequencies. Well, that’s how big Project Madurai is. But increasing the size of a dictionary is a trivial problem.)

Anyway, getting the frequency is easy. How do I get the probabilities, though? That’s what I’m working on right now.

Splitting a sentence into words

I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think.

Think about it. What is a word?

Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece:

"I'd look at McDonald's," he said.
"They sell over 3,000,000 burgers a day -- at $1.50 each."
High-fat foods were the rage. For e.g., margins in fries
were over 50%... and (except for R&M & Dyana [sic]) everyone
was at ~30% net margin; growing at 25% too!

Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:

"I'd
look
at
McDonald's,"
...

Now, some of these like “I’d” are words. But “McDonald’s” isn’t. I mean, there’s a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd becomes Id. So we need to be careful about which punctuation to remove. Let’s take a closer look.

The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:

McDonald's,
e.g.,
High-fat
R&M
...

The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:

Comma
Not inside numbers: 3,000,000.
Hyphen
Not for hyphenated words: High-fat.
Single-quote
Not for possessives: McDonald’s. Not for joint words: I’d.
Ampersand
Not for abbreviations: R&M
Period
Not for abbreviations: O.K. Not for URLs: www.s-anand.net
Slash
Not for fractions: 3/4. Not for URLs: google.com/search

Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn’t make sense to separate them.

So here are my current rules for splitting a sentence into words. (It’s a Perl regular expression. Don’t worry. Cooper’s Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)

# Split by clear word separators
/       [\s \! \? \;\(\)\[\]\{\}\<\> " ]
 
# ... by COMMA, unless it has numbers on both sides: 3,000,000
|       (?<=\D) ,
|       , (?=\D)
 
# ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides
|       (?<=\W) [\.\-\&]
|       [\.\-\&] (?=\W)
 
# ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes')
|       (?<=\W) [']
 
# ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split)
|       \s \/
|       \/ \s
 
# ... by COLON, unless it's a URL or a time (11:30am for e.g.)
|       \:(?!\/\/|\d)
/x;

This doesn’t even scratch the surface of the issue, though. Here are some issues:

  • Lots of files split words into two at the end of a line. How do we handle that?
  • How do we handle incorrect punctuation? For instance, if someone types “done.Yet,” without leaving a space after the full-stop, I’ll think it’s an abbreviation.
  • What about other separators? Like the ± symbol or the £ symbol for instance.
  • What about other languages?!

And you thought it was easy!

HTTP download speeds

In some of the Web projects I’m working on, I have a choice of many small files vs few big files to download.

There are conflicting arguments. I’ve read that many small files are better, because you can choose to use only the required files, and they’ll be cached across the site. (These are typically CSS or Javascript files.) On the other hand, a single large file takes less time to download than the sum on many small files, because there’s less latency. (Latency is more important than bandwidth these days.)

I ran some tests, and the answer is rather easy. The graph below shows the average time taken to download a file of size 1KB – 100KB.

Time to download a file of size ranging from 1KB - 100KB

The X-axis is the size of the file. The Y-axis is the number of milliseconds taken to download the file, averaged over 10 attempts on my home computer. (I did the same thing at work and at a client site. The results are similar.)

The key thing is, it’s not linear. Larger files take less time. Specifically:

  • A file twice as big only takes 30% longer to load.
  • Breaking a file into two parts takes 54% longer to load.

These days, it looks like few big files are better.

To give you an example: my home page had the following components:

Size (KB) Load time (ms)
HTML 25 680
CSS 4 340
Javascript 6 400
Total 35 1420

The total time for these 3 files would be about 1.4 seconds. Instead, if I put them all on one file…

Size (KB) Load time (ms)
All-in-one 35 770

The combined file takes only 0.77 seconds — half the download time for split files. It’s a compelling argument to put all your CSS and Javascript (images, too, if possible) into a single page.

But what if people visit multiple pages on my site, and lose the benefit of caching? Not really. The benefit of caching is small. By having a single file, I have 770 – 680 = 90 ms additional time for each HTML to load. But I don’t have to load the CSS and Javascript individually, which takes 740 seconds. The breakeven is about 740 / 90 = 8 page visits.

So, on average, if people, visit more than 8 pages in my site, it’s worth breaking up the CSS and Javascript. But the average for my site is only 2 pages per person. (It’s a skewed distribution. Most, from Google, just visit one page. Few, from bookmarks, visit several pages. On average, it’s just over 2.) I’d argue that unless you’re using an external Javascript / CSS library (prototype, etc.) or people view many pages each visit, you’re better of having a single HTML+Javascript+CSS file.

Making a Media PC

Two weeks ago, I pulled together a Media PC.

This has been a long-term ambition. I’ve always wanted to have a PC as the centre of all my media. To use it as a radio, TV, stereo system, CD player, DVD player, etc.

I finally did it, for just under 1000 pounds.

Media PC Setup

At the centre of the setup is my 42″ Plasma TV (LG 42PC1D). I was debating between a plasma and LCD TV. The differences, as I understand them, are:

  1. Plasma TVs have a higher contrast ratio. My LG 42PC1D has a 10000:1 contrast ratio. An LG 42LC2D has a 1100:1 contrast ratio. The Plasma TV is also brighter (1500cd/m2) than the LCD TV (500cd/m2).
  2. Plasma TVs are cheaper for a given size. A 42″ LCD TV costs about 5-20% more.
  3. LCD TVs are lighter. The only reason this matters for me is if I carry the TV back home to India. But the shipping costs are exhorbitant anyway. So I’d be better off leaving the TV behind. And the weight becomes irrelevant.
  4. LCD TVs consume less power. And my power bill is quite high. But I replaced most of the bulbs in our house with energy-efficient ones. Hopefully it will balance out.
  5. LCD TVs work better with computers. If you leave an image on a plasma TV for a long time, it burns on the screen. Screensavers become a must.

I finally picked the Plasma TV, but it was a borderline decision.

The TV is hooked to a Cyberhome DVD player with DivX and a Freeview receiver. The DVD player lets me watch DivX movies I download as torrents. The Freeview player gives me over 40 free channels for casual viewing. (I don’t watch enough TV to need Sky TV.)

I bought an Intel Pentium III Tower that I bought on eBay. This is my “media PC”. I hooked this up to my TV (which has a PC input), a pair of Bose Mediamate speakers (a gift from my brother-in-law) with excellent sound response, and a Labtec webcam.

Two other components I bought were wireless: a Microsoft wireless keyboard and mouse to control the system from my sofa, and a Linksys cordless Skype phone (with a speakerphone), so that I could hold videoconferences on Skype while on the sofa.

Having set this up, I’m truly beginning to appreciate the convenience of wireless appliances. Right now, I can do any of the above things without gettin up from my sofa. My laptop, phone and wireless keyboard are always just a hand’s reach away! Here are some of the things I’ve been doing (wirelessly):

  1. Videoconferencing. I leave the computer on permanently. My parents or in-laws call me on Skype. The cordless phone rings. I can answer Skype calls directly from the phone. When I pick up the call, the webcam turns on automatically. We can sit on the sofa and speak, while they see us. I can turn on the TV and see them through their webcam. It’s a full-fledged wireless video-conference setup.
  2. Listening to radio. I use my laptop to connect wirelessly to my media PC using Remote Desktop, start up WinAmp, and pick a Shoutcast channel (which has a decent tamil channel list). The sound comes through the Bose speakers connected to the media PC, and I can control the volume from any room, using my laptop.
  3. Listening to MP3s. Ditto, except I turn on a playlist on WinAmp.
  4. Watching online videos. I turn on the TV, use my wireless keyboard, and connect to Google Video. Most of the time, I watch recent tamil movies or Google Techtalks.
  5. Watching TV. (Live from BD and ShareVDO being some choices.)
  6. Watching movies. I actually use “low tech” to do this. I record DivX files I download on to a DVD, and play them through my DVD player (which recognises DivX). On those occasions that I download WMV files, I play them through the computer.

With this setup, it’s easy to do more cool things, like a Truman-show like broadcast (which Justin.TV already does).

Hindi songs online

Click here to search for Hindi songs.
This is an article on how I wrote the search engine.

I find it a nuisance to have to go to Raaga, search for a song, not find it, then go to MusicIndiaOnline, not find it, then go to Musicplug.in, and so on until Google.

So I got the list of songs from some of these sites, put it together in one place, and implemented a find-as-you-type.

Just go to s-anand.net/hindi and type a song or movie name.

Update: How I created this is a very long story, spanning over two years. And here it is.


Over the last 5 years, my MP3 collection had grown quite large. Most of these were from MP3 CDs I had bought, songs I’d downloaded, or songs I’d recorded. In any case, the file names were a mess. In early 2005, I decided to organise the collection and put them all into one big directory per language, and name the files in the format “Movie.Song.mp3”.

People think I’m crazy to use a single large directory. But I prefer one directory with 5,000 files to 1000 directories with 5 files for a simple reason. Searching in one directory is easier than in multiple directories. You can just sort everything by name, date modified, size, whatever. On the command prompt, you can type “DIR dil*.txt” to see all movies starting with “Dil”.

I chose the “Movie.Song.mp3” format because the movie name and the song name were really the only two things I knew about every song I had. I didn’t always know the music director, singers or year of the movie. And I placed “Movie” before “Song” because I often browse songs within a single movie, and it’s useful to sort by name in Windows Explorer and see all the songs in a movie. I’ve never had a need to sort by song name. If I wanted to find a song that started with, say, “pehla”, I’d just type “DIR *pehla*” on the Command Prompt. (As you might have guessed, I have a Command Prompt window always open.)

So, having very rationally organised my music collection, I was happy.

Soon the problem shifted to song discovery. I’d heard the familiar songs in my collection many times. Of the unfamiliar songs, but I didn’t know which to pick. I knew I liked some music directors more than others, and had a hunch I liked older songs. (My subsequent analysis of song preferences confirmed this.) But I didn’t know the year or music director for any of my songs.

Since Raaga had a decent collection of songs, along with the year and music director, I decided to download this information and tag my files with this information. There were two problems. Firstly, the data in Raaga needs to be parsed. I need to hunt through each file to find the year and music director. The second was worse: my song names were spelt differently from Raaga’s.

Step 1: download the HTML and parse it. Perl is pretty much the only programming language I know. I used Perl’s LWP library to download the movie index of Raaga. Each movie in the index always has the URL http://www.raaga.com/channels/hindi/movie/something.html. So I extracted these patterns and downloaded all these URLs as well. (Others have a recognisable pattern as well: http://www.musicindiaonline.com/music/hindi_bollywood/s/movie.some_number/, http://www.musicplug.in/songs.php?movieid=some_number, http://ww.smashits.com/music/hindi_film/songs/some_number, etc.)

You probably realise that I downloaded a fair bit of the entire Raaga.com site’s HTML. Actually, it’s not that big. The 800-odd files in the Hindi collection didn’t take more than 40MB of space, and a few hours to download. Here’s what the code looks like:

# Get the list of movies HTML file
my $index = get("http://www.raaga.com/channels/hindi/movies.asp");

# Extract the list of movies from that into a hash (movie, url pairs)
my %movie = ($index =~ m|<a href="/channels/hindi/movie/([^"]*).html" class="[^"]*">([^>]*)</a>|gsi);
 
# Loop through each movie
for my $file (keys %movie) {
  # Get the song HTML file
  my $html = get("http://www.raaga.com/channels/hindi/movie/$file.html");
 
  # Year is typically like this: Movie_Name (1983)
  my $year = ($html =~ m|<b>$movie{$file}\s+\((\d*)\)</b>|) ? $1 : "";
 
  # Music director always has the URL /channels/hindi/music/something
  my $md = ($html =~ m|<a href="http://www.raaga.com/channels/hindi/music/[^"]*" class="black">(.*?)</a>|) ? $1 : "";
  for (split /\n/, $html) {
 
    # Now get individual songs and rating. They always have an onClick="setList(something)"
    if (m|onClick="setList1\((\d*)\)[^>]*>(.*?)<\/a>.*?"http://images.raaga.com/.*?.gif" alt="RATING: ([\d\.]*)"|) {
    $song = $2;
    $rating = $3;
    print join("\t", $movie, $song, $year, $md, $rating), "\n";
  }
}

Incidentally, I’m showing you a simplifed version. I actually wrote the script in a way that I could resume where I left off. The ability to resume was probably the single most useful time-saver in the entire process.

Step 2: match the song names with those on my list. This is tricky. I hate doing it manually. So I developed a set of rules that could compare two spellings of a movie and decide if they were the same or not (see my earlier post on matching misspelt movie names). For Hindi songs and movies, here are my rules (in JavaScript):

w=w.toUpperCase();                      // Convert to upper case
w=w.replace(/\s+/, " ");                // All whitespaces = 1 space
w=w+" ";                                // Ensure there's a space after every word
w=w.replace(/W/g, "V");                 // V=W
w=w.replace(/Z/g, "J");                 // Z=J
w=w.replace(/PH/g, "F");                // PH=F
w=w.replace(/([KGCJTDPBS])H/g, "$1");   // Ignore H after most consonants
w=w.replace(/(.)\1/g, "$1");            // Ignore repeated letters
w=w.replace(/Y/g, "");                  // Ignore Y
w=w.replace(/([AEIOU])N /g, "$1");      // Ignore trailing N after vowel (hein, mein)
w=w.replace(/[AEIOU]/g, "");            // Ignore vowels
w=w.replace(/ /g, "");                  // Ignore spaces

These are the rules, incidentally, that I use in my Hindi quizzes. Even if you type in a different spelling, the rules above will match the correct answer.

I extended these programs over 2006 to cover MusicIndiaOnline, Musicplug.in and Smashits. (I’ve hit a point of diminishing returns, I think, so I’m not too keen on expanding this list.)

Now, with a database of song information, I needed a good media player to view this in. I’ve used several media players over time: WinAmp, Windows Media Player, RealPlayer, iTunes, and MediaMonkey. I’m a big WinAmp fan, but I’ve been forced to Media Monkey now. WinAmp has a 10 second delay before playing any song on my new laptop. MediaMonkey’s not bad, though. The lack of advanced filters is countered by the heavy programmability (I can use Javascript to update MP3 ID3 tags on MediaMonkey). Plus, I get all the WinAmp plugins. As for the other media players, I think they’re junk.

There are five things I want in a perfect media player:

  1. Find as I type. I shouldn’t have to type the entire song, or press a “Go” button. I’ll just type. It should show all matches instantly. WinAmp does this, and that’s why I loved it. (Today, most media players can do this.)
  2. Advanced filters. Instead of manually creating playlists, I’d rather create filters, like “Highly rated songs in the 2000s I haven’t heard recently”. (See How I listen to music.)
  3. Enqueable playlists. When I click on a song, I don’t want my current song to be interrupted. Just play it next.
  4. Global hotkeys. I want to pause the song when someone says something — without having to go to the application, search for the pause button, etc. WinAmp does this brilliantly with its global hotkeys.
  5. Online and offline integration. I want to be able to search online collections, like Raaga. Unfortunately none of the media players can do this. They have their own collections (radio stations, really), but even these aren’t really searchable.

Since my favourite media players (WinAmp and MediaMonkey) lack only one of these features, I thought I might be able to build them in.

But no such luck. It was almost easier to build my own media player. So I started to build my two weeks ago. My hope was to cover as many of my favourite requirements, beginning with find as you type.

The key to find-as-you-type is speed. You can’t afford many calls back and forth between the browser and the server. Even if people have a fast connection, my server is not fast enough. (A good part of the reason why I use Google applications is speed. Google’s server is blazingly fast, and the design of their applications complements that.) To make find-as-you-type fast, ideally the entire database should be loaded. Then, as you type, I just need to check with the database in memory. But downloading an entire database takes ages! (My full music database is 7MB right now.)

Step 3: compress the database. Rathern than load the full 4MB, I managed to get the page to start after loading 100KB of data. First, I cut down less important fields. Most searches are for a song or movie, rarely for a year or music director. So I took only the movie and song names. That brought the data down to ~2MB.

I don’t need to repeat the movie name across songs. If I have something like this:

Movie1  Song1
Movie1  Song2
Movie1  Song3
Movie2  Song1
Movie2  Song2

I can store this instead as:

Movie1  Song1   Song2   Song3
Movie2  Song1   Song2

I can also eliminate duplicate names. This brings down the space requirements to ~500KB.

The next step was the clever one. I don’t need to load the full database before you start searching! It’s enough to load a reasonable subset, and let you start searching while the rest loads in the background. So my hindi song search engine loads about 100KB of the data from one Javascript file, lets you search, and in the background loads the 400KB Javascript file. As soon as that finishes loading, it displays results from that set as well. (The initial portion is really songs on Raaga. I figured it would represent a decent search set.)

Step 4: find-as-you-type. This is quite easy, actually. I send the onkeyup event to my search function.

<input id="inp" onkeyup="csearch()">

The csearch() function goes through all the songs, checks if there’s a match, and prints all those results.

// Create a case-insensitive regular expression
var re = new RegExp(search, "i");
for (var i in songs) {
  if (re.test(songs[i)) { str += songs[i]; }
}
library.innerHTML = str;

But that, unfortunately, takes ages to finish. If you implemented this as is, you would have to wait 1 – 1.5 seconds between each key stroke. So I made two modifications. Firstly, I restrict the number of results displayed to 100. When you start typing, (for example, you’d go ‘c’… ‘ch’… ‘chu’…) there are several songs early on that match the string, so I don’t have to search through the whole list. This speeds up the search for small strings.

When the search gets bigger, (‘chupke chu’… ‘chupke chupk’…), there aren’t 100 results. So the search has to cover the full list, and that takes 1-1.5 seconds between each keystroke. So I took another clever step. I broke the search into chunks of 5000 songs. That takes a fraction of a second. I search successive chunks of 5000 songs. If I find any results, I add them. Then I wait for a keystroke. If nothing happens, I continue searching the next 5000 after 50 milliseconds, and so on. If you press a key in the meantime, I stop searching for the original word, and start searching for the new word. This makes the application appear a lot faster.

There are ways I could make this even faster. For example, people type more often than delete. A typical sequence would be (‘chupke ch’… ‘chupke chu’… ‘chupke chupk’…) rather than the reverse. Backspace is not pressed very often. So, instead of re-searching the whole list, I could just search the already-searched list in such cases. But right now, the search works fast enough, so I’ll leave it at that.

The next step is advanced filters. I’m working on that right now. Hopefully you’ll see in a while.

Statistically improbable phrases 2

My earlier list of statistically improbable phrases in Calvin and Hobbes is technically just a list of “Statistically Improbable Words”. I re-did the same analysis using phrases. Here are the top 20 statistically improbable phrases (2 – 4 words only):

baby sitter chocolate frosted sugar bombs comic books doing homework fearless spaceman spiff() good night hamster huey ice cream miss wormwood new year peanut butter really think slimy girls spaceman spiff stuffed tiger stupendous man sugar bombs susie derkins watch tv water balloon

That is, these are the 2-4 word phrases whose frequency in Calvin and Hobbes is substantially (at least 5 times) higher than in the other books I have.

While doing this, the single biggest problem that stumped me was: what is a word?

  • Is “it’s” one word or two words?
  • Is “six-year-old” one word or three words?
  • How do I distinguish between abbreviations (g.r.o.s.s.) and full-stops without a space ( … homework.what’s a …)?
  • Does a comma always split words? (It doesn’t in numbers, like “3,500”)

The other problem is, phrases with more words are more improbable. Right now, if a phrase occurs 5 times more frequently in Calvin and Hobbes than my other books, I include it. But three-letter words rarely occur that often, and four-letter words even less so. Maybe I should have a lower cutoff for longer phrases.

Anyway, this analysis is a crude first approximation. Clearly Amazon’s gotten much further with their system.