S Anand

Advanced Google Reader

I’ve stopped visiting websites.

No, really. There’s only one website I visit these days. Google Reader.

Google Reader screenshot

Google Reader is a feed reader. If you want to just catch up on the new stuff on a site, you can add the site to Google Reader. Anything new that is published on the site appears in Google Reader. Right now, I’ve subscribed to over 50 feeds. There’s no way I can remember to visit 50 sites — so I’m actually able to read more and miss less.

In a sense, that’s the essence of using feeds to read stuff: remember less, read more and miss less. If I ever find an interesting site, I just add it to Google Reader and forget about it.

But it goes beyond that. You can subscribe to search results.

Videos, for examples. Subscribe to a search for google engedu on Google Video. Any new TechTalk at Google, you get it in your Reader. (The talks are fabulous by the way.) Or search for hindi movies.

Photos? Track new photos of the Indian cricket team on Flickr.

Movies? Get an update of the latest movie torrents from mininova or torrentspy. Or just get updates on American Gangster DVDRip.

Presentations? Track anything on Javascript or consulting from SlideShare.

Documents? Find new eBooks on Scribd.

Not all pages offer a feed. But Page2Rss offers a reasonable solution. It converts pretty much any page into a feed.

It’s gotten to the point where anything I know I want to read, I put it on Google Reader. The rest of the Internet is when I don’t know what I want to read.

Website load distribution using Javascript

My music search engine shows a list of songs as you type — sort of like Google’s autosuggest feature. I load my entire list of songs upfront for this to work. Though it’s compressed to load fast, each time you load the page, it downloads about 500KB worth of song titles.

My allotted bandwidth on my hosting service is 3GB per month. To ensure I don’t exceed it, I uploaded the songs list to an alternate free server: Freehostia. This keeps my load down. If I exceed Freehostia’s limit, my main site won’t be affected — just the songs. I also uploaded half of them to Google Pages, to be safe.

This all worked fine… until recently. Google Pages has a relatively low bandwidth restriction. (Not sure what, and they won’t reveal it, but my site is affected.) Freehostia is doing some maintenance, and their site goes down relatively often. So my song search goes down when any of these go down.

Now, these are rarely down simultaneously. Just one or the other. But whenever Freehostia is down, I can’t listen to one bunch of songs. When Google Pages is down, I can’t listen to another.

What I needed was a load distribution set-up. So I’ve made one in JavaScript.

Normally, I load the song list using an external javascript. I have a line that says:

<script src="http://sanand.freehostia.com/songs/..."></script>

… and the song’s loaded from Freehostia.

What I’d like to do is:

loadscripts(
    "list.hasLoaded()", [
    "http://sanand.freehostia.com/songs/...",
    "http://root.node.googlepages.com/...",
    "http://www.s-anand.net/songs/..."
])

If the function can’t load it from the first link, it loads it from the second, and so on, until list.hasLoaded() returns true.

Here’s how I’ve implemented the function:

function loadscripts(check, libs) {
    document.write('<script src="' + libs.shift() +
        '"><' + '/script>');

    if (libs.length>0) document.write('<script>if (!(' + check + ')) {'
        + 'loadscripts("' + check + '",["' + libs.join('","')+'"])'
        + '}</' + 'script>')
}

The first document.write loads the first script in the list. The if condition checks if there’s more scripts to load. If yes, the second document.write writes a script that

  • checks whether the script is already loaded, and
  • if not, loads the rest of the scripts using the same function.

I’ve expanded the sites that have these free songs as well. So now, as long as my main site works, and at least one of the other sites work, the search engine will work.

PS: You can easily expand this to do random load distribution as well.

A busy break from blogging

Between July 17th and August 22nd, I saw 57 movies and read 7 books. There were Saturdays when I watched four movies back-to-back. (I tried five. Couldn’t stay awake.) Amidst this, I also cooked, cleaned, shopped… and went to office. (Oh yes, I was working 10 hours a day.) And managed to build some interesting sites which I’ll release in a while.

But first, let me share the books with you.

Harry Potter and the Deathly Hallows

I wasn’t planning to buy it. I figured I’d just wait for the soft copy. On 21st July at 8am, I went shopping to the local Sainsbury’s to get groceries for my pre-movie cooking. I didn’t know that was the release date. And there it was. In a huge stack. 50% discount. Should I? Shouldn’t I? After finishing the rest of my shopping, and having deeply analysed the cost-benefit and ROI, I figured: if I didn’t buy it now, someone else might tell me the answers!

Was Snape evil? I couldn’t believe that. Not after Dumbledore’s implicit trust. Besides, I re-read Harry Potter and the Half-Blood Prince, and if Dumbledore was dismissing Harry’s explicit warnings about Snape, he had to know something more. Anyway, what was the significance of Dumbledore’s last words? "Severus… please…" Please what? Dumbledore begging for death seemed more likely than Dumbledore begging for life. I had to know.

Who dies? Voldemort, of course. But who else? It couldn’t be Harry, unless J K Rowling was looking to make herself one of the most hated novelists. Yet, it seems so… possible. Harry dying to take Voldemort out. Naah, can’t be. Not Ron or Hermione either. Same reason. One of the other Weasleys? Maybe. Plenty of them anyway. Hopefully Percy. Hargrid? Possible. Lupin? The last of Harry’s father’s friends?

And all the minor questions: What’s the significance of Harry’s eyes? What does Wormtail do to help Harry? What’s the significance of Voldemort having used Harry’s blood to resurrect himself? etc. etc.

So I bought it.

But didn’t start reading.

I knew that if I picked it up, I wouldn’t put it down. It was time to cook. And watch movies.

By 4:00pm, after three movies, I couldn’t stand it any more. So I picked it up. Read until 2:00am. Picked it up again on Sunday at 9:00am, and starved until 11:00am until I finished it.

Whew! What a book. Definitely the raciest of the lot. My earlier favourite in the series was The Prisoner of Azkaban, though The Half-Blood Prince came close. But this one beats them all. Resolves most of the mysteries till date, too. As Stephen King says in his review of Harry Potter, but by the time she penned the final line of Deathly Hallows, she had become one of the finer stylists in her native country.

How to be Good

And then there was Nick Hornby’s book. I’d seen a couple of his movies: About a Boy and High Fidelity. They were interesting, and I’d heard the books were good. Figured I’d pick one up.

And it was hilarious!

How to be Good is the funniest book I’ve read since Five Point Someone and The Inscrutable Americans. Most of my colleagues kept wondering what I was laughing out so loudly about.

This is the story. The author is a doctor and a good wife. “Gooder” than her husband, certainly, and that makes her feel good. Until he has suddently becomes GOOD. Truly good. Saint-like. And then she can’t stand him any more. The story is in first person, so you can see her thoughts almost verbatim. (See thoughts verbatim? Well, whatever the phrase is.)

The Runaway Jury

I had just seen the movie Runaway Jury, so I had to re-read it immediately. The movie was surprisingly good, though. Dustin Hoffman, Gene Hackman, John Cusack and Rachel Weisz. Many changes from the book, but didn’t detract from the experience.

Batman Year One, The Dark Knight Returns and Understanding Comics

I saw 300. WOW! Brilliant. One of the greatest visual experiences ever. Possibly better than V for Vendetta and certainly better than Sin City — both of which I thought had incredible visuals. The colours, the texture, the contrast, the surrealism — whew!

That’s when it hit me. Three of the best movies I’ve seen recently were based on graphic novels (comics). Two by Frank Miller. Maybe I should explore this a little more.

I got myself Batman Year One — and that’s when I realised where Batman Begins got its inspiration from. The graphics were pretty old style, but the story, incredible. Then I picked up The Dark Knight Returns. Now THAT is phenomenal graphics. And what a story! Christopher Nolan‘s next movie is slated to be The Dark Knight. Really looking forward to that.

With all this, I ended up reading a bunch of new Superman comics as well (but they were lousy, so I won’t mention anything), and in the meanwhile, heard about Scott McCloud‘s comic book on comics — Understanding Comics. Like I said, it’s a comic book, but non-fiction. It’s about the history and art of comics. Very nice reading, and quite insightful too. I think every visual designer should take a look at it.

Pro Javascript Techniques

I’d been reading up a lot of Javascript recently — learning mostly from Douglas Crockford, Peter Paul-Koch, Dean Edwards, and John Resig. So when I realised John had a book, I had to read it. Douglas Crockford recommends JavaScript: The Definitive Guide (5th Edition) as the least bad among many bad books. I read it. Sorry, but it was quite a bore. Pro Javascript Techniques on the other hand, is gripping. Dives right into modern techniques, writing style, and is filled with practical advice. How wonderful.


So anyway, that’s what my month’s been like. Well worth the break from blogging, I think. (And I haven’t even told you about the movies or sites yet! Well, soon.)

Default camera ISO setting

In those early days, when all I had was an analog SLR, I had to make choices up-front. Do I buy an ISO 100 film for daytime shooting? (It’s cheaper, besides.) Do I go in for the expensive ISO 1600 film for my fancy night shots? Do I lug around the tripod? Do I use the flash? Do I even bother taking indoor shots? etc.

With my new digital camera, at least the ISO choice vanishes. The ISO range varies from 64 to 1600. And so, I don’t need flash or a tripod most of the time.

But once in a while, I get into a tricky situation.

Having a digital camera lets me take pictures a lot faster. Suppose I spot a boat speeding by when strolling along the Thames. The time it takes from the instant I see it to the instant I click the shutter is about 5 seconds. 2 seconds to pull out the camera, 1.5 seconds for the camera to start up, and about 1.5 seconds for me to aim and shoot.

I love being able to do this kind of a thing.

Except, it’s still a bit tricky managing the ISO. It takes me about 10 seconds to change the ISO settings. No, not because the menus are complex… that accounts for only about 3 seconds or so. The bulk of it is because I have to think about what ISO setting to use — especially given that I like to overexpose digital camera images a bit.

So, when I’m going indoors, I have to remember to set the ISO to something like 400 or 800, and back again when I get out. It may sound like I’m going a too far, but the thing is, since I don’t keep my tripod always attached, and don’t ever turn on the flash, I’ve spoiled a fair number of impulsive indoor and night shots because I’ve had the wrong ISO setting at the wrong time.

Being digital images, many of these problems can be fixed.

If I use a high ISO setting (say ISO 800), I get a fair bit of digital noise. But NeatImage does a decent job of reducing noise (thanks, Dhar!), so the result is not too bad.

If I use a low ISO setting (say ISO 100), I get clean images in bright light, but blurred images in low light (no tripod, no flash, you see). I haven’t found anything that can recover from a blurred image.

I decided, on the balance, to have a slightly higher ISO setting by default. I get slightly noisier images, but that’s less of a worry.

So I leave the camera in ISO 400. I can quickly shoot indoors. If I have the time and need, I shift to ISO 100, or use a tripod if required. Then I set it back to ISO 400 when done.

Tamil songs quiz 2006-2007

Here is the background music from some hit songs from 2006 and 2007. Can you guess which movie they are from?

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green. (A couple of spellings are tricky. Try J instead of S and V instead of G.)

Search for the song and listen online, if you want to confirm your guess.

Score: 0 / 10
Song 1
Song 2
Song 3
Song 4
Song 5
Song 6
Song 7
Song 8
Song 9
Song 10

Making my music search engine faster

My music search engine takes quite a while to load (typically 40 seconds). That’s an unusually long time for a page, given that most of the people that access it are on broadband connections, and are listening to music online.

The reason is, firstly, that I’m loading a lot of data. Literally every single song on that you can search comes through as Javascript. All the downloadable Hindi songs, for instance, occupy 1.3 megabytes before compression. On average, this takes about 20 seconds to load.

The second reason is, I’m doing a lot of processing with the data. Two things take the bulk of the time: uncompressing the data, and massaging it to allow for different spellings. On average, that takes about 20 seconds.

40 seconds is too long to wait. Actually, the perceived waiting time is 8 seconds, because I start showing the search and the results after 1/5th of the data is downloaded. But 8 seconds is still too long.

I managed to cut this time by half with two things I just learned.

Firstly, Javascript loads and executes sequentially. As soon as a Javascript block is loaded, it is executed. While it is being executed, the rest of the HTML file is not parsed. So any subsequent Javascript blocks (or images, CSS, any other external links) are not loaded. Bandwidth is just sitting idle while the browser is busy at work on the Javascript.

Since I’ve split all my songs into five equal-sized Javascript files, my search engine loads the Javascript files one after another! The problem is even worse — the calculations I do in Javascript take up as much time as the load time. If the loading went on in parallel, by the time the first calculation is done, the second script would have loaded.

This problem can be solved for Internet Explorer and Opera. The “defer” attribute loads the scripts in the background, and defers their execution. This reduces the loading time to nearly zero for all the Javascript files except for the first one or two, because by the time my calculations are done, the next script is already loaded.

Javascript loading sequence before and after 'defer'

These Javascript files contain a list of songs as a long string, which I then split into individual songs. Then I modify each song using regular expressions so that approximate matches will still work. (For e.g., typing “aa” is the same as typing “a” on this search engine.) The performance of regular expressions is critical to me.

Originally, I did this:

  1. Split the string into songs
  2. Modify each song using regular expressions

Now, I changed the sequence to this:

  1. Modify the string using regular expressions
  2. Split the string into songs

When I timed the speed of this change, I realised browsers differ a lot in the speed of processing regular expressions. Here is the time (in seconds) for each browser to process the Javascript before and after the change.

Browser Before After
Internet Explorer 6.3 5.0
Firefox 17.7 4.7
Opera 93.8 19.9

Internet Explorer wasn’t affected much by this change. But Firefox and Opera were heavily impacted. I’m guessing this is because Firefox and Opera have a high setup time for matching regular expressions. Before, I was matching thousands of strings. After, I was matching only one (long) string. IE didn’t care much. Firefox and Opera speeded up dramatically. (I suspect my Opera data is skewed by a few people using a rather slow machine… but then, that’s probably why they should use Opera in the first place — it works rather well old machines.)

With these changes, my total load time fell from about 40 seconds to about 20 seconds. Pretty decent improvement.

There are two further steps I could take from here on.

Compress the songs files further

Currently, I’m doing two things to compress the songs.

  1. Instead of sending the list as a (movie-song),(movie-song),... combination for every song, I send it as a (movie-song,song,song,...)(movie-song,song,song,...)... combination. So I don’t repeat the movie names.
  2. Secondly, I compress them via HTTP/1.1. (My server doesn’t let me do that with Javascript, actually, because Netscape 4.x doesn’t accept compressed Javascript. But since it’s such an old browser and none of my viewers use it, I trick the server by renaming my Javascript files as .html, and it passes them on compressed.

What I could do additionally is:

  1. Remove duplicate songs. If songs are spelt differently, I include them both. But I can knock off the duplicate ones.
  2. Compress the text. Though I’m gzipping the text before sending it, I suspect there may be ways of storing the data in a more compressed form. For example, many songs are repeated with the (male) and (female) versions. These could be clubbed (somehow…)

Speed up the Javascript calculations further

There are two steps I’m doing right now:

  1. Modify the string using regular expressions
  2. Split the string into songs

Here’s how much time each step takes (in seconds) across browsers. (Note: these were based on one day’s data, sampling about 2,000 page views)

Browser Regular expression Split
Internet Explorer 0.88 3.04
Firefox 3.76 1.08
Safari 0.47 0.46
Opera 4.96 29.78

Firefox takes a lot of time with regular expressions. IE and Opera take a lot longer to split (which involves creating many objects). I may have to think up different optimisations for the two.

The code is available in the HTML of the song search engine. It’s readable. The stuff I’m talking about is in the function sl(p,s). Any thoughts are welcome!

Reducing the server load

I’m been using a shared hosting service with 100 WebSpace over the last 7 years. It’s an ad-free account that offers 100MB of space and 3GB of bandwidth per month. Things were fine until two months ago, which was when my song search engines started attracting an audience. I had anticipated that I might run out of bandwidth, so I used a different server (that has 5GB of bandwidth per month quota) for loading the songs. But what I didn’t anticipate whas that my server load would run over the allotted CPU limit.

You’d think this is unusual, given how cheap computing power is, and that I’d run out of bandwidth quicker. But no. My allotted limit was 1.3% of CPU usage (whatever that meant), and about 2 months ago, I hit 1.5% a few times. I upgraded my account to one which had a 2.5% limit immediately, but the question was: why did this happen?

This blog uses a lot of Perl scripts. I store all articles on a MySQL database. Every time a link is requested, I dynamically generate the HTML by pulling up the article from the MySQL database and formatting the text based on a template.

Schematic of how my website displays pages

I also use MySQL to store users’ comments. Every time I display each page, I also pull out the comments related to that page.

I can’t store the files directly as HTML because I keep changing the template. Every time I change the template, I have to regenerate all the files. If I do that on my laptop and upload it, I consume a lot of bandwidth. If I do that on the server, I consume a lot of server CPU time.

Anyway, since I’d upgraded my account, I thought things would be fine. Two weeks ago, I hit the 2.5% limit as well. No choice. Had to do something.

If you read the O’Reilly Radar Database War Stories, you’ll gather that databases are great for queries, joins and the like, while flat files are better to process large volume data as a batch. Since page requests come one by one, and I don’t need to do much batch processing, I’d gone in for a MySQL design. But there’s a fair bit of overhead to each databasse query, and that’s the key problem. Perl takes a while to load (and I suspect my server is not using mod_perl). The DBI module takes a while to load. Connecting to MySQL takes a while. (The query itself, though, is quite fast.)

So I moved to flat files instead. Instead of looking up from a database, I just look up a test file using grep. (I don’t use Perl’s regular expression matching because regular expression matching in UNIX is faster than in Perl.) I have a 1.6MB text file that contains all my blog entries.

But looking up a 1.6MB text file takes a while. So I split the file based on the first letter of the title. So this post (Reducing the server load) would go under a file x.r.txt (for ‘R’) while my last post (Calvin and Hobbes animated) would go under a file x.c.txt (for ‘C’). This speeds up the grep by a factor of 5-10.

On average, using MySQL query used to take 0.9 seconds per query. Now, using grep, it’s down to about 0.5 seconds per query. Flat files reduced the CPU load by about half. (And as a bonus, my site has no SQL code. I never did like SQL that much.)

So that’s why you haven’t seen any posts from me the last couple of weeks. Partly because I didn’t have anything to say. Partly because I was forced to revamp this site.