How I do things

The Search

I was reading John Battelle’s The Search , and realised: We don’t sit down on the computer and say, “Let’s do a search”.

True. We want to get something done. We know it’s out there somewhere. We search.

So every search on a search engine is a commercial opportunity. Contrawise, every site must let people to do what they want to do on the site.

Think… What do people want to do when they’re on YOUR site?

Search queries to my site

On a related note, 60% of the search queries that lead to my site this year were Calvin and Hobbes quotes. “i can’t help but wonder what kind of desperate straits would drive a man to invent this thing.” topped the list (Calvin referring to a yo-yo), with i always catch these trick questions following closely.

People searching for Excel related stuff were next (20%): excel indirect(address(, row() excel offset address and the like.

A few were also looking for me by name or school (10%).

The last 10% ranged from the puzzling to the bizarre, including these gems.

IMDB Top 250 outliers

On the IMDb top 250, you normally see a correlation between the number of votes and the rating for a movie. Better rated movies are more watched. The outliers are interesting.

IMDb: Correlation between number of votes and rating

The movies that are popular despite not having a high rating are:

I can understand why The Sixth Sense, Pirates of the Caribbean and especially The Matrix are on this list — geeks would have watched these and voted on IMDb, though their voting need not have been high. But why are Gladiator and Sixth Sense on that list?

Movies that are highly rated, but not as popular are:

Seven Samurai and The Good, The Bad, The Ugly probably didn’t get the votes they deserve because they’re written in their Japanese and Mexican names on IMDb. I hadn’t seen them for a long time for the same reason. As for The Godfather, I personally think it’s just overrated. But Rear Window? That’s a surprise. Hitchcock thriller with all the classic elements…

Another correlation is between the rating and the year of the movie. Early movies get lower ratings than recent movies. Technique could be the reason, but I doubt it. In any case, some movies stand out of their time.

IMDb: Correlation between rating and year of movie

I haven’t seen Metropolis or M. But among the others, I think Citizen Kane is the one that deserves to stand out, if only for portraying the anti-hero, and for not having a happy ending. The Shawshank Redemption was a bit of a surprise. Few people that I know have heard of it. And yet, there it is, right on top.

How I buy gadgets

I’m a cautious gadget freak. I love buying gadgets, but think a lot before buying them. Invariably, I use spreadsheets to help me decide. I try to buy only those gadgets that are right for me at the cheapest possible price, and I look at two things: features based on usage and breakeven.

Usage-driven buying

I pick the features I want based on my usage. For example, when I bought my first mobile, I listed the my most likely uses for the phone:

  • I’m in the car (e.g. 2 hr drive to airport), and want to catch up
  • Emergency calls (means, carry the phone always)

So I need high battery life (at least 2 hours). I need low weight, if I’m going to carry it around. I don’t need colour display or MMS for my usage pattern. Then I ran through all available mobile phone options, filtered them against my criteria, and picked the cheapest (Nokia 3310).

Another example was my digital camera. The reason I wanted one was:

  • I can take a lot more photographs and print only those I want
  • For low light shots, take multiple snaps, so at least one will be OK
  • I can just take one snap and print it, and not have to complete a roll

So my camera should be light (to carry around and take lots of snaps), have a high ISO rating and flash (to work well in low light), and needn’t have much memory (I transfer it to my laptop pretty quickly).

Having identified such features, I compared models (Internet / visit shops) in 2002.

Product Price Size Flash Mpx Zoom Mem Comment
Kodak DC3400 16500 Y 2 2x No — 2x zoom not enough
Canon S10 20000 Small Y 2 2x No — 2x zoom not enough
Sony DSC P50 20000 Y 2 3X 4MB No — too little memory
Nikon 775 19000 Small Y 2 3x 8MB OK
Fuji FinePix 2600 15000 OK Y 2 3x 16MB OK
Olympus D-230 15000 Small Y 2 None 16MB No — No zoom
Nikon 885 27500 OK Y 3 3x 16MB Too little manual control
Canon G1 40000 Y 3 3X Too expensive
Sony DSC S85 40000 Y 4 3x 16MB Slow shutter
Canon G2 45000 Y 4 3X 16MB Too expensive.
Olympus C4040 45000 Y 4 3x Too expensive

I finally picked the Fuji FinePix 2600.

Breakeven

I had a normal camera. Would a digital camera be economically worth it? For a normal camera, the roll costs Rs 2.5 (Rs 90 / 36 shots), developing costs Rs 2.8 (Rs 100 / 36 shots), and each print costs Rs 5. Total cost per photo: Rs 10.3. I don’t need prints, I see pictures on the computer. The digital camera cost me Rs 20,000 including customs duty. So I break even when I take about 2,000 pictures. That sounded feasible, so I switched to digital in 2002. (I’ve taken about 2,800 snaps since.)

For similar reasons, I also decided I didn’t need a colour printer. Given my expected usage, it would have cost me Rs 34 for a single 4″ x 6″ colour photo printout. I could get the same at a shop for Rs 8.

Recently, I bought a DVD writer. DVDs cost about the same as CDs in bulk. (I bought a 100 DVD pack for 14 pounds, and 100 CDs for about the same.) A DVD stores 6 times as much as a CD. So for every DVD I burn, I save the cost of 5 CDs, about 70 pence. A DVD writer cost 50 pounds. So after burning about 70 DVDs, I’d break even. Once I’m through with my pack of 100 DVDs, I’m guaranteed breakeven. (I’ve burned about 25 DVDs till date.)

Tracking

I don’t stop there. After buying, I track my usage. Where I’ve done a breakeven, I try to track quantitatively. Otherwise, I track my usage pattern (high / medium / low). So far, my best return-on-investment has been on my webcam and mic, followed by my digital camera, CD writer, video camera and DVD writer. The worst have been my TV tuner card (I didn’t really record many movies), and my second mobile phone (turned out I didn’t really use GPRS).

I once started doing this sort of analysis for my clothes, but stopped… maybe I was carrying this a bit too far…

MP3 bitrates and sound quality

At what bitrate should you encode your MP3 files? Listening tests show that at 256kbps, you can’t tell the difference. But that’s with 2 amplifiers and big speakers. What about headphones?

I tried an experiment with my cousin, who has the best ear for music that I know. We ripped a good audio CD of his at 128 kbps. He put on a pair of headphones (the kind that fit into your ear) connected to my laptop. I played the first half a minute of the original and the ripped version 10 times, in a random order, asking him to guess which was which. Result: 5 correct and 5 wrong. He couldn’t tell the difference.

We tried again, ripping at 64kbps this time. Same experiment, and surprisingly, same result — 5 correct and 5 wrong.

Conclusion: With a pair of headphones, even a good ear can’t tell the difference between a 64kbps MP3 and an original CD. So, if you want to cram in more songs into your iPod, just re-encode them at 64kbps. You’ll easily shrink the size in half, as most of them are at least 128kbps.

Python vs Perl

Python vs Perl. Sums up my feelings perfectly: Python may be better for larger projects, but for my meddling, I’ll stick to Perl. It’s served me well for 10 years.

Until 1999, I used Perl a fair bit, but no more than Java or C or anything else. My first “real-life” use of Perl was in 2000, when I was processing 600MB of IBES data. Access and SPSS couldn’t handle the load. Perl slurped all the data in a few seconds, though. A few years later, when processing bank data (3GB worth, this time), Perl again was the only saviour. In fact, between Excel and Perl (and CPAN), I think I have all the data analysis power I’ve ever needed. This blog, for instance, is written in an Excel spreadsheet, exported to XML, and converted into the blog format by Perl.

How I listen to music

I have a large MP3 collection (Tamil and Hindi films). I don’t like selecting songs to listen to. Too much effort.

I rated all songs I had listened to (650 songs x 5-10 seconds = 1-2 hrs) and created 7 SmartViews. I just go to one of these and play them in order. Here are my views, in descending order of their use.

  1. Most played. Sorted by Play Count. Songs I play the most. Plays stuff I listen to usually.
  2. Not heard recently. Played Last before 3 months ago AND Rating >= 3. Plays good songs I haven’t heard recently.
  3. Not played much or recently. Played Last before 1 month ago AND Play Count <= 2 AND Rating >= 3. Plays good songs I haven’t heard often enough.
  4. Recent hits. Last updated after 3 months ago AND Play count >= 3. Plays songs recently added and liked.
  5. Recently played. Sorted by Last Updated. Often, I like to listen to songs I listened to yesterday.
  6. Top rated. Sorted by Rating. My best songs. (Suprisingly, I don’t use this view much.)
  7. Recently added. Sorted by Played Last. Plays songs I just downloaded.

But WinAmp’s not good enough. For example, I can’t find out what songs I played at least thrice last month. How do I see what I’ve been listening to a lot recently? Fortunately, there are a few WinAmp history plugins. I installed Pepper, which produces a log file that can be analysed. I did this two weeks ago, and don’t have enough data. When I do, I’ll modify two views

  1. Not heard much or recently. I’ll change this to “Not heard much recently” – Rating >= 3, Play Count > 5, Play Count = 0 last month.
  2. Recent hits. Modify it to show songs played at least thrice last month.

Matching misspelt Tamil movie names

I don’t like hunting for new songs either. Too much effort.

External recommendations like Raaga Top 10 help, but not much. I usually like only 1 of the top 10.

I don’t really know the recent music directors. But many interesting songs I’ve heard recently (like Ondra Renda in Kakka Kakka, Vaseegara in Minnale, and Kaadhalikkum in Chellame) are by Harris Jayaraj. So maybe if I can find the music directors I like, other songs by them would be good recommendations.

I have an automated way to find the music director for a movie. First, I spent a few hours renaming my MP3s to a Movie.Song.mp3 filename format (using Excel and Perl liberally). After that, I wrote a Perl program that reads movie names and the movie directors from Raaga and matches the Raaga movie names with my movie names. (Raaga has all but 5 movies whose songs I’ve heard.) Then I rate music directors based on my songs’ ratings.

Unfortunately, the matching worked only for 45% of my 273 movies. The rest were spelt differently on my list and Raaga. I checked CPAN if there was a way to match Tamil words roughly. The closest was Lingua::Phonology, but Jesse, the author, mailed me saying that was “like slicing your bread with a chainsaw”.

So I developed these rules. The -> arrow below is to be read as “is also spelt as”. By just applying them sequentially, I matched 33% more movies.

Vowel rules
AEdhiri -> Edhiri
kadhal kondEIn -> kadhal kondEn
chellamEY at end-> chellamE
sachIEn -> sachIn
marupadIUm -> marupadIYUm
OI, OY, OVI, OYI are all the same
AAthma -> Athma
azhagiya thEEye -> azhagiya thIye
abOOrva ragam -> abUrva ragam
Ignore H. It is redundant.

Consonant rules
arasakTCHi -> arasakSHi
CHippikkul muthu -> Sippikkul muthu
thenNDRal -> thenNRal
devar maHan -> devar maGan
bagaWathi -> bagaVathi
avvai shanmuGi -> avvai shanmuKi
konJi pesalam -> konCHi pesalam
anDha 7 naatkal -> anTha 7 naatkal
aBoorva sagodharargal -> aPoorva sagodharargal
agni natchaTHIRam -> agni natchaTHRam

The remaining movies either had spelling mistakes (e.g. Kilipethcu Ketkavaa) or had structural differences (Ilamai Oonjal Aadugiradhu vs Ilamai Oonjal Aadudhu). By permitting approximate matches using String::Approx, I was able to match 12% more, making my total accuracy ~90%.

Though this is good enough for identifying music directors, I’m working on improving the approximate matching rules. I hope to have 98% accuracy, and then I can match individual songs — and know who the singers are. Hopefully, this can be extended to other sites like MusicIndiaOnline, and who knows — maybe even IMDb.

Why Google Reader

I switched to Google Reader as my blog reader (I was using Mozilla so far). The reason was simple: speed. Thanks to the Google site’s speed and keyboard navigation, I can read blog entries 10 times faster. Now there’s a unique proposition for Google that a lot of people are missing: that their site loads a whole lot faster than others. It makes a huge difference to the whole browsing experience.

Autoblog

I have an automated (and lazy) way of finding interesting sites. This is what I do every day.

  1. I get the del.icio.us tags of every URL I blog about. (It’s available at http://del.icio.us/rss/url/ followed by the MD5 hex version of the URL).
  2. I pick the most popular tags (at least 50 links must have this tag), and use them as my “preferred tags”
  3. I scan the most popular sites on del.icio.us, and get each site’s tags
  4. If a site has my preferred tags, I give it points (the number of points is equal to the number of times I’ve blogged that tag)
  5. I pick the top 5 sites based on my points, and read them.

There are two problems I have now. Firstly, I will find sites similar to those I have blogged about — not discover anything new. That’s fine to start with — I can search for those manually. The bigger problem is, this is restricted to del.icio.us. There are two ways I can extend this (lazily).

  1. By finding new sources of popular URLs (which requires a site with a list of popular URLs updated daily, which I will find interesting)
  2. By finding new sites that tag URLs (which ideally requires an API to get the tags for a given URL)

There are lots of sources for popular URLs. But though many of sites, including notably Technorati, tag URLs, but none of them I know have APIs.