How I do things

My Fuji Finepix S5600

My digital camera conked off. The cover that holds the battery fell off, and I can’t use it any more.

I went back to my buying principles, and prepared an Excel sheet to choose my next camera. Here’s what I was looking for:

  • Low-light photography. Flashes are lousy. This effectively means I need ISO control.
  • Shutter speed control. I sometimes take really long exposure (3-10s) snaps, and sometimes can’t afford the blur (1/250s).
  • Long battery life. My current camera consumed batteries like crazy.
  • Fast start-up. By the time I got my earlier camera out and it started, it was too late.
  • RAW mode. Gives me more control in Photoshop.

I didn’t care about:

  • megapixels. 2 megapixels (1600×1200) is more than enough, even for my printouts. Takes too much space besides.
  • zoom. I need wide-angle more than zoom, really.
  • removeable lens. I’m not going to carry around multiple lenses.

After scouting around on Amazon for many months, I found the Fuji Finepix S5600. Not an SLR, but had all the features that I wanted, and at a pretty reasonable price.

Fuji Finepix S5600

Here’s a shot I took from my drawing room. This is a 3-second exposure on ISO 100 at F 3.2. The streaks on the road are car headlights.

2006-11-28 01 Newbury Park

As a bonus, it had a pretty good (10X) zoom too. See the brightly lit buildings towards the top-left? That’s Canary Wharf. Below is a blow-up of those buildings from the same spot I took the above photo from.

2006-11-29 01 Canary Wharf

Google search in Tamil

When I wrote my Tamil song lyrics quizzes, I had two problems:

  1. I can’t write in Tamil (not on paper, nor on a computer)
  2. I can’t spell right in Tamil (ந vs ன, ர vs ற)

I overcame the first using a Tamil transliterator. I write in English, and you see it in Tamil.

The problem of ந vs ன was simple. ந occurs as the first letter of a word, and just before த. Nowhere else. (Is this always true?)

But ர vs ற can’t be solved except through experience, and I’m short of that. So, rather than bother my family with every quiz, I used the wisdom of crowds. I googled both spellings of the word. The correct spelling has more Google hits than the incorrect one.

I did this so often, I made a Google gadget out of it.

Just type the word in English, click ‘Search’, and my gadget will search in tamil. It’s amazing how much stuff there is in Tamil on the Web, from song lyrics to texts (thirukkuraL, for example).

You can add this gadget to:

  • your desktop (in the Search Gadgets box, type “http://www.s-anand.net/a/tamilsearchgadget.xml”)
  • your website or blog (click here for the code)
  • Google Reader. Add to Google

Here’s the transliteration table:

Tamil English
a
A or aa
i
I or ee
u
U or oo
e
E
ai
o
O
au
k or g
n
ch or s
j
n
t or d
N
th or dh
n
p or b
m
y
r
l
v
zh
L
R
sh
S
h

Automated resume filtering

I had to screen resumes from a leading MBA school. I’m lazy, and there were hundreds of CVs. So after procrastinating until this morning, I decided on 2 principles:

  1. I will not spend more than 45 minutes on this. (That’s the duration of my train ride to office.)
  2. I will not read a single CV. (I would write a program.)

The CVs were in a single PDF file. I saved it as text (it shrunk from 66MB to 1.6MB without the photos). Then I wrote a Perl program to filter CVs by keywords. We were looking for people with an interest and/or experience in IT consulting, so I picked “technology”, “consulting”, “SAP”, “IBM”, “Accenture”, “Deloitte”, etc.

Anyone without these keywords would fall out of the list. This eliminated 75% of the crowd. But since I didn’t want to read the rest, I used my favourite text-analysis technique: concordance. I extracted 3 words on either side of each keywords, and just read those. It was easy to see who’d “worked with suppliers like IBM” as opposed to who’d worked at IBM.

That’s it! I managed to cut the list down to 10%. Better yet, I also had a preference ranking. People with multiple keywords ranked higher than those with fewer keywords. And all this took little more than my train ride to office.

I can see this going to the next level. It’s easy to write a customised rejection letter, depending on which keywords are missing for each person.

Now, if it’s this easy to filter resumes, I can see every organisation do it in a few years. Which means, you need to write resumes for machines as well, not just for humans! For example, on my next CV, I’ll make sure I include the words “Boston Consulting Group” as well as “BCG” — just in case the software searches for only one of those keywords. Further, I’ll make sure I avoid spelling mistakes!

Playing sounds backwards

You can play a video backwards and still recognise the scenes quite well. Can you do that with sound?

I tried it on this Bryan Adams clip of Summer of ’69 (mp3). When played backwards (mp3), it almost sounds like Arabic!

Instruments sound weird backwards too, like the guitar played backwards and drums played backwards.

It’s seems obvious once you see the wave file. The picture below shows the guitar. The sounds are clearly not symmetric left to right.

Sound wave diagram of a guitar

Whereas this guitar is a lot more symmetric, and doesn’t sound too different backwards.

Sound wave diagram of another guitar

So how come we can’t recognise sounds played backwards, but can recognise video played backwards? (Initially, I thought it was a trivial question. But I couldn’t find a trivial answer. The question may be subtler than it looks.)

Google searches that lead to my site

I stopped using Google Analytics when I redesigned my site. I track my own statistics. This gives me access to raw data, and I can do my own analyses.

I wanted to know the keywords on Google that led to my site. (Google Analytics only gives you phrases.) I also wanted independent words. If you search for “Calvin and Hobbes”, I want to count only “Calvin”, knowing that it’s in the context of “Hobbes”.

So I did this analysis. Here are the keywords that lead to my site. (This is based on 3 weeks of data).

  1. excel in the context of cell, formula, function, leading to my Excel tips. People mostly want to know how to remove errors like #N/A.
  2. calvin in the context of hobbes, fight, club. (There was a great article on how Fight Club is really Calvin and Hobbes.) Most of these queries are searches for specific quotes, and I’ve typed out all the Calvin and Hobbes quotes.
  3. indian in the context of torrents, tv. One of my most popular posts is Indian Torrents. I simply linked to a couple of Google searches, so it’s popularity is unjustified.
  4. tamil in the context of songs, lyrics, movie. This is mostly thanks to the recent tamil quizzes I’ve put up.
  5. mumbai in the context of local, schedule, train. A shockingly large number of people search for Mumbai bus and train schedule, landing on my link to the IIT-B Mumbai Navigator.
  6. anand in the context of s anand, bcg, infosys. This is people searching for me.
  7. irr in the calculating, excel, formula. Calculating IRR turned out to be another unexpectedly popular post.
  8. interview in the context of lehman brothers, bcg, landing at some of my interview experiences.
  9. mckinsey in the context of ppt, presentation. Most of these people are looking for presentations, while I have a link to the McKinsey pre-placement talk at LBS. Interesting that BCG is not on the top 10.
  10. google in the context of engedu, types, authors@google. Though I have several posts about Google, the ones about Google video like Meet the author and on Google TechTalks are the most popular.

Having read the actual queries, I’ve concluded that only the keywords excel, mumbai, anand, irr and interview definitely lead to relevant hits. The rest are debatable. Maybe I should reduce the importance of the less relevant posts on my sitemaps file.

Experiments in sound

Wikipedia says the human voice frequency for speech is between 85 to 155 Hz for men, and 165 to 255 Hz for women. That set me thinking.

  1. What is the limit to our hearing?
  2. How do sounds differ?
  3. How can we synthesise speech?

What are the limits to our hearing?

Kids can hear frequencies from 20 Hz to 20 kHz, while adults hear only up to 12-14 kHz (Frequency Range of Human Hearing).

To check the lower frequency limit, I created an MP3 with sounds from 1 Hz to 100 Hz at 1 second intervals. Just play the sound, and see when you start hearing something. (Of course, whether you can hear something also depends on the volume of your speaker, the ambient noise, etc.) I could hear nothing for the first 40 seconds: so I can’t hear frequencies lower than 40 Hz.

PS: Don’t be worried if you don’t hear anything for a while. You’re not supposed to! Keep the volume at full level, though.

To check the upper frequency limit, I created this MP3 with sounds from 1 kHz to 20 kHz in 1 second intervals. Just play the sound, and see when you stop hearing anything. I couldn’t hear anything beyond 14 seconds: so I can’t hear frequencies beyond 14 kHz.

How do sounds differ?

I took this audio file of someone reciting vowels and plotted a spectrogram (below). A spectrogram plots time on the X axis and frequency on the Y-axis.

Vowels spectrogram

Some observations:

  • All the vowels have evenly spaced bars. (In this case, they’re all multiples of something around 120 Hz.)
  • ‘u’ has the lowest frequency mix. ‘a’ spans from low to high. ‘i’ has a bit of low and a bit of high, nothing in the middle. ‘ai’ and ‘au’ look like ‘a’ followed by ‘i’ and ‘u’ respectively.

How can we synthesise speech?

I don’t know. There are lots of speech synthesizers. They sound robotic. I’m trying to see if knowing what sounds look like improves things. I’ll let you know if I do well.

Link to a Google search rather than a site

When you make a link, there’s no guarantee that the link will work 5 years later. Sites change their URL structure. I’m finding that many of my blog entries from 2000 are invalid.

Sometimes you want to link to a concept rather than a site. In such cases, it’s better to link to a Google query.

For example, rather than link to a site that defines SVG, I could link to the Google search define:SVG.

Rather than link to a tutorial on Excel array formulas, I could link to the Google search excel array formulas. I could even link to the first hit on Google for excel array formulas, mimicking the “I’m feeling lucky” button. This may change over time, but 5 years from now, it’ll still point to the most relevant link.

To link to the Google query for “excel array formulas”, just link to the URL http://www.google.com/search?q=excel+array+formulas. To link directly to the first result, add &btnI=I'm+Feeling+Lucky to the URL. (Linking to A9 is simpler: http://a9.com/excel+array+formulas)

PS: An alternative is to link to a permanent copy of the page from the Wayback machine (it has copies of my page all the way from May 2001 to Mar 2005). (You can’t use Google’s cache. When the site changes, the cache will soon change. But it’s a good defence against site downtime. Manually doing this is a lot of effort. Ideally, future browsers will automatically take you to the Wayback machine or the Google cache. (The Firefox plugins ErrorZilla and CacheIt come close.)

Making a Tamil transliterator

I’ve built a simple Tamil transliterator. You can type in words in English and it will spell them out in Tamil. You can copy-paste the Tamil above into Microsoft Word, etc.

You may need to turn on tamil scripts to see the Tamil fonts above. If you have Windows 98, it may not work well. If you’ve visited this page recently, you will need to refresh this page as well (press F5).

Browse through my Javascript to see how it works. Feel free to reuse.

I’ve also made a Google Gadget that searches Google in Tamil using this tool.

Here’s what to type:

Tamil English
a
A or aa
i
I or ee
u
U or oo
e
E
ai
o
O
au
k or g
n
ch or s
j
n
t or d
N
th or dh
n
p or b
m
y
r
l
v
zh
L
R
sh
S
h

I also have a gadget that lets you search in Tamil.

Statistically improbable phrases

Calvin and Hobbes has some recurrent themes, like Hobbes pouncing, snow art, polls, letters to Santa, …

Over the last 5 years, I’ve transcribed the Calvin and Hobbes comics, and tagged them manually by theme. But can I generate themes automatically?

One way is to use Amazon’s statistically improbable phrases. It’s a list of words that occur a lot in a book, but rarely occur in others. It gives you a good feel of what topics the book is about.

Here’s how I did it:

  1. Transcribe Calvin & Hobbes. This is 99% of the work.
  2. Make a C&H word list. Just join all the words in Calvin and Hobbes. (Be careful about punctuation, and colloquialisms like “dunno”, “leggo”, etc.)
  3. Get an English corpus. That is, get a big list of words in normally occurring text. I have some e-books, and I picked 23 megabytes worth of these as my corpus.
  4. Compare the word frequency in C&H with the corpus. That is, compare the % of occurrences of a word in Calvin and Hobbes versus the corpus.
  5. Display those with significantly higher frequency in C&H.

The list below has common Calvin & Hobbes words occurring 10 times as often as in normal text. It’s incredible how closely it relates to most of the themes.

(Big words occur more often. Dark words are more improbable.)


allowance assignment babe balloon bat bath beanie bedtime bee beep bet bike blaster boring bug bus butter calvin calvinball cartoon cent cereal cheat chew chocolate click comic cookie crunch dad dame derkins dictator-for-life dinosaur disgusting doll doomed dumb duplicate earthling explorer fang fearless ferocious flip flush frog frosted fun fuzzy genius goggle goodness goon grade gross grown-up gum hack hamburger hamster hate hero hideous hobbes homework huey insect invent jelly jerk jurassic kid leaf loot martian math mild-mannered mom monster moron motto munch mushy nickel oatmeal ouija pant peanut perspective pit playground poll porridge poster quiz recess rosalyn rotten rub sandwich santa scary sculpture scum shovel
sissy sitter sled slimy slug slushball sniff snow snowball snowman soak spaceman spiff splash spoil sport squirt steer sting stuffed stupendous sugar susie tickle tiger toy transmogrifier transmogrify tub tuna twinky tyrannosaur underwear vacation weird wham whiff worm wormwood


Summary: “Statistically improbable phrases” are a powerful tool for text analysis. You can apply it on any content and figure out what topics it talks about.

Update: Technically, these are “Statistically improbable WORDS”, not phrases. So I re-did this analysis using phrases instead of words.