How I do things Archives - Page 13 of 14

Making a Tamil transliterator

August 28, 2006 / How I do things, Tools / 21 Comments

I’ve built a simple Tamil transliterator. You can type in words in English and it will spell them out in Tamil. You can copy-paste the Tamil above into Microsoft Word, etc.

You may need to turn on tamil scripts to see the Tamil fonts above. If you have Windows 98, it may not work well. If you’ve visited this page recently, you will need to refresh this page as well (press F5).

Browse through my Javascript to see how it works. Feel free to reuse.

I’ve also made a Google Gadget that searches Google in Tamil using this tool.

Here’s what to type:

Tamil	English
அ	a
ஆ	A or aa
இ	i
ஈ	I or ee
உ	u
ஊ	U or oo
எ	e
ஏ	E
ஐ	ai
ஒ	o
ஓ	O
ஔ	au
க	k or g
ங	n
ச	ch or s
ஜ	j
ஞ	n
ட	t or d
ண	N
த	th or dh
ந	n
ப	p or b
ம	m
ய	y
ர	r
ல	l
வ	v
ழ	zh
ள	L
ற	R
ஷ	sh
ஸ	S
ஹ	h

I also have a gadget that lets you search in Tamil.

Making a Tamil transliterator Read More »

Statistically improbable phrases

August 23, 2006 / How I do things / 2 Comments

Calvin and Hobbes has some recurrent themes, like Hobbes pouncing, snow art, polls, letters to Santa, …

Over the last 5 years, I’ve transcribed the Calvin and Hobbes comics, and tagged them manually by theme. But can I generate themes automatically?

One way is to use Amazon’s statistically improbable phrases. It’s a list of words that occur a lot in a book, but rarely occur in others. It gives you a good feel of what topics the book is about.

Here’s how I did it:

Transcribe Calvin & Hobbes. This is 99% of the work.
Make a C&H word list. Just join all the words in Calvin and Hobbes. (Be careful about punctuation, and colloquialisms like “dunno”, “leggo”, etc.)
Get an English corpus. That is, get a big list of words in normally occurring text. I have some e-books, and I picked 23 megabytes worth of these as my corpus.
Compare the word frequency in C&H with the corpus. That is, compare the % of occurrences of a word in Calvin and Hobbes versus the corpus.
Display those with significantly higher frequency in C&H.

The list below has common Calvin & Hobbes words occurring 10 times as often as in normal text. It’s incredible how closely it relates to most of the themes.

(Big words occur more often. Dark words are more improbable.)

allowance assignment babe balloon bat bath beanie bedtime bee beep bet bike blaster boring bug bus butter calvin calvinball cartoon cent cereal cheat chew chocolate click comic cookie crunch dad dame derkins dictator-for-life dinosaur disgusting doll doomed dumb duplicate earthling explorer fang fearless ferocious flip flush frog frosted fun fuzzy genius goggle goodness goon grade gross grown-up gum hack hamburger hamster hate hero hideous hobbes homework huey insect invent jelly jerk jurassic kid leaf loot martian math mild-mannered mom monster moron motto munch mushy nickel oatmeal ouija pant peanut perspective pit playground poll porridge poster quiz recess rosalyn rotten rub sandwich santa scary sculpture scum shovel
sissy sitter sled slimy slug slushball sniff snow snowball snowman soak spaceman spiff splash spoil sport squirt steer sting stuffed stupendous sugar susie tickle tiger toy transmogrifier transmogrify tub tuna twinky tyrannosaur underwear vacation weird wham whiff worm wormwood

Summary: “Statistically improbable phrases” are a powerful tool for text analysis. You can apply it on any content and figure out what topics it talks about.

Update: Technically, these are “Statistically improbable WORDS”, not phrases. So I re-did this analysis using phrases instead of words.

Statistically improbable phrases Read More »

How I use Google Spreadsheets

June 28, 2006 / How I do things / 3 Comments

I work across multiple computers (my office laptop, home laptop, client desktop) on a daily basis.

I used to transfer data across these by e-mailing them before I travelled. (I often forgot to do so.) Mostly, these are notes — like telephone numbers, things to buy, places to visit, etc.

Google Notebook solves the problem. But not entirely. I store a lot of my notes on spreadsheets, as lists. For example:

Gadgets to buy (and accompanying research)
Movies I’ve seen
Books to read (and which library they’re available from)
To do lists

That’s what I use Google Spreadsheets for — to share lists with myself, across computers.

How I use Google Spreadsheets Read More »

Demographics prediction from online behaviour

June 26, 2006 / How I do things / 3 Comments

Microsoft adCenter Labs has a demographics prediction engine. Based on a person’s search queries and web sites visited, it can predict their gender and age.

So I tried that on parts of the body, to see what men were interested in vs women.

topic	male	female
hair	25%	75%
eyes	33%	67%
cheek	33%	67%
hands	33%	67%
lips	36%	64%
ears	39%	61%
fingers	40%	60%
forehead	42%	58%
nose	43%	57%
neck	46%	54%
beard	55%	45%
moustache	58%	42%
leg	60%	40%
palm	61%	39%
toe	64%	36%

While I can understand men being more interested in beards and moustaches (perhaps even legs), why are they far more interested in toes than women?

Demographics prediction from online behaviour Read More »

Cut-and-paste is not understanding

June 21, 2006 / How I do things / 3 Comments

Cut and paste has become easier. So we make less effort to understand. We don’t need to. Like when we pay less attention if we’re recording a lecture.

Solution? I suggest the Tunnel in the Sky strategy. Rod Walker is going for survival training on an alien planet, and asks his sister, Captain Walker…

“Uh, Sis, what sort of gun should I carry?”

“Huh? Why the deuce do you want a gun?”

“Why, for things I might run into of course.”

“Your only purpose is to stay alive. Not to be brave, not to fight. One time in a hundred a gun might save your life; the other ninety-nine it will tempt you into folly.”

“Did you take a gun on your solo test?”

“I did. And I lost it the first day. Which saved my life. I know how good a gun makes you feel. You’re ready for anything and hoping you’ll find it. Which is exactly what is dangerous about it – because you aren’t anything of that sort.”

So, don’t take a gun.

Don’t record lectures. Don’t give yourself the illusion of perfect memory.

Don’t bookmark for future reading. You won’t read it later.

Don’t cut and paste. You don’t understand it now. You won’t understand it later.

Cut-and-paste is not understanding Read More »

The Search

April 15, 2006 / How I do things / Leave a Comment

I was reading John Battelle’s The Search , and realised: We don’t sit down on the computer and say, “Let’s do a search”.

True. We want to get something done. We know it’s out there somewhere. We search.

So every search on a search engine is a commercial opportunity. Contrawise, every site must let people to do what they want to do on the site.

Think… What do people want to do when they’re on YOUR site?

The Search Read More »

Search queries to my site

April 6, 2006 / How I do things / 5 Comments

On a related note, 60% of the search queries that lead to my site this year were Calvin and Hobbes quotes. “i can’t help but wonder what kind of desperate straits would drive a man to invent this thing.” topped the list (Calvin referring to a yo-yo), with i always catch these trick questions following closely.

People searching for Excel related stuff were next (20%): excel indirect(address(, row() excel offset address and the like.

A few were also looking for me by name or school (10%).

The last 10% ranged from the puzzling to the bizarre, including these gems.

michalengelo hidden skull. Probably looking for the alleged hidden skull in Last Judgement.
googlemail access between england and india. Why? Did he think there wouldn’t be any?
greenwich meridien time for india. This is usually the same in Greenwich and in India. Rest of the world too.
origin of monkey in fez. What?
address of sexy girl in ahmedabad. But why my site?
address of tool makers for converting html to xml in chennai. Do they have a license to convert?

Search queries to my site Read More »

IMDB Top 250 outliers

April 4, 2006 / How I do things / 9 Comments

On the IMDb top 250, you normally see a correlation between the number of votes and the rating for a movie. Better rated movies are more watched. The outliers are interesting.

The movies that are popular despite not having a high rating are:

I can understand why The Sixth Sense, Pirates of the Caribbean and especially The Matrix are on this list — geeks would have watched these and voted on IMDb, though their voting need not have been high. But why are Gladiator and Sixth Sense on that list?

Movies that are highly rated, but not as popular are:

Seven Samurai and The Good, The Bad, The Ugly probably didn’t get the votes they deserve because they’re written in their Japanese and Mexican names on IMDb. I hadn’t seen them for a long time for the same reason. As for The Godfather, I personally think it’s just overrated. But Rear Window? That’s a surprise. Hitchcock thriller with all the classic elements…

Another correlation is between the rating and the year of the movie. Early movies get lower ratings than recent movies. Technique could be the reason, but I doubt it. In any case, some movies stand out of their time.

I haven’t seen Metropolis or M. But among the others, I think Citizen Kane is the one that deserves to stand out, if only for portraying the anti-hero, and for not having a happy ending. The Shawshank Redemption was a bit of a surprise. Few people that I know have heard of it. And yet, there it is, right on top.

IMDB Top 250 outliers Read More »

How I buy gadgets

March 26, 2006 / How I do things / 9 Comments

I’m a cautious gadget freak. I love buying gadgets, but think a lot before buying them. Invariably, I use spreadsheets to help me decide. I try to buy only those gadgets that are right for me at the cheapest possible price, and I look at two things: features based on usage and breakeven.

Usage-driven buying

I pick the features I want based on my usage. For example, when I bought my first mobile, I listed the my most likely uses for the phone:

I’m in the car (e.g. 2 hr drive to airport), and want to catch up
Emergency calls (means, carry the phone always)

So I need high battery life (at least 2 hours). I need low weight, if I’m going to carry it around. I don’t need colour display or MMS for my usage pattern. Then I ran through all available mobile phone options, filtered them against my criteria, and picked the cheapest (Nokia 3310).

Another example was my digital camera. The reason I wanted one was:

I can take a lot more photographs and print only those I want
For low light shots, take multiple snaps, so at least one will be OK
I can just take one snap and print it, and not have to complete a roll

So my camera should be light (to carry around and take lots of snaps), have a high ISO rating and flash (to work well in low light), and needn’t have much memory (I transfer it to my laptop pretty quickly).

Having identified such features, I compared models (Internet / visit shops) in 2002.

Product	Price	Size	Flash	Mpx	Zoom	Mem	Comment
Kodak DC3400	16500		Y	2	2x		No — 2x zoom not enough
Canon S10	20000	Small	Y	2	2x		No — 2x zoom not enough
Sony DSC P50	20000		Y	2	3X	4MB	No — too little memory
Nikon 775	19000	Small	Y	2	3x	8MB	OK
Fuji FinePix 2600	15000	OK	Y	2	3x	16MB	OK
Olympus D-230	15000	Small	Y	2	None	16MB	No — No zoom
Nikon 885	27500	OK	Y	3	3x	16MB	Too little manual control
Canon G1	40000		Y	3	3X		Too expensive
Sony DSC S85	40000		Y	4	3x	16MB	Slow shutter
Canon G2	45000		Y	4	3X	16MB	Too expensive.
Olympus C4040	45000		Y	4	3x		Too expensive

I finally picked the Fuji FinePix 2600.

Breakeven

I had a normal camera. Would a digital camera be economically worth it? For a normal camera, the roll costs Rs 2.5 (Rs 90 / 36 shots), developing costs Rs 2.8 (Rs 100 / 36 shots), and each print costs Rs 5. Total cost per photo: Rs 10.3. I don’t need prints, I see pictures on the computer. The digital camera cost me Rs 20,000 including customs duty. So I break even when I take about 2,000 pictures. That sounded feasible, so I switched to digital in 2002. (I’ve taken about 2,800 snaps since.)

For similar reasons, I also decided I didn’t need a colour printer. Given my expected usage, it would have cost me Rs 34 for a single 4″ x 6″ colour photo printout. I could get the same at a shop for Rs 8.

Recently, I bought a DVD writer. DVDs cost about the same as CDs in bulk. (I bought a 100 DVD pack for 14 pounds, and 100 CDs for about the same.) A DVD stores 6 times as much as a CD. So for every DVD I burn, I save the cost of 5 CDs, about 70 pence. A DVD writer cost 50 pounds. So after burning about 70 DVDs, I’d break even. Once I’m through with my pack of 100 DVDs, I’m guaranteed breakeven. (I’ve burned about 25 DVDs till date.)

Tracking

I don’t stop there. After buying, I track my usage. Where I’ve done a breakeven, I try to track quantitatively. Otherwise, I track my usage pattern (high / medium / low). So far, my best return-on-investment has been on my webcam and mic, followed by my digital camera, CD writer, video camera and DVD writer. The worst have been my TV tuner card (I didn’t really record many movies), and my second mobile phone (turned out I didn’t really use GPRS).

I once started doing this sort of analysis for my clothes, but stopped… maybe I was carrying this a bit too far…

How I buy gadgets Read More »

MP3 bitrates and sound quality

February 10, 2006 / How I do things / 11 Comments

At what bitrate should you encode your MP3 files? Listening tests show that at 256kbps, you can’t tell the difference. But that’s with 2 amplifiers and big speakers. What about headphones?

I tried an experiment with my cousin, who has the best ear for music that I know. We ripped a good audio CD of his at 128 kbps. He put on a pair of headphones (the kind that fit into your ear) connected to my laptop. I played the first half a minute of the original and the ripped version 10 times, in a random order, asking him to guess which was which. Result: 5 correct and 5 wrong. He couldn’t tell the difference.

We tried again, ripping at 64kbps this time. Same experiment, and surprisingly, same result — 5 correct and 5 wrong.

Conclusion: With a pair of headphones, even a good ear can’t tell the difference between a 64kbps MP3 and an original CD. So, if you want to cram in more songs into your iPod, just re-encode them at 64kbps. You’ll easily shrink the size in half, as most of them are at least 128kbps.

MP3 bitrates and sound quality Read More »

How I do things