Calvin and Hobbes Dad explains science
My second most favourite series from Calvin and Hobbes, where Dad teaches Calvin the wonders of science.
My second most favourite series from Calvin and Hobbes, where Dad teaches Calvin the wonders of science.
Take a look at the Excel screenshot below.
Yes, that’s right. I have a user-defined function called AMAZONPRICE. And it returns these cameras’ prices directly from Amazon.com. (Given the category and some keywords, it returns the price of the bestselling item on Amazon.com.)
Here’s the code behind the function.
Function AmazonPrice(index As String, keywords As String) As String Dim xDoc As MSXML2.DOMDocument30 Set xDoc = New MSXML2.DOMDocument30 xDoc.async = False If xDoc.Load("http://ecs.amazonaws.com/onca/xml?Service=AWSECommerceService" _ "&Version=2005-03-23&Operation=ItemSearch&SubscriptionId=03SDGQDFEB455W53SB82" _ "&AssociateTag=sanand-20&MinimumPrice=10000&ResponseGroup=OfferSummary,Small" _ "&Sort=salesrank&SearchIndex=" & index & "&Keywords=" & keywords) Then xDoc.setProperty "SelectionLanguage", "XPath" xDoc.setProperty "SelectionNamespaces", _ "xmlns:a=""http://webservices.amazon.com/AWSECommerceService/2005-03-23""" AmazonPrice = xDoc.selectSingleNode("/a:ItemSearchResponse//a:Amount").Text End If End Function
This is how it all started…
Flickr has a camera finder that shows the most popular cameras in the Flickr community.
I love comparing gadgets, I’d been doing some research around these cameras, and the Fuji series (because I own a Fuji Finepix S5600). I’d normally make a spreadsheet that compares these cameras on various parameters, including price.
Since I believe in never typing in data, wondered if there was a way to get the prices in automatically…
Two things made this possible.
The first task is to get the XML feed for a product you want. Amazon lets you do that through by typing in a URL. The best way to construct the URL is through AWSZone. I picked the US ItemSearch method, which searches for a title or keyword within a category, and returns the matches. The feed for the Canon EOS Digital Rebel XT, based on this, would be at:
http://ecs.amazonaws.com/onca/xml? Service=AWSECommerceService& Version=2005-03-23& Operation=ItemSearch& SubscriptionId=0525E2PQ81DD7ZTWTK82& SearchIndex=Electronics& Keywords=Canon EOS Digital Rebel XT& Sort=salesrank& ResponseGroup=Offers,Small
(You really need to replace the Subscription ID with your own.)
If you retrieved this URL, you have an XML file containing the details of all Canon EOS Digital Rebel XTs, sorted by sales rank.
To load this in Excel, you need to create a UDF in Visual Basic. First, go to Tools – References and enable Microsoft XML, v3.0 or v4.0. Now, to load an XML document, do this:
Dim xDoc As MSXML2.DOMDocument30 Set xDoc = New MSXML2.DOMDocument30 xDoc.async = False xDoc.Load(url)
If the load succeeds, then you can extract the information fairly easily, using XPath.
xDoc.setProperty "SelectionLanguage", "XPath" xDoc.setProperty "SelectionNamespaces", _ "xmlns:a=""http://webservices.amazon.com/AWSECommerceService/2005-03-23"""
The first line says we’ll be searching using XPath. The second line is a workaround to support default namespaces. (Don’t worry about it. I don’t quite get it either.)
Finally, you get the price from the XML tree. In this case, it’s under ItemSearchResponse/Items/Item/OfferSummary/LowestNewPrice/Amount, and it’s in cents.
AmazonPrice = xDoc.selectSingleNode("/a:ItemSearchResponse//a:Amount").Text
That’s it! Now that this function is defined, just pass it the category and keywords, and you have the price of the first matching product. You can retrieve any other information about products as well — like sales rank, weight, ratings, reviews, etc.
Here’s the spreadsheet for you to play around with.
The technique of Web lookups in Excel I described yesterday is very versatile. I will be running through some of the practical uses it can be put to over the next few days
TO generalise things beyond just getting the Amazon price, I created a user-defined function called XPATH. It takes two parameters:
URL of the XML feed to read
Search XPath list string (separated by spaces)
This function can be used to extract information out of any XML file on the Web and get it out as a table. For example, if you wanted to watch the Top 10 movies on the IMDb Top 250, and were looking for torrents, an RSS feed is available from mininova. The URL http://www.mininova.org/rss/movie_name/4 gives you an RSS file matching all movies with “movie_name”. From this, we need to extract the <item><title> and <item><link> elements. That’s represented by “//item title link” on my search string.
The formula becomes XPath2( "http://www.mininova.org/rss/"&A2&"/4", "//item title link")
. The result is a 2-dimensional array returning individual items in rows, and the columns are title and link. Pulling it all together, you can get the sheet above.
All of this could be done using a command-line program. Excel has one huge advantage though. It’s one of the most powerful user-interfaces. Increasingly, I’m beginning to rely on just two user interfaces for almost any task. One is the browser, and the other is Excel. With Excel, I could have a sheet that has my movie wishlist (which changes often), and add check to see if the torrent exists. Every time I add a bunch of movies to the wishlist, it’s just a matter of copying the formula down. No need to visit a torrent search engine and typing each movie in, one by one.
Another example. Someone suggests 10 movies to watch. I’d rather watch the ones with a higher IMDb rating. Again, enter the Web lookup. Type in the movie names. Use a function like this to look up the rating on IMDb, and sort by the rating.
The possibilities are endless.
Now that I’m well on my way to watching the Top 250 movies on IMDb, I’m slowly turned my attention to fiction. My interest is mainly in the Fantasy & Science Fiction area. Unfortunately, I don’t know of any list like the IMDb Top 250, but there are a few awards that could take the place of the Oscars for books. That’s probably a good place to start.
The most popular awards in Science Fiction are the Hugo award and Nebula award, followed by the Philip K Dick award, John W. Campbell award, Arthur C Clarke award and other awards. I collated a list of all the awards (from LocusMag) into the spreadsheet below
Very few of these books have won multiple awards. None have won more than 3 on this list. Of these, only five have 3 awards:
Over 20 books have won two of these awards.
I haven’t read most of these books. The ones I have read are:
That’s about a 50% ratio, so I guess this list doesn’t quite very well work for me. Or at least, my taste doesn’t match the Award critics. But maybe you will find something interesting to read…
The Internet has a lot of tamil song lyrics in English. Finding them is not easy, though. Two problems. The lyrics are fragmented: there’s no one site to search them. And Google doesn’t help. It doesn’t know that alaipaayudhe, alaipaayuthe and alaipayuthey are the same word.
This is similar to the problem I faced with tamil audio. The solution, as before, is to make an index, and provide a search interface that is tolerant of English spellings of Tamil words. But I want to go a step further. Is it possible to display these lyrics in Tamil?
My Tamil Transliterator does a lousy job of this. Though it’s tolerant of mistakes, it’s ignorant of spelling and grammer. So,
kanda nal muthalai kathal peruguthadi
… becomes…
kanda nal muthalai kathal peruguthadi
… when in fact we want…
kanda naaL muthalaay kaathal peruguthadi
(If you’re viewing this on an RSS reader, check my post to see what I mean.)
I need an automated Tamil spelling corrector. Reading Peter Norvig’s “How to Write a Spelling Corrector” and actually having understood it, I gave spelling correction in Tamil a shot.
Norvig’s approach, in simple terms, is this:
Making a dictionary is easy. I just need lots of Tamil literature, and then I pick out words from it. For now, I’m just using the texts in Project Madurai.
Tweaking the word to check is easy. Norvig’s article has a working code example.
Picking valid tweaks is easy. Just check against the dictionary.
The tough part is choosing the likely correction. For each valid word, I need the probability of having made this particular error.
Let’s take an example. I’ve spelt kathal. A list of valid tweaks to this word include: kal, kol, kadal, kanal, and kaadhal. For each of these, I need to figure out how often the valid tweaks occur, and the probability that I typed kathal when I really meant one of these tweaks. This is what such a calculation would look like:
Tweak | Frequency | Probability of typing kathal | Product |
---|---|---|---|
kal | 1 | 0.04 | 0.04 |
kol | 4 | 0.02 | 0.08 |
kadal | 10 | 0.1 | 1.0 |
kanal | 1 | 0.01 | 0.01 |
kaadhal | 6 | 0.25 | 1.50 |
Once we have this, we can see that kaadhal is the right one — it has the maximum value (1.50) in the last column, where we multiply the frequency and the probability.
(You probably realise how small my dictionary is, looking at the frequencies. Well, that’s how big Project Madurai is. But increasing the size of a dictionary is a trivial problem.)
Anyway, getting the frequency is easy. How do I get the probabilities, though? That’s what I’m working on right now.
I often need to extract words out of sentences. It’s one of the things I used to build the Statistically Improbable Phrases for Calvin and Hobbes. But splitting a sentence into words isn’t as easy as you think.
Think about it. What is a word?
Something that has spaces around it? OK, let’s start with the simplest way to get words: split by spaces. Consider this piece:
"I'd look at McDonald's," he said. "They sell over 3,000,000 burgers a day -- at $1.50 each." High-fat foods were the rage. For e.g., margins in fries were over 50%... and (except for R&M & Dyana [sic]) everyone was at ~30% net margin; growing at 25% too!
Splitting this by spaces (consider new lines, tabs, etc as spaces too.), we get the following:
"I'd look at McDonald's," ...
Now, some of these like “I’d” are words. But “McDonald’s” isn’t. I mean, there’s a full-stop and a double-quotes at the end. Clearly we need to remove the punctuation as well. But, if we do that, I'd
becomes Id
. So we need to be careful about which punctuation to remove. Let’s take a closer look.
The following punctuation marks are clear word separators: spaces, the exclamation mark, the question mark, semicolon, brackets of any kind, and double-quotes (not single quotes). No word has these in the middle. If we use these as separators, our list of words is better, but we still have some words with punctuation:
McDonald's, e.g., High-fat R&M ...
The issue is, these punctuation marks are ambiguous word separators: comma, hyphen, single-quote, ampersand, period and slash. These usually separate words, but there are exceptions:
Colon is ambiguous too. In normal English usage, it would be a separator. But URLs like http://www.s-anand.net/ use these characters, and it doesn’t make sense to separate them.
So here are my current rules for splitting a sentence into words. (It’s a Perl regular expression. Don’t worry. Cooper’s Law: If you do not understand a particular word in a piece of technical writing, ignore it. The piece will make perfect sense without it.)
# Split by clear word separators / [\s \! \? \;\(\)\[\]\{\}\<\> " ] # ... by COMMA, unless it has numbers on both sides: 3,000,000 | (?<=\D) , | , (?=\D) # ... by FULL-STOP, SINGLE-QUOTE, HYPHEN, AMPERSAND, unless it has a letter on both sides | (?<=\W) [\.\-\&] | [\.\-\&] (?=\W) # ... by QUOTE, unless it follows a letter (e.g. McDonald's, Holmes') | (?<=\W) ['] # ... by SLASH, if it has spaces on at least one side. (URLs shouldn't be split) | \s \/ | \/ \s # ... by COLON, unless it's a URL or a time (11:30am for e.g.) | \:(?!\/\/|\d) /x;
This doesn’t even scratch the surface of the issue, though. Here are some issues:
And you thought it was easy!
In some of the Web projects I’m working on, I have a choice of many small files vs few big files to download.
There are conflicting arguments. I’ve read that many small files are better, because you can choose to use only the required files, and they’ll be cached across the site. (These are typically CSS or Javascript files.) On the other hand, a single large file takes less time to download than the sum on many small files, because there’s less latency. (Latency is more important than bandwidth these days.)
I ran some tests, and the answer is rather easy. The graph below shows the average time taken to download a file of size 1KB – 100KB.
The X-axis is the size of the file. The Y-axis is the number of milliseconds taken to download the file, averaged over 10 attempts on my home computer. (I did the same thing at work and at a client site. The results are similar.)
The key thing is, it’s not linear. Larger files take less time. Specifically:
These days, it looks like few big files are better.
To give you an example: my home page had the following components:
Size (KB) | Load time (ms) | |
---|---|---|
HTML | 25 | 680 |
CSS | 4 | 340 |
Javascript | 6 | 400 |
Total | 35 | 1420 |
The total time for these 3 files would be about 1.4 seconds. Instead, if I put them all on one file…
Size (KB) | Load time (ms) | |
---|---|---|
All-in-one | 35 | 770 |
The combined file takes only 0.77 seconds — half the download time for split files. It’s a compelling argument to put all your CSS and Javascript (images, too, if possible) into a single page.
But what if people visit multiple pages on my site, and lose the benefit of caching? Not really. The benefit of caching is small. By having a single file, I have 770 – 680 = 90 ms additional time for each HTML to load. But I don’t have to load the CSS and Javascript individually, which takes 740 seconds. The breakeven is about 740 / 90 = 8 page visits.
So, on average, if people, visit more than 8 pages in my site, it’s worth breaking up the CSS and Javascript. But the average for my site is only 2 pages per person. (It’s a skewed distribution. Most, from Google, just visit one page. Few, from bookmarks, visit several pages. On average, it’s just over 2.) I’d argue that unless you’re using an external Javascript / CSS library (prototype, etc.) or people view many pages each visit, you’re better of having a single HTML+Javascript+CSS file.
John Resig has written a Sparklines library. Here’s an example. I wrote that HTTP download speeds not linear and that they flatten out over time. A linear line would look like this: The little red line here is a sparkline that’s based on real data. John’s javascript converts the data into a graph.
Sparklines were introduced by Edward Tufte.