A R Rahman Hindi songs
By popular demand, here are interludes from 15 Hindi songs of A R Rahman. Can you guess which movie they are from?
Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.
By popular demand, here are interludes from 15 Hindi songs of A R Rahman. Can you guess which movie they are from?
Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.
One of the reasons I moved to WordPress was the ability to write posts offline, for which I use Windows Live Writer most of the time. The beauty of this is that I can preview the post exactly as it will appear on my site. Nothing else that I know is as WYSIWYG, and it’s very useful to be able to type knowing exactly where each word will be.
The only hitch is: if you write your own WordPress theme, Live Writer probably won’t be able to detect your theme — unless you’re an expert theme writer.
I hunted on Google to see how to get my theme to work with Live Writer. I didn’t find any tutorials. So after a bit of hit-and-miss, I’m sharing a quick primer of what worked for me. If you don’t want to go through the hassle, you can always call on professionals who are adept at services like professional custom website design.
Open any post on your blog (using your new theme) and save that as view.html
in your theme folder. Now replace the page’s title with {post-title} and the page’s content with {post-body}. For example:
This is the file Live Writer will be using as its theme. This page will be displayed exactly as it is by Live Writer, with {post-title} and {post-body} replaced with what you type. You can put in anything you want in this page — but at least make sure you include your CSS files.
To let Live Writer know that view.html
is what it should display, copy WordPress’ /wp-includes/wlw-manifest.xml
to your theme folder and add the following lines just before </manifest>
.
WebLayout |
Live Writer searches for wlmanifest.xml in the <link rel="wlmanifest">
tag of your home page. Since WordPress already links to its default wlwmanifest.xml
, we need remove that link and add our own. So add the following code to your functions.php
:
function my_wlwmanifest_link() { echo ''; } remove_action('wp_head', 'wlwmanifest_link'); add_action('wp_head', 'my_wlwmanifest_link'); |
That’s it. Now if you add your blog to Live Writer, it will automatically detect the theme.
This morning, I was watching an episode of Finley the Fire Engine in which one of the trucks had hiccups. Reminded me of this Calvin & Hobbes — especially Hobbes’ remark in the second strip.
By curious coincidence, just a day after my post on client side scraping, I had a chance to demo this to a client. They were making a contacts database. Now, there are two big problems with managing contacts.
Now, people happy to fill out information about themselves in great detail. If you look at the public profiles on LinkedIn, you’ll find enough and more details about most people.
Normally, when getting contact details about someone, I search for their name on Google with a “site:linkedin.com” and look at that information.
Could this be automated?
I spent a couple of hours and came up with a primitive contacts scraper. Click on the link, type in a name, and you should get the LinkedIn profile for that person. (Caveat: It’s very primitive. It works only for specific URL public profiles. Try ‘Peter Peverelli’ as an example.)
It uses two technologies. Google AJAX Search API and YQL. The search()
function searches for a phrase…
1 2 3 4 5 6 7 8 9 10 11 | google.load("search", "1"); google.setOnLoadCallback(function () { gs = new google.search.WebSearch(); $('#getinfo').show(); }); function search(phrase, fn) { gs.setSearchCompleteCallback(gs, function() { fn(this.results); }); gs.execute(phrase); } |
… and the linkedin()
function takes a LinkedIn URL and extracts the relevant information from it, using XPath.
13 14 15 16 17 18 19 20 21 22 23 | function scrape(url, xpath, fn) { $.getJSON('http://query.yahooapis.com/v1/public/yql?callback=?', { q: 'select * from html where(url="' + url + '" and xpath="' + xpath + '")', format: 'json' }, fn); } function linkedin(url, fn) { scrape(url, "//li[@class][h3]", fn); }; |
So if you wanted to find Peter Peverelli, it searches on Google for “Peter Peverelli site:linkedin.com” and picks the first result.
From this result, it displays all the <LI>
tags which have a class and a <H3>
element inside them (that’s what the //li[@class][h3]
XPath does).
The real value of this is in bulk usage. When there’s a big list of contacts, you don’t need to scan each of them for updates. They can be automatically updated — even if all you know is the person’s name, and perhaps where they worked at some point in time.
“Scraping” is extracting content from a website. It’s often used to build something on top of the existing content. For example, I’ve built a site that tracks movies on the IMDb 250 by scraping content.
There are libraries that simplify scraping in most languages:
But all of these are on the server side. That is, the program scrapes from your machine. Can you write a web page where the viewer’s machine does the scraping?
Let’s take an example. I want to display Amazon’s bestsellers that cost less than $10. I could write a program that scrapes the site and get that information. But since the list updates hourly, I’ll have to run it every hour.
That may not be so bad. But consider Twitter. I want to display the latest iPhone tweets from http://search.twitter.com/search.atom?q=iPhone
, but the results change so fast that your server can’t keep up.
Nor do you want it to. Ideally, your scraper should just be Javascript on your web page. Any time someone visits, their machine does the scraping. The bandwidth is theirs, and you avoid the popularity tax.
This is quite easily done using Yahoo Query Language. YQL converts the web into a database. All web pages are in a table called html
, which has 2 fields: url
and xpath
. You can get IBM’s home page using:
select * from html where url="http://www.ibm.com"
Try it at Yahoo’s developer console. The whole page is loaded into the query.results
element. This can be retrieved using JSONP. Assuming you have jQuery, try the following on Firebug. You should see the contents of IBM’s site on your page.
$.getJSON( 'http://query.yahooapis.com/v1/public/yql?callback=?', { q: 'select * from html where url="http://www.ibm.com"', format: 'json' }, function(data) { console.log(data.query.results) } ); |
That’s it! Now, it’s pretty easy to scrape, especially with XPath. To get the links on IBM’s page, just change the query to
select * from html where url="http://www.ibm.com" and xpath="//a"
Or to get all external links from IBM’s site:
select * from html where url="http://www.ibm.com" and xpath="//a[not(contains(@href,'ibm.com'))][contains(@href,'http')]""
Now you can display this on your own site, using jQuery.
This leads to interesting possibilities, such as Map-Reduce in the browser. Here’s one example. Each movie on the IMDb (e.g. The Dark Knight) comes with a list of recommendations (like this). I want to build a repository of recommendations based on the IMDb Top 250. So here’s the algorithm. First, I’ll get the IMDb Top 250 using:
select * from html where url="http://www.imdb.com/chart/top" and xpath="//tr//tr//tr//td[3]//a"
Then I’ll get a random movie’s recommendations like this:
select * from html where url="http://www.imdb.com/title/tt0468569/recommendations" and xpath="//td/font//a[contains(@href,'/title/')]"
Then I’ll send off the results to my aggregator.
Check out the full code at http://250.s-anand.net/build-reco.js.
In fact, if you visited my IMDb Top 250 tracker, you already ran this code. You didn’t know it, but you just shared a bit of your bandwidth and computation power with me. (Thank you.)
And, if you think a little further, here another way of monetising content: by borrowing a bit of the user’s computation power to build complex tasks. There already are startups built around this concept.
I don’t have any copyright declaration on this website.
The problem with that is: content is copyrighted by default. As Jeff Atwood indicates, this means that people with experience in such matters won’t copy the content because they have no legal right to use it.
Let me clarify: I don’t care what you do with my content. Feel free. You don’t have to ask. You don’t have to attribute it to me. You can change it. You can misquote me. Whatever.
I tried to find a Creative Commons license that suits my purposes. Of their licenses, the most liberal is the Creative Commons Attribution license.
This says you can do what you want as long as you attribute my content to me.
But that creates a constraint. And if I had a choice, I’d rather have my content quoted than be attributed.
The license that best captures this is the WTFPL, or Do What The Fuck You Want To Public License.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004 Copyright (C) 2004 Sam Hocevar 14 rue de Plaisance, 75014 Paris, France Everyone is permitted to copy and distribute verbatim or modified copies of this license document, and changing it is allowed as long as the name is changed. DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. You just DO WHAT THE FUCK YOU WANT TO.
So, in the spirit of a happy and open Internet, the contents and code in this site is released under the WTFPL. Do what you want with it.
A tribute to our Academy Award winner, A R Rahman. Here are interludes from 25 Tamil songs of A R Rahman. Can you guess which movie they are from?
Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.
It’s been a good movie month for me, and I’ve managed to nudge closer to my target of watching the IMDb Top 250.
But one tool I had in the past, that I sorely miss, is twofifty.org. It’s a now-defunct site that kept track of the IMDb Top 250, and let you strike off the movies that you had watched. You could see which movies you hadn’t seen, keep score, and discuss the movies.
Since it’s demise, my movie watching slowed down as well.
Earlier this month, I set up a similar site at 250.s-anand.net. It has the same basic function. You can log in, strike out movies that you’ve seen, and keep track of what’s left to see. For the more technically minded, the source-code is at two-fifty.googlecode.com.
Happy movie tracking, and looking forward to your suggestions.
I just finished Stephen Few‘s book on Information Dashboard Design. It talks about what’s wrong with the dashboards most Business Intelligence vendors (Business Objects, Oracle, Informatica, Cognos, Hyperion, etc.), and brings Tuftian principles of chart design to dashboards.
So I took a shot at designing a dashboard based on those principles, and made this dashboard for InfyBLOGS.
You can try for yourself. Go to http://www.s-anand.net/reco/
Note: This only works within the Infosys intranet.
The rest of this article discusses design principles and the technology behind the implementation. (It’s long. Skim by reading just the bold headlines.)
Dashboards are minimalist
I’ll use the design of the dashboard to highlight some of the concepts in the book.
I designed the dashboard first on Powerpoint, keeping these principles in mind.
The first was easy. I just constrained myself to one page of PowerPoint, though if you had a lot of toolbars, the viewing area of your browser may be less than mine.
The second is a matter of picking what you want to see. For me, these are the things I look for when I log into InfyBLOGS:
Then I dig deeper, occasionally, into popular entries, popular referrers, how fast the blogs are growing, etc. So I’ve put in what I think are the useful things.
The third is reflected in the way some of this information is shown. Let me explain.
Keep the charts bare
Consider the graphs on the right. They look like this.
Notice the wiggly line to the right. It’s a graph called sparkline, and was introduced by Edward Tufte. Sparklines are great to show trends in a compact way. Forget the axes. Forget the axes labels. Forget the gridlines. The text on the left ("visitors per day") tells you what it is. The number (10475) is the current value. And the line is the trend. Clearly the number of visitors has exploded recently, from a relatively slow and flat start. The labels and axes aren’t going to tell you much more.
Boldly highlight what’s important
The most important thing here, the title, is on top. It’s pure black, in large font, positioned right on top, and has a line segregating it from the rest of the page.
The sections are highlighted by a bigger font, different colour, and a line, but the effect is a bit more muted.
The numbers on the right are prominent only by virtue of size and position. If anything, the colour de-emphasizes them. This is to make sure that they don’t overwhelm your attention. (They would if they were in red, for instance.)
The number 10475 is carefully set to occupy exactly two line spaces. That alignment is very important. The small lines are at a font size of 11px, and line spacing is 1.5. So at a font size of 2 x 11px x 1.5 = 33px, the size of the large number covers exactly two rows.
The labels, such as "visitors" or "sites" are in blue, bold. Nothing too out of the way, but visible enough that they stand out.
The "View more…" links just use colour to stand out. They’re pretty unimportant.
The bulk of the text is actually made of links, unlike traditional links, they’re not underlined and they’re not blue. It would just add to the noise if everything where underlined. But on mouseover, they turn blue and are underlined, clearly indicating that they’re links.
I’ve used four mechanisms to highlight relative importance: position, size, colour and lines. There are many more: font, styling, boxes, underlining, indentation, etc.
The purpose of good design is to draw attention to what’s important, to direct the flow of the eye in a way that builds a good narrative. Don’t be shy of using every tool at your disposal in doing that. While on the topic, the Non-Designer’s Design Book is a fantastic and readable book on design for engineers.
Always use grids to display
I can’t say this any better than Mark Boulton of subtraction.com, in his presentation Grids are Good. Grids are pervasive in every form of good design. It’s a fantastic starting point as well. Just read the slideshow and follow it blindly. You can’t go wrong.
This dashboard uses 12-grid. The page is divided into two vertically. The left half has mostly text and the right half has statistics and help. There are 3 blocks within each, and they’re all the same size (alignment is critical in good design). Two of the blocks on the left are subdivided into halves, while the bottom right "Links and help" section is subdivided into three. Well, it’s easier to show it than explain it:
Copy shamelessly
The design for this dashboard was copied in part from WordPress 2.7’s new dashboard, part from the dashboards on Stephen Few’s book, and part from the winners of the 2008 Excel dashboard competition.
Most of these designs are minimalist. There’s no extra graphics, jazzy logos, etc. that detract from the informational content. This is an informational dashboard, not a painting.
Good design is everywhere. View this presentation on How to be a Creative Sponge to get a sense of the many sources where you can draw inspiration from. You’re much much better of copying good design than inventing bad ones.
You can hack the dashboard yourself
I’ve uploaded the source for the dashboard at http://infyblog-dashboard.googlecode.com/
Please feel free to browse through it, but don’t stop there! Go ahead and tweak it to suit what you think should be on an ideal dashboard. I’ll give access to the project to anyone who asks (leave a comment, mail me or call me at +44 7957 440 260).
Please hack at it. Besides, it’s great fun learning jQuery and working on a distributed open source project.
Now for some notes on how this thing works.
Bookmarklets bring Greasemonkey to any browser
This is implemented as a bookmarklet (a bookmark written in Javascript). It just loads the file http://www.s-anand.net/infyblog-dashboard.js which does all the grunt work. This is the code for the bookmarklet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | // Create a new a script element var s = document.createElement('script'); // Set the source of the script element to my bookmarklet script s.setAttribute('src','http://www.s-anand.net/infyblog-dashboard.js'); // Add the script element to the head // This does the the equivalent of: // <head><script src="..."></script></head> document.getElementsByTagName('head')[0]. appendChild(s); // Return void. Not sure why, but this is required void(s); |
This form of the bookmarklet is perhaps it’s most powerful use. It lets you inject any script into any site. If you want to change it’s appearance, content, anything, just write it as a bookmarklet. (Think of it as the IE and Safari equivalent of Firefox’s Greasemonkey.)
So if you wanted to load jQuery into any page:
Notes:
jQuery is a powerful Javascript library
Firstly, you don’t want to program Javascript without a library. You just don’t. The state of browser incompatibilities and DOM usability is just to pathetic to contemplate. Stop learning DOM methods. Pick any library instead.
Going by popularity, you would do well to pick jQuery. Here’s a graph of searches on Google for the popular Javascript libraries, and jQuery leads the pack.
Legend: jquery prototype dojo ext yui
I picked it in early 2007, and my reasons were that it:
These days, I have some additional reasons:
So anyway, the first thing I do in the dashboard script is to load jQuery, using the same code outlined in the bookmarklet section above. The only addition is:
1 2 3 4 5 | script.onload = script.onreadystatechange = function() { if (!this.readyState||this.readyState=='loaded'||this.readyState=='complete'){ callback(); } }; |
This tells the browser to run the script "callback" once the script is loaded. My main dashboard function runs once jQuery is loaded.
InfyBLOGS has most statistics the dashboard needs
Thanks to posts from Utkarsh and Simon, I knew where to find most of the information for the dashboard:
The only things I couldn’t find were:
Using IFRAMEs, we can load these statistics onto any page
Let’s say we want the latest 10 comments. These are available on the comments page. To load data from another page, we’d normally use XMLHTTPRequest, get the data, and parse it — perhaps using regular expressions.
1 2 3 | $.get('/tools/recent_comments.bml').function(data) { // Do some regular expression parsing with data }); |
But from a readability and maintainability perspective, regular expressions suck. A cleaner way is to use jQuery itself to parse the HTML.
1 2 3 4 | $.get('/tools/recent_comments.bml').function(data) { var doc = $(data); // Now process the document }); |
This works very well for simple pages, but sadly, for our statistics, this throws a stack overflow error (at least on Firefox; I didn’t have the guts to try it on IE.)
So on to a third approach. Why bother using Javascript to parse HTML, when we’re running the whole application in a browser, the most perfect HTML parser? Using IFRAMEs, we can load the whole page within the same page, let the browser do the HTML parsing.
1 2 3 4 5 6 | $('<iframe src="/tools/recent_comments.bml"></iframe>') .appendTo('body') .load(function() { var doc = $(this).contents(); // Now process the document }); |
This way, you can read any page from Javascript within the same domain. Since both the bookmarklet and statistics are on the InfyBLOGS domain, we’re fine.
jQuery parses these statistics rather well
Now the doc
variable contains the entire comments page, and we can start process it. For example, the comments are on the rows of the second table in the page. (The first table is the navigation on the left.). So
doc.find('table:eq(1) tr') |
gets the rows in the second table (table:eq(0)
is the first table). Now, we can go through each row, extract the user, when the comment was made, links to the entry and to the comment.
1 2 3 4 5 6 7 8 | doc.find('table:eq(1) tr').slice(0,8).map(function() { var e = $(this), user = e.find('td:eq(0) b').text(), when = e.find('td:eq(0)').text().replace(user, ''), post = e.find('td:eq(1) a:contains(Entry Link)').attr('href'), cmnt = e.find('td:eq(1) a:contains(Comment Link)').attr('href'), // return HTML constructed using the above information }); |
Google Charts API displays dynamic charts
Another cool thing is that we can use Google Charts to display charts without having to create an image beforehand. For instance, the URL:
http://chart.apis.google.com/chart?cht=lc&chs=300x200&chd=t:5,10,20,40,80
shows an image with a line chart through the values (5,10,20,40,80).
The dashboard uses sparklines to plot the trends in InfyBLOG statistics. These are supported by Google Charts. We can extract the data form the usage page and create a new image using Google Charts that contains the sparkline.
If you want to play around with Google Charts without mucking around with the URL structure, I have a relatively easy to use chart constructor.
Always use CSS frameworks
Libraries aren’t restricted to Javascript. CSS frameworks exist, and as with Javascript libraries, these are a no-brainer. It’s not worth hand-coding CSS from scratch. Just build on top of one of these.
The state of CSS frameworks isn’t as advanced as in Javascript. Yahoo has it’s UI grids, Blueprint‘s pretty popular and I found Tripoli to be nice, but the one I’m using here is 960.gs. It lets you create a 12-grid and 16-grid on a 960-pixel wide frame, which is good enough for my purposes.
At nearly 2,500 words, this is the longest post I’ve ever written. It took a day to design and code the dashboard, but longer to write this post. Like Paul Buchheit says, it’s easier to communicate with code. I’ve also violated my principle of Less is More. My apologies. But to re-iterate:
Please hack it
The code is at http://infyblog-dashboard.googlecode.com/. You can join the project and make changes. Leave a comment, mail me or call me.
I’ve recently switched to Python, after having programmed in Perl for many years. I’m sacrificing all my knowledge of the libraries and language quirks of Perl. The reason I moved despite that is for a somewhat trivial reason, actually. It’s because Python doesn’t require a closing brace.
Consider this Javascript (or very nearly C or Java) code:
1 2 3 4 5 6 | var s=0; for (var i=0; i<10; i++) { for (var j=0; j<10; j++) { s = s + i * j } } |
That’s 6 lines, with two lines just containing the closing brace. Or consider Perl.
1 2 3 4 5 6 | my $s = 0 foreach my $i (1 .. 10) { foreach my $j (1 .. 10) { $s = $s + $i * $j } } |
Again, 6 lines with 2 for the braces. The $ before the variables also drops readability just a little bit. Here’s Python:
1 2 3 4 | s = 0 for i in xrange(1, 10): for j in xrange(1, 10): s = s + i * j |
On the margin, I like writing shorter programs, and it annoys me to no end to have about 20 lines in a 100-line program devoted to standalone closing braces.
What I find is that once you’ve really know one language, the rest are pretty straightforward. OK, that’s not true. Let me qualify. Knowing one language well out of C, C++, Java, Javascript, PHP, Perl, Python and Ruby means that you can program in any other pretty quickly — perhaps with a day’s reading and a reference manual handy. It does NOT mean that you can pick up and start coding with Lisp, Haskell, Erlang or OCaml.
Occasionally, availability constrains which programming language I use. If I’m writing a web app, and my hosting provider does not offer Ruby or Python, that rules them out. If I don’t have a C or Java compiler on my machine, that rules them out. But quite often, these can be overcome. Installing a compiler is trivial and switching hosting providers is not too big a deal either.
Most often, it’s the libraries that determine the language I pick for a task. Perl’s regular expression library is why I’ve been using it for many years. Ruby’s HPricot and Python’s BeautifulSoup make them ideal for scraping, much more than any regular expression setup I could use with Perl. Python Image Library is great with graphics, though for animated GIFs, I need to go to the GIF89 library in Java. And I can’t do these easily with other languages. Though each of these languages boast of vast libraries (and it’s true), there are still enough things that you want done on a regular basis for which some libraries are a lot easier to use than others.
So these days, I just find the library that suits my purpose, and pick the language based on that. Working with Flickr API or Facebook API? Go with their default PHP APIs. Working on AppEngine? Python. These days, I pick Python by default, unless I need something quick and dirty, or if it’s heavy text processing. (Python’s regular expression syntax isn’t as elegant as Perl’s or Javascript’s, mainly because it isn’t built into the language.)
To get a better hang of Python (and out of sheer bloody-mindedness), I’m working through the problems in Project Euler in Python. For those who don’t know about Project Euler,
Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems.
Each problem has been designed according to a "one-minute rule", which means that although it may take several hours to design a successful algorithm with more difficult problems, an efficient implementation will allow a solution to be obtained on a modestly powered computer in less than one minute.
It’s a great way of learning a new programming language, and my knowledge of Python is pretty rudimentary. At the very least, going through this exercise has taught me the basics of generators.
I’ve solved around 40 problems so far. Here are my solutions to Project Euler. I’m also measuring the time it takes to run each solution. My target is no more than 10 seconds per problem, rather than the one-minute, for a very practical reason: the solutions page itself is generated by a Python script that runs each script, and I can’t stand waiting for more than that to refresh the page each time I solve a problem.