S Anand

Client side scraping

“Scraping” is extracting content from a website. It’s often used to build something on top of the existing content. For example, I’ve built a site that tracks movies on the IMDb 250 by scraping content.

There are libraries that simplify scraping in most languages:

But all of these are on the server side. That is, the program scrapes from your machine. Can you write a web page where the viewer’s machine does the scraping?

Let’s take an example. I want to display Amazon’s bestsellers that cost less than $10. I could write a program that scrapes the site and get that information. But since the list updates hourly, I’ll have to run it every hour.

That may not be so bad. But consider Twitter. I want to display the latest iPhone tweets from http://search.twitter.com/search.atom?q=iPhone, but the results change so fast that your server can’t keep up.

Nor do you want it to. Ideally, your scraper should just be Javascript on your web page. Any time someone visits, their machine does the scraping. The bandwidth is theirs, and you avoid the popularity tax.

This is quite easily done using Yahoo Query Language. YQL converts the web into a database. All web pages are in a table called html, which has 2 fields: url and xpath. You can get IBM’s home page using:

select * from html where url="http://www.ibm.com"

Try it at Yahoo’s developer console. The whole page is loaded into the query.results element. This can be retrieved using JSONP. Assuming you have jQuery, try the following on Firebug. You should see the contents of IBM’s site on your page.

$.getJSON(
  'http://query.yahooapis.com/v1/public/yql?callback=?',
  {
    q: 'select * from html where url="http://www.ibm.com"',
    format: 'json'
  },
  function(data) {
    console.log(data.query.results)
  }
);

That’s it! Now, it’s pretty easy to scrape, especially with XPath. To get the links on IBM’s page, just change the query to

select * from html where url="http://www.ibm.com" and xpath="//a"

Or to get all external links from IBM’s site:

select * from html where url="http://www.ibm.com" and xpath="//a[not(contains(@href,'ibm.com'))][contains(@href,'http')]""

Now you can display this on your own site, using jQuery.

 

This leads to interesting possibilities, such as Map-Reduce in the browser. Here’s one example. Each movie on the IMDb (e.g. The Dark Knight) comes with a list of recommendations (like this). I want to build a repository of recommendations based on the IMDb Top 250. So here’s the algorithm. First, I’ll get the IMDb Top 250 using:

select * from html where url="http://www.imdb.com/chart/top" and xpath="//tr//tr//tr//td[3]//a"

Then I’ll get a random movie’s recommendations like this:

select * from html where url="http://www.imdb.com/title/tt0468569/recommendations" and xpath="//td/font//a[contains(@href,'/title/')]"

Then I’ll send off the results to my aggregator.

Check out the full code at http://250.s-anand.net/build-reco.js.

 

In fact, if you visited my IMDb Top 250 tracker, you already ran this code. You didn’t know it, but you just shared a bit of your bandwidth and computation power with me. (Thank you.)

And, if you think a little further, here another way of monetising content: by borrowing a bit of the user’s computation power to build complex tasks. There already are startups built around this concept.

No copyright

I don’t have any copyright declaration on this website.

The problem with that is: content is copyrighted by default. As Jeff Atwood indicates, this means that people with experience in such matters won’t copy the content because they have no legal right to use it.

Let me clarify: I don’t care what you do with my content. Feel free. You don’t have to ask. You don’t have to attribute it to me. You can change it. You can misquote me. Whatever.

I tried to find a Creative Commons license that suits my purposes. Of their licenses, the most liberal is the Creative Commons Attribution license.

This says you can do what you want as long as you attribute my content to me.

But that creates a constraint. And if I had a choice, I’d rather have my content quoted than be attributed.

The license that best captures this is the WTFPL, or Do What The Fuck You Want To Public License.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004

Copyright (C) 2004 Sam Hocevar
 14 rue de Plaisance, 75014 Paris, France
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.

So, in the spirit of a happy and open Internet, the contents and code in this site is released under the WTFPL. Do what you want with it.

A R Rahman songs

A tribute to our Academy Award winner, A R Rahman. Here are interludes from 25 Tamil songs of A R Rahman. Can you guess which movie they are from?

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.

Search for the song and listen online, if you want to confirm your guess.

Score: 0 / 25

Song 1
Song 2
Song 3
Song 4
Song 5
Song 6
Song 7
Song 8
Song 9
Song 10
Song 11
Song 12
Song 13
Song 14
Song 15
Song 16
Song 17
Song 18
Song 19
Song 20
Song 21
Song 22
Song 23
Song 24
Song 25

Try your luck with the Hindi songs of A R Rahman too.

twofifty.org

It’s been a good movie month for me, and I’ve managed to nudge closer to my target of watching the IMDb Top 250.

But one tool I had in the past, that I sorely miss, is twofifty.org. It’s a now-defunct site that kept track of the IMDb Top 250, and let you strike off the movies that you had watched. You could see which movies you hadn’t seen, keep score, and discuss the movies.

Since it’s demise, my movie watching slowed down as well.

Earlier this month, I set up a similar site at 250.s-anand.net. It has the same basic function. You can log in, strike out movies that you’ve seen, and keep track of what’s left to see. For the more technically minded, the source-code is at two-fifty.googlecode.com.

Visit 250.s-anand.net

Happy movie tracking, and looking forward to your suggestions.

Infyblogs dashboard

I just finished Stephen Few‘s book on Information Dashboard Design. It talks about what’s wrong with the dashboards most Business Intelligence vendors (Business Objects, Oracle, Informatica, Cognos, Hyperion, etc.), and brings Tuftian principles of chart design to dashboards.

So I took a shot at designing a dashboard based on those principles, and made this dashboard for InfyBLOGS.

Infyblog dashboard

You can try for yourself. Go to http://www.s-anand.net/reco/
Note: This only works within the Infosys intranet.

  1. Right click on the "Infyblog Dashboard" link and click "Add to Favourites…" (Non-IE users — drag and drop it to your links bar)
  2. If you get a security alert, say "Yes" to continue
  3. Return to InfyBLOGS, make sure you’re logged in (that’s important) and click on the "Infyblog Dashboard" bookmark
  4. You’ll see a dashboard for your account, with comments and statistics

The rest of this article discusses design principles and the technology behind the implementation. (It’s long. Skim by reading just the bold headlines.)

Dashboards are minimalist

I’ll use the design of the dashboard to highlight some of the concepts in the book.

I designed the dashboard first on Powerpoint, keeping these principles in mind.

  1. Fits in one screen. No scrolling. Otherwise, it isn’t a dashboard.
  2. Shows only what I need to see. Ideally, from this dashboard, I should receive all the information I need to act on, and no more.
  3. Minimises the data-ink ratio. Don’t draw a single pixel that’s not required.

The first was easy. I just constrained myself to one page of PowerPoint, though if you had a lot of toolbars, the viewing area of your browser may be less than mine.

The second is a matter of picking what you want to see. For me, these are the things I look for when I log into InfyBLOGS:

  1. Any new comments?
  2. Any new posts from my friends?
  3. What’s new? What’s hot?

Then I dig deeper, occasionally, into popular entries, popular referrers, how fast the blogs are growing, etc. So I’ve put in what I think are the useful things.

The third is reflected in the way some of this information is shown. Let me explain.

Keep the charts bare

Consider the graphs on the right. They look like this.

Notice the wiggly line to the right. It’s a graph called sparkline, and was introduced by Edward Tufte. Sparklines are great to show trends in a compact way. Forget the axes. Forget the axes labels. Forget the gridlines. The text on the left ("visitors per day") tells you what it is. The number (10475) is the current value. And the line is the trend. Clearly the number of visitors has exploded recently, from a relatively slow and flat start. The labels and axes aren’t going to tell you much more.

Boldly highlight what’s important

The most important thing here, the title, is on top. It’s pure black, in large font, positioned right on top, and has a line segregating it from the rest of the page.

The sections are highlighted by a bigger font, different colour, and a line, but the effect is a bit more muted.

The numbers on the right are prominent only by virtue of size and position. If anything, the colour de-emphasizes them. This is to make sure that they don’t overwhelm your attention. (They would if they were in red, for instance.)

The number 10475 is carefully set to occupy exactly two line spaces. That alignment is very important. The small lines are at a font size of 11px, and line spacing is 1.5. So at a font size of 2 x 11px x 1.5 = 33px, the size of the large number covers exactly two rows.

The labels, such as "visitors" or "sites" are in blue, bold. Nothing too out of the way, but visible enough that they stand out.

The "View more…" links just use colour to stand out. They’re pretty unimportant.

The bulk of the text is actually made of links, unlike traditional links, they’re not underlined and they’re not blue. It would just add to the noise if everything where underlined. But on mouseover, they turn blue and are underlined, clearly indicating that they’re links.

I’ve used four mechanisms to highlight relative importance: position, size, colour and lines. There are many more: font, styling, boxes, underlining, indentation, etc.

The purpose of good design is to draw attention to what’s important, to direct the flow of the eye in a way that builds a good narrative. Don’t be shy of using every tool at your disposal in doing that. While on the topic, the Non-Designer’s Design Book is a fantastic and readable book on design for engineers.

Always use grids to display

I can’t say this any better than Mark Boulton of subtraction.com, in his presentation Grids are Good. Grids are pervasive in every form of good design. It’s a fantastic starting point as well. Just read the slideshow and follow it blindly. You can’t go wrong.

This dashboard uses 12-grid. The page is divided into two vertically. The left half has mostly text and the right half has statistics and help. There are 3 blocks within each, and they’re all the same size (alignment is critical in good design). Two of the blocks on the left are subdivided into halves, while the bottom right "Links and help" section is subdivided into three. Well, it’s easier to show it than explain it:

Picture of grid

Copy shamelessly

The design for this dashboard was copied in part from WordPress 2.7’s new dashboard, part from the dashboards on Stephen Few’s book, and part from the winners of the 2008 Excel dashboard competition.

Most of these designs are minimalist. There’s no extra graphics, jazzy logos, etc. that detract from the informational content. This is an informational dashboard, not a painting.

Good design is everywhere. View this presentation on How to be a Creative Sponge to get a sense of the many sources where you can draw inspiration from. You’re much much better of copying good design than inventing bad ones.

You can hack the dashboard yourself

I’ve uploaded the source for the dashboard at http://infyblog-dashboard.googlecode.com/

Please feel free to browse through it, but don’t stop there! Go ahead and tweak it to suit what you think should be on an ideal dashboard. I’ll give access to the project to anyone who asks (leave a comment, mail me or call me at +44 7957 440 260).

Please hack at it. Besides, it’s great fun learning jQuery and working on a distributed open source project.

Now for some notes on how this thing works.

Bookmarklets bring Greasemonkey to any browser

This is implemented as a bookmarklet (a bookmark written in Javascript). It just loads the file http://www.s-anand.net/infyblog-dashboard.js which does all the grunt work. This is the code for the bookmarklet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Create a new a script element
var s = document.createElement('script');
 
// Set the source of the script element to my bookmarklet script
s.setAttribute('src','http://www.s-anand.net/infyblog-dashboard.js');
 
// Add the script element to the head
// This does the the equivalent of:
// <head><script src="..."></script></head>
document.getElementsByTagName('head')[0].
appendChild(s);
 
// Return void. Not sure why, but this is required
void(s);

This form of the bookmarklet is perhaps it’s most powerful use. It lets you inject any script into any site. If you want to change it’s appearance, content, anything, just write it as a bookmarklet. (Think of it as the IE and Safari equivalent of Firefox’s Greasemonkey.)

So if you wanted to load jQuery into any page:

  1. Change the URL in line 5 above to the jQuery URL and add it as a bookmark. (Or use Karl’s jQuery bookmarklet. It’s better)
  2. Go to any page, say the IMDB Top 250 for instance
  3. Click on the bookmarklet. Now your page is jQueryified. (If you used Karl’s bookmarklet, it’ll tell you so.)
  4. On the address bar, type "javascript:alert($(‘table:eq(10) tr’).length)" and press enter.
  5. You should see 252, the number of rows on main table of the IMDB Top 250

Notes:

jQuery is a powerful Javascript library

Firstly, you don’t want to program Javascript without a library. You just don’t. The state of browser incompatibilities and DOM usability is just to pathetic to contemplate. Stop learning DOM methods. Pick any library instead.

Going by popularity, you would do well to pick jQuery. Here’s a graph of searches on Google for the popular Javascript libraries, and jQuery leads the pack.

Legend: jquery prototype dojo ext yui

Google Trends showing jQuery as the dominant library

I picked it in early 2007, and my reasons were that it:

  1. Is small. At that time, it was 19KB and was the smallest of the libraries. (Today it’s 18KB and has more features.)
  2. Makes your code compact. It lets you chain functions, and overloads the $() function to work with DOM elements, HTML strings, arrays, objects, anything.
  3. Doesn’t pollute the namespace. Just 2 global variables: jQuery and $. And you can make it not use $ if you like.
  4. Is very intuitive. You can learn it in an hour, and the documention is extremely helpful.
  5. Is fully functional. Apart from DOM manipulation, it covers CSS, animations and AJAX, which is all I want from a library.

These days, I have some additional reasons:

  1. It’s extensible. The plugin ecosystem is rich.
  2. It’s hosted at Google, and is really fast to load. Most often, it’s already cached in your users’ browser.
  3. John Resig is a genius. After env.js and processing.js, I trust him to advance jQuery better than other libraries.

So anyway, the first thing I do in the dashboard script is to load jQuery, using the same code outlined in the bookmarklet section above. The only addition is:

1
2
3
4
5
script.onload = script.onreadystatechange = function() {
    if (!this.readyState||this.readyState=='loaded'||this.readyState=='complete'){
        callback();
    }
};

This tells the browser to run the script "callback" once the script is loaded. My main dashboard function runs once jQuery is loaded.

InfyBLOGS has most statistics the dashboard needs

Thanks to posts from Utkarsh and Simon, I knew where to find most of the information for the dashboard:

The only things I couldn’t find were:

  • Friends posts. I’d have liked to show the titles of recent posts by friends. Yeah, I know: blogs/user_name/friends has it, but thanks to user styles, it’s nearly impossible to parse in a standard way across users. I’d really like an RSS feed for friends.
  • Interests posts. Would be cool to show recent posts by users who shared your your interest.
  • Communities posts. Again, I’d have liked to show the recent posts in communities, rather than just the names of communities.

Using IFRAMEs, we can load these statistics onto any page

Let’s say we want the latest 10 comments. These are available on the comments page. To load data from another page, we’d normally use XMLHTTPRequest, get the data, and parse it — perhaps using regular expressions.

1
2
3
$.get('/tools/recent_comments.bml').function(data) {
    // Do some regular expression parsing with data
});

But from a readability and maintainability perspective, regular expressions suck. A cleaner way is to use jQuery itself to parse the HTML.

1
2
3
4
$.get('/tools/recent_comments.bml').function(data) {
    var doc = $(data);
    // Now process the document
});

This works very well for simple pages, but sadly, for our statistics, this throws a stack overflow error (at least on Firefox; I didn’t have the guts to try it on IE.)

So on to a third approach. Why bother using Javascript to parse HTML, when we’re running the whole application in a browser, the most perfect HTML parser? Using IFRAMEs, we can load the whole page within the same page, let the browser do the HTML parsing.

1
2
3
4
5
6
$('<iframe src="/tools/recent_comments.bml"></iframe>')
    .appendTo('body')
    .load(function() {
        var doc = $(this).contents();
        // Now process the document
    });

This way, you can read any page from Javascript within the same domain. Since both the bookmarklet and statistics are on the InfyBLOGS domain, we’re fine.

jQuery parses these statistics rather well

Now the doc variable contains the entire comments page, and we can start process it. For example, the comments are on the rows of the second table in the page. (The first table is the navigation on the left.). So

doc.find('table:eq(1) tr')

gets the rows in the second table (table:eq(0) is the first table). Now, we can go through each row, extract the user, when the comment was made, links to the entry and to the comment.

1
2
3
4
5
6
7
8
doc.find('table:eq(1) tr').slice(0,8).map(function() {
    var e = $(this),
        user = e.find('td:eq(0) b').text(),
        when = e.find('td:eq(0)').text().replace(user, ''),
        post = e.find('td:eq(1) a:contains(Entry Link)').attr('href'),
        cmnt = e.find('td:eq(1) a:contains(Comment Link)').attr('href'),
        // return HTML constructed using the above information
    });

Google Charts API displays dynamic charts

Another cool thing is that we can use Google Charts to display charts without having to create an image beforehand. For instance, the URL:

http://chart.apis.google.com/chart?cht=lc&chs=300x200&chd=t:5,10,20,40,80

shows an image with a line chart through the values (5,10,20,40,80).

Google chart example

The dashboard uses sparklines to plot the trends in InfyBLOG statistics. These are supported by Google Charts. We can extract the data form the usage page and create a new image using Google Charts that contains the sparkline.

If you want to play around with Google Charts without mucking around with the URL structure, I have a relatively easy to use chart constructor.

Always use CSS frameworks

Libraries aren’t restricted to Javascript. CSS frameworks exist, and as with Javascript libraries, these are a no-brainer. It’s not worth hand-coding CSS from scratch. Just build on top of one of these.

The state of CSS frameworks isn’t as advanced as in Javascript. Yahoo has it’s UI grids, Blueprint‘s pretty popular and I found Tripoli to be nice, but the one I’m using here is 960.gs. It lets you create a 12-grid and 16-grid on a 960-pixel wide frame, which is good enough for my purposes.


At nearly 2,500 words, this is the longest post I’ve ever written. It took a day to design and code the dashboard, but longer to write this post. Like Paul Buchheit says, it’s easier to communicate with code. I’ve also violated my principle of Less is More. My apologies. But to re-iterate:

Please hack it

The code is at http://infyblog-dashboard.googlecode.com/. You can join the project and make changes. Leave a comment, mail me or call me.

To Python from Perl

I’ve recently switched to Python, after having programmed in Perl for many years. I’m sacrificing all my knowledge of the libraries and language quirks of Perl. The reason I moved despite that is for a somewhat trivial reason, actually. It’s because Python doesn’t require a closing brace.

Consider this Javascript (or very nearly C or Java) code:

1
2
3
4
5
6
var s=0;
for (var i=0; i<10; i++) {
    for (var j=0; j<10; j++) {
        s = s + i * j
    }
}

That’s 6 lines, with two lines just containing the closing brace. Or consider Perl.

1
2
3
4
5
6
my $s = 0
foreach my $i (1 .. 10) {
    foreach my $j (1 .. 10) {
        $s = $s + $i * $j
    }
}

Again, 6 lines with 2 for the braces. The $ before the variables also drops readability just a little bit. Here’s Python:

1
2
3
4
s = 0
for i in xrange(1, 10):
    for j in xrange(1, 10):
        s = s + i * j

On the margin, I like writing shorter programs, and it annoys me to no end to have about 20 lines in a 100-line program devoted to standalone closing braces.

What I find is that once you’ve really know one language, the rest are pretty straightforward. OK, that’s not true. Let me qualify. Knowing one language well out of C, C++, Java, Javascript, PHP, Perl, Python and Ruby means that you can program in any other pretty quickly — perhaps with a day’s reading and a reference manual handy. It does NOT mean that you can pick up and start coding with Lisp, Haskell, Erlang or OCaml.

Occasionally, availability constrains which programming language I use. If I’m writing a web app, and my hosting provider does not offer Ruby or Python, that rules them out. If I don’t have a C or Java compiler on my machine, that rules them out. But quite often, these can be overcome. Installing a compiler is trivial and switching hosting providers is not too big a deal either.

Most often, it’s the libraries that determine the language I pick for a task. Perl’s regular expression library is why I’ve been using it for many years. Ruby’s HPricot and Python’s BeautifulSoup make them ideal for scraping, much more than any regular expression setup I could use with Perl. Python Image Library is great with graphics, though for animated GIFs, I need to go to the GIF89 library in Java. And I can’t do these easily with other languages. Though each of these languages boast of vast libraries (and it’s true), there are still enough things that you want done on a regular basis for which some libraries are a lot easier to use than others.

So these days, I just find the library that suits my purpose, and pick the language based on that. Working with Flickr API or Facebook API? Go with their default PHP APIs. Working on AppEngine? Python. These days, I pick Python by default, unless I need something quick and dirty, or if it’s heavy text processing. (Python’s regular expression syntax isn’t as elegant as Perl’s or Javascript’s, mainly because it isn’t built into the language.)

To get a better hang of Python (and out of sheer bloody-mindedness), I’m working through the problems in Project Euler in Python. For those who don’t know about Project Euler,

Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve. Although mathematics will help you arrive at elegant and efficient methods, the use of a computer and programming skills will be required to solve most problems.

Each problem has been designed according to a "one-minute rule", which means that although it may take several hours to design a successful algorithm with more difficult problems, an efficient implementation will allow a solution to be obtained on a modestly powered computer in less than one minute.

It’s a great way of learning a new programming language, and my knowledge of Python is pretty rudimentary. At the very least, going through this exercise has taught me the basics of generators.

I’ve solved around 40 problems so far. Here are my solutions to Project Euler. I’m also measuring the time it takes to run each solution. My target is no more than 10 seconds per problem, rather than the one-minute, for a very practical reason: the solutions page itself is generated by a Python script that runs each script, and I can’t stand waiting for more than that to refresh the page each time I solve a problem.

Ubuntu 8.10 on a Dell Latitude D420

Here’s the fastest way I’ve found to install Ubuntu on a USB flash drive, for my Dell Latitude D420. (Pendrivelinux.com is a great resource for this sort of thing.)

Ingredients

  1. One large USB flash drive like this one. Not less than 4GB. I’d suggest 8GB or more
  2. One CD (not a DVD)
  3. Ubuntu 8.10 desktop CD ISO
  4. IMGBurn or any other CD burning software
  5. Direct Internet via LAN cable (without proxy, without wireless)

Installation

  1. Burn the Ubuntu ISO file on the CD
  2. Press F12 when the laptop boots up, and select CD/DVD Drive as the boot device
  3. On the Ubuntu splash screen, select "Try Ubuntu without making any change to your computer" and wait
  4. Insert the flash drive
  5. Go to System > Administration > Create a USB startup disk and follow instructions there
  6. Once done, remove the CD and reboot using the USB flash drive (pressing F12 during the boot sequence)

To enable wireless, which won’t work by default

  1. Connect to the Internet using a LAN cable
  2. Go to System > Administration > Hardware devices
  3. Select the Broadcom LAN driver, and activate it

That’s it. It’s been a fairly painless installation.

I do have one big crib. I planned to use Hibernation (or suspend-to-disk on Ubuntu) to switch between Windows and Ubuntu. But there are a couple of problems:

  • Hibernate doesn’t work on Ubuntu. I need to reboot Ubuntu every time, and that takes 3 minutes
  • When Windows is hibernating, Ubuntu can’t access any files on the hard disk

This means switching between Ubuntu and Windows is roughly a 6 minute shutdown-one-OS-reboot-the-other process rather than the 1-minute hibernate-one-OS-resume-the-other that I had had hoped for.

Another minor problem I have is that our Exchange server doesn’t seem to have an IMAP interface, at least that I know of. So I can’t check mail. But like I said, it’s minor. I just forward mails from my BlackBerry to GMail.

On teaching

This vacation, I took a session each for class XI and XII at my school, Vidya Mandir. The subject was Computer Science (the only one I can teach with some confidence), and the topic was networks.

It was an experiment, in two parts. The first was to understand how students of this generation interact with the Internet. (I’m twice as old as them, so I guess they qualify as the next generation.) The second was to see whether I’d leave them far behind, or they’d leave me far behind.

I began the class with a series of questions.

How many of you have… Expected Actual
Access to a PC and the Internet (home or nearby).
I was expecting ~80%. Every single one of them raised their hands. Every single one.
80% 100%
Chatted online.
I was expecting ~70%. Every single one, except for one girl, raised their hands.
70% 100%
Used a bluetooth device.
I was expecting around 60%. I got nearly everyone, but the remaining were wondering what that was.
60% 100%
Video-chatted.
I expected ~50%. Got ~80%
50% 80%
Uploaded a photo or video.
Again, far more than expected.
40% 80%
Own a blog or website.
This is where the surprises started. I thought that at least one in 3 would have a blog. Turns out I was wrong. There were very few.
30% 5%
Written a web application.
Not one soul. Some thought they had, but no…
10% 0%
Contributed to an open source project.
None at all.
1 or 2 0%

It was an eye-opener. On the one hand, everyone has an Internet connection. (In fact, the announcements following the morning prayer began with the Principal warning about the dangers of chatting with strangers online.) On the other hand, they’re doing little of the cool stuff.

Some of the discussions I had after class did lessen my concern a bit. There are, as always, a few that are very interesting in hacking, and are playing around with a lot of interesting things. But still, on average…

As for the other part of the experiment, I spent an hour talking about what goes on behind the scenes when they search on Google, taking them down to some of the elements of HTTP. My slides are below. I do suspect I left a fair number of them behind, but there were a handful that were with me right up to the end.

Computer Networks: An Introduction

View SlideShare presentation or Upload your own. (tags: http)

But I learned something that I did not expect. I spent a lot of time at the staff room, and talking with the teachers. The best way I can summarise what I learnt is through this Calvin and Hobbes strip.

Somehow, I thought the bulk of the discussion at the staff room would centre around students. Or, at the very least, around education. It was eye-opening to listen to a two-hour-long argument on the political reasons behind the tea at primary school staff room being better than at high school’s.

I remember my first book on acting defining a modern-day magician as "an actor who plays the role of a magician". The modern-day teacher is, in similar vein, an employee assigned role of a teacher. Teaching is their profession, not passion. Not that they are disinterested, quite the opposite. But oh, it could be so much better!

I read a speech by John Taylor Gatto titled "The Six-Lesson Schoolteacher". He gave this speech on being awarded the New York State Teacher of the Year award in 1991. He teaches six lessons at school, he says.

The first lesson I teach is: "Stay in the class where you belong." I don’t know who decides that my kids belong there but that’s not my business.

The second lesson I teach kids is to turn on and off like a light switch. I demand that they become totally involved in my lessons… But when the bell rings I insist that they drop the work at once and proceed quickly to the next work station. Nothing important is ever finished in my class, nor in any other class I know of.

The third lesson I teach you is to surrender your will to a predestined chain of command… As a schoolteacher I intervene in many personal decisions, issuing a Pass for those I deem legitimate, or initiating a disciplinary confrontation for behavior that threatens my control.

The fourth lesson I teach is that only I determine what curriculum you will study…. Of the millions of things of value to learn, I decide what few we have time for. Curiosity has no important place in my work, only conformity.

In lesson five I teach that your self-respect should depend on an observer’s measure of your worth… A monthly report, impressive in its precision, is sent into students’ homes to spread approval or to mark exactly — down to a single percentage point — how dissatisfied with their children parents should be.

In lesson six I teach children that they are being watched. I keep each student under constant surveillance and so do my colleagues… Students are encouraged to tattle on each other, even to tattle on their parents. Of course I encourage parents to file their own child’s waywardness, too.

I smiled a bit when I read this. It had been a while since I’d been in school, and I was lucky to have been in very liberal colleges. But then I went back to school and saw it for myself. The organisation that comes closest to the school is the military… or the prison. Not exactly the best place to foster creativity.

I began my class this time by saying, "Look, I might be wrong in what I tell you. Usually, it’s not deliberate. Quite often, I simply may not know. Or I may mis-communicate. When in doubt, Google and Wikipedia. Let me repeat: this is the single most important thing that I can tell you. When in doubt, Google and Wikipedia."

At the end of the class, a few came over and said, "But how do we do that? Our teachers are asking us not to waste time on the Internet, and to stay away from Wikipedia!"

Sir Ken Robinson gave a TED Talk on Do Schools Kill Creativity? Do watch it. Apart from being one of the funniest 20-minute talks ever, it drives home a strong message. Schools aren’t quite organised to foster creativity. When they were created, that wasn’t the intent.

Teaching as a profession, I imagine, does not pay as much as many others. So there’s little interest for practitioners to enter the field. I can therefore understand and appreciate that it takes a long time for new knowledge to enter the curriculum. But also sad is the way the curriculum is treated. It isn’t treated, as Gatto says, as choices among the million things of value to learn. It is treated as a Bible that defines knowledge.

It is easy for teachers to fall into the trap. If it contradicts the curriculum, it is wrong. If it is not in the curriculum, it is irrelevant. Since I know the curriculum inside out, I know all that is required to know. It’s not that I refuse to learn. Just that there is nothing more to learn that is relevant.

As an institution, schools aren’t going away any time soon. Nor perhaps should they. But in the interest of knowledge and creativity, I can only hope for two things.

  1. Students: keep learning what you like outside of school. It may be your only hope.
  2. Everyone else: drop by to your old school or your nearby school, and offer to teach one class any subject you have a passion for. You’d be surprised at how well you’ll be received, how much you know, and how much you can learn by that interaction.

The hunt for a Twitter client

I hadn’t jumped on to the Twitter bandwagon for a while. I’m not much of a conversationalist, nor am I a very sociable. I also tend to stay away from social networks. But I figured I would try Twitter out for a while, mostly because it’s an outlet for short comments. For long articles, I have my blog. For sharing links, I have Google Reader and del.icio.us. I don’t quite have anything for that occasional moment when I want to say, "Hey! A great way to shred mint leaves is to freeze them!"

The question is what client to use. I wanted something free, portable and featherweight (as in lighter than lightweight: no additional memory usage.)

SMS is the classic Twitter channel. But I don’t like being bothered by SMS messages often. Besides, it’s not free. So that’s out.

The next best would be e-mail via my BlackBerry. The problem is, Twitter doesn’t accept tweets via e-mail. So when looking for alternatives, I found Identi.ca, which is even better than Twitter except for the fact that it doesn’t have Twitter’s user base. Anyway, it accepted e-mail, so that was fine.

On the desktop, the browser is the obvious choice. But somehow, going to the Twitter home page and typing out a tweet felt so… Web 1.0. I didn’t fashion installing a client just for tweeting, like Twhirl. The closest was instant messenger software. Since Identi.ca accepts messages via XMPP, I could install Google Talk and send messages via instant messenger.

That worked for a couple of weeks. Then I pulled out. Instant messenger has the disadvantage of making you accessible, and I honestly don’t have the time. Plus, I don’t fancy running apps persistently, not even something as light as Google Talk. So back to square one.

In the meantime, I was having another problem with sending updates via BlackBerry. My corporate mails have a HUGE disclaimer attached to them. Doesn’t make sense to have 140 character message followed by a 940 character disclaimer. I’d have to get rid of those anyway.

After a bit of digging around, I came across mail handlers. I can write a program on my server to handle mails. So I wrote one that strips out the disclaimer and forwards it to my identi.ca e-mail ID. (Now I’ve modified it to use the API.) So that solves my mobile twittering problem.

It also solves my cross-posting problem. I maintain a twitter.com/sanand0 and an identi.ca/sanand0 account and keep them updated in parallel. My mail handler updates the post on both services.

As for the desktop, I have the best solution of all. I use the browser address bar to twitter. I’ve created a keyword search with the keyword "twitter" with is keyed to a URL like http://www.s-anand.net/twitter/%s. So if I say "twitter Some message" on the address bar and press enter, it contacts the server, which updates Identi.ca and Twitter using the API.

Of course, you don’t really need to do that to update Twitter. Just create a keyword search with a keyword "twitter" and a URL http://twitter.com/home?status=%s, and you’re done. Remember: you can create keyword searches in Internet Explorer as well (read how). With this, you can update twitter from the address bar by just typing "twitter your message goes here".


Anyway, that was a long-winded way of saying just two things.

  1. Mail handlers are cool.
  2. Keyword searches let you update Twitter from the address bar using the URL http://twitter.com/home?status=%s

Bound methods in Javascript

The popular way to create a class in Javascript is to define a function and add methods to its prototype. For example, let’s create a class Node that has a method hide().

1
2
3
4
5
6
var Node = function(id) {
    this.element = document.getElementById(id);
};
Node.prototype.hide = function() {
    this.style.display = "none";
};

If you had a header, say <h1 id="header">Heading</h1>, then this piece of code will hide the element.

1
2
var node = new Node("header");
node.hide();

If I wanted to hide the element a second later, I am tempted to use:

3
4
var node = new Node("header");
setTimeout(node.hide, 1000);

… except that it won’t work. setTimeout has no idea that the function node.hide has anything to do with the object node. It just runs the function. When node.hide() is called by setTimeout, the this object isn’t set to node, it’s set to window. node.hide() ends up trying to hide window, not node.

The standard way around this is:

3
4
var node = new Node("header");
setTimeout(function() { node.hide()}, 1000);

I’ve been using this for a while, but it gets tiring. It’s so easy to forget to do this. Worse, it doesn’t work very well inside a loop:

1
2
3
4
for (var id in ["a", "b", "c"]) {
    var node = new Node(id);
    setTimeout(function() { node.hide(); }, 1000);
}

This actually hides node "c" thrice, and doesn’t touch nodes "a" and "b". You’ve got to remember to wrap every function that contains a function in a loop.

1
2
3
4
5
6
for (var id in ["a", "b", "c"]) {
    (function() {
        var node = new Node(id);
        setTimeout(function() { node.hide(); }, 1000);
    })();
}

Now, compare that with this:

1
2
3
for (var id in ["a", "b", "c"]) {
    setTimeout((new Node(id)).hide, 1000);
}

Wouldn’t something this compact be nice?

To do this, the method node.hide must be bound to the object node. That is, node.hide must know that it belongs to node. And when we call another_node.hide, it must know that it belongs to another_node. This, incidentally, is the way most other languages behave. For example, on python, try the following:

>>> class Node:
...     def hide():
...             pass
...
>>> node = Node()
>>> node
<__main__.Node instance at 0x00BA32D8>
>>> node.hide
<bound method Node.hide of <__main__.Node instance at 0x00BA32D8>>

The method hide is bound to the object node.

To do this in Javascript, instead of adding the methods to the prototype, you need to do two things:

  1. Add the methods to the object in the constructor
  2. Don’t use this. Set another variable called that to this, and use that instead.
1
2
3
4
5
6
7
var Node = function(id) {
    var that = this;
    that.element = document.getElementById(id);
    that.hide = function() {
        that.element.style.display = "none";
    }
};

Now node.hide is bound to node. The following code will work.

8
9
var node = new Node("header");
setTimeout(node.hide, 1000);

I’ve taken to using this pattern almost exclusively these days, rather than prototype-based methods. It saves a lot of trouble, and I find it makes the code a lot compacter and easier to read.