How I do things

Ubuntu 8.10 on a Dell Latitude D420

Here’s the fastest way I’ve found to install Ubuntu on a USB flash drive, for my Dell Latitude D420. (Pendrivelinux.com is a great resource for this sort of thing.)

Ingredients

  1. One large USB flash drive like this one. Not less than 4GB. I’d suggest 8GB or more
  2. One CD (not a DVD)
  3. Ubuntu 8.10 desktop CD ISO
  4. IMGBurn or any other CD burning software
  5. Direct Internet via LAN cable (without proxy, without wireless)

Installation

  1. Burn the Ubuntu ISO file on the CD
  2. Press F12 when the laptop boots up, and select CD/DVD Drive as the boot device
  3. On the Ubuntu splash screen, select "Try Ubuntu without making any change to your computer" and wait
  4. Insert the flash drive
  5. Go to System > Administration > Create a USB startup disk and follow instructions there
  6. Once done, remove the CD and reboot using the USB flash drive (pressing F12 during the boot sequence)

To enable wireless, which won’t work by default

  1. Connect to the Internet using a LAN cable
  2. Go to System > Administration > Hardware devices
  3. Select the Broadcom LAN driver, and activate it

That’s it. It’s been a fairly painless installation.

I do have one big crib. I planned to use Hibernation (or suspend-to-disk on Ubuntu) to switch between Windows and Ubuntu. But there are a couple of problems:

  • Hibernate doesn’t work on Ubuntu. I need to reboot Ubuntu every time, and that takes 3 minutes
  • When Windows is hibernating, Ubuntu can’t access any files on the hard disk

This means switching between Ubuntu and Windows is roughly a 6 minute shutdown-one-OS-reboot-the-other process rather than the 1-minute hibernate-one-OS-resume-the-other that I had had hoped for.

Another minor problem I have is that our Exchange server doesn’t seem to have an IMAP interface, at least that I know of. So I can’t check mail. But like I said, it’s minor. I just forward mails from my BlackBerry to GMail.

On teaching

This vacation, I took a session each for class XI and XII at my school, Vidya Mandir. The subject was Computer Science (the only one I can teach with some confidence), and the topic was networks.

It was an experiment, in two parts. The first was to understand how students of this generation interact with the Internet. (I’m twice as old as them, so I guess they qualify as the next generation.) The second was to see whether I’d leave them far behind, or they’d leave me far behind.

I began the class with a series of questions.

How many of you have… Expected Actual
Access to a PC and the Internet (home or nearby).
I was expecting ~80%. Every single one of them raised their hands. Every single one.
80% 100%
Chatted online.
I was expecting ~70%. Every single one, except for one girl, raised their hands.
70% 100%
Used a bluetooth device.
I was expecting around 60%. I got nearly everyone, but the remaining were wondering what that was.
60% 100%
Video-chatted.
I expected ~50%. Got ~80%
50% 80%
Uploaded a photo or video.
Again, far more than expected.
40% 80%
Own a blog or website.
This is where the surprises started. I thought that at least one in 3 would have a blog. Turns out I was wrong. There were very few.
30% 5%
Written a web application.
Not one soul. Some thought they had, but no…
10% 0%
Contributed to an open source project.
None at all.
1 or 2 0%

It was an eye-opener. On the one hand, everyone has an Internet connection. (In fact, the announcements following the morning prayer began with the Principal warning about the dangers of chatting with strangers online.) On the other hand, they’re doing little of the cool stuff.

Some of the discussions I had after class did lessen my concern a bit. There are, as always, a few that are very interesting in hacking, and are playing around with a lot of interesting things. But still, on average…

As for the other part of the experiment, I spent an hour talking about what goes on behind the scenes when they search on Google, taking them down to some of the elements of HTTP. My slides are below. I do suspect I left a fair number of them behind, but there were a handful that were with me right up to the end.

Computer Networks: An Introduction

View SlideShare presentation or Upload your own. (tags: http)

But I learned something that I did not expect. I spent a lot of time at the staff room, and talking with the teachers. The best way I can summarise what I learnt is through this Calvin and Hobbes strip.

Somehow, I thought the bulk of the discussion at the staff room would centre around students. Or, at the very least, around education. It was eye-opening to listen to a two-hour-long argument on the political reasons behind the tea at primary school staff room being better than at high school’s.

I remember my first book on acting defining a modern-day magician as "an actor who plays the role of a magician". The modern-day teacher is, in similar vein, an employee assigned role of a teacher. Teaching is their profession, not passion. Not that they are disinterested, quite the opposite. But oh, it could be so much better!

I read a speech by John Taylor Gatto titled "The Six-Lesson Schoolteacher". He gave this speech on being awarded the New York State Teacher of the Year award in 1991. He teaches six lessons at school, he says.

The first lesson I teach is: "Stay in the class where you belong." I don’t know who decides that my kids belong there but that’s not my business.

The second lesson I teach kids is to turn on and off like a light switch. I demand that they become totally involved in my lessons… But when the bell rings I insist that they drop the work at once and proceed quickly to the next work station. Nothing important is ever finished in my class, nor in any other class I know of.

The third lesson I teach you is to surrender your will to a predestined chain of command… As a schoolteacher I intervene in many personal decisions, issuing a Pass for those I deem legitimate, or initiating a disciplinary confrontation for behavior that threatens my control.

The fourth lesson I teach is that only I determine what curriculum you will study…. Of the millions of things of value to learn, I decide what few we have time for. Curiosity has no important place in my work, only conformity.

In lesson five I teach that your self-respect should depend on an observer’s measure of your worth… A monthly report, impressive in its precision, is sent into students’ homes to spread approval or to mark exactly — down to a single percentage point — how dissatisfied with their children parents should be.

In lesson six I teach children that they are being watched. I keep each student under constant surveillance and so do my colleagues… Students are encouraged to tattle on each other, even to tattle on their parents. Of course I encourage parents to file their own child’s waywardness, too.

I smiled a bit when I read this. It had been a while since I’d been in school, and I was lucky to have been in very liberal colleges. But then I went back to school and saw it for myself. The organisation that comes closest to the school is the military… or the prison. Not exactly the best place to foster creativity.

I began my class this time by saying, "Look, I might be wrong in what I tell you. Usually, it’s not deliberate. Quite often, I simply may not know. Or I may mis-communicate. When in doubt, Google and Wikipedia. Let me repeat: this is the single most important thing that I can tell you. When in doubt, Google and Wikipedia."

At the end of the class, a few came over and said, "But how do we do that? Our teachers are asking us not to waste time on the Internet, and to stay away from Wikipedia!"

Sir Ken Robinson gave a TED Talk on Do Schools Kill Creativity? Do watch it. Apart from being one of the funniest 20-minute talks ever, it drives home a strong message. Schools aren’t quite organised to foster creativity. When they were created, that wasn’t the intent.

Teaching as a profession, I imagine, does not pay as much as many others. So there’s little interest for practitioners to enter the field. I can therefore understand and appreciate that it takes a long time for new knowledge to enter the curriculum. But also sad is the way the curriculum is treated. It isn’t treated, as Gatto says, as choices among the million things of value to learn. It is treated as a Bible that defines knowledge.

It is easy for teachers to fall into the trap. If it contradicts the curriculum, it is wrong. If it is not in the curriculum, it is irrelevant. Since I know the curriculum inside out, I know all that is required to know. It’s not that I refuse to learn. Just that there is nothing more to learn that is relevant.

As an institution, schools aren’t going away any time soon. Nor perhaps should they. But in the interest of knowledge and creativity, I can only hope for two things.

  1. Students: keep learning what you like outside of school. It may be your only hope.
  2. Everyone else: drop by to your old school or your nearby school, and offer to teach one class any subject you have a passion for. You’d be surprised at how well you’ll be received, how much you know, and how much you can learn by that interaction.

The hunt for a Twitter client

I hadn’t jumped on to the Twitter bandwagon for a while. I’m not much of a conversationalist, nor am I a very sociable. I also tend to stay away from social networks. But I figured I would try Twitter out for a while, mostly because it’s an outlet for short comments. For long articles, I have my blog. For sharing links, I have Google Reader and del.icio.us. I don’t quite have anything for that occasional moment when I want to say, "Hey! A great way to shred mint leaves is to freeze them!"

The question is what client to use. I wanted something free, portable and featherweight (as in lighter than lightweight: no additional memory usage.)

SMS is the classic Twitter channel. But I don’t like being bothered by SMS messages often. Besides, it’s not free. So that’s out.

The next best would be e-mail via my BlackBerry. The problem is, Twitter doesn’t accept tweets via e-mail. So when looking for alternatives, I found Identi.ca, which is even better than Twitter except for the fact that it doesn’t have Twitter’s user base. Anyway, it accepted e-mail, so that was fine.

On the desktop, the browser is the obvious choice. But somehow, going to the Twitter home page and typing out a tweet felt so… Web 1.0. I didn’t fashion installing a client just for tweeting, like Twhirl. The closest was instant messenger software. Since Identi.ca accepts messages via XMPP, I could install Google Talk and send messages via instant messenger.

That worked for a couple of weeks. Then I pulled out. Instant messenger has the disadvantage of making you accessible, and I honestly don’t have the time. Plus, I don’t fancy running apps persistently, not even something as light as Google Talk. So back to square one.

In the meantime, I was having another problem with sending updates via BlackBerry. My corporate mails have a HUGE disclaimer attached to them. Doesn’t make sense to have 140 character message followed by a 940 character disclaimer. I’d have to get rid of those anyway.

After a bit of digging around, I came across mail handlers. I can write a program on my server to handle mails. So I wrote one that strips out the disclaimer and forwards it to my identi.ca e-mail ID. (Now I’ve modified it to use the API.) So that solves my mobile twittering problem.

It also solves my cross-posting problem. I maintain a twitter.com/sanand0 and an identi.ca/sanand0 account and keep them updated in parallel. My mail handler updates the post on both services.

As for the desktop, I have the best solution of all. I use the browser address bar to twitter. I’ve created a keyword search with the keyword "twitter" with is keyed to a URL like http://www.s-anand.net/twitter/%s. So if I say "twitter Some message" on the address bar and press enter, it contacts the server, which updates Identi.ca and Twitter using the API.

Of course, you don’t really need to do that to update Twitter. Just create a keyword search with a keyword "twitter" and a URL http://twitter.com/home?status=%s, and you’re done. Remember: you can create keyword searches in Internet Explorer as well (read how). With this, you can update twitter from the address bar by just typing "twitter your message goes here".


Anyway, that was a long-winded way of saying just two things.

  1. Mail handlers are cool.
  2. Keyword searches let you update Twitter from the address bar using the URL http://twitter.com/home?status=%s

Dilbert search statistics

It’s been three weeks since I initiated the effort to type in the Dilbert strips and the results are encouraging. About 2 years worth of strips have been typed out. So this Dilbert viewer now has a reasonably sized index for searching.

Many thanks are in order here. The first is due to geek.nl, whose images I have taken the liberty of hotlinking. Thanks also to those who’ve taken the time out to type strips:

  • granger95
  • bthangaraj
  • gdibyo
  • adrienbernard
  • sundar.ramakrishnan
  • pistohl
  • waywardone
  • balamurugan.cse
  • sruppenthal

… and several others.

When I initially planned to share the typing of the Dilbert strips, I anticipated that I would probably type in the most, and almost no one would pitch in. While I still have typed in the most, the contributions of the above have been of great help in more than the obvious way. When I typed out 10 years of Calvin & Hobbes, it took me 5 years. This is 2 years of Dilbert in 3 weeks. If nothing else, it’s pushing me to work harder on this. So thanks again for the motivation.

Here’s my request again to all you Dilbert fans.

  1. Please go to dilbert-search.appspot.com
  2. Log in using your Google account and type in as many strips as you like
  3. Bookmark it for the future, whenever you’re bored

Here’s a Wordle cloud of all the strips typed out so far (with Dilbert and Pointy Haired Boss removed.)

Dilbert

(Seeing that there’s more "good" than "bad", and more "like" than "dislike" or "hate", you might even call Dilbert an optimistic strip.)

Recording online songs

In the 1980s, we rarely used to buy audio cassettes. It was a lot cheaper to record songs from the radio. It’s amazing that in the 2000s, this technique seems to be less used than before.

If you wanted to record a song that was streamed online, you could go through the complex procedures I’d mentioned earlier to download online songs, or you could use the 1980s technologies. Get a tape recorder, connect the headphones of your PC to the tape recorder’s microphone using a stereo cable, and record to your heart’s content.

Except, of course, that tape recorders are rather outdated. And with the right software, your PC can act like a tape recorder. Here’s how you can go about it.

  1. Download Audacity and install it
  2. Download Lame and save it
  3. Open Audacity and select "Wave Out" as the source
  4. Play a song online and click on the Record button. Press the Stop button when done
  5. File – Export as MP3. (The first time, you need to tell Audacity where you’ve saved Lame)

That’s it. You can convert anything your computer plays into an MP3 file. (The general rule in digital media is: if you can see / hear it, you can copy it.)

OK, lets’ do this more slowly.

1. Download Audacity and install it.

Audacity is a program lets you record and edit music. Just visit the link above (or search on Google for "Download Audacity") and install the program. This is what it looks like.

Audacity

2. Download Lame and save it

When you record something with Audacity, you’ll usually want to save it as an MP3 file. Lame is another software that lets you do that. Go to the link above, download the ZIP file, and unzip it in some folder. (Remember where you unzipped it.)

3. Open Audacity and select "Wave Out" as the source

You can choose which source to record from in Audacity. Do you see the "Line In" in the screenshot below? That’s the source from which Audacity will record sound from. Usually, your PC will have a "Microphone" socket, and may have a "Line in" socket. It may also have a built-in microphone. Depending on what sockets and capabilities your PC has, you may see different things.

Audacity source

One of these sources will probably be "Wave Out". That lets you record any sound played by your computer. So if you want to record a song your computer’s playing, what’s what you should choose.

Not all sound cards have the "Wave Out" option, though. Many laptops that I have used don’t seem to have this option. If that’s the case with you, there’s a fairly simple solution. Just buy a stereo-to-stereo cable (shown below) and connect your headphone socket to your microphone socket.

Stereo to stereo cable

This transfers everything your computer plays back into the microphone, and you can select "External Mic" as your source.

Buying this stereo cable has another advantage. Rather than connect one end to your computer’s headphones, you can connect it to anything: your old cassette player, your radio, a microphone, whatever. So that means you can now:

  • Convert your old tapes to MP3
  • Record songs on the radio as MP3
  • Record songs from the TV / DVD player as MP3
  • Record live conversations as MP3
  • Record phone conversations as MP3
  • etc.

4. Play a song online and click on the Record button. Press the Stop button when done

That’s easy. The Record button is the red circular button that’s third from the left. The Stop button is the yellow square button that’s second from the right.

5. File – Export as MP3

When you’ve stopped recording, you can actually do a bunch of useful things with Audacity.

The first is to adjust the volume level. Go to the Effect menu and select Amplify. Then you can try different amplification levels to see how it sounds.

The next is to trim the audio. Unless you’re really fast with the keyboard, you probably have some unwanted sound recorded at the beginning or the end. You can select these pieces by dragging the mouse over the wiggly blue lines, and go to the Edit menu and pick Delete.

Lastly, you’ll want to set the sound quality. Go to Edit – Preferences and under the File Formats tab, set the bit rate under the MP3 Export Setup section. (If you don’t know what rate to put in there, 128 is a safe number. If you want better quality, increase it. If you’re short of disk space or want to mail it to someone, decrease it. Based on my experiments, even a good ear can’t tell the difference at 128. I use 64 or 96. My ear is pretty bad.)

All of the above was optional. If you just wanted to save the file, go to the File menu and select "Export as MP3". The first time you do that, you’ll be asked to mention the folder where you saved lame_enc.dll (which is where you unzipped Lame.) Show Audacity the folder, and that’s it.

Dilbert search engine

Wouldn’t it be cool to be able to search through the Dilbert archives using text?

This used to be possible at Dilbert.com some years ago, as a paid service. In late 2003, I needed to find some Dilbert strips for a client, so I’d subscribed for a year. I could then search for the quotes (I happened to be looking for "outsourcing", so you can guess the context).

But I can’t seem to find the feature any more, even as a paid service. The site looks a lot better, of course. But I can’t find strips.

Well, why not type them out? After all, I’d done that with Calvin and Hobbes.

This would be a much larger exercise, though. And I’m hoping to take your help. I’ve set up a site at dilbert-search.appspot.com. You can type in a comic randomly, starting from 2000. These will be made searchable on my Dilbert page. You can export the data and use it yourself, of course.

When typing in Calvin and Hobbes, I did have a few volunteers willing to pitch in, but collaboration tools weren’t easy to set up, and I ended up typing the whole thing myself. This time, I’d be delighted if even 10 people typed in just a strip each.


So, here’s my request, to all you Dilbert fans.

  1. Please go to dilbert-search.appspot.com
  2. Log in using your Google account and type in as many strips as you like
  3. Bookmark it for the future, whenever you’re bored

As I said, the data is readily exportable from the page, so if you’re looking to do cool mash-ups with it, great! And if you want the data exported in other formats, please let me know.

Incidentally, I created the site using Google AppEngine. The source code is at dilbert-search.googlecode.com.

Gadgets

Some gadgets I’ve bought / got over the last few years.


SDHC Card Reader on 17 March 2009

16GB USB Flash Drive on 8 Jan 2009

16GB SD Card on 14 March 2009

USB MIDI cable on 30 Dec 2008

Creative Labs EP-630/A Earphones on 30 Dec 2008

Recta Micro Compass Accessory on 30 Dec 2008

Strand iPod Cassette adapter on 30 Dec 2008

Keysonic Compact Notebook Layout Wireless 2.4Ghz Radio Frequency Keyboard With Integrated Touch Pad on 6 Sep 2008

TomTom ONE v3 Great Britain on 31 Aug 2008

BlackBerry Curve 8320 on 10 Aug 2008

Acer Aspire 5715Z Notebook Laptop, Intel Pentium Dual Core T2330 1.6GHz, 15.4″ TFT, 2GB RAM, 80GB Hard-drive, DVD±RW, Intel Graphics Media Accelerator X3100, WiFi, Vista Home Premium on 29 Jul 2008

BlueNEXT BN-909 GPS Receiver SiRF Star III on 6 Jul 2008

TDK Recordable Blank 16x DVD+R Discs 25pack Cakebox on 6 Jul 2008

Sandisk MicroSDHC 4GB Card in 6 Jul 2008

HTC S620 (Excalibur) on 1 Jul 2008

Masterplug 4 Gang Switched Extension Lead 2m 13 Amp Fused on 31 Mar 2008

TRUST HU-4440P 4 PORT USB2 MINI HUB on 31 Mar 2008

Hama Compact USB 2.0 Hub 1:4 on 31 Mar 2008

Nintendo Wii Remote on 1 Mar 2008

Sandisk 2GB Secure Digital Card on 29 Feb 2008

Canon IXUS 70 Digital Camera – Silver (7.1MP, 3x Optical Zoom) 2.5″ LCD on 29 Feb 2008

Verbatim DVD+R 25Pk 16x Spindle on 15 Feb 2008

Western Digital My Book Essential 500GB External USB 2.0 Hard Drive on 15 Feb 2008

LUPO DIGITAL TV DVB-T USB ADAPTER/DONGLE/STICK FREEVIEW RECEIVER & AERIAL FOR PC AND LAPTOP on 6 Jan 2008

Bontempi Keyboard – 61 Full Key GM, Midi, Stereo (AD177.12) on 4 Jan 2008

Mini-Headphone Splitter (Stereo) on 1 Jan 2008

Microsoft LifeCam VX-6000 on 26 Dec 2007

Kenwood FP580 Food Processor 2 Speed White on 26 Dec 2007

SanDisk Sansa m240 1Gb MP3 Player on 24 Jul 2007

Sennheiser CX300 Eco Ear Canal Headphones Black on 24 Jul 2007

Uniross AAA 1000mAh (4)Rechargeable Battery Ni-Mh on 24 Jul 2007

Logitech EX110 Wireless Desktop on 5 Mar 2007

Cordless Skype Phone Kit NON VISTA on 5 Mar 2007

LG 42PC1D 42″ Plasma TV on 20 Jan 2007

Sony Ericsson Standard Travel Charger (UK) CST-13 on 12 Jan 2007

Uniross Sprint 1 Hour Battery Charger inc 4 x AA 2700 mAh Rechargeable Batteries – batteries upgraded from 2500 mAh on 11 Jan 2007

Fuji FinePix S5600 Zoom Digital Camera [5.0MP,10xOptical Zoom] on 17 Nov 2006

Fuji 1GB XD Type M Picture Card on 17 Nov 2006

DIGIHOME DVB915 FREEVIEW Digital Terrestrial Receiver with SCART Lead on 17 Nov 2006

Sony 80min/700MB Thermo printable CD-R spin 50pk on 28 Jul 2006

CyberHome DVD 401/0 Multi-region Capable DVD Player with DIVX on 28 Jul 2006

Rivision DVD+R 8x 4.7Gb 100 Cake Box on 15 Jun 2006

Dynamode 3.5″ IDE Interface Disk Enclosure on 1 May 2006

SANDISK CRUZER MICRO 1GB on 3 Jan 2006

BenQ 16x External Dual Format, Double Layer DVD Writer – EW162I, Beige on 3 Jan 2006

Emtec DVD+R Cake Box 100pk on 3 Jan 2006

Panasonic NV-GS17B MiniDV Digital Camcorder [24x Optical, 2.5″ LCD, DV out] on 9 Nov 2005

Mobile browsing

When I analysed my HTTP log last week, I had another motive: are there enough people accessing my site on a mobile device? Or is it too small at this stage for me to care about?

Well, have a look at the numbers.

Windows 98.4%
Mobile 0.6%
Linux 0.5%
OS X 0.5%

Yes, there are more people accessing my site through a mobile device than there are using Linux or OS X. That’s shocking!

Now, I’m not saying that this is representative of the rest of the world or anything, but at least it tells me a couple of things.

Firstly, the whole mobile browsing thing is bigger than I thought it was. I started worrying about this a couple of months ago and got myself a HTC s620 phone and a BlackBerry (for free, through some innovative social engineering and smooth talking). It really does get pretty useful on the move… which is frankly anywhere outside of the home and the office, and sometimes even within. (It’s handier to read recipes off the HTC than a laptop.) Google had caught on to the whole mobile browsing trend a very long time ago, and are rather well positioned to make use of it.

Secondly, it means that rather than worrying about my site working on Linux or OS X (i.e. worrying about what plugins to use), I should worry more about it working on mobile devices (i.e. small screen, no Javascript / CSS).

That’s a fairly big shift in my thinking. Earlier, I had been all for shifting all the processing to client-side Javascript. Now it appears I need to design more towards plain HTML pages generated by Perl / PHP.

Google Chrome screenshots

I went to the Google Chrome site.

Clicking on the “Accept and Install” button…

… automatically launched the downloader in Firefox…

… and (after a fairly short while) started installing the application directly. This may be the most painless install I’ve done in a while.

I clicked on “Customise the settings”

This is what it looks like.

And that’s it! It installs, and launches in just a few seconds. First impressions: the startup and rendering are really fast.

The address bar doubles up as a search bar. Very sensible.

Several nice features: incognito mode, application shortcuts, and developer tools.

The Javascript console has Javascript autocompletion! Watch out, Firebug.

The “Use DNS pre-fetching” looks interesting. My browsing certainly seems faster. Might be faster than Opera, even.

The “Show suggestions for navigation errors” feature.

There’s a task manager…

… that shows how much memory each site uses.

But not all is good. This jQuery animation on my site leaves trails behind.

And the text box resizing is good, but feels a bit… wrong, somehow.

Plus: I can re-import history, bookmarks, etc. from Firefox at any point, so I don’t have to worry about using this as a secondary browser.

Update (8am UK, 3rd Sep): Chrome.exe isn’t installed in your “Program Files” folder. It’s in your “Documents and Settings” folder, under “Local Settings\Application Data\Google\Chrome\Application”. (That’s on Windows XP. Not sure about Vista.)

There’s a Themes folder, so I imagine more themes should be on their way.

There doesn’t seem to be an about:config option. But there are a whole lot of others:

  • about:cache
  • about:dns
  • about:histograms
  • about:memory
  • about:plugins
  • about:stats
  • about:version
  • about:crash
  • about:internets
  • about:network
  • about:blank
  • about:shorthang
  • about:hang
  • about:objects

I’m not entirely sure if the last two work. Based on comments at John Resig’s blog. Go through the code to see if you can find more.

Attack of the bots

One out of every 5 hits to my site is from a bot.

I spent a fair bit of time this weekend analysing my log file for last month (which runs to gigabytes, and I ended up learning a few things about file system optimisation, but more on that later). 80% of the hits were from regular browsers. 20% were from robots. Here’s a sample of the user-agents:

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mediapartners-Google
DotBot/1.0.1 (http://www.dotnetdotcom.org/#info, crawler@dotnetdotcom.org)
Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)
msnbot/1.1 (+http://search.msn.com/msnbot.htm)
FeedBurner/1.0 (http://www.FeedBurner.com)
Mozilla/5.0 (compatible; attributor/1.13.2 +http://www.attributor.com)
WebAlta Crawler/2.0 (http://www.webalta.net/ru/about_webmaster.html) (Windows; U; Windows NT 5.1; ru-RU)
Yandex/1.01.001 (compatible; Win16; I)
...

You get the idea. The bulk of these are search engines. Over two-thirds of the bot requests were from Yahoo Slurp. Now, this struck me as weird. If I take the top 3 search engines that are sending traffic my way,

  Referral % Crawl %
Google 90% 24%
Yahoo 6% 66%
Microsoft 3% 0.3%
Others 1% 9%

The search engine that sends me the most traffic is being reasonably conservative, while Yahoo is just eating up the bandwidth on my site. Actually, this shouldn’t bother me too much. It’s not taking up too much bandwidth, or even CPU usage, given that all the bots put together make up only 20% of my traffic. But somehow… it’s sub-optimal. Inelegant, even.

So I decided to take a closer look. Just how often are they crawling my site?

Yahoo Every 5 seconds
Google Every 13 seconds
DotBot Every 9 minutes
Cuill Every 9 minutes
Microsoft Every 18 minutes
Feedburner Every 18 minutes
Attributor Every 23 minutes
Yandex Every 27 minutes

Look at those numbers. Yahoo is hitting my site once every 5 seconds. No wonder there’s a help page at Yahoo titled How can I reduce the number of requests you make on my web site? I followed their advice and set the crawl-delay to 60, so at least it slows down to once a minute.

Just that one little line change should (hopefully) reduce the load on my site by around 15%.

As for the other engines, I don’t mind that much in terms of load.

  • Google, for all that it crawls every 13 seconds, has faithfully reported that it has only 11% of my site under its index, so I’ve no idea what they’re doing, but I’m not complaining about the traffic that’s coming my way.
  • DotBot. Today was the first I’d heard of them. Visited the site, and smiled. These guys can do all the crawling of my site that they like, and I hope something interesting comes out of their work.
  • Cuill, sends me 0.2% of my traffic, but it’s a new search engine, I’m happy to give it time.
  • Microsoft‘s OK, sends me a tiny stream of traffic.
  • Feedburner is just pinging my RSS feed every 18 minutes.
  • Attributor and Yandex I’m hearing of for the first time, again. Not too much load on a system, so that’s OK.

What’s amazing is the sheer number of bots out there. Last month, I counted over 600 distinct user-agent strings just representing bots. So it’s true. The Web is no longer just for humans. We do need a Semantic Web.