S Anand

Scraping for a laptop

I’ve returned my laptop, and it’s time to buy a new one. For the first time in my life, I’m buying a laptop for myself.

I have a fairly clear idea of what I want: a 500GB+ 7200 rpm hard disk with 4GB of RAM and an Intel Core i7. I thought that would make finding one of those powerful laptops for producing music since I record some stuff too out of hobby.

Sheer naïveté. Not a single site let me filter by hard disk rpm in India. (To be fair, I haven’t found any sites outside India that did that either.)

After spending a good two hours hunting for the details and collating it, I did what I normally would: spend 30 minutes writing a scraper. The scraper runs through all laptops on Flipkart and pulls out all of their specs. Thanks to the diligence of the good folks at Flipkart, this information is readily available on each page. The HTML is structured quite neatly too, so it was just a 30-line program to scrape it all. Full credit to ScraperWiki as well — I could use it on a netbook without any developer tools installed.

The scraper took 2 hours to run. Feel free to filter through the output (CSV) for your favourite laptop, or fork the code and pull any other data you like.

The next chapter of my life

I’m writing this post on a one-way flight from London back to India. I’ve moved on from Infosys Consulting, and am starting up on my own.

I’ve wanted to do this for a long time. There’s always more freedom in your own company than someone else’s. There’s often more money in it too, if you’re lucky enough. But my upbringing is a bit too conservative to make that bold step. However, given that my father runs his own firm, I figured it was just a question of time for me to do the same.

Two years ago, in Jan 2010, I picked up Rashmi Bansal’s Stay Hungry Stay Foolish at an airport. That book killed the last bit of resistance I had. If the people in that book could succeed, I felt I could too. And if what they did (building small companies, not huge ones) could be called a success, I could be successful too.

After the flight, it was clear in my mind. I would be an entrepreneur. I would create a small company that would probably fold. Then I’d do it again. And again, 10 times, because 1 in 10 companies survive. And finally, I’d be running a small business that’d be called successful by virtue of having survived. A modest, achievable ambition that I had the courage for.

I usually make big decisions without analysis, by just sleeping over them. I slept over it and announced it to my family the next day. I’m not sure they believed me.

Two months later, along with a friend, I built a dynamic digital image resizing product. We had our wives start a company in the UK, and tried selling it to retailers. There clearly was a demand. The problem was, we didn’t know how to sell. After a year and having spent £500 with no sales, it was clear to us that venture #1 had failed. We eventually shut it down.

In the middle of this, my ex- boss from IBM told me that he was looking to start a venture, focusing on mobile, rural BPO and energy management. This later on changed to data analytics and visualisation. They all sounded like fun, so I said I’ll help out in my spare time.

A few months later, a classmate told me he’d started a business digitising school report cards. That sounded like fun too, so I said I’d help out in my spare time.

Now, if that sounds like I had a lot of spare time on my hands — you’re right, I did. And it’s time to talk about the jobs in my life. My first 3 years at IBM were fun. I was coding, learning, and leading a bachelor’s life with friends, money, and no responsibilities. My 4 years at BCG were strenuous with 80-hour weeks, but it was interesting and challenging. I was newly married, and between work and home responsibilities, I had no time for fun.

I moved to Infosys Consulting in the UK with the specific aim of rectifying that (and for health reasons as well). In the last 7 years, the work has (except on occasion) been a bit boring, but very relaxing. On most days, I would spend 4 hours working, and 4 hours learning new stuff. The things I learnt only helped me be more efficient. So I ended up getting even more work done in less time.

Many things came out of this. Firstly, I recovered my health. We had a daughter, and I spent more time with her. I started coding in earnest again. By 2007, I was writing code as part of my projects — stuff that others whose job it was were unable to. By 2009, I had a few websites running, like an Indian music search engine, an IMDb Top 250 tracker, a few transliterators, and so on.

So when I said I’d help out with these startups, it wasn’t an empty promise. For the last 18 months, I’ve had a day job and three night jobs. I never did justice to any of them in my opinion, but I had more fun than ever in my life, I learnt more than ever in my life, and I produced more tangible output than ever in my life. Sometimes, quantity beats quality or reliability.

Both these startups are doing well today. Gramener.com offers data visualisation and IT services. I will be joining them as Chief Data Scientist. Reportbee.com offers a hosted report card solution. I will continue helping them out. And I will continue working with a few NGOs.

You’ll see me a lot more active online now. I can publicly write about my work — something I’ve been unable to do the last 11 years.

I am relocating to Bangalore. From a professional front, it’s an obvious choice. That’s where the geeks are. In my last visit to India, I was at Bangalore, Chennai and Hyderabad. In the latter two, it’s tough to meet geeks. And when you do, it’s no easier to find the next. Bangalore has many more geeks, and they’re fairly well networked.

From a personal front, too, Bangalore works well. It’s close enough to Chennai without actually being in Chennai.

It’s 10am on Thu 12th Jan. Our flight is descending into Delhi airport. It’s the start of a new chapter in my life. Scary, but exciting. Wish me luck!

Markdress

This year, I’ve converted the bulk of my content into Markdown – a simple way of formatting text files in a way that can be rendered into HTML.

Not out of choice, really. It was the only solution if I wanted to:

  • Edit files on my iPad / iPhone (I’ve started doing that a lot more recently)
  • Allow the contents to be viewable as HTML as well as text, and
  • Allow non techies to edit the file

As a bonus, it’s already the format Github and Bitbucket use for markup.

If you toss Dropbox into the mix, there’s a powerful solution there. You can share files via Dropbox as Markdown, and publish them as web pages. There are already a number of solutions that let you do this. DropPages.com and Pancake.io let you share Dropbox files as web pages. Calepin.co lets you blog using Dropbox.

My needs were a bit simpler, however. I sometimes publish Markdown files on Dropbox that I want to see in a formatted way – without having to create an account. Just to test things, or share temporarily.

Enter Markdress.org. My project for this morning.

Just add any URL after markdress.org to render it as Markdown. For example, to render the file at http://goo.gl/zTG1q, visit http://markdress.org/goo.gl/zTG1q.

To test it out, create any text file in your Dropbox public folder, get the public link:

… and append it to http://markdress.org/ without the http:// prefix.

GarageBand in Phir Se Ud Chala

A month ago, I was at the theatre watching Ra.One. The movie was terrible, yet enjoyable. But I’m going to talk about something else – a song I heard that caught my imagination.

The song is Phir Se Ud Chala from Rockstar. Around 14 seconds into the video, you’ll hear a guitar start off at the background. That’s what caught my ear first – because I’d heard it before. Listen to this piece below:

Mystic light

I’d created this a couple of months ago with GarageBand on my iPad2. It just plays two Apple Loops one after another.

photo

The first one that you hear – Cheerful Mandolin 07 – is exactly the same background music that you hear in Phir Se Ud Chala. Guess A R Rahman uses GarageBand too!

(The strange thing is, I found no mention of this anywhere on the internet, as of 2 Dec 2011. Thought I’d have a go and be the first… just in case someone searches for Apple Loops or GarageBand in Phir Se Ud Chala from Rockstar.)

Protect static files on Apache with OpenID

I moved from static HTML pages to web applications and back to static HTML files. There’s a lot to be said for the simplicity and portability of a bunch of files. Static site generators like Jekyll are increasingly popular; I’ve built a simple publisher that I use extensively.

Web apps give you something else, though, that are still useful on a static site. Access control. I’ve been resorting to htpasswd to protect static files, and it’s far from optimal. I don’t want to know or manage users’ passwords. I don’t want them to remember a new ID. I just want to allow specific people to log in via their Google Accounts. (OpenID is too confusing, and most people use Google anyway.)

The easiest option would be to use Google AppEngine. But their new pricing worries me. Hosting on EC2 is expensive in the long run. All my hosting is now out of a shared Hostgator server that offers Apache and PHP.

So, obviously, I wrote a library protects static files on Apache/PHP using OpenID.

Download the code

 

Say you want to protect /home/www which is accessible at http://example.com/.

  1. Copy .htaccess and _auth/ under /home/www.
  2. In .htaccess, change RewriteBase to /
  3. In _auth/, copy config.sample.php into config.php, and
    1. change $AUTH_PATH to http://example.com/
    2. add permitted email IDs to function allow()

Now, when you visit http://example.com, you’ll be taken to Google’s login page. Once you log in, if your email ID is allowed , you’ll be able to see the file.

Feel free to try, or fork the code.

Codecasting

The best way to explain code to a group of people is by walking through it. If they’re far away in space or time, then a video is the next best thing. You can recommend them to try out the best coding apps as well.

The trouble with videos, though, is that they’re big. I can’t host them on my server – I’d need YouTube. Editing them is tough. You can’t copy & paste code from videos. And so on.

One interesting alternative is to use presentations with audio. Slideshare, for instance, lets you share slides and sync it with audio. That almost works. But it’s still not good enough. I’d like code to be stored as code.

What I really need is codecasting: a YouTube or Slideshare for code. The closest I’ve seen until day-before was etherpad or ttyrec – but neither support audio.

Enter Popcorn. It’s a Javascript library from Mozilla that, among other things, can fire events when an audio/video element reaches a particular point.

Watch a demo of how I used it for codecasting

A look at the code will show you that I’m using two libraries: SyntaxHighlighter to highlight the code, and Popcorn. The meat of the code I’ve written is in this subtitle function.

function subtitle(media_node, pre_node, events) {
  var pop = Popcorn(media_node);
  for (var i=0, l=events.length; i<l; i++)="" {="" for="" (var="" j="0," line_selector="[]," line_no;="" line_no="events[i][1][j];" j++)="" line_selector.push(pre_node="" +="" '="" .number'="" line_no)="" }="" var="" start="events[i][0]" ,="" end="i<l-1" ?="" events[i+1][0]="" :="" events[i][0]+999;="" (function(start,="" end,="" selector)="" pop.code({start:="" start,="" end:end,="" onstart:="" function(o)="" $(selector).addclass('highlighted');="" },="" onend:="" $(selector).removeclass('highlighted');="" })="" })(start,="" line_selector.join(','));="" }<="" pre="">

When called like this:

subtitle('#audio', 'pre', [
  [ 1, [1,2,3]],
  [ 5, [4,5,6]],
  [ 9, [7,8]],
])

… it takes the #audio element, when it plays to 1 second, highlights lines 1,2,3; at 5 seconds, highlights lines 4,5,6; and so on.

Another thing that helped was that my iPad has a much better mic than my laptop, and ClearRecord is a really simple way to create recordings with minimal noise. [Note to self: sampling at 16KHz and saving as a VBR MP3 (45-85kbps) seems the best trade-off.]

With these tools, my time to prepare a tutorial went down from 4 hours to half an hour!

</l;>

Javascript arrays vs objects

Summary: Arrays are a lot smaller than objects, but only slightly faster on newer browsers.

I’m writing an in-memory Javascript app that handles several thousand rows. Each row could be stored either as an array [1,2,3] or an object {"x":1,"y":2,"z":3}. Having read up on the performance of arrays vs objects, I thought I’d do a few tests on storing numbers from 0 to 1 million. The results for Chrome are below. (Firefox 7 was similar.)

  Time Size (MB)
Array: x[i] = i 2.44s 8
Object: x[i] = i 3.02s 57
Object: x["a_long_dummy_testing_string"+i]=i 4.21s 238

The key lessons for me were:

  • Browsers used to process arrays MUCH faster than objects. This gap has now shrunk.
  • However, arrays are still better: not for their speed, but for their space efficiency.
  • If you’re processing a million rows or less, don’t worry about memory. If you’re storing stuff as arrays, you can store 128 columns in 1GB of RAM (1024/8=128).

Software for my new laptop 2

Time for a new laptop, and to replace software. Here’s my new list.

A lot has changed in the last 5 years. Mainly, I use the browser, cygwin and Portable Apps a lot more. (The last is to escape jailers, not registry bloat.)

Media

  • Chrome [new]: For browsing and development. Fast, light, and stays out of the way.
  • Firefox: I keep it just for printing. Chrome sucks at printing.
  • Media Player Classic: Nothing against it, but I decided to stick to just one app, which is…
  • VLC: Continues to be the best media player, IMHO.
  • WinAmp: I just manage my playlists as M3U files, using Python programs.
  • Audacity: Still the easiest way to record audio.
  • Camstudio: The simplest free portable screen capture software I know.
  • PicPick [new]: Lightweight, powerful screenshot grabber
  • VirtualDub: Not the simplest, but still good for what I need: cropping and joining video.
  • MediaCoder [new]: Good for video/audio conversions. Maybe I’ll install this later.
  • Foxit Reader: The simples free portable PDF reader I know, better than…
  • NitroPDF Reader [new]: … which is good for Printing PDFs – better than…
  • Primo PDF: … which has trouble on rare occasions.
  • Microsoft Reader: I have a lot of ebooks in .LIT.
  • Kindle for PC [new]: I don’t own a Kindle, but I’ve bought a few ebooks.
  • Paint.NET: Good enough for cropping and adjusting colours on images.
  • Windows Live Writer [new]: The best way to write this blog WYSIWYG
  • Inkscape [new]: I occasionally edit vector graphics.
  • Google Earth. Google Maps is good enough.
  • ImgBurn: I no longer use CDs/DVDs. Just flash drives and external hard disks.
  • Picasa: I’ve stopped browsing pictures. No time.

Sharing

  • Dropbox [new]: Simplest way of sharing files.
  • Skype: I use it more than my phone.
  • Google Talk: For those friends who have chat enabled on Gmail.
  • TeamViewer [new]: Pretty efficient screen sharing. Works better than Skype, I think.
  • Google Calendar Sync: To keep Outlook in sync with Google Calendar.

Utilities

  • 7-Zip [new]: Covers all compressed formats, and has the best compression ratio.
  • WinRAR: 7-Zip has it covered.
  • AutoHotKey [new]: Shockingly powerful macro functionality. Shockingly underused.
  • Clip [new]: Command line clipboard. dir | clip copies the directory to the clipboard.
  • ClipX [new]: Stores multiple clipboard entries and history. Invaluable.
  • DiskTT [new]: I’m paranoid about disk speed. I keep measuring it.
  • WinDirStat [new]: Best way to find what’s taking up space on disk.
  • ProcessExplorer [new]: Just in case Task Manager doesn’t show you everything.
  • Google Desktop: Well, it’s dead.
  • mDesktop [new]: A Virtual Desktop Manager (multiple screens) for Windows 7.
  • PowerToys: doesn’t work on Windows 7, but I got X-Mouse working.
  • Teracopy: I don’t worry too much about copying files any more. Maybe later.
  • Junction Link Magic [new]: To map folders. But I now use Cygwin, and symlinks rock.
  • uTorrent [new]: For bittorrent.
  • ntlmaps [new]: proxies requiring a password to a proxy not requiring a password
  • Putty [new]: SSH for Windows, but can also act as an SSH tunnel
  • TrueCrypt [new]: To securely back up my bank details on the cloud.

Development

Data Visualisation

  • R [new]. The God of all statistical packages. Install reshape and ggplot2.
  • Gephi [new]: Does network visualisations quite well. 
  • GraphViz [new]: Does network visualisations not quite as well.
  • Google Refine [new]: Helps clean up messy data.
  • qhull [new]: For voronoi treemaps. Don’t ask.
  • wkhtml2pdf [new]: To print web pages as PDF.

What am I missing that you really like?

Faster data crunching

I’ve been playing with big data lately.

The good part is, it’s easy to get interesting results. The data is so unwieldy that even average value calculations provoke a “Amazing! I didn’t know that,” response (No exaggeration. I heard this from two separate ~ $1bn businesses this month.)

The bad part is that calculating even that simple average is slow.

For example, take this 40MB file (380MB unzipped) and extract the first column.

The simplest Python script to get the first column looks like this:

for row in csv.reader(fileinput.input(), delimiter='\t'):
    if len(row) > 0: print row[0]

That took a good 3 minutes to execute on my laptop.

Since I’m used to UNIX data processing, I tried cut -f1. Weirdly, that’s worse. 5 minutes. Paradoxically, awk ‘{print $1}’ only takes 17 seconds. That’s about 12 times faster. Clearly the tool makes a big difference. And we always knew UNIX was fast.

But I also ran these on an Amazon EC2 server, and a Hostgator server. Here’re the results.

  python cut awk
My Dell E5400 3:04 (1x) 5:42 (0.5x) 0:17 (11x)
EC2 standard 0:33 (6x) 0:5.6 (33x) 0:16 (11x)
Hostgator 0:19 (10x) 0:2.5 (74x) 0:0.7 (265x)

What took 3 minutes with Python my Dell E5400 took less than a second on Hostgator’s server with awk. Over 250 times faster. (Not 250%. 250 times).

And it’s not just hardware. A good tool (awk) made things 11x faster on my machine. Good hardware (hostgator) made the same program 10x faster. But choosing the right combination can make things go faster than 11 x 10 = 110 times. Much faster.

There are a few of things I’m taking away from this.

  1. Good hardware can speed you up much as (or more than) choosing the right tool.
  2. Good hardware can be rented. From many places. Cheaply.
  3. Always test what’s fast. awk’s fastest on my machine and Hostgator, but not on EC2.

India district map

I put together a district map of India in SVG this weekend.

So what?

You can now plot data available at a district level on a map, like the temperature in India over the last century (via IndiaWaterPortal). The rows are years (1901, 1911, … 2001) and the columns are months (Jan, Feb, … Dec). Red is hot, green is cold.

temperature

(Yeah, the west coast is a great place to live in, but I probably need to look into the rainfall.)

districts.svg has has 640 districts (I’ve no idea what the 641st looks like) and is tagged with the State and District names as titles:

<g title="Madhya Pradesh">
  <path title="Alirajpur" d="..." />
  <path title="Jhabua" d="..." />
  ...
</g>

How?

I made it from the 2011 census map (0.4MB PDF). I opened it in Inkscape, removed the labels, added a layer for the districts, and used the paint bucket to fill each district’s area. I then saved the districts layer, cleaning it up a big. Then I labelled each district with a title. (Seemed like the easiest way to get this done.)

Thanks to @planemad, @gkjohn, @arjunram for inputs. Play around. Feedback welcome.