Coding

Restartable and Parallel

When processing data at a large scale, there are two characteristics that make a huge difference to my life.

Restartability. When something goes wrong, being able to continue from where it stopped. In my opinion, this is more important than parallelism. There’s nothing as depressing as having to start from scratch every time. Think of it as the ability to save a game as opposed to starting from Level 1 in every life.

Parallelism. Being able to run multiple processes in parallel. Often, this is easy. You don’t need threads. Good old UNIX xargs can do a great job of it. Interestingly, I’ve never used Hadoop for any real-life problem. I’ve gotten by with UNIX commands and smart partitioning.

The “smart partitioning” bit is important. For example, if you’re dealing with telecom data, you’d be calculating most of your metrics (e.g. did the number of calls grow or fall, are there more outgoing or incoming calls, etc.) are calculated on a single mobile number. So if you have multiple data sets, as long as all the data related to one mobile number are on the same system, you’re fine. If you have 100 machines, just split the data based on the last 2 digits of the mobile number. So data about 9012345678 would go to machine 78 (the last two digits). Given a mobile number for any type of data, you’d know exactly which machine would have that data. For all practical purposes, that gives you the basics of a distributed file system.

(I’m not saying you don’t need Hadoop. Just that I haven’t needed it.)

Colour spaces

In reality, a colour is a combination of light waves with frequencies between 400-700THz, just like sound is a combination of sound waves with frequencies from 20-20000Hz. Just like mixing various pure notes produces a new sound, mixing various pure colours (like from a rainbow) produces new colours (like white, which isn’t on the rainbow.)

Our eyes aren’t like our ears, though. They have 3 sensors that are triggered differently by different frequencies. The sensors roughly peak around red, green and blue. Roughly.

It turns out that it’s possible to recreate most (not all) colours using a combination of just red, green and blue by mimicking these three sensors to the right level. That’s why TVs and monitors have red, blue and green cells, and we represent colours using hex triplets for RRGGBB – like #00ff00 (green).

There are a number of problems with this from a computational perspective. Conceptually, we think of (R, G, B) as a 3-dimensional cube. That’d mean that 100% red is about as bright as 100% green or blue. Unfortunately, green is a lot brighter than red, which is a lot brighter than blue. Our 3 sensors are not equally sensitive.

You’d also think that a colour that’s numerically mid-way between 2 colours should appear to be mid-way. Far from it.

This means that if you’re picking colours using the RGB model, you’re using something very far from the intuitive human way of perceiving colours.

Which is all very nice, but I’m usually in a rush. So what do I do?

  1. I go to the Microsoft Office colour themes and use a colour picker to pick one. (I extracted them to make life easier.) These are generally good on the eye.
  2. Failing that, I pick something from http://kuler.adobe.com/
  3. Or I go to http://colorbrewer2.org/ and pick a set of colours
  4. If I absolutely have to do things programmatically, I use the HCL  colour scheme. The good part is it’s perceptually uniform. The bad part is: not every interpolation is a valid colour.

Is Protocol buffers worth it?

Google’s Protocol Buffers is a “language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler

XML is slow and large. There’s no doubting that. JSON’s my default alternative, though it’s a bit large. CSV’s ideal for tabular data, but ragged hierarchies are a bit difficult.

I was trying to see if Protocol Buffers would be smaller and faster, at least when using Python. I took JSON as the base, and checked the write speed, read speed and file sizes. Here’s the comparison:

image

Protocol Buffers are 17 times slower to write and almost 6 times slower to read than JSON files. File sizes are smaller, but then, all it takes is a simple gzip operation to compress the JSON files even smaller. Reading json.gz files is just 2% slower than JSON files, and writing them is only 4 times slower.

The code base is at https://bitbucket.org/sanand0/protobuftest

On the whole, it appears that GZipped JSON files are smaller, faster, and just as simple as Protocol Buffers. What am I missing?

Update: When you add GZipped CSV to the mix, it’s twice as fast as GZipped JSON to read: clearly a huge win. It’s only slightly slower to write, and but compresses a tiny bit more than JSON.

Audio data URI

Turns out that you can use data URIs in the <audio> tag.

Just upload an MP3 file to http://dataurl.net/#dataurlmaker and you’ll get a long string starting with data:audio/mp3;base64...

Insert this into your HTML:

<audio controls src=”data:audio/mp3;base64...”>

That’s it – the entire MP3 file is embedded into your HTML page without requiring additional downloads.

This takes a bit more bandwidth than the MP3, and won’t work on Internet Explorer. But for modern browsers, and small audio files, it reduces the overall load time – sort of like CSS sprites.

So, on my bus ride today, I built a little HTML5 musical keyboard that generates data URIs on the fly. Click to play.

keyboard

Markdress

This year, I’ve converted the bulk of my content into Markdown – a simple way of formatting text files in a way that can be rendered into HTML.

Not out of choice, really. It was the only solution if I wanted to:

  • Edit files on my iPad / iPhone (I’ve started doing that a lot more recently)
  • Allow the contents to be viewable as HTML as well as text, and
  • Allow non techies to edit the file

As a bonus, it’s already the format Github and Bitbucket use for markup.

If you toss Dropbox into the mix, there’s a powerful solution there. You can share files via Dropbox as Markdown, and publish them as web pages. There are already a number of solutions that let you do this. DropPages.com and Pancake.io let you share Dropbox files as web pages. Calepin.co lets you blog using Dropbox.

My needs were a bit simpler, however. I sometimes publish Markdown files on Dropbox that I want to see in a formatted way – without having to create an account. Just to test things, or share temporarily.

Enter Markdress.org. My project for this morning.

Just add any URL after markdress.org to render it as Markdown. For example, to render the file at http://goo.gl/zTG1q, visit http://markdress.org/goo.gl/zTG1q.

To test it out, create any text file in your Dropbox public folder, get the public link:

… and append it to http://markdress.org/ without the http:// prefix.

Protect static files on Apache with OpenID

I moved from static HTML pages to web applications and back to static HTML files. There’s a lot to be said for the simplicity and portability of a bunch of files. Static site generators like Jekyll are increasingly popular; I’ve built a simple publisher that I use extensively.

Web apps give you something else, though, that are still useful on a static site. Access control. I’ve been resorting to htpasswd to protect static files, and it’s far from optimal. I don’t want to know or manage users’ passwords. I don’t want them to remember a new ID. I just want to allow specific people to log in via their Google Accounts. (OpenID is too confusing, and most people use Google anyway.)

The easiest option would be to use Google AppEngine. But their new pricing worries me. Hosting on EC2 is expensive in the long run. All my hosting is now out of a shared Hostgator server that offers Apache and PHP.

So, obviously, I wrote a library protects static files on Apache/PHP using OpenID.

Download the code

 

Say you want to protect /home/www which is accessible at http://example.com/.

  1. Copy .htaccess and _auth/ under /home/www.
  2. In .htaccess, change RewriteBase to /
  3. In _auth/, copy config.sample.php into config.php, and
    1. change $AUTH_PATH to http://example.com/
    2. add permitted email IDs to function allow()

Now, when you visit http://example.com, you’ll be taken to Google’s login page. Once you log in, if your email ID is allowed , you’ll be able to see the file.

Feel free to try, or fork the code.

Codecasting

The best way to explain code to a group of people is by walking through it. If they’re far away in space or time, then a video is the next best thing. You can recommend them to try out the best coding apps as well.

The trouble with videos, though, is that they’re big. I can’t host them on my server – I’d need YouTube. Editing them is tough. You can’t copy & paste code from videos. And so on.

One interesting alternative is to use presentations with audio. Slideshare, for instance, lets you share slides and sync it with audio. That almost works. But it’s still not good enough. I’d like code to be stored as code.

What I really need is codecasting: a YouTube or Slideshare for code. The closest I’ve seen until day-before was etherpad or ttyrec – but neither support audio.

Enter Popcorn. It’s a Javascript library from Mozilla that, among other things, can fire events when an audio/video element reaches a particular point.

Watch a demo of how I used it for codecasting

A look at the code will show you that I’m using two libraries: SyntaxHighlighter to highlight the code, and Popcorn. The meat of the code I’ve written is in this subtitle function.

function subtitle(media_node, pre_node, events) {
  var pop = Popcorn(media_node);
  for (var i=0, l=events.length; i&lt;l; i++)="" {="" for="" (var="" j="0," line_selector="[]," line_no;="" line_no="events[i][1][j];" j++)="" line_selector.push(pre_node="" +="" '="" .number'="" line_no)="" }="" var="" start="events[i][0]" ,="" end="i&lt;l-1" ?="" events[i+1][0]="" :="" events[i][0]+999;="" (function(start,="" end,="" selector)="" pop.code({start:="" start,="" end:end,="" onstart:="" function(o)="" $(selector).addclass('highlighted');="" },="" onend:="" $(selector).removeclass('highlighted');="" })="" })(start,="" line_selector.join(','));="" }&lt;="" pre=""&gt;

When called like this:

subtitle('#audio', 'pre', [
  [ 1, [1,2,3]],
  [ 5, [4,5,6]],
  [ 9, [7,8]],
])

… it takes the #audio element, when it plays to 1 second, highlights lines 1,2,3; at 5 seconds, highlights lines 4,5,6; and so on.

Another thing that helped was that my iPad has a much better mic than my laptop, and ClearRecord is a really simple way to create recordings with minimal noise. [Note to self: sampling at 16KHz and saving as a VBR MP3 (45-85kbps) seems the best trade-off.]

With these tools, my time to prepare a tutorial went down from 4 hours to half an hour!

&lt;/l;&gt;

Javascript arrays vs objects

Summary: Arrays are a lot smaller than objects, but only slightly faster on newer browsers.

I’m writing an in-memory Javascript app that handles several thousand rows. Each row could be stored either as an array [1,2,3] or an object {"x":1,"y":2,"z":3}. Having read up on the performance of arrays vs objects, I thought I’d do a few tests on storing numbers from 0 to 1 million. The results for Chrome are below. (Firefox 7 was similar.)

  Time Size (MB)
Array: x[i] = i 2.44s 8
Object: x[i] = i 3.02s 57
Object: x["a_long_dummy_testing_string"+i]=i 4.21s 238

The key lessons for me were:

  • Browsers used to process arrays MUCH faster than objects. This gap has now shrunk.
  • However, arrays are still better: not for their speed, but for their space efficiency.
  • If you’re processing a million rows or less, don’t worry about memory. If you’re storing stuff as arrays, you can store 128 columns in 1GB of RAM (1024/8=128).

Server speed benchmarks

Yesterday, I wrote about node.js being fast. Here are some numbers. I ran Apache Benchmark on the simplest Hello World program possible, testing 10,000 requests with 100 concurrent connections (ab -n 10000 -c 100). These are on my Dell E5400, with lots of application running, so take them with a pinch of salt.

PHP5 on Apache 2.2.6
<?php echo “Hello world” ?>
1,550/sec Base case. But this isn’t too bad
Tornado/Python
See Tornadoweb example
1,900/sec Over 20% faster
Static HTML on Apache 2.2.6
Hello world
2,250/sec Another 20% faster
Static HTML on nginx 0.9.0
Hello world
2,400/sec 6% faster
node.js 0.4.1
See nodejs.org example
2,500/sec Faster than a static file on nginx!

I was definitely NOT expecting this result… but it looks like serving a static file with node.js could be faster than nginx. This might explain why Markup.io is exposing node.js directly, without an nginx or varnish proxy.

Why node.js

I’ve moved from Python to Javascript on the server side – specifically, Tornado to Node.js.

Three years ago, I moved from Perl to Python because I got free hosting at AppEngine. Python’s a cleaner language, but that was not enough to make me move. Free hosting was.

Initially, my apps were on AppEngine, but that wouldn’t work for corporate apps, so I tried Django. IMHO, Django’s too bulky, has too much “magic”, and templates are restrictive. Then I tried Tornado: small; independent modules; easy to learn. I used it for almost 2 years.

The unexpected bonus with Tornado was it’s event-based model: it wouldn’t wait for file or HTTP requests to be complete before serving the next request. I ended up getting a fair bit of performance from a single server.

Trouble is, Python’s a rare skill. I tried selling Python in corporates a couple of times, and barring RBS (which used it before I came in, and made it really easy for me to build an IRR calculator), I’ve failed every time. Apart from general fear, uncertainty and doubt, getting people is tougher.

Javascript’s a good choice. It has many of Python’s benefits. It’s easy to recruit people. Corporates aren’t terrified of it. Rhino was good enough a server. All it lacked was the “cool” factor, which node.js has now brought it. And besides,

  • It’s fast. About 20 times faster than Rhino, by my crude benchmarks.
  • It’s stable. (Well, at least, it feels stable. Rock solid stable. Sort of like nginx.)
  • It’s asynchronous. So I don’t miss Tornado
  • It has a pretty good set of libraries, thanks to everyone jumping on to it
  • I can write code that works on the client and server – e.g. form validation

Bye, Python.