S Anand, Author at S Anand

Auto reloading pages

September 30, 2012 / Coding / Leave a Comment

After watching Bret Victor’s Inventing on Principle, I just had to figure out a way of getting live reloading to work. I know about LiveReload, of course, and everything I’ve heard about it is good. But their Windows version is in alpha, and I’m not about to experiment just yet.

This little script does it for me instead:

(function(interval, location) {
  var lastdate = "";
  function updateIfChanged() {
    var req = new XMLHttpRequest();
    req.open('HEAD', location.href, false);
    req.send(null);
    var date = req.getResponseHeader('Last-Modified');
    if (!lastdate) {
      lastdate = date;
    }
    else if (lastdate != date) {
      location.reload();;
    }
  };
  setInterval(updateIfChanged, interval);
})(300, window.location)

It checks the current page every 300 milliseconds and reloads it if the Last-Modified header is changed. I usually include it as a minified script:

<script>(function(d,c){var b="";setInterval(function(){var a=new
XMLHttpRequest;a.open("HEAD",c.href,false);a.send(null);
a=a.getResponseHeader("Last-Modified");if(b)b!=a&&
c.reload();else b=a},d)})(300,window.location)</script>

There are no dependencies on any library, like jQuery. However, it requires that the file be on a web server. (It’s easy to fix that, but since I always run a local webserver, I’ll let you solve that problem yourself.)

Auto reloading pages Read More »

Windows XP virtual machine

September 16, 2012 / Coding / 2 Comments

Here’s the easiest way to set up a Windows XP virtual machine that I could find.

(This is useful if you want to try out programs without installing it on your main machine; test your code on a new machine; or test your website on IE6 / IE7 / IE8.)

Go to the Virtual PC download site. (I tried VirtualBox and VMWare Player. Virtual PC is better if you’re running Windows on Windows.)

If you have Windows 7 Starter or Home, select “Don’t need XP Mode and want VPC only? Download Windows Virtual PC without Windows XP Mode.”
If you have Windows Vista or Windows 7, select “Looking for Virtual PC 2007?”
Download it. (You may have to jump through a few hoops like activation.)
Download Windows XP and run it to extract the files. (It’s a 400MB download.)
Open the “Windows XP.vmc” file – just double-clicking ought to work. At this point, you have a working Windows XP version. (The Administrator password is “Password1”.)
Under Tools – Settings – Networking – Adapter 1, select “Shared Networking (NAT)”

That’s pretty much it. You’ve got a Windows XP machine running inside your other Windows machine.

Update (18 Sep 2012): I noticed something weird. The memory usage of VMWindow and vpc.exe is tiny!

Between the two processes, they take up less than 30MB of memory. This is despite the Windows XP Task Manager inside the virtual machine showing me 170MB of usage. I’ve no clue what’s happening, but am beginning to enjoy virtualisation. I’ll start up a few more machines, and perhaps install a database cluster across them.

Windows XP virtual machine Read More »

Inspecting code in Python

September 4, 2012 / Coding / 2 Comments

Lisp users would laugh, since they have macros, but Python supports some basic code inspection and modification.

Consider the following pieces of code:

margin = lambda v: 1 - v['cost'] / v['sales']

What if you wanted another function that lists all the dictionary indices used in the function? That is, you wanted to extract cost and sales?

This is a real-life problem I encountered this morning. I have 100 functions, each defining a metric. For example,

lambda v: v['X'] + v['Y']
lambda v: v['X'] - v['Z']
lambda v: (v['X'] + v['Y']) / v['Z']
…

I had to plot the functions, as well as each of the corresponding elements (‘X’, ‘Y’ and ‘Z’) in the formula.

Two options. One: along with each formula, maintain a list of the elements used. Two: figure it out automatically.

Each function has a func_code attribute. So, when you take

margin = lambda v: 1 - v['cost'] / v['sales']

margin.func_code is a “code object”. This has a bunch of interesting attributes, one of which is co_consts

>>> margin.func_code.co_consts
(None, 1, 'cost', 'sales')

There — I just pick the strings out of that list and we’re done (for simple functions at least.)

Check out http://docs.python.org/reference/datamodel.html and search for func_ — you’ll find a number of interesting things you can do with functions, such as

Finding and changing the default parameters
Accessing the global variables of the namespace where the function was defined (!)
Replacing the function code with new code

Also search for co_ — you’ll find some even more interesting things you can do with the code:

Find all local variable names
Find all constants used in the code
Find the filename and line number where the code was compiled from

Python also comes with a disassembly module dis. A look at its source is instructive.

Inspecting code in Python Read More »

Restartable and Parallel

August 30, 2012 / Coding, Data / 2 Comments

When processing data at a large scale, there are two characteristics that make a huge difference to my life.

Restartability. When something goes wrong, being able to continue from where it stopped. In my opinion, this is more important than parallelism. There’s nothing as depressing as having to start from scratch every time. Think of it as the ability to save a game as opposed to starting from Level 1 in every life.

Parallelism. Being able to run multiple processes in parallel. Often, this is easy. You don’t need threads. Good old UNIX xargs can do a great job of it. Interestingly, I’ve never used Hadoop for any real-life problem. I’ve gotten by with UNIX commands and smart partitioning.

The “smart partitioning” bit is important. For example, if you’re dealing with telecom data, you’d be calculating most of your metrics (e.g. did the number of calls grow or fall, are there more outgoing or incoming calls, etc.) are calculated on a single mobile number. So if you have multiple data sets, as long as all the data related to one mobile number are on the same system, you’re fine. If you have 100 machines, just split the data based on the last 2 digits of the mobile number. So data about 9012345678 would go to machine 78 (the last two digits). Given a mobile number for any type of data, you’d know exactly which machine would have that data. For all practical purposes, that gives you the basics of a distributed file system.

(I’m not saying you don’t need Hadoop. Just that I haven’t needed it.)

Restartable and Parallel Read More »

Storytelling: Part 1

August 29, 2012 / Business realities, How I do things / 2 Comments

In a number of sessions I’ve been to, people ask analysts to make their results more interesting – to tell stories with them. I’m co-teaching a course, part of which involves telling stories with data. So this got me thinking: what is a story? How does one teach storytelling to, let’s say, an alien?

Consider this mini-paper.

ABSTRACT: Meter readings exhibit spikes at slab boundaries. We also
find significant evidence of improbably events at round numbers.

Electricity shortage is a serious problem in most Indian states. Part
of this problem is due to the inaccuracy of reporting procedures used
in monitoring meter readings. Our focus here is not to document or
experimentally determine the degree of inaccuracy. We have adopted a
data driven approach to this problem and attempt to model the extent
of inaccuracy using basic statistical analysis techniques such as
histograms and the comparison of means.

Our dataset comprises of the frequency analysis 12-month dataset
containing monthly meter readings of 1.8 million customers in the
State of Andhra Pradesh.

We find that a histogram of these readings shows unexpectedly high
values at the slab boundaries: 50 (+45.342%, t > 13.431), 100
(+55.134%, t > 16.384), 200 (+33.341%, t > 15.232), and 300
(+42.138%, t > 19.958).

We also detected spikes at round numbers: 10 (+15.341%, t > 5.315),
20 (+18.576%, t > 6.152), 30 (+11.341%, t > 4.319).

The statistical significance of every deviation listed above is over
99.9%. Further, every deviation has a positive mantissa. This leads us
to confidently declare the existence of a systematic bias in the meter
readings analysed.

You’re probably thinking: “I know why he’s put this example here. It must be a bad one. So, what a rotten paper it must be!”

Well, not quite. It’s a good piece of analysis. I did it myself and there’s a fair bit of effort and care behind these short paragraphs.

The trouble is, if I read it out to my daughter, she’d say “What?” and not understand a word. My wife’d say “So what?” and not care a bit. I might as well not have written it.

It’s like that Zen thing: If a tree falls in a forest and no on hears it, does it make a sound?

If you did a piece of analysis, and no one understands or cares about it, why did you do it in the first place?

Why do you do it?

That last question is important: why do we analyse?

Sometimes, we do it for fun. The knowledge is beautiful. Knowing Tetris is NP-Complete is rewarding, even though my colleague sarcastically remarked, “Thank God! I’m sooo relieved now that I know that Tetris is NP whatever.” If that’s the case with you, great. Write the analysis any which way you’ll enjoy.

Sometimes, we do it because we’re forced to. In class. At work. Wherever. But that’s another way of saying “I don’t know why I’m doing it.” In that case, I’d gently recommend watching 3 Idiots.

Most often, we do it to share knowledge and drive actions. In that case, if no on understands it, or does anything with it, why do it?

Keep it simple

We prerajulisation of Farhanitate flagellated with ...

Would your audience understand that? Or are you just scared that simple words indicate a simple mind?

I was once afraid. 15 years ago, when writing a paper on IBM India’s competitive advantage for the CXOs, I was worried about it being too simple. I didn’t know anything about management. So I filled it with jargon. They politely nodded when I presented it, but I wasn’t fooling anyone. If there’s no content, jargon doesn’t help.

Unfortunately, it’s become polite to accept jargon as a substitute for substance. Why were they not ripping me apart? Or at least, kindly asking me what on earth I wanted to say?

My friend Manoj did that. In his nice, humble way, he asked, “But Anand, what does this mean?” When I explained it to him, I found I didn’t have a clue. He was OK with that. He just wanted to make sure he hadn’t missed something.

(That’s the technique I use these days. Ask people to explain things clearly. It’s OK if they’re just lost in jargon. I just want to make sure I haven’t missed something.)

Don’t cloak your ignorance. No one will think less of you. In the long run, you’ll learn more, and won’t need the jargon.

Part 2 of the article will talk about focusing on people and actions; storylining and the pyramid principle; and the structure of messages.

Storytelling: Part 1 Read More »

Colour spaces

August 27, 2012 / Coding, Visualisation / 1 Comment

In reality, a colour is a combination of light waves with frequencies between 400-700THz, just like sound is a combination of sound waves with frequencies from 20-20000Hz. Just like mixing various pure notes produces a new sound, mixing various pure colours (like from a rainbow) produces new colours (like white, which isn’t on the rainbow.)

Our eyes aren’t like our ears, though. They have 3 sensors that are triggered differently by different frequencies. The sensors roughly peak around red, green and blue. Roughly.

It turns out that it’s possible to recreate most (not all) colours using a combination of just red, green and blue by mimicking these three sensors to the right level. That’s why TVs and monitors have red, blue and green cells, and we represent colours using hex triplets for RRGGBB – like #00ff00 (green).

There are a number of problems with this from a computational perspective. Conceptually, we think of (R, G, B) as a 3-dimensional cube. That’d mean that 100% red is about as bright as 100% green or blue. Unfortunately, green is a lot brighter than red, which is a lot brighter than blue. Our 3 sensors are not equally sensitive.

You’d also think that a colour that’s numerically mid-way between 2 colours should appear to be mid-way. Far from it.

This means that if you’re picking colours using the RGB model, you’re using something very far from the intuitive human way of perceiving colours.

Which is all very nice, but I’m usually in a rush. So what do I do?

I go to the Microsoft Office colour themes and use a colour picker to pick one. (I extracted them to make life easier.) These are generally good on the eye.
Failing that, I pick something from http://kuler.adobe.com/
Or I go to http://colorbrewer2.org/ and pick a set of colours
If I absolutely have to do things programmatically, I use the HCL colour scheme. The good part is it’s perceptually uniform. The bad part is: not every interpolation is a valid colour.

Colour spaces Read More »

Style of blogging

August 27, 2012 / How I do things / 1 Comment

Until 2007, my blog was mostly just linking to stuff I found interesting on the Web. Since 2007, I’ve tried to write longer articles, mostly based on my own experiences.

At the moment, that’s unsustainable. Right now, being in a startup, I doing more stuff than I ever have in the past. (That does not mean working more hours, by the way.)

My posts, going forward, are likely to be smaller, less original, but hopefully more frequent.

Style of blogging Read More »

Is Protocol buffers worth it?

August 1, 2012 / Coding / 7 Comments

Google’s Protocol Buffers is a “language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler”

XML is slow and large. There’s no doubting that. JSON’s my default alternative, though it’s a bit large. CSV’s ideal for tabular data, but ragged hierarchies are a bit difficult.

I was trying to see if Protocol Buffers would be smaller and faster, at least when using Python. I took JSON as the base, and checked the write speed, read speed and file sizes. Here’s the comparison:

Protocol Buffers are 17 times slower to write and almost 6 times slower to read than JSON files. File sizes are smaller, but then, all it takes is a simple gzip operation to compress the JSON files even smaller. Reading json.gz files is just 2% slower than JSON files, and writing them is only 4 times slower.

The code base is at https://bitbucket.org/sanand0/protobuftest

On the whole, it appears that GZipped JSON files are smaller, faster, and just as simple as Protocol Buffers. What am I missing?

Update: When you add GZipped CSV to the mix, it’s twice as fast as GZipped JSON to read: clearly a huge win. It’s only slightly slower to write, and but compresses a tiny bit more than JSON.

Is Protocol buffers worth it? Read More »

Audio data URI

July 17, 2012 / Coding / 1 Comment

Turns out that you can use data URIs in the <audio> tag.

Just upload an MP3 file to http://dataurl.net/#dataurlmaker and you’ll get a long string starting with data:audio/mp3;base64...

Insert this into your HTML:

<audio controls src=”data:audio/mp3;base64...”>

That’s it – the entire MP3 file is embedded into your HTML page without requiring additional downloads.

This takes a bit more bandwidth than the MP3, and won’t work on Internet Explorer. But for modern browsers, and small audio files, it reduces the overall load time – sort of like CSS sprites.

So, on my bus ride today, I built a little HTML5 musical keyboard that generates data URIs on the fly. Click to play.

Audio data URI Read More »

Recent Tamil Songs Quiz

May 21, 2012 / Quizzes / 42 Comments

After a long break, here’s another quiz, featuring relatively recent Tamil songs. Can you guess which movie they are from?

Don’t worry about the spelling. Just spell it like it sounds, and the box will turn green.

Recent Tamil Songs Quiz Read More »

Author name: S Anand