Data

How to direct a data movie

Ganes and I created a data movie on speed-cubing records as part of a Gramener hackathon.

Here’s a video of us talking about how we created it.

Anand: We picked the Rubik’s cube story for this hackathon. Tell me more about how this excited you.

Ganes: Since my son started solving the Rubik’s cube a few months back, I’ve been fascinated with these competitions. I still don’t know how to solve it, but I like watching it.

Anand: But he does?

Ganes: Yeah, he does. So, in the competitions, I’ve seen kids solving the Rubik’s cube in under 10 seconds. So that was the first source of amazement. I’ve seen kids doing it with one hand, blindfolded. I first couldn’t believe it. Doing it with their legs. So that got me really interested.

When we were talking about this, and I was sharing my amazement, we were talking about the hackathon and the conversations kind of merged. So that, I think, the curiosity around it led to picking this as the story.

Anand: And what was the next step?

Ganes: I have always seen the World Cube Association publishing these records. Their website is great. So I thought maybe we could scrape from that, and that’s when I start looking at the website and the competitions we can pick. and then I stumbled on the export feature where they have multiple formats neatly curated that you can take and directly start the analysis.

Anand: Which was actually a big factor in deciding to go for this. Big data set. Very rich, interesting possibilities.

Ganes: So we had had some five or six ideas. This immediately shot up to the top. So after we got the idea, you kind of took over. I think after I mentioned that all these formats were available, it got you excited. So what did you do after that?

Anand: Then it became a question of what all interesting things we can find. It’s almost an exploratory data analysis, but my approach to EDA (exploratory data analysis) is: let’s formulate the hypotheses and then validate, and see if there Is an interesting story behind it.

So it begins with, for instance, the speed at which records have been broken. Today, it’s at 3½ seconds. We know that. But how fast did it fall? Or: what’s the spread of solving-speed for somebody who solves it fast? Does the same person solve it really fast sometimes and really slow sometimes? Is there a movement in their average? You said, “Let’s see how much longer it takes to solve bigger cubes.” Nikhil was going to take the demographics of solvers and see how they’re spread out. There are definitely a lot of Chinese solves in the spread. So, the thing was, let’s look at possible ideas that could lead to an interesting answer, and then validate those.

Ganes: It was almost like “What would we be interested in finding out” and not necessarily like looking at the column of data.

Anand: Yes. And that I think is important, because, from the data, there may be some ideas. But after absorbing it, knowing what’s interesting is what should drive the story.

Ganes: Right. Yeah. So that was a good starting point. We listed all of these on the board. Then, what did you do next?

Anand: Then it’s about proving these. So, we know here are some possible interesting stories, and let us explore and validate whether these are, in fact, interesting, or can be turned into something interesting. So, when I looked at the speed at which records were broken, for instance, I thought that would be an interesting story. But it wasn’t. It was just getting broken at a steadily successive pace.

But something that I did not expect emerged, which is that Wusheng Du, who holds the world record, is not the person who was there in the records consistently. In fact, Felix Zemdegs has been the consistent winner for the last 10 years and is the only cubing champion who’s won the WCA twice. So, that was something that emerged from doing the analysis. So, that has the ability, therefore, of both proving what we’re looking to prove (or disproving), and also coming up with new stuff that we can choose to incorporate into the story.

Ganes: Almost like starting with a business hypothesis, or what, in the enterprise world, the business wants to know, and then once you get into the data, the data is revealing a few interesting insights, and then you kind of marry both. Looks just like that.

Anand: Exactly. Exactly.

Ganes: So, we identified the insights. And then, the target here was to come up with a 2 minute video. So how did you plan from insights to the video.

Anand: So, one of my cousins is a director, and she tried explaining to me the concept of a screenplay. I never really understood it, even though I’ve read a number of screenplays. So, in the last hackathon, when I was creating a (data) movie, that’s when I realised: as I started writing what I want to shoot (because it requires a whole lot of planning), I was effectively writing a screenplay.

The steps are, basically, you have to decide what are the frames or the sequences you want to shoot. So, one sequence was: we want to introduce this Rubik’s cube win. Another sequence was: we want to show how quickly different types of cubes can be solved, etc.

So, for each of these, what I do is: create a storyline that has the following structure. One: what is the message I want people to take away from that.

Ganes: The headline from there.

Anand: Exactly.

And then, in order to do that, what are the words I would narrate on top of it? That literally forms the dialogue. The third thing is, what are the visuals that prove the dialogue. That I structure in the form of a video. The fourth thing is the transition — from one video to another, or from one sequence to another, how do I flow. These are the 4 things that I captured.

When I write down the full dialogue. I speak it out, put in a timer, and then say “OK, this took 10 seconds, this took 15 seconds, this took 14 seconds” and so on.

Then comes the process of recording (the audio). Assembling the visuals, yes, but timing it and sequencing it based on the recording is pretty critical. So, actually, I wanted your voice – it’s better. And initially, I wanted you to do the recording, but because you were busy in the Dell workshop, I had to do the recording to make sure that I get the timing. Then you re-recorded post that.

That recording makes a huge difference. The audio quality on my iPhone is better than the laptop. I transfer it via Dropbox on to the system.

Ganes: Were there some issues because you have some insights and you have a certain sequence, but it may not add up to 2 minutes. Or, there might be something which will just not flow. How do you correct those issues?

Anand: I found that I consistently underestimate (the time). I thought that we only have material for 1½ minutes, but I knew at that point that invariably, because of this bloat, it will somehow add up to 2 minutes. Which is exactly what happened. It moved to 2 minutes 4 seconds.

Ganes: Yes. Exactly. Yeah.

Anand: So, once you’ve done it once or twice, that amount of correction is there. It’s in fact a whole lot easier to control a video than something as crazy as a (software) program, for instance. The estimation error in programming is much higher than this.

The good part is that post production or editing can take care of a lot of stuff. That 2-minute video can be cut to 1½ if required.

Ganes: Yeah, it can be improved, but my biggest fear is: after recording, the post production is a nightmare. It takes hours and hours of effort. A five-minute video, to post, probably takes 2 hours.

Anand: That is true.

Ganes: How do you go about it? After having these audio clippings, videos and images, how do you stitch all together into a video?

Anand: My workflow is on PowerPoint, mostly, and then on Windows Video Editor. And then you introduced iMovie into the mix.

PowerPoint makes it fairly simple. I can put in an audio in the background. I can handle the animations. It’s not a great tool at all, but it’s a tool I’m very familiar with. So, my workflow is: one slide is one shot or one headline in the storyline. Then I record the video independently or download it from YouTube, put it in the background or wherever. Create all the visuals, create the animations around it, put it there. At this point, the raw material is in. Then I insert the audio and let it play the background for that particular slide. Then I time the animation to the audio.

This is a slow process because PowerPoint doesn’t have the right tools. So I play the audio till that point and then set the animation. Then I start from the beginning again, play the audio to the next point, and then set that animation. Which takes a long duration. But once that’s sorted out, I play that full slide and it works out, I then go back and correct.

The good part is that the audio is the time keeper. I pre-recorded the audio. So I know that the entire duration is only going to be 1.8 minutes (and then towards the end we added a few more vidoes that took it to 2 minutes). So the audio keeps you in control, and if you synchronize everything to the audio, then it becomes easier.

Then I exported it into a video file from PowerPoint directly, and then did a little bit of post-processing, adding a background music and adding a few captions, mostly, on Windows Video Editor, and then gave it to you. Which was at around 9 o’clock or so. What did you do from 9 o’clock to 3 o’clock?

Ganes: So, the first thing — on the PowerPoint, I couldn’t believe that you’d done all this on PowerPoint. Yes, you’re taking the tool beyond the limit it was designed for.

I’ve been working with iMovie for a year, and I find it very powerful. For someone who doesn’t come from that background, it was very easy for me to pick up. I had the images and raw video footage for the different portions we were trying to introduce. I was able to split the audio that you recorded from the video, and then was able to record mine and add it. iMovie has these multiple streams you can insert and remove. I had one stream for my audio for my voice over. And there was this video which you had.

On top of that, I could overlay the pictures and other videos that I had towards the end — two videos playing side-by-side. So all of that was possible. and then I could also introduce background music at the very end. iMovie makes it very easy to move all of these things around. And even the synchronization issue which you told about, that’s much easier to resolve in iMovie.

So, all of this finally coming together, I think, at 3 o’clock… when I had all of this, at 3 o’clock I was hunting for the background music (laughs). I was playing all kinds of clips and finally I chose one. So that’s how we got the final YouTube video.

Anand: My lesson from this is: make sure you have a team member who has a Mac!

Ganes: Right, yeah. So let’s go back and look at our video and see what we can learn from it. Thank you!

2 inches will change my life

I walked ~11 million steps in the last 3 years, at ~10K steps daily.

Since 1 Jan 2018, I’ve steadily increased my walking average until Aug 2018. Then my legs started aching. So I cut it down until Jan 2019. In Feb, I resumed and was fairly steady until May 2020. To complement workouts like this, products that are aimed for men over 50 can be used.

In May, my wife refused to let me walk for more than an hour a day. It took me a few months to convince her and level up. I ended 2020 averaging a little over 10K steps for the year.

I’m becoming more regular. I walked 10K/day 15% more in 2020 than in 2018.

2018: I walked 10K steps almost half the time.
2019: it grew to a bit more, to 56%.
2020: I walked 10K steps a day almost two-thirds of the time.

But in May 2020, I went for 5 days without walking even 3K steps.

In 2018, I started being more and more regular until my leg started aching.
2019 was fairly consistent.
2020 is when I applied brakes again — for very different reasons.

I’ve never gone for 5 days without walking even 3K/day before, since 2018. At most, it was 3 days at a stretch.

But when my wife refused to let me walk for more than an hour a day in May 2020, I went on strike! 😉

I walk ~77 min daily. This has increased over the years.

In 2020, this has gone up slightly to 84 min — but it’s still under an hour-and-half. I spend most of this time on calls or listening to audio books / podcasts.
Instead of spending it with my family.

Sometimes, I lose myself in calls and walk for almost 3 hrs and 20K steps.

Naveen is usually to blame. But this happens rarely. I walked 20K steps just 6 times over the last 3 years.

Though the longest walk here indicates over 3 hrs, I’ve never walked 3 hrs in a day.

On 21 Nov, my daughter borrowed my phone and went for her walk. So my phone shows our combined walks, not mine. Many of the other long walks are spread out during the day when I commute by walking in Singapore.

Datehrskm#Why?
21-Nov-203.4615.51My daughter took my phone.
These are her + my walking stats.
15-Nov-192.9811.52Walked to meetings in Singapore.
17-Sep-192.9610.73Walked to meetings in Singapore.
11-Jul-202.8913.94Was talking to Pratap & Ganes.
15-Oct-182.839.55Walked to meetings in Singapore.
03-Sep-202.8213.06Was talking to Naveen & my coach.

I want to walk faster. I walk at ~4.4 km/hr. My target is 5 km/hr.

Walking at over 5 km/hr speeds the heart up and improves metabolism. (Or so I’ve heard.)

I was steadily going towards 5 km/hr in my early days of walking. I slowed down starting Aug 2018, since my legs were aching. Then I picked up speed in end-2018.

I slowed down again in Nov 2019 — and I don’t remember why.

In Jun 2020, I started walking much faster — mainly to complete 10K steps within the hour my wife gave me. That seems to have had a lasting impact. I walked faster overall in 2020.

I’ve managed fast walking 66 times in 2020, a bit more than before.

In Jun 2020, I walked at over 5 km / hr on 20 / 30 days — a very consistent high speed. I’ve never gotten close to this any other month.
(Clearly, there are adverse effects of being able to convince my wife.)

The fastest I walked was in 2018, at 6.8 km/hr. It might have led to my leg aches.

My top 5 walking speeds were in 2018. In 2020, I’ve managed to walk faster than 6 km / hr just once.

Fastest dayskm/hr
07-Jun-20186.80
05-Jan-20196.65
16-Mar-20186.34
08-Jun-20186.31
06-Feb-20186.19
05-Jun-20206.02

The normal stride/height ratio is 0.43. I’m 5’8″. My stride is 2.4 ft. That’s almost exactly 0.43 times my height. So all is well.

By increasing my stride by 2 inches, I can cover 10,000 steps in 8 min less time.

For every inch I lengthen my stride, I walk ~0.2km/hr faster.

I’ve walked with a stride as long as 32″, which is 3″ more than my 2020 average stride. By walking with a 2″ longer stride, I can be 9.2% faster.

So in 2021, I plan to get healthier (and scolded less) with a 2″ longer stride.

A longer stride means a faster walk. That’s a good cardio exercise.
A faster walk also means that it takes less time. So I’ll get beaten up less.
All it takes is stretching my legs 2″ more. Might hurt a bit. I’ll report on this when I know better.

NowNewChangeBenefit
Longer stride29″31″2″Builds character?
Faster walk (kmph)4.55.00.5Better cardio exercise
Time to 10K steps (min)8477-8Less scolding from wife

PostScript: This analysis was done in Excel. Download see the sheet below.

Mystery of the extra returns

This month, I sold half my Indian equity mutual funds and was researching funds to invest in. I was looking for something safe & long term.

As I was exploring 10-year Gilt Funds (mutual funds that invest in the Indian Government’s 10-year bond), I noticed that they had a pretty high yield — mostly over 10%.

I took a closer look at ICICI Prudential’s Constant Maturity Gilt Fund. (They had the lowest expense ratio.) The annualized returns over the last 5 years were 10.77%, and it’s never fallen below 10% in the last 5 years.

But the strange thing is that the underlying 10-year bond always yielded less than 10% in the last 10 years.

So, how does a mutual fund that buys only one bond yield more than that bond’s ever done in the last 10 years?

(I went ahead and invested. But this is going to worry me to no end.)

Restartable and Parallel

When processing data at a large scale, there are two characteristics that make a huge difference to my life.

Restartability. When something goes wrong, being able to continue from where it stopped. In my opinion, this is more important than parallelism. There’s nothing as depressing as having to start from scratch every time. Think of it as the ability to save a game as opposed to starting from Level 1 in every life.

Parallelism. Being able to run multiple processes in parallel. Often, this is easy. You don’t need threads. Good old UNIX xargs can do a great job of it. Interestingly, I’ve never used Hadoop for any real-life problem. I’ve gotten by with UNIX commands and smart partitioning.

The “smart partitioning” bit is important. For example, if you’re dealing with telecom data, you’d be calculating most of your metrics (e.g. did the number of calls grow or fall, are there more outgoing or incoming calls, etc.) are calculated on a single mobile number. So if you have multiple data sets, as long as all the data related to one mobile number are on the same system, you’re fine. If you have 100 machines, just split the data based on the last 2 digits of the mobile number. So data about 9012345678 would go to machine 78 (the last two digits). Given a mobile number for any type of data, you’d know exactly which machine would have that data. For all practical purposes, that gives you the basics of a distributed file system.

(I’m not saying you don’t need Hadoop. Just that I haven’t needed it.)

Faster data crunching

I’ve been playing with big data lately.

The good part is, it’s easy to get interesting results. The data is so unwieldy that even average value calculations provoke a “Amazing! I didn’t know that,” response (No exaggeration. I heard this from two separate ~ $1bn businesses this month.)

The bad part is that calculating even that simple average is slow.

For example, take this 40MB file (380MB unzipped) and extract the first column.

The simplest Python script to get the first column looks like this:

for row in csv.reader(fileinput.input(), delimiter='\t'):
    if len(row) > 0: print row[0]

That took a good 3 minutes to execute on my laptop.

Since I’m used to UNIX data processing, I tried cut -f1. Weirdly, that’s worse. 5 minutes. Paradoxically, awk ‘{print $1}’ only takes 17 seconds. That’s about 12 times faster. Clearly the tool makes a big difference. And we always knew UNIX was fast.

But I also ran these on an Amazon EC2 server, and a Hostgator server. Here’re the results.

  python cut awk
My Dell E5400 3:04 (1x) 5:42 (0.5x) 0:17 (11x)
EC2 standard 0:33 (6x) 0:5.6 (33x) 0:16 (11x)
Hostgator 0:19 (10x) 0:2.5 (74x) 0:0.7 (265x)

What took 3 minutes with Python my Dell E5400 took less than a second on Hostgator’s server with awk. Over 250 times faster. (Not 250%. 250 times).

And it’s not just hardware. A good tool (awk) made things 11x faster on my machine. Good hardware (hostgator) made the same program 10x faster. But choosing the right combination can make things go faster than 11 x 10 = 110 times. Much faster.

There are a few of things I’m taking away from this.

  1. Good hardware can speed you up much as (or more than) choosing the right tool.
  2. Good hardware can be rented. From many places. Cheaply.
  3. Always test what’s fast. awk’s fastest on my machine and Hostgator, but not on EC2.

India district map

I put together a district map of India in SVG this weekend.

So what?

You can now plot data available at a district level on a map, like the temperature in India over the last century (via IndiaWaterPortal). The rows are years (1901, 1911, … 2001) and the columns are months (Jan, Feb, … Dec). Red is hot, green is cold.

temperature

(Yeah, the west coast is a great place to live in, but I probably need to look into the rainfall.)

districts.svg has has 640 districts (I’ve no idea what the 641st looks like) and is tagged with the State and District names as titles:

<g title="Madhya Pradesh">
  <path title="Alirajpur" d="..." />
  <path title="Jhabua" d="..." />
  ...
</g>

How?

I made it from the 2011 census map (0.4MB PDF). I opened it in Inkscape, removed the labels, added a layer for the districts, and used the paint bucket to fill each district’s area. I then saved the districts layer, cleaning it up a big. Then I labelled each district with a title. (Seemed like the easiest way to get this done.)

Thanks to @planemad, @gkjohn, @arjunram for inputs. Play around. Feedback welcome.

What does India search for?

Over the last couple of years, I’ve been tracking the top 5 hot searches in India on Google Trends (http://www.google.co.in/trends). Here are the results:

If you’re interested in making visualisations out of it, please feel free. But there’s one particular thing I’m trying out, which is to categorise these searches and see if there’s a trend around that. I’ve added a “Tag” column.

Could you please help me tag the spreadsheet: https://spreadsheets.google.com/ccc?key=0Av599tR_jVYgdE5zTU5QWjcxVWVCaTBuY3d0NkUtc1E&hl=en_GB

It’s publicly editable, no special access required. If you could stick to the tags I already have (Business, Education, Entertainment, News, Politics, Sports, Technology), that would be great. If not, that’s fine as well.

And if you’ve made any visualisations or done any analysis using this data, please do drop a comment.

Shortening sentences

When writing Mixamail, I wanted tweets automatically shortened to 140 characters – but in the most readable manner.

Some steps are obvious. Removing redundant spaces, for example. And URL shortening. I use bit.ly because it has an API. I’ll switch to Goo.gl, once theirs is out.

I tried a few more strategies:

  1. Replace words with short forms. “u” for “you”, “&” for and, etc.
  2. Remove articles – a, an, the
  3. Remove optional punctuation – comma, semicolon, colon and quotes, in particular
  4. Replace “one” with “1”, “to” or “too” with 2, etc. “Before” becomes “Be4”, for example
  5. Remove spaces after punctuations. So “a, b” becomes “a,b” – the space after the comma is removed
  6. Remove vowels in the middle. nglsh s lgbl wtht vwls.

How did they pan out? I tested out these on the English sentences on the Tanaka Corpus, which has about 150,000 sentences. (No, they’re not typical tweets, but hey…). By just doing these, independently, here is the percentage reduction in the size of text:

2.0% Remove optional punctuations – comma, semicolon, colon and quotes
2.2% Remove spaces after punctuations. So “a, b” becomes “a,b”
3.3% Replace words with short forms. “u” for “you”, “&” for and, etc.
3.3% Replace “one” with “1”, “to” or “too” with 2, etc.
6.7% Remove articles – a, an, the
18.2% Remove vowels in the middle

Touching punctuations doesn’t have much impact. There aren’t that many of them anyway. Word substitution helps, but not too much. I could’ve gone in for a wider base, but the key is the last one: removing vowels in the middle kills a whopping 18%! That’s tough to beat with any strategy. So I decided to just stop there.

The overall reduction, applying all of the above, is about 22%. So there’s a decent chance you can type in a 180-character tweet, and Mixamail.com will still tweet it intelligibly.

I had one such tweet a few days ago. I try and stay well within 140, but this one was just too long.

The Lesson: If you’re writing an app (or building anything), find a use for yourself. There’s no better motivation — and it won’t ever be a wasted effort.

That was 156 characters. It got shortened to:

Lesson If u’re writing app (or building anything) find use 4 yourself. There’s no better motivation — & it won’t ever be wasted ef4t.

Perfectly acceptable.

You may notice that Mixamail didn’t have to employ vowel shortening. It makes the most readable shortenings first, checks if it’s within 140, and tries the next only if required.

If anyone has a simple, readable way of shortening Tweets further, please let me know!

Bayes’ Theorem

I’ve tried understanding Bayes’ Theorem several times. I’ve always managed to get confused. Specifically, I’ve always wondered why it’s better than simply using the average estimate from the past. So here’s a little attempt to jog my memory the next time I forget.

Q: A coin shows 5 heads when tossed 10 times. What’s the probability of a heads?
A: It’s not 0.5. That’s the most likely estimate. The probability distribution is actually:

dbeta(x,5,5)

That’s because you don’t really know the probability with which the coin will throw a heads. It could be any number p. So lets say we have a probability distribution for it, f(p).

Initially, you don’t know what this probability distribution is. So assume they’re all the same – a flat function: f(p) = 1dbeta(x,1,1)

Now, given this, let’s say a heads falls on the next toss. What’s the revised probability distribution? It’s:

f(p) ← f(p) * probability(heads | x) / probability(heads) = 1 * (x^1 * (1-x)^0) / 1 = x

dbeta(x,2,1)

Let’s say the next is again a heads. Now it’s

f(p) ← f(p) * probability(heads | x) / probability(heads) = x * (x^1 * (1-x)^0) / 1 = x^2

dbeta(x,3,1)

Now if it’s a tails, it becomes:

f(p) ← f(p) * prob(tails | x) / prob(tails) = x^2 * (x^0 * (1-x)^1) / 1 = x^2 * (1-x)

dbeta(x,3,2)

… and so on. (This happens to be a called a Beta distribution.)

Now, instead of this being the probability of heads, it could be the probability of a person having blood pressure, or a document being spam. As you get more data, the probability distribution of the probability keeps getting revised.

R scatterplots

I was browsing through Beautiful Data, and stumbled upon this gem of a visualisation.

r-scatterplots

This is the default plot R provides when supplied with a table of data. A beautiful use of small multiples. Each box is a scatterplot of a pair of variables. The diagonal is used to label the rows. It shows for every pair of variables their correlation and spread – at a glance.

Whenever I get any new piece of data, this is going to be the very first thing I do:

plot(data)