S Anand

Protect static files on Apache with OpenID

I moved from static HTML pages to web applications and back to static HTML files. There’s a lot to be said for the simplicity and portability of a bunch of files. Static site generators like Jekyll are increasingly popular; I’ve built a simple publisher that I use extensively.

Web apps give you something else, though, that are still useful on a static site. Access control. I’ve been resorting to htpasswd to protect static files, and it’s far from optimal. I don’t want to know or manage users’ passwords. I don’t want them to remember a new ID. I just want to allow specific people to log in via their Google Accounts. (OpenID is too confusing, and most people use Google anyway.)

The easiest option would be to use Google AppEngine. But their new pricing worries me. Hosting on EC2 is expensive in the long run. All my hosting is now out of a shared Hostgator server that offers Apache and PHP.

So, obviously, I wrote a library protects static files on Apache/PHP using OpenID.

Download the code

 

Say you want to protect /home/www which is accessible at http://example.com/.

  1. Copy .htaccess and _auth/ under /home/www.
  2. In .htaccess, change RewriteBase to /
  3. In _auth/, copy config.sample.php into config.php, and
    1. change $AUTH_PATH to http://example.com/
    2. add permitted email IDs to function allow()

Now, when you visit http://example.com, you’ll be taken to Google’s login page. Once you log in, if your email ID is allowed , you’ll be able to see the file.

Feel free to try, or fork the code.

Codecasting

The best way to explain code to a group of people is by walking through it. If they’re far away in space or time, then a video is the next best thing. You can recommend them to try out the best coding apps as well.

The trouble with videos, though, is that they’re big. I can’t host them on my server – I’d need YouTube. Editing them is tough. You can’t copy & paste code from videos. And so on.

One interesting alternative is to use presentations with audio. Slideshare, for instance, lets you share slides and sync it with audio. That almost works. But it’s still not good enough. I’d like code to be stored as code.

What I really need is codecasting: a YouTube or Slideshare for code. The closest I’ve seen until day-before was etherpad or ttyrec – but neither support audio.

Enter Popcorn. It’s a Javascript library from Mozilla that, among other things, can fire events when an audio/video element reaches a particular point.

Watch a demo of how I used it for codecasting

A look at the code will show you that I’m using two libraries: SyntaxHighlighter to highlight the code, and Popcorn. The meat of the code I’ve written is in this subtitle function.

function subtitle(media_node, pre_node, events) {
  var pop = Popcorn(media_node);
  for (var i=0, l=events.length; i<l; i++)="" {="" for="" (var="" j="0," line_selector="[]," line_no;="" line_no="events[i][1][j];" j++)="" line_selector.push(pre_node="" +="" '="" .number'="" line_no)="" }="" var="" start="events[i][0]" ,="" end="i<l-1" ?="" events[i+1][0]="" :="" events[i][0]+999;="" (function(start,="" end,="" selector)="" pop.code({start:="" start,="" end:end,="" onstart:="" function(o)="" $(selector).addclass('highlighted');="" },="" onend:="" $(selector).removeclass('highlighted');="" })="" })(start,="" line_selector.join(','));="" }<="" pre="">

When called like this:

subtitle('#audio', 'pre', [
  [ 1, [1,2,3]],
  [ 5, [4,5,6]],
  [ 9, [7,8]],
])

… it takes the #audio element, when it plays to 1 second, highlights lines 1,2,3; at 5 seconds, highlights lines 4,5,6; and so on.

Another thing that helped was that my iPad has a much better mic than my laptop, and ClearRecord is a really simple way to create recordings with minimal noise. [Note to self: sampling at 16KHz and saving as a VBR MP3 (45-85kbps) seems the best trade-off.]

With these tools, my time to prepare a tutorial went down from 4 hours to half an hour!

</l;>

Javascript arrays vs objects

Summary: Arrays are a lot smaller than objects, but only slightly faster on newer browsers.

I’m writing an in-memory Javascript app that handles several thousand rows. Each row could be stored either as an array [1,2,3] or an object {"x":1,"y":2,"z":3}. Having read up on the performance of arrays vs objects, I thought I’d do a few tests on storing numbers from 0 to 1 million. The results for Chrome are below. (Firefox 7 was similar.)

  Time Size (MB)
Array: x[i] = i 2.44s 8
Object: x[i] = i 3.02s 57
Object: x["a_long_dummy_testing_string"+i]=i 4.21s 238

The key lessons for me were:

  • Browsers used to process arrays MUCH faster than objects. This gap has now shrunk.
  • However, arrays are still better: not for their speed, but for their space efficiency.
  • If you’re processing a million rows or less, don’t worry about memory. If you’re storing stuff as arrays, you can store 128 columns in 1GB of RAM (1024/8=128).

Software for my new laptop 2

Time for a new laptop, and to replace software. Here’s my new list.

A lot has changed in the last 5 years. Mainly, I use the browser, cygwin and Portable Apps a lot more. (The last is to escape jailers, not registry bloat.)

Media

  • Chrome [new]: For browsing and development. Fast, light, and stays out of the way.
  • Firefox: I keep it just for printing. Chrome sucks at printing.
  • Media Player Classic: Nothing against it, but I decided to stick to just one app, which is…
  • VLC: Continues to be the best media player, IMHO.
  • WinAmp: I just manage my playlists as M3U files, using Python programs.
  • Audacity: Still the easiest way to record audio.
  • Camstudio: The simplest free portable screen capture software I know.
  • PicPick [new]: Lightweight, powerful screenshot grabber
  • VirtualDub: Not the simplest, but still good for what I need: cropping and joining video.
  • MediaCoder [new]: Good for video/audio conversions. Maybe I’ll install this later.
  • Foxit Reader: The simples free portable PDF reader I know, better than…
  • NitroPDF Reader [new]: … which is good for Printing PDFs – better than…
  • Primo PDF: … which has trouble on rare occasions.
  • Microsoft Reader: I have a lot of ebooks in .LIT.
  • Kindle for PC [new]: I don’t own a Kindle, but I’ve bought a few ebooks.
  • Paint.NET: Good enough for cropping and adjusting colours on images.
  • Windows Live Writer [new]: The best way to write this blog WYSIWYG
  • Inkscape [new]: I occasionally edit vector graphics.
  • Google Earth. Google Maps is good enough.
  • ImgBurn: I no longer use CDs/DVDs. Just flash drives and external hard disks.
  • Picasa: I’ve stopped browsing pictures. No time.

Sharing

  • Dropbox [new]: Simplest way of sharing files.
  • Skype: I use it more than my phone.
  • Google Talk: For those friends who have chat enabled on Gmail.
  • TeamViewer [new]: Pretty efficient screen sharing. Works better than Skype, I think.
  • Google Calendar Sync: To keep Outlook in sync with Google Calendar.

Utilities

  • 7-Zip [new]: Covers all compressed formats, and has the best compression ratio.
  • WinRAR: 7-Zip has it covered.
  • AutoHotKey [new]: Shockingly powerful macro functionality. Shockingly underused.
  • Clip [new]: Command line clipboard. dir | clip copies the directory to the clipboard.
  • ClipX [new]: Stores multiple clipboard entries and history. Invaluable.
  • DiskTT [new]: I’m paranoid about disk speed. I keep measuring it.
  • WinDirStat [new]: Best way to find what’s taking up space on disk.
  • ProcessExplorer [new]: Just in case Task Manager doesn’t show you everything.
  • Google Desktop: Well, it’s dead.
  • mDesktop [new]: A Virtual Desktop Manager (multiple screens) for Windows 7.
  • PowerToys: doesn’t work on Windows 7, but I got X-Mouse working.
  • Teracopy: I don’t worry too much about copying files any more. Maybe later.
  • Junction Link Magic [new]: To map folders. But I now use Cygwin, and symlinks rock.
  • uTorrent [new]: For bittorrent.
  • ntlmaps [new]: proxies requiring a password to a proxy not requiring a password
  • Putty [new]: SSH for Windows, but can also act as an SSH tunnel
  • TrueCrypt [new]: To securely back up my bank details on the cloud.

Development

Data Visualisation

  • R [new]. The God of all statistical packages. Install reshape and ggplot2.
  • Gephi [new]: Does network visualisations quite well. 
  • GraphViz [new]: Does network visualisations not quite as well.
  • Google Refine [new]: Helps clean up messy data.
  • qhull [new]: For voronoi treemaps. Don’t ask.
  • wkhtml2pdf [new]: To print web pages as PDF.

What am I missing that you really like?

Faster data crunching

I’ve been playing with big data lately.

The good part is, it’s easy to get interesting results. The data is so unwieldy that even average value calculations provoke a “Amazing! I didn’t know that,” response (No exaggeration. I heard this from two separate ~ $1bn businesses this month.)

The bad part is that calculating even that simple average is slow.

For example, take this 40MB file (380MB unzipped) and extract the first column.

The simplest Python script to get the first column looks like this:

for row in csv.reader(fileinput.input(), delimiter='\t'):
    if len(row) > 0: print row[0]

That took a good 3 minutes to execute on my laptop.

Since I’m used to UNIX data processing, I tried cut -f1. Weirdly, that’s worse. 5 minutes. Paradoxically, awk ‘{print $1}’ only takes 17 seconds. That’s about 12 times faster. Clearly the tool makes a big difference. And we always knew UNIX was fast.

But I also ran these on an Amazon EC2 server, and a Hostgator server. Here’re the results.

  python cut awk
My Dell E5400 3:04 (1x) 5:42 (0.5x) 0:17 (11x)
EC2 standard 0:33 (6x) 0:5.6 (33x) 0:16 (11x)
Hostgator 0:19 (10x) 0:2.5 (74x) 0:0.7 (265x)

What took 3 minutes with Python my Dell E5400 took less than a second on Hostgator’s server with awk. Over 250 times faster. (Not 250%. 250 times).

And it’s not just hardware. A good tool (awk) made things 11x faster on my machine. Good hardware (hostgator) made the same program 10x faster. But choosing the right combination can make things go faster than 11 x 10 = 110 times. Much faster.

There are a few of things I’m taking away from this.

  1. Good hardware can speed you up much as (or more than) choosing the right tool.
  2. Good hardware can be rented. From many places. Cheaply.
  3. Always test what’s fast. awk’s fastest on my machine and Hostgator, but not on EC2.

India district map

I put together a district map of India in SVG this weekend.

So what?

You can now plot data available at a district level on a map, like the temperature in India over the last century (via IndiaWaterPortal). The rows are years (1901, 1911, … 2001) and the columns are months (Jan, Feb, … Dec). Red is hot, green is cold.

temperature

(Yeah, the west coast is a great place to live in, but I probably need to look into the rainfall.)

districts.svg has has 640 districts (I’ve no idea what the 641st looks like) and is tagged with the State and District names as titles:

<g title="Madhya Pradesh">
  <path title="Alirajpur" d="..." />
  <path title="Jhabua" d="..." />
  ...
</g>

How?

I made it from the 2011 census map (0.4MB PDF). I opened it in Inkscape, removed the labels, added a layer for the districts, and used the paint bucket to fill each district’s area. I then saved the districts layer, cleaning it up a big. Then I labelled each district with a title. (Seemed like the easiest way to get this done.)

Thanks to @planemad, @gkjohn, @arjunram for inputs. Play around. Feedback welcome.

Formatting tables

Formatting tables in Excel is a fairly common task, but there are a number of ways to improve on the way it’s done most of the time.

Here are a few tips. Fairly basic stuff, but hopefully useful.

Eating more for less

A couple of years ago, I managed to lose a fair bit of weight. At the start of 2010, I started putting it back on, and the trajectory continues. I’m at the stage where I seriously need to lose weight. I subscribe to The Hacker’s Diet principle – that you lose weight by eating less, not exercising.
An hour of jogging is worth about one Cheese Whopper. Now, are you going to really spend an hour on the road every day just to burn off that extra burger? You don't exercise to lose weight (although it certainly helps). You exercise because you'll live longer and you'll feel better.
I’m afraid I’ll live too long anyway, so I won't bother exercising just yet. It's down to eating less. Sadly, I like food. So to make my “diet” work, I need foods that add less calories per gram. Usually, when browsing stores, I check these manually. But being a geek, I figured there’s an easier way. Below is a graph of some foods (the kind I particularly need to avoid, but still end up eating). The ones on the top add a lot of calories (per 100g), and better to avoid. The ones at the right cost a lot more. Now, I’m no longer at the point where I need to worry about food expenses, but still, I can’t quite kick the habit, also you might want to check out this Rootine's comparison of B12 methylcobalamin and cyanocobalamin that will help you in your diet. Hover over the foods to see what they are, and click on them to visit the product. (If you’re using an RSS reader and this doesn’t work, read on my site.)
(The data was picked from Tesco.) It’s interesting that cereals are in the middle of the calorie range. I always thought they’d be low calories per gram. Turns out that if I want to to have such foods, I’m better off with desserts or ice creams (profiterole, lemon meringue or tiramisu). In fact, even jams have less calories than cereals. But there are some desserts to avoid. Nuts are a disaster. So are chocolates. Gums, dates and honey are in the middle – about as good as cereals. Salsa dip seems surprisingly low. Custards seem to hit the sweet spot – cheap, and very low in calories. Same for jellies. So: custards and jelly. My daughter’s going to be happy.

Birthday matters

Does it matter which month you’re born in?

Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years (via Reportbee), it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200 – or 10%!

Most students who took the Class XII exams in 2011 were born between March 1991 and June 1992. The average marks of each student (out of 1200) is shown in the graph below.

tn-2011

Students born in June 1991 scored the lowest – around 720/1200. This suddenly shoots up in July, then in August, and the students born in September score as much as 840/1200 on average. From there on, it’s downhill.

This result is consistent across years. In 2009 and 2010, you see a similar pattern.

tn-2009 tn-2010

Why could this be?

Malcolm Gladwell’s book Outliers offers a clue.

Outliers opens, for example, by examining why a hugely disproportionate number of professional hockey and soccer players are born in January, February and March.

The answer turns out to be completely unrelated to numerology or astrology.

It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.

In Tamil Nadu, students must be 5 years old before entering Class 1. Schools open mid-June. So students born in June 1994 would barely make it in June 1999 – making them the youngest students in the class. July and August students would be missed – but since many schools implement this policy leniently, they sometimes make it in as well. September borns are often consistently the eldest students in a class.

This pattern reflected in the marks. The eldest – the September 1993 borns – score the highest. The next eldest, the October 1993 borns, score a bit less. And so on. (There are older students who take the exam – the ones born before September 1993 – but many of these are failed students from the previous year, introducing a bias in the results.)

Perhaps this initial advantage that the elder students have over their classmates continues through the years? Whatever the reason, it’s clear that if your child is born in September, he or she already has a 100 mark advantage! If you’re overwhelmed from being a parent, you can always take a breather and play games like 토토사이트.

Visualising the IMDb

The IMDb Top 250, as a source of movies, dries out quickly. In my case, I’ve seen about 175/250. Not sure how much I want to see the rest.

When chatting with Col Needham (who’s working his way through every movie with over 40,000 votes), I came up with this as a useful way of finding what movies to watch next.

visualising-the-imdb-1

Each box is one or more movies. Darker boxes mean more movies. Those on the right have more votes.  Those on top have a better rating. The ones I’ve seen are green, the rest are red. (I’ve seen more movies than that – just haven’t marked them green yet 🙂

I think people like to watch the movies on the top right – that popularity compensates (at least partly) for rating, and the number of votes is an indication of popularity.

For example, my movie pattern tells me that I ought to see Cidade de Deus, Inglourious Basterds and Heat – which I knew from the IMDb Top 250, but also that I ought to cover Kick-Ass, The Hangover and Juno.

visualising-the-imdb-2

It’s easy to pick movies in a specific genre as well.

visualising-the-imdb-3

Clearly, there are many more Comedy movies in the list than any other type – though Romance and Action are doing fine too. And I seem the have a strong preference for the Fantasy genre, in stark contrast to Horror.

(Incidentally, I’ve given up trying to see The Shining after three attempts. Stephen King’s scary enough. The novel kept me awake checking under my bed for a week at night. Then there’s Stanley Kubrick’s style. A Clockwork Orange was disturbing enough, but Haley Joel Osment in the first part of A.I. was downright scary. Finally, there’s Jack Nicholson. Sorry, but I won’t risk that combination on a bright sunny day with the doors open.)

You can track your list at http://250.s-anand.net/visual.

For those who want to play with the code, it’s at http://code.google.com/p/two-fifty/source/browse/trunk/visual.html.