S Anand

Motorbike science lab

My cousin’s working on an interesting project at the Agastya Foundation. A group of scientifically inclined volunteers go around on a bike to schools, taking with them a science lab kit, and show children in rural schools a variety of experiments.

Google will award this and 3 other projects (out of 10) Rs 3 crores based on public votes. You can vote for and read more at https://impactchallenge.withgoogle.com/india2013#/agastya|vote

Courtesy

We are often subject to body searches, baggage inspections, and identity verifications. At malls. At airports. At offices.

These are to ensure that no one carries ammunition inside, or goods or secrets outside. In other words, to deter terrorists and thieves.

It’s nothing personal, of course. When someone does not know me, I can choose to accept that (or not; the choice is mine).

When I’m invited somewhere, however, I assume that I am not deemed a security threat. Therefore, I expect that:

  • My and my belongings will not be searched or scanned
  • I need not leave behind my personal belongings
  • I need not carry an identity card

Please afford me this courtesy if you are inviting me.


For some months now, I’ve visited many corporate offices. The reception is comprised of security guards, a metal detector and a register. I’m given a tag and an escort.

I’m not fussy. I’m not worried about being greeted, for example. I’m quite happy to plug into a power socket and work on my laptop until logistics are sorted out. But when that happens at the security outpost with no sitting space, or outside the gate in the rain, it inconveniences me.

A few weeks ago, I was at Singapore, and visited a client’s office in slippers. One of them complemented my choice of footwear, and remarked that he had not yet risen high enough in the corporate ladder to afford this luxury. (There’s a series of stories behind my footwear that I’ll get to later.)

That told me something. After a long time, I now can afford this luxury. Especially if someone knows me well enough to invite me to their office.

I hope to point them to this blog post and request that security be arranged so that I can be afforded this small courtesy; be treated with trust rather than as a terrorist or a thief.

(If their organisation’s practice does not permit this, I’m happy to meet outside. Besides, our office is happy to extend warm hospitality.)

Open source in corporates

[This is a post that I’d published internally in InfyBlogs in Dec 2009. Time to share it.]

Last month, my first application went live.

I’ve been writing code for 20 years. Not one line of my code has been officially deployed in a corporate. (Loser…)

It’s a happy feeling. Someone defined happiness as the intersection of pleasure and meaning. Writing code is pleasurable. Others using it is meaningful.

But this post isn’t quite about that. It’s about the hoops I’ve had to jump through to make this happen.

I’ve been living in a nightmare since March 2009. That was when I decided that I’d try and get corporates to use open source.

March 2009

It began with a pitch to a VC firm. They were looking to build a content management system (CMS). Normally we’d pull together slides that say we’ll deliver the moon. This time, we put together demo based on WordPress’ CMS plugins.

The meeting went fabulously well. We said, “Here’s a demo we’ve built for you. Do you like it?” The business lead (Stuart) was drooling and declared that that’s exactly what they wanted. The IT lead (another Stuart) was happy too, but warned the business users: “Just remember: this isn’t how we do development, so don’t get your hopes up that we can deliver stuff like this :-)”

Time to make my point. I asked, “What’s your policy on open source software?”

The business lead went quiet. “I don’t know,” he finally said. Fair enough.

I turned to the IT lead. “Well, we don’t use it as a matter of policy… there are security concerns…” he said.

“Which web server do you use?”

”Oh, OK. I see what you mean. We use Apache. So on a case to case basis, we have exceptions. But generally we have security concerns.“

”Why? Do you believe open source software is more insecure than commercial software?“

He thought about it for a while. “Well… maybe. I don’t know.” We debated this a bit. Then we found the real issue: “It’s just that we don’t have control over the process. We don’t know enough about it to decide.”

A couple of weeks later, I tried pitching to a newspaper. This time, it was our sales team that raised the same question. “But… isn’t open source insecure?”

I didn’t even bother pitching any open source stuff to them. But I’d learnt my lessons:

  1. Demo the application. Don’t talk about it.
  2. Show it to the business first, and then tackle IT.

Aside: June 2009

In June, I got another chance at a client where we were building their new website. The very first thing I did was ask to see the Javascript. Total mess, and filled with browser-incompatible DOM requests. So I went over to their web development team.

“Look, why don’t you guys use a Javascript library? It’ll get you cross browser compatibility and compact maintainable code at the same time.”

And, to their credit, they said, “Sure. Which library?”

I showed them this and we agreed on jQuery. So, if nothing else, I’ve managed to get one open source library into a corporate.

July 2009

I was also looking at payments on the website, and our client was looking to replace their chargeback application. Since I had a week off, I built a working PCI compliant prototype on Django. (I must clarify what I mean by PCI compliant. You see, any application that stores credit card information must pass through a stringent security clearance process. I bypassed the problem by not storing the card information. I’ve realised that I’ve been building PCI compliant applications all my life – and it’s a huge benefit to let people know that.)

This time, I applied the lessons I’d learned, and demo-ed it to the business, who were thrilled. Time to tackle IT.

I started with the architecture team. Matt on the architecture team was the most approachable. So I went over, demo-ed it, and said, “Matt, this took a week to put together. It’s based on some new technologies. Are you game to try these out?”

He was. And quite enthused about it too. So we put together a proposal for the architecture review board, proposing a new technology stack: Django / Python and MySQL. As before, I showed the demo before I talked technology. I had prepared answers to all security related questions upfront (and practically memorised section 3 of the PCI guidelines.) The clincher, though, was the business case. To build it on Java, it would cost ~1,000 person days. On Django, I’d mostly done it in 5. There was no way of justifying 1,000 person days for an application that could save, at best £100,000 a year.

So they said “Go ahead, we’re fine if operations and infrastructure are fine.”

It was time to find a Django developer in Infy. I hunted for a couple of weeks but none was available. (Only 2 people that I knew knew Django in the first place.) So that effort got canned, and we were back to the 1,000 person day solution. (Which got canned too, later.)
But in the process, I’d learned my third lesson.

  1. If you’re trying new technologies, plan on delivering it yourself.

October 2009

Another application popped up that looked like a prime candidate for introducing open source. They were using an Excel application to fraud screen orders, and wanted to make a web app out of it.

I followed the same route as before. Demo it. Show it to business first, then IT. Built it myself. I skipped Architecture, since they’d already approved the technology stack, and took it straight to Infrastructure.

“This application uses Apache as the web server, MySQL as the database, and uses PHP and Javascript for the application logic. Could we get a Linux server to host it?”

Our entire conversation lasted 30 seconds. He said, “No. We use Windows servers” (I was fine)

“… and you’ll need to chance Apache to IIS” (fine again)

“… and we don’t support PHP, so it’ll have to be Java or .NET” (I don’t know .NET or Java… but fine)

“… and we don’t support MySQL, it’ll have to be SQL Server” (fine, I guess)

“… and we don’t have DBAs available until January, so you’ll have to wait.” (definitely not good.)

So back to the drawing board on the technology stack. I needed something in Java (I know very little Java, but nothing at all in .NET) and to avoid the DBA headache, it would have to bundle in a database. I first explored key-value stores like CouchDB, Redis, etc. None of them worked on Java. The only one I found that did was Persevere, and it was a JSON data store, which fit perfectly with my plans.

By this time, I’d also learn my my fourth and most important lesson.

  1. Don’t try to promote open source. Just deliver the application

I said, “This is a custom-built application that runs on Java. Could we get a Windows server to host it?”

The answer was “Yes”, and we had it live the next day.

PS: December 2009

The application’s deployed and running. It has about 10,000 orders fraud screened by now.
And the lessons are well learnt. So when some came over asking if there was any image resizing solution I knew off, I said: “Sure, who’s your business sponsor?” Then I went over and said, “Let me show you this open source application called ImageMagick. It handles aspect ratios correctly, and can crop too. Doesn’t this look professional?” Then I went over to IT and said, “It’s open source, so you can change it. It has Java bindings, so you can integrate it into your environment. It can handle 8 3000×2400 images a second on my puny laptop. It’s used by your competitors. And I can build it for you if you like.”

I might just have my second open source entry into a corporate this year.

The scary Internet

I’m not that difficult to scare, and this log message certainly didn’t help:

ip223.hichina.com [223.4.183.127] failed - POSSIBLE BREAK-IN ATTEMPT!

That’s the message I saw – one thousand five hundred and seventy times yesterday in /var/log/auth.log on one of my Amazon EC2 instances.

Someone, presumably from China, has been patiently trying out a variety of SSH keys to log into this system.

These were grouped as batches. There were exactly 314 attempts at 8am yesterday, then 314 at 12noon, then 314 at 4pm, then 314 at 8pm, then 232 at 3am today. (All times are in UTC – that is, UK time without daylight saving). Every burst took 9 minutes to run through all 314 attempts.

The worst part was, when I tried using SSH this morning, I wasn’t able to log in. (It turned out that I had made a configuration error, but this is the sort of thing that gets me quite worried.)

Perhaps I shouldn’t be complaining. I’ve written enough scrapers to make most webmasters cringe at their logs. I remember a few years ago, when I was working on a project at Tesco, and was scraping bestsellers lists from most sites. (Here’s a blog post about it.) We were putting together a prototype to see how real-time competitive pricing could help.

The scraper was a pretty mild one. It would visit a hundred links, roughly at the pace of one a second. No images were loaded, of course, just the HTML.

One fine day, a few weeks after this had started, I got a call from Andy.

“Hi Anand, are you running any scrapers on our books website?”

“Yes, why?”

“Oh! The site’s very slow. Could you shut it down immediately?”

Turns out that not a single page on the site loaded, and it had almost crawled to a halt. Now, obviously, my little 100-page script could hardly cause damage, but it’s easy to understand their reactions. No unauthorised scraping! After a few days of trying to figure out what the problem was, they increased the memory and things went back to normal. Not a bad solution, actually – throw hardware at the problem, and if it vanishes, it’s probably the cheapest solution.

But anyway, I’m sure it’s some nice chap who’s just curious to know what I’ve got on my servers. I’d be happy to share some of it. And even if it’s not so nice a chap, there’s little that I can do, is there?

Update (1pm India, 3rd June): Actually, I now realise that this has been happening ever four hours since May 29th, as regular as a clockwork. Wish I knew enough UNIX programming to pull a prank…

Hosting options

I’ve been trying out a number of options for hosting recently, and have settled on Amazon spot instances.

Here were my options:

  • Application hosting, like Google AppEngine. I used this a lot until 2 years ago. Then they changed their pricing, and I realised what “lock-in” means. I can’t just take that code and move it to another server. Besides, I’m a bit wary of Google pulling the plug. Heroku? Same problem. I just want to take the code elsewhere and run it.
  • Shared hosting, like Hostgator. This blog is run on Hostgator and I’m extremely happy with them. But the trouble is, with shared hosting, I don’t get to run long-running processes on any ports I like.
  • Run you own servers. The problem here is quite simple: power cuts in India.
  • Dedicated hosting, like Amazon EC2, Azure, GCE, etc. This remains as pretty much the main hosting option

I’m a price optimisation freak. So I ran the numbers for a year’s worth of usage. I was looking at the CPU cost of a large machine with 7-8GB RAM. Bandwidth and storage are negligible. The cost per hour worked out to:

  • Amazon: $0.32 / hr in Singapore, $0.24 in Virginia
  • Google: $0.29 / hr in Europe
  • Microsoft: $0.32 / hr in US

The price is not all that different, but I need low latency, so Singapore it what it’ll have to be.

EC2 location Latency (ms)
Singapore 139
Oregon, US 334
Japan 517
Ireland 618
Australia 620
California, US 677
Virginia, US 710

Now comes the choice of the right model. At $0.32 per hour, that’s $230 a month.

Amazon offers some ways of getting this down. Instead of on-demand instances, I could go for reserved instances. For a year of usage, that’d get the price down to about $131 a month, nearly halving it. ($739 upfront for a heavy utilisation large reserved instance, with $0.095 * 24 * 365.25 for the year.)

In this case, I know I’ll need the servers for a year. Probably more, but then, I might want to switch later. So this isn’t a bad move. But we can do better. Amazon also offers spot instances. Spot instances might get shut down any time – but in reality, so can on-demand instances. I need to plan for it anyway. I’m not going to host anything that’s so sensitive that if it’s down for a few hours, I’ll have a problem.

But what’s attractive is the pricing. Typically, it’s $0.04 per hour, making it about $29 per month. Even if it shoots up to twice that, at $58, it’s less than a fourth of the on-demand price and less than half the reserved instance price.

I’ve managed to script the entire setup up sequence as shell scripts, and it takes less than an hour to get a new server up and running the software I need. I need to work out a decent backup mechanism. Plus, I could use more reliable storage like like Amazon’s EBS to preserve the data. But on the whole, the pricing is far too attractive and makes the risks worthwhile.

Geocoding in Excel

It’s easy to convert addresses into latitudes and longitudes into addresses in Excel. Here’s the Github project with a downloadable Excel file.

This is via Visual Basic code for a GoogleGeocode function that geocodes addresses.

Function GoogleGeocode(address As String) As String
    Dim xDoc As New MSXML2.DOMDocument
    xDoc.async = False
    xDoc.Load ("http://maps.googleapis.com/maps/api/geocode/" + _
        "xml?address=" + address + "&sensor=false")
    If xDoc.parseError.ErrorCode <> 0 Then
        GoogleGeocode = xDoc.parseError.reason
    Else
        xDoc.setProperty "SelectionLanguage", "XPath"
        lat = xDoc.SelectSingleNode("//lat").Text
        lng = xDoc.SelectSingleNode("//lng").Text
        GoogleGeocode = lat & "," & lng
    End If
End Function

Goodbye Google

Google Reader was where I spent most of my browsing time, but now, it’s shutting down.

Time for alternatives, but not just for Reader: for all Google products. I’m not sure when one of these might go down, become paid, or become unusable.

I just uninstalled Google Drive and Google Talk. but I don’t use it much (I use Skype), so no loss. I’ll leave Chrome for the while, but I’m hearing reports that Firefox is improving faster than Chrome is. Or there’s Chromium.

I’m not worried much about search services (including image, video, scholar and books). When needed, I can switch. Scholar might be a bit sad to lose, but I don’t use it much. Google Translate, too, isn’t essential.

Likewise for content. YouTube’s not a problem. There’re enough other video services. Trends are useful, but not critical. Maps might be, so I’ll try and switch to OpenStreetMap. I don’t use News or Picasa much.

I don’t care much for social media anyway, so Blogger, Orkut and Plus can die any time.

Google’s apps are the worrying ones. Mail and Calendar, in particular. I’ll probably migrate away from them last, but the attempt is on. I’ll be documenting the alternatives I find at https://gist.github.com/sanand0/5176161 (safely cloned locally).

Looks like there’s no safe long-term alternative to being able to host your own apps. Pity.

Github page-only repository

Github offers Github Pages that let you host web pages on Github.

You create these by adding a branch to git called gh-pages, and this is often in addition to the default branch master.

I just needed the gh-pages branch. So thanks to YJL, here’s the simplest way to do it.

  1. Create the repositoryon github.
  2. Create your local repository and git commitinto it.
  3. Type git push -u origin master:gh-pages
  4. In .git/config, under the [remote "origin"] section, add push = +refs/heads/master:refs/heads/gh-pages

The magic is the last :gh-pages.

The most popular scientific Python modules

I just scraped the scientific packages on pypi. Here are the top 50 by downloads.

Name Description Size Downloads
numpy NumPy: array processing for numbers, strings, records, and objects. 2000000 133076
scipy SciPy: Scientific Library for Python 7000000 33990
pygraphviz Python interface to Graphviz 99000 22828
geopy Python Geocoding Toolbox 32000 18617
googlemaps Easy geocoding, reverse geocoding, driving directions, and local search in Python via Google. 69000 15135
Rtree R-Tree spatial index for Python GIS 495000 14370
nltk Natural Language Toolkit 1000000 12844
Shapely Geometric objects, predicates, and operations 93000 12635
pyutilib.component.doc Documentation for the PyUtilib Component Architecture. 372000 10181
geojson Encoder/decoder for simple GIS features 12000 9407
GDAL GDAL: Geospatial Data Abstraction Library 410000 8957
scikits.audiolab A python module to make noise from numpy arrays 1000000 8856
pupynere NetCDF file reader and writer. 16000 8809
scikits.statsmodels Statistical computations and models for use with SciPy 3000000 8761
munkres munkres algorithm for the Assignment Problem 42000 8409
scikit-learn A set of python modules for machine learning and data mining 2000000 7735
networkx Python package for creating and manipulating graphs and networks 1009000 7652
pyephem Scientific-grade astronomy routines 927000 7644
PyBrain PyBrain is the swiss army knife for neural networking. 255000 7313
scikits.learn A set of python modules for machine learning and data mining 1000000 7088
obspy.seisan SEISAN read support for ObsPy. 3000000 6990
obspy.wav WAV(audio) read and write support for ObsPy. 241000 6985
obspy.seishub SeisHub database client for ObsPy. 237000 6941
obspy.sh Q and ASC (Seismic Handler) read and write support for ObsPy. 285000 6926
crcmod CRC Generator 128000 6714
obspy.fissures DHI/Fissures request client for ObsPy. 1000000 6339
stsci.distutils distutils/packaging-related utilities used by some of STScI’s packages 25000 6215
pyopencl Python wrapper for OpenCL 1000000 6124
Kivy A software library for rapid development of hardware-accelerated multitouch applications. 11000000 5879
speech A clean interface to Windows speech recognition and text-to-speech capabilities. 17000 5809
patsy A Python package for describing statistical models and for building design matrices. 276000 5517
periodictable Extensible periodic table of the elements 775000 5498
pymorphy Morphological analyzer (POS tagger + inflection engine) for Russian and English (+perhaps German) languages. 70000 5174
imposm.parser Fast and easy OpenStreetMap XML/PBF parser. 31000 4940
hcluster A hierarchical clustering package for Scipy. 442000 4761
obspy.core ObsPy – a Python framework for seismological observatories. 487000 4608
Pyevolve A complete python genetic algorithm framework 99000 4509
scikits.ann Approximate Nearest Neighbor library wrapper for Numpy 82000 4368
obspy.imaging Plotting routines for ObsPy. 324000 4356
obspy.xseed Dataless SEED, RESP and XML-SEED read and write support for ObsPy. 2000000 4331
obspy.sac SAC read and write support for ObsPy. 306000 4319
obspy.arclink ArcLink/WebDC client for ObsPy. 247000 4164
obspy.iris IRIS Web service client for ObsPy. 261000 4153
Orange Machine learning and interactive data mining toolbox. 14000000 4099
obspy.neries NERIES Web service client for ObsPy. 239000 4066
pandas Powerful data structures for data analysis, time series,and statistics 2000000 4037
pycuda Python wrapper for Nvidia CUDA 1000000 4030
GeoAlchemy Using SQLAlchemy with Spatial Databases 159000 3881
pyfits Reads FITS images and tables into numpy arrays and manipulates FITS headers 748000 3746
HTSeq A framework to process and analyze data from high-throughput sequencing (HTS) assays 523000 3720
pyopencv PyOpenCV – A Python wrapper for OpenCV 2.x using Boost.Python and NumPy 354000 3660
thredds THREDDS catalog generator. 25000 3622
hachoir-subfile Find subfile in any binary stream 16000 3540
fluid Procedures to study geophysical fluids on Python. 210000 3520
pygeocoder Python interface for Google Geocoding API V3. Can be used to easily geocode, reverse geocode, validate and format addresses. 7000 3514
csc-pysparse A fast sparse matrix library for Python (Commonsense Computing version) 111000 3455
topex A very simple library to interpret and load TOPEX/JASON altimetry data 7000 3378
arrayterator Buffered iterator for big arrays. 7000 3320
python-igraph High performance graph data structures and algorithms 3000000 3260
csvkit A library of utilities for working with CSV, the king of tabular file formats. 29000 3236
PyVISA Python VISA bindings for GPIB, RS232, and USB instruments 237000 3201
Quadtree Quadtree spatial index for Python GIS 40000 3000
ProxyHTTPServer ProxyHTTPServer — from the creator of PyWebRun 3000 2991
mpmath Python library for arbitrary-precision floating-point arithmetic 1000000 2901
bigfloat Arbitrary precision correctly-rounded floating point arithmetic, via MPFR. 126000 2879
SimPy Event discrete, process based simulation for Python. 5000000 2871
Delny Delaunay triangulation 18000 2790
pymc Markov Chain Monte Carlo sampling toolkit. 1000000 2727
PyBUFR Pure Python library to encode and decode BUFR. 10000 2676
collective.geo.bundle Plone Maps (collective.geo) 11000 2676
dap DAP (Data Access Protocol) client and server for Python. 125000 2598
rq RQ is a simple, lightweight, library for creating background jobs, and processing them. 29000 2590
pyinterval Interval arithmetic in Python 397000 2558
StarCluster StarCluster is a utility for creating and managing computing clusters hosted on Amazon’s Elastic Compute Cloud (EC2). 2000000 2521
fisher Fast Fisher’s Exact Test 43000 2503
mathdom MathDOM – Content MathML in Python 169000 2482
img2txt superseded by asciiporn, http://pypi.python.org/pypi/asciiporn 443000 2436
DendroPy A Python library for phylogenetics and phylogenetic computing: reading, writing, simulation, processing and manipulation of phylogenetic trees (phylogenies) and characters. 6000000 2349
geolocator geolocator library: locate places and calculate distances between them 26000 2342
MyProxyClient MyProxy Client 67000 2325
PyUblas Seamless Numpy-UBlas interoperability 51000 2252
oroboros Astrology software 1000000 2228
textmining Python Text Mining Utilities 1000000 2198
scikits.talkbox Talkbox, a set of python modules for speech/signal processing 147000 2188
asciitable Extensible ASCII table reader and writer 312000 2160
scikits.samplerate A python module for high quality audio resampling 368000 2151
tabular Tabular data container and associated convenience routines in Python 52000 2114
pywcs Python wrappers to WCSLIB 2000000 2081
DeliciousAPI Unofficial Python API for retrieving data from Delicious.com 19000 2038
hachoir-regex Manipulation of regular expressions (regex) 31000 2031
Kamaelia Kamaelia – Multimedia & Server Development Kit 2000000 2007
seawater Seawater Libray for Python 2000000 1985
descartes Use geometric objects as matplotlib paths and patches 3000 1983
vectorformats geographic data serialization/deserialization library 10000 1949
PyMT A framework for making accelerated multitouch UI 18000000 1945
times Times is a small, minimalistic, Python library for dealing with time conversions between universal time and arbitrary timezones. 4000 1929
CocoPy Python implementation of the famous CoCo/R LL(k) compiler generator. 302000 1913
django-shapes Upload and export shapefiles using GeoDjango. 9000 1901
sympy Computer algebra system (CAS) in Python 5000000 1842
pyfasta fast, memory-efficient, pythonic (and command-line) access to fasta sequence files 14000 1836