Coding Archives - Page 3 of 7

Github page-only repository

December 14, 2012 December 26, 2012 / Coding / Leave a Comment

Github offers Github Pages that let you host web pages on Github.

You create these by adding a branch to git called gh-pages, and this is often in addition to the default branch master.

I just needed the gh-pages branch. So thanks to YJL, here’s the simplest way to do it.

Create the repositoryon github.
Create your local repository and git commitinto it.
Type git push -u origin master:gh-pages
In .git/config, under the [remote "origin"] section, add push = +refs/heads/master:refs/heads/gh-pages

The magic is the last :gh-pages.

The most popular scientific Python modules

December 11, 2012 December 11, 2012 / Coding / 2 Comments

I just scraped the scientific packages on pypi. Here are the top 50 by downloads.

Name	Description	Size	Downloads
numpy	NumPy: array processing for numbers, strings, records, and objects.	2000000	133076
scipy	SciPy: Scientific Library for Python	7000000	33990
pygraphviz	Python interface to Graphviz	99000	22828
geopy	Python Geocoding Toolbox	32000	18617
googlemaps	Easy geocoding, reverse geocoding, driving directions, and local search in Python via Google.	69000	15135
Rtree	R-Tree spatial index for Python GIS	495000	14370
nltk	Natural Language Toolkit	1000000	12844
Shapely	Geometric objects, predicates, and operations	93000	12635
pyutilib.component.doc	Documentation for the PyUtilib Component Architecture.	372000	10181
geojson	Encoder/decoder for simple GIS features	12000	9407
GDAL	GDAL: Geospatial Data Abstraction Library	410000	8957
scikits.audiolab	A python module to make noise from numpy arrays	1000000	8856
pupynere	NetCDF file reader and writer.	16000	8809
scikits.statsmodels	Statistical computations and models for use with SciPy	3000000	8761
munkres	munkres algorithm for the Assignment Problem	42000	8409
scikit-learn	A set of python modules for machine learning and data mining	2000000	7735
networkx	Python package for creating and manipulating graphs and networks	1009000	7652
pyephem	Scientific-grade astronomy routines	927000	7644
PyBrain	PyBrain is the swiss army knife for neural networking.	255000	7313
scikits.learn	A set of python modules for machine learning and data mining	1000000	7088
obspy.seisan	SEISAN read support for ObsPy.	3000000	6990
obspy.wav	WAV(audio) read and write support for ObsPy.	241000	6985
obspy.seishub	SeisHub database client for ObsPy.	237000	6941
obspy.sh	Q and ASC (Seismic Handler) read and write support for ObsPy.	285000	6926
crcmod	CRC Generator	128000	6714
obspy.fissures	DHI/Fissures request client for ObsPy.	1000000	6339
stsci.distutils	distutils/packaging-related utilities used by some of STScI’s packages	25000	6215
pyopencl	Python wrapper for OpenCL	1000000	6124
Kivy	A software library for rapid development of hardware-accelerated multitouch applications.	11000000	5879
speech	A clean interface to Windows speech recognition and text-to-speech capabilities.	17000	5809
patsy	A Python package for describing statistical models and for building design matrices.	276000	5517
periodictable	Extensible periodic table of the elements	775000	5498
pymorphy	Morphological analyzer (POS tagger + inflection engine) for Russian and English (+perhaps German) languages.	70000	5174
imposm.parser	Fast and easy OpenStreetMap XML/PBF parser.	31000	4940
hcluster	A hierarchical clustering package for Scipy.	442000	4761
obspy.core	ObsPy – a Python framework for seismological observatories.	487000	4608
Pyevolve	A complete python genetic algorithm framework	99000	4509
scikits.ann	Approximate Nearest Neighbor library wrapper for Numpy	82000	4368
obspy.imaging	Plotting routines for ObsPy.	324000	4356
obspy.xseed	Dataless SEED, RESP and XML-SEED read and write support for ObsPy.	2000000	4331
obspy.sac	SAC read and write support for ObsPy.	306000	4319
obspy.arclink	ArcLink/WebDC client for ObsPy.	247000	4164
obspy.iris	IRIS Web service client for ObsPy.	261000	4153
Orange	Machine learning and interactive data mining toolbox.	14000000	4099
obspy.neries	NERIES Web service client for ObsPy.	239000	4066
pandas	Powerful data structures for data analysis, time series,and statistics	2000000	4037
pycuda	Python wrapper for Nvidia CUDA	1000000	4030
GeoAlchemy	Using SQLAlchemy with Spatial Databases	159000	3881
pyfits	Reads FITS images and tables into numpy arrays and manipulates FITS headers	748000	3746
HTSeq	A framework to process and analyze data from high-throughput sequencing (HTS) assays	523000	3720
pyopencv	PyOpenCV – A Python wrapper for OpenCV 2.x using Boost.Python and NumPy	354000	3660
thredds	THREDDS catalog generator.	25000	3622
hachoir-subfile	Find subfile in any binary stream	16000	3540
fluid	Procedures to study geophysical fluids on Python.	210000	3520
pygeocoder	Python interface for Google Geocoding API V3. Can be used to easily geocode, reverse geocode, validate and format addresses.	7000	3514
csc-pysparse	A fast sparse matrix library for Python (Commonsense Computing version)	111000	3455
topex	A very simple library to interpret and load TOPEX/JASON altimetry data	7000	3378
arrayterator	Buffered iterator for big arrays.	7000	3320
python-igraph	High performance graph data structures and algorithms	3000000	3260
csvkit	A library of utilities for working with CSV, the king of tabular file formats.	29000	3236
PyVISA	Python VISA bindings for GPIB, RS232, and USB instruments	237000	3201
Quadtree	Quadtree spatial index for Python GIS	40000	3000
ProxyHTTPServer	ProxyHTTPServer — from the creator of PyWebRun	3000	2991
mpmath	Python library for arbitrary-precision floating-point arithmetic	1000000	2901
bigfloat	Arbitrary precision correctly-rounded floating point arithmetic, via MPFR.	126000	2879
SimPy	Event discrete, process based simulation for Python.	5000000	2871
Delny	Delaunay triangulation	18000	2790
pymc	Markov Chain Monte Carlo sampling toolkit.	1000000	2727
PyBUFR	Pure Python library to encode and decode BUFR.	10000	2676
collective.geo.bundle	Plone Maps (collective.geo)	11000	2676
dap	DAP (Data Access Protocol) client and server for Python.	125000	2598
rq	RQ is a simple, lightweight, library for creating background jobs, and processing them.	29000	2590
pyinterval	Interval arithmetic in Python	397000	2558
StarCluster	StarCluster is a utility for creating and managing computing clusters hosted on Amazon’s Elastic Compute Cloud (EC2).	2000000	2521
fisher	Fast Fisher’s Exact Test	43000	2503
mathdom	MathDOM – Content MathML in Python	169000	2482
img2txt	superseded by asciiporn, http://pypi.python.org/pypi/asciiporn	443000	2436
DendroPy	A Python library for phylogenetics and phylogenetic computing: reading, writing, simulation, processing and manipulation of phylogenetic trees (phylogenies) and characters.	6000000	2349
geolocator	geolocator library: locate places and calculate distances between them	26000	2342
MyProxyClient	MyProxy Client	67000	2325
PyUblas	Seamless Numpy-UBlas interoperability	51000	2252
oroboros	Astrology software	1000000	2228
textmining	Python Text Mining Utilities	1000000	2198
scikits.talkbox	Talkbox, a set of python modules for speech/signal processing	147000	2188
asciitable	Extensible ASCII table reader and writer	312000	2160
scikits.samplerate	A python module for high quality audio resampling	368000	2151
tabular	Tabular data container and associated convenience routines in Python	52000	2114
pywcs	Python wrappers to WCSLIB	2000000	2081
DeliciousAPI	Unofficial Python API for retrieving data from Delicious.com	19000	2038
hachoir-regex	Manipulation of regular expressions (regex)	31000	2031
Kamaelia	Kamaelia – Multimedia & Server Development Kit	2000000	2007
seawater	Seawater Libray for Python	2000000	1985
descartes	Use geometric objects as matplotlib paths and patches	3000	1983
vectorformats	geographic data serialization/deserialization library	10000	1949
PyMT	A framework for making accelerated multitouch UI	18000000	1945
times	Times is a small, minimalistic, Python library for dealing with time conversions between universal time and arbitrary timezones.	4000	1929
CocoPy	Python implementation of the famous CoCo/R LL(k) compiler generator.	302000	1913
django-shapes	Upload and export shapefiles using GeoDjango.	9000	1901
sympy	Computer algebra system (CAS) in Python	5000000	1842
pyfasta	fast, memory-efficient, pythonic (and command-line) access to fasta sequence files	14000	1836

Auto reloading pages

September 30, 2012 September 30, 2012 / Coding / Leave a Comment

After watching Bret Victor’s Inventing on Principle, I just had to figure out a way of getting live reloading to work. I know about LiveReload, of course, and everything I’ve heard about it is good. But their Windows version is in alpha, and I’m not about to experiment just yet.

This little script does it for me instead:

(function(interval, location) {
  var lastdate = "";
  function updateIfChanged() {
    var req = new XMLHttpRequest();
    req.open('HEAD', location.href, false);
    req.send(null);
    var date = req.getResponseHeader('Last-Modified');
    if (!lastdate) {
      lastdate = date;
    }
    else if (lastdate != date) {
      location.reload();;
    }
  };
  setInterval(updateIfChanged, interval);
})(300, window.location)

It checks the current page every 300 milliseconds and reloads it if the Last-Modified header is changed. I usually include it as a minified script:

<script>(function(d,c){var b="";setInterval(function(){var a=new
XMLHttpRequest;a.open("HEAD",c.href,false);a.send(null);
a=a.getResponseHeader("Last-Modified");if(b)b!=a&&
c.reload();else b=a},d)})(300,window.location)</script>

There are no dependencies on any library, like jQuery. However, it requires that the file be on a web server. (It’s easy to fix that, but since I always run a local webserver, I’ll let you solve that problem yourself.)

Windows XP virtual machine

September 16, 2012 September 18, 2012 / Coding / 2 Comments

Here’s the easiest way to set up a Windows XP virtual machine that I could find.

(This is useful if you want to try out programs without installing it on your main machine; test your code on a new machine; or test your website on IE6 / IE7 / IE8.)

Go to the Virtual PC download site. (I tried VirtualBox and VMWare Player. Virtual PC is better if you’re running Windows on Windows.)

If you have Windows 7 Starter or Home, select “Don’t need XP Mode and want VPC only? Download Windows Virtual PC without Windows XP Mode.”
If you have Windows Vista or Windows 7, select “Looking for Virtual PC 2007?”
Download it. (You may have to jump through a few hoops like activation.)
Download Windows XP and run it to extract the files. (It’s a 400MB download.)
Open the “Windows XP.vmc” file – just double-clicking ought to work. At this point, you have a working Windows XP version. (The Administrator password is “Password1”.)
Under Tools – Settings – Networking – Adapter 1, select “Shared Networking (NAT)”

That’s pretty much it. You’ve got a Windows XP machine running inside your other Windows machine.

Update (18 Sep 2012): I noticed something weird. The memory usage of VMWindow and vpc.exe is tiny!

Between the two processes, they take up less than 30MB of memory. This is despite the Windows XP Task Manager inside the virtual machine showing me 170MB of usage. I’ve no clue what’s happening, but am beginning to enjoy virtualisation. I’ll start up a few more machines, and perhaps install a database cluster across them.

Inspecting code in Python

September 4, 2012 September 4, 2012 / Coding / 2 Comments

Lisp users would laugh, since they have macros, but Python supports some basic code inspection and modification.

Consider the following pieces of code:

margin = lambda v: 1 - v['cost'] / v['sales']

What if you wanted another function that lists all the dictionary indices used in the function? That is, you wanted to extract cost and sales?

This is a real-life problem I encountered this morning. I have 100 functions, each defining a metric. For example,

lambda v: v['X'] + v['Y']
lambda v: v['X'] - v['Z']
lambda v: (v['X'] + v['Y']) / v['Z']
…

I had to plot the functions, as well as each of the corresponding elements (‘X’, ‘Y’ and ‘Z’) in the formula.

Two options. One: along with each formula, maintain a list of the elements used. Two: figure it out automatically.

Each function has a func_code attribute. So, when you take

margin = lambda v: 1 - v['cost'] / v['sales']

margin.func_code is a “code object”. This has a bunch of interesting attributes, one of which is co_consts

>>> margin.func_code.co_consts
(None, 1, 'cost', 'sales')

There — I just pick the strings out of that list and we’re done (for simple functions at least.)

Check out http://docs.python.org/reference/datamodel.html and search for func_ — you’ll find a number of interesting things you can do with functions, such as

Finding and changing the default parameters
Accessing the global variables of the namespace where the function was defined (!)
Replacing the function code with new code

Also search for co_ — you’ll find some even more interesting things you can do with the code:

Find all local variable names
Find all constants used in the code
Find the filename and line number where the code was compiled from

Python also comes with a disassembly module dis. A look at its source is instructive.

Restartable and Parallel

August 30, 2012 August 27, 2012 / Coding, Data / 2 Comments

When processing data at a large scale, there are two characteristics that make a huge difference to my life.

Restartability. When something goes wrong, being able to continue from where it stopped. In my opinion, this is more important than parallelism. There’s nothing as depressing as having to start from scratch every time. Think of it as the ability to save a game as opposed to starting from Level 1 in every life.

Parallelism. Being able to run multiple processes in parallel. Often, this is easy. You don’t need threads. Good old UNIX xargs can do a great job of it. Interestingly, I’ve never used Hadoop for any real-life problem. I’ve gotten by with UNIX commands and smart partitioning.

The “smart partitioning” bit is important. For example, if you’re dealing with telecom data, you’d be calculating most of your metrics (e.g. did the number of calls grow or fall, are there more outgoing or incoming calls, etc.) are calculated on a single mobile number. So if you have multiple data sets, as long as all the data related to one mobile number are on the same system, you’re fine. If you have 100 machines, just split the data based on the last 2 digits of the mobile number. So data about 9012345678 would go to machine 78 (the last two digits). Given a mobile number for any type of data, you’d know exactly which machine would have that data. For all practical purposes, that gives you the basics of a distributed file system.

(I’m not saying you don’t need Hadoop. Just that I haven’t needed it.)

Colour spaces

August 27, 2012 August 27, 2012 / Coding, Visualisation / 1 Comment

In reality, a colour is a combination of light waves with frequencies between 400-700THz, just like sound is a combination of sound waves with frequencies from 20-20000Hz. Just like mixing various pure notes produces a new sound, mixing various pure colours (like from a rainbow) produces new colours (like white, which isn’t on the rainbow.)

Our eyes aren’t like our ears, though. They have 3 sensors that are triggered differently by different frequencies. The sensors roughly peak around red, green and blue. Roughly.

It turns out that it’s possible to recreate most (not all) colours using a combination of just red, green and blue by mimicking these three sensors to the right level. That’s why TVs and monitors have red, blue and green cells, and we represent colours using hex triplets for RRGGBB – like #00ff00 (green).

There are a number of problems with this from a computational perspective. Conceptually, we think of (R, G, B) as a 3-dimensional cube. That’d mean that 100% red is about as bright as 100% green or blue. Unfortunately, green is a lot brighter than red, which is a lot brighter than blue. Our 3 sensors are not equally sensitive.

You’d also think that a colour that’s numerically mid-way between 2 colours should appear to be mid-way. Far from it.

This means that if you’re picking colours using the RGB model, you’re using something very far from the intuitive human way of perceiving colours.

Which is all very nice, but I’m usually in a rush. So what do I do?

I go to the Microsoft Office colour themes and use a colour picker to pick one. (I extracted them to make life easier.) These are generally good on the eye.
Failing that, I pick something from http://kuler.adobe.com/
Or I go to http://colorbrewer2.org/ and pick a set of colours
If I absolutely have to do things programmatically, I use the HCL colour scheme. The good part is it’s perceptually uniform. The bad part is: not every interpolation is a valid colour.

Is Protocol buffers worth it?

August 1, 2012 August 15, 2012 / Coding / 7 Comments

Google’s Protocol Buffers is a “language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler”

XML is slow and large. There’s no doubting that. JSON’s my default alternative, though it’s a bit large. CSV’s ideal for tabular data, but ragged hierarchies are a bit difficult.

I was trying to see if Protocol Buffers would be smaller and faster, at least when using Python. I took JSON as the base, and checked the write speed, read speed and file sizes. Here’s the comparison:

Protocol Buffers are 17 times slower to write and almost 6 times slower to read than JSON files. File sizes are smaller, but then, all it takes is a simple gzip operation to compress the JSON files even smaller. Reading json.gz files is just 2% slower than JSON files, and writing them is only 4 times slower.

The code base is at https://bitbucket.org/sanand0/protobuftest

On the whole, it appears that GZipped JSON files are smaller, faster, and just as simple as Protocol Buffers. What am I missing?

Update: When you add GZipped CSV to the mix, it’s twice as fast as GZipped JSON to read: clearly a huge win. It’s only slightly slower to write, and but compresses a tiny bit more than JSON.

Audio data URI

July 17, 2012 July 17, 2012 / Coding / 1 Comment

Turns out that you can use data URIs in the <audio> tag.

Just upload an MP3 file to http://dataurl.net/#dataurlmaker and you’ll get a long string starting with data:audio/mp3;base64...

Insert this into your HTML:

<audio controls src=”data:audio/mp3;base64...”>

That’s it – the entire MP3 file is embedded into your HTML page without requiring additional downloads.

This takes a bit more bandwidth than the MP3, and won’t work on Internet Explorer. But for modern browsers, and small audio files, it reduces the overall load time – sort of like CSS sprites.

So, on my bus ride today, I built a little HTML5 musical keyboard that generates data URIs on the fly. Click to play.

Markdress

December 3, 2011 December 3, 2011 / Coding, Tools / 11 Comments

This year, I’ve converted the bulk of my content into Markdown – a simple way of formatting text files in a way that can be rendered into HTML.

Not out of choice, really. It was the only solution if I wanted to:

Edit files on my iPad / iPhone (I’ve started doing that a lot more recently)
Allow the contents to be viewable as HTML as well as text, and
Allow non techies to edit the file

As a bonus, it’s already the format Github and Bitbucket use for markup.

If you toss Dropbox into the mix, there’s a powerful solution there. You can share files via Dropbox as Markdown, and publish them as web pages. There are already a number of solutions that let you do this. DropPages.com and Pancake.io let you share Dropbox files as web pages. Calepin.co lets you blog using Dropbox.

My needs were a bit simpler, however. I sometimes publish Markdown files on Dropbox that I want to see in a formatted way – without having to create an account. Just to test things, or share temporarily.

Enter Markdress.org. My project for this morning.

Just add any URL after markdress.org to render it as Markdown. For example, to render the file at http://goo.gl/zTG1q, visit http://markdress.org/goo.gl/zTG1q.

To test it out, create any text file in your Dropbox public folder, get the public link:

… and append it to http://markdress.org/ without the http:// prefix.

Coding