S Anand

lxml is fast enough

Given the blazing speed of Node.js these days, I expected HTML parsing to be faster on Node than on Python.

So I compared lxml with htmlparser2 — the fastest libraries on Python and JS in parsing the reddit home page (~700KB).

  • lxml took ~8.6 milliseconds
  • htmlparser2 took ~14.5 milliseconds

Looks like lxml is much faster. I’m likely to stick around with Python for pure HTML parsing (without JavaScript) for a while longer.

In [1]: from lxml.html import parse

In [2]: %timeit tree = parse('reddit.html')
8.69 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const fs = require('fs');
const html = fs.readFileSync('reddit.html');
const handler = new DomHandler(function (error, dom) { });

const start = +new Date();
for (var i = 0; i < 100; i++) {
  const parser = new Parser();
  parser.write(html);
  parser.end();
}
const end = +new Date();
console.log((end - start) / 100);

Note: If I run the htmlparser2 code 100 times instead of 10, it only takes 7ms per loop. The more the number of loops, the faster it parses. I guess Node.js optimizes repeated loops. But I’m only interested in the first iteration, since I’ll be parsing files only once.

Mining for Ancient Debris

I’ve been active on Minecraft for the last 6 months, thanks to my daughter. She keeps watching game videos for hours. I thought I’d see what the big deal was, and made one myself.

In this 5-minute clip, I’m mining for Ancient Debris in the Nether by placing beds — which explode when used in the Nether. That’s a quick way to clear large areas and is cheaper than TNT. Ancient Debris is used to make Netherite Scrap which makes Netherite ingots that can upgrade to Netherite weapons and armor — the strongest things in Minecraft.

Why do I care? Well, when my friend’s son said “You’re the only adult I know who plays Minecraft”, I felt 20 years younger 😊.

Books in 2020

My Goodreads 2020 Reading Challenge target is 50 books. I’m at 45/50, with little hope of getting to 50. (I managed 25/24 in 2019.)

The 10 non-fiction books I read (most useful first) are below.

  1. The Lean Startup by Eric Reis.
    The principle of Build – Measure – Learn is useful everywhere in life too, not just in startups.
  2. Never Split The Difference by Chriss Voss.
    Shares principle-driven strategies to convince people.
  3. The 4 Disciplines of Execution by McChesney, Covey & Huling.
    Teaches how to build execution rigor in an organization. A bit long at the end, but the first section is excellent.
  4. Sprint by Jake Knapp.
    A detailed step-by-step guide to running product development sprints that you can follow blindly.
  5. How to Fail at Almost Everything and Still Win Big by Scott Adams.
    Dilbert’s author shares his strategies for life. Very readable, intelligent, and slightly provocative, but always interesting.
  6. The Five Dysfunctions of a Team by Patrick Lencioni.
    Written as a story (like The Goal). Talks about the 5 problems in teams and how to overcome them.
  7. The Culture Code by Daniel Coyle.
    Explains the elements of strong cultures – belongingness, shared vulnerability, and shared purpose.
  8. Data-Driven Storytelling by Nathalie Henry Riche et al.
    Shares the latest points of view on telling data stories. My team and I read these chapters as a group.
  9. Leaders Eat Last by Simon Sinek.
    Inspiring when I read it, but I don’t remember what it said.
  10. Deep Work by Cal Newport.
    Shares tactics to focus. Practical and useful.

I also started, by haven’t finished these four:

  1. Hacking Growth by Sean Ellis & Morgan Brown
  2. The Laws of Human Nature by Robert Greene
  3. How to Fail at Almost Everything and Still Win Big by Scott Adams
  4. Stories at Work by Indranil Chakraborty

I read these 25 works of fiction — mostly by Brandon Sanderson (my current favorite author) and Brent Weeks.

  1. Lightbringer (Books 1-5) by Brent Weeks.
    In a world where color is woven as magic, the most powerful man is caught in politics. This series had enough twists and turns to keep me hooked till the end.
  2. Skyward (Books 1-2) by Brandon Sanderson.
    An outcast girl on an outcast planet becomes a fighter pilot with an alien spaceship. I love the way this is developing.
  3. The Wheel of Time (Books 1-6) by Robert Jordan.
    I picked it up again mainly because Brandon Sanderson wrote the last 3 books. It was great up to book 4 but has started dragging.
  4. Alcatraz Versus The Evil Librarians (Books 1-4) by Brandon Sanderson.
    The author lies to you. Literally. And tells you that he will, in almost every other paragraph. Great book for kids to laugh over.
  5. Night Angel (Books 1-3) by Brent Weeks.
    An assassin in a story that spans centuries of the history of magic.
  6. Legion (Books 1-3) by Brandon Sanderson.
    A detective who has multiple split-personalities in him — that help him solve cases.
  7. Snapshot by Brandon Sanderson.
    What if you could create a snapshot of the world, enter it, interact with it, and solve crimes?
  8. The Art of Letting Go: Poetry for the Seekers by Sanhita Baruah.
    It’s my first poetry book. (I hate poetry.) I took this up to see if I could survive it, and get a fresh perspective. I survived.

… and these 10 comic books/series.

  1. Batman, Volume 1: The Court of Owls
  2. Batman, Volume 2: The City of Owls
  3. World War Hulk (1-5)
  4. Superman: Red Son (1-3)
  5. Flashpoint (1-5)
  6. Batman – The Long Halloween (1-13)
  7. Batman – The Killing Joke
  8. Kingdom Come (Vol 1-4)
  9. Spiderman: Ends of the Earth
  10. Amazing Spiderman, Vol. 1

At the moment, I’m at 45 books, with little hope of completing 5 more this month unless I pick up comics. So that’s exactly what I’m going to do 😉

Micro-notes

I maintain my (extensive) notes in text files. I’ve explored Evernote, Onenote, Google Keep, Apple Notes, and many other platforms. But text files work. I store them as Markdown and sync them on DropBox.

They used to be relatively large files (50-100KB) each, on broad topics. For example:

  • todo.txt was a consolidated list of things I had to do
  • people.txt was a list of everything I knew about people (addresses, birthdays, etc)
  • towrite.txt was a list of everything I wanted to write about
  • notes.txt was where I tracked notes about any topics
  • … and more

This led to a couple of problems.

  1. Searching across files was hard. I wouldn’t remember if I wrote ideas for my next talk in todo.txt or towrite.txt, or if my meeting minutes where in notes.txt todo.txt. I had to open each file and search.
  2. Files were getting too big. Editing them on mobile was harder. Scanning them was harder.

So I changed this system a few years ago into micro-notes. These files became a folder. For example, my notes/ folder looks like this:

  • time-management.txt has my time management notes
  • book-never-split-the-difference.txt has book notes on Never Split the Difference
  • eat-food-sleep-exercise-live-healthy.txt has notes on fitness

The folder has nearly 300 files. Here’s a glimpse of the latest files.

Similarly, my people/ folder has details of my discussions with various people I interact with — friends, colleagues & clients.

What made this change possible is Everything, a fast file search tool on Windows that lets me find files as I type. For example, if I’m looking for my notes on SlideSense, I just type “notes s” and it appears on the list.

I usually sort the files by run count (how often I opened them). That makes it easy to re-open the most used files.

It also makes it easier to edit these notes on mobile. I sync the folder on Dropbox, and use IAWriter to edit them while on walks. Dictation is pretty good, so I’ve been using that to take notes too.

Happiness generator

In my current thrust towards greater management responsibilities, I have discovered a mechanism for generating happiness.

I set up meetings on important topics. That makes me happy — I’m driving something useful.

Often, the meeting gets cancelled. That makes me happy — I’ve more free time.

It’s the perfect perpetual motion machine.

Dissecting my Airtel bills

My monthly postpaid mobile bills have been in the Rs 2,000 – Rs 3,000 range for some time now, and I spent a few hours dissecting them yesterday.

Page 3 had the good stuff. It’s a little hard to figure out, but what the last 2 columns say is that most of my spend is offset by discounts.

2015-08-11 18_26_53-Start

What’s not getting offset are outgoing roaming calls. Followed by calls to local landlines. For all practical purposes, that’s the only thing that counts in this bill. Everything else is close enough to zero.

It took me some time to figure out that Airtel postpaid has something called myPlan. Based on your plan, you get set of “myPacks” or discounts. That determines your final bill. And it turns out that I was barely using my quota in some areas – specifically data. I had 3GB of data available. I was typically using 200MB – 500MB. The last 2 columns on page 2 show the usage of myPacks.

image

Clearly, I can do with less data, less SMS, less local mobile, and perhaps less STD mobile. I might need more outgoing roaming, but that’s about it. This means I need fewer myPacks. So I was able to switch to the Rs 799 plan from the Rs 999 plan, while simultaneously increasing the number of free outgoing roaming calls I can make.

There seems to be no myPack for incoming roaming, so I’m actually better off calling people if I’m travelling, rather than receiving calls!

The rest of the bill is a treasure-trove of data, listing every call and every pulse of data connection. I only wish it also had the location of the calls, and were available as CSV files.

An ambulance ride

I rode in an ambulance yesterday.

I’ve been in a few of these in the last 3 weeks, but yesterday was the first time I had to give directions, so I was paying a bit more attention to the traffic.

It’s remarkable how well Bangalore traffic responds to ambulances. Almost every single person gave way. (Not that this is easy. Merely slowing towards the left isn’t always effective if it ends up blocking the way. Many people were wise enough to give way at the appropriate place, and our flow was not impeded.)

Not everyone gave way, though.

The first was the driver of a small car. (I can’t identify vehicle models. This sort of looked like a Maruti.) He was driving right in front of us for a while, right in the middle of the road, without giving way on either side. After some time, when the road widened, we passed him. I noticed that he was wearing earphones. Based on his body posture, my guess was that he was listening to music — as opposed to speaking on the phone, for example. Clearly, he had not heard the siren nor the horn.

The second was a young girl riding a moped. (Scooty, or something like that.) The road was narrow. There was no chance of overtaking her. She did hear the siren, though, and tried her best to rush ahead of the ambulance. After about half a minute, the road widened again, and she gave way. My guess is that she was under 18, out on a ride in a relatively small and safe road, and had no idea what to do when an ambulance scares her noisily from behind.

The third was a bus driver, though the circumstances were different. The car ahead of us (a call taxi) gave way to the left. The bus was approaching us from the opposite side. The bus stopped right next to the car. Given the width of the road, the ambulance could not pass.

The taxi driver tried to guide the bus driver, telling him that he should move a bit forward to let us pass, but the bus driver (again, a very young chap) seemed frozen. He didn’t (or couldn’t budge). The taxi driver in front of us started his car, moved 50 metres ahead, and let us pass.

The fourth caused the longest delay. A couple riding a bike were ahead of us. The road turned ahead. They tried to give way, skidded, and they fell right in the middle of the road.

The were shaken, but not hurt, thankfully. It took a couple of minutes for them to gather themselves and their vehicle (with assistance from a passerby) and give way. We checked after confirming that they were unhurt.

I’m not sure what to make of this. In every case, the cause for delay ways ignorance. Either not hearing, or not knowing what to do, or not knowing how to do it well. But it’s gladdening that the bulk of Bangalore is both knowledgeable and responsive to the needs of those in ambulances.

But: please don’t endanger yourselves while giving way.

Software I currently use

Every few years, I review the software I use. Here are some of my earlier lists.

Right now, among browsers, Chrome is my primary browser. What’s interesting is that IE 11 has overtaken Firefox in terms of usage. That’s partly because we’re working with Microsoft a lot, but also because Firefox has a number of weird bugs like IE6 used to have, and is slowly lagging in the race.

Next to browsers, I spend most of my time on the command prompt. I use Console2 for tabbed console windows. Given the number of command prompts I open, this is often necessary. I use bash in Cygwin as the default shell. Haven’t had the need for PowerShell.

The only text editor I use is Sublime Text 3. This is the only text editor I’ve used for the last 3 years. The only plugin I use is PlainTasks which I use as my todo list. I write my blog posts in Windows Live Writer, which makes blogging offline quite painless.

For image editing, I use PicPick to capture screens and basic editing. Since I haven’t upgraded to Windows 8, I don’t have the snipping tool. But PicPick also lets me pick colors from the screen, which is pretty useful when copying designs. For slightly more serious editing like changing colours, adding annotations, etc., I use Paint.NET. It’s close enough to Photoshop for most practical uses. On rare occasions, I’ve needed to power of GIMP – especially to remove background on images. But when even this fails me, it’s ImageMagick to the rescue, with inscrutable command line options that can morph Obama into Osama. If I want to edit icon files (to create favicons, for example), I use IcoFX. For vector graphics, I use InkScape, which has a steep learning curve but doesn’t seem to have a good free alternative. To edit shapefiles, I use QGIS, and Shape Viewer to view them.

For music and movies, I’ve kept it simple: I use VLC. It lets me stream on to my iPad. I can also watch/stream movies as they are being downloaded via μTorrent – which is probably the coolest feature feature of any torrent client. I store all my music in one large folder, and keep .m3u playlists. These are rsynced periodically into my Android phone.

For audio editing, Audacity remains my best bet. However, for video editing, my needs have changed. It’s usually screen-recordings I need to create, so I don’t use VirtualDub much. I’ve moved from CamStudio to Microsoft Expression Encoder Screen Capture (long name for a rather nice piece of software that works reasonably well.)

To read books, I’ve started using Calibre, simply because it can read both ePub and .mobi formats. Since then, I’ve been using Kindle less. I continue to use my old copy of Microsoft Reader, even though the product is dead, because I have a lot of .lit files. (That’s one of the advantages of software over online services. Even if they pull the plug, you can use an old copy of the software. And it works!) To read PDF files, I use Foxit PDF Reader. On the extremely rare occasion that I need to print PDF files from a software that does not support PDF printing, I use CutePDF Writer.

For file sharing, I use Dropbox for files. It’s simple, popular and just works. I tried BitTorrent Sync as a peer-to-peer alternative to Dropbox, but the interface has a long way to go before it’s usable. I do hope something emerges. For screen sharing, I use TeamViewer (which is fast) or join.me (which doesn’t require a client). Though I use Skype for calls, I don’t find its screen-sharing fast enough.

I play around with data a lot. This is mostly done in Python, for which I use Continuum’s Anaconda builds – they have most of the useful packages built-in. When I need to scrape Javascript-based websites, I try CasperJS on top of PhantomJS. This is particularly handy for the several ASPX based Government websites. I also have node.js installed, but don’t really use it much.

I use RStudio as my R IDE. I’m experimenting with Tabula to see if it’s practical to extract PDF tables with it, but my current preference is to use xpdf to convert PDF to text and then process it. For data cleansing, there’s only one tool that I know that’s effective: Open Refine. For network visualisations, I use Gephi, though NodeXL can do a small but useful subset of that within Excel.

For compression, I use 7-zip. The 7z format provides the best compression across most file types that I’ve seen, but even if you want to use ZIP files, 7-zip creates smaller ZIP files. For image compression, I use kraken.io, which offers the best compression I’ve seen. On the desktop, TruePNG and jpegoptim do the trick.

There are several small utilities I use. WinDirStat tells me how my hard disk space is used up, helping clean drives and Dropbox folders. ClipX lets you copy and keep multiple items in the clipboard. Restoration can undelete even permanently deleted files. Truecrypt keeps files encrypted. Putty lets you connect via SSH if you don’t have cygwin. But the mother of all tools is AutoHotkey, which I use for everything ranging from typing my signature to resizing windows to storing our conference bridge numbers.

I’ve a number of web servers on my system. I use XAMPP for Apache, MySQL and PHP, but also have nginx handy. But the simplest, easiest and smallest web server is perhaps Mongoose. Just run it in any folder to start a web server. python -mSimpleHTTPServer does the same for developers. I also have Fiddler installed as a proxy – partly to monitor what URLs my applications access, and partly to simulate slow speed connections for the web apps I build. Apart from MySQL in XAMPP, I have a few databases installed: SQL Server, SQLite and SQLite Studio to read the sqlite3 files.

      Of course, some of my apps apps have moved online, and my earlier post on the A-Z of my browsing history covers that. But there are a few applications that I’ve hosted which I must talk about. WordPress, which this blog runs on, is the primary one on the list. I also use gitlab as an internal alternative to Github, slideshare.net to share slides, and etherpad.mozilla.org to chat / collaborate on code. But the application that I spend the most time on is selfoss – an RSS reader, my replacement for the late beloved Google Reader.