Given the blazing speed of Node.js these days, I expected HTML parsing to be faster on Node than on Python.
So I compared lxml with htmlparser2 — the fastest libraries on Python and JS in parsing the reddit home page (~700KB).
- lxml took ~8.6 milliseconds
- htmlparser2 took ~14.5 milliseconds
Looks like lxml is much faster. I’m likely to stick around with Python for pure HTML parsing (without JavaScript) for a while longer.
In [1]: from lxml.html import parse
In [2]: %timeit tree = parse('reddit.html')
8.69 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
const { Parser } = require("htmlparser2");
const { DomHandler } = require("domhandler");
const fs = require('fs');
const html = fs.readFileSync('reddit.html');
const handler = new DomHandler(function (error, dom) { });
const start = +new Date();
for (var i = 0; i < 100; i++) {
const parser = new Parser();
parser.write(html);
parser.end();
}
const end = +new Date();
console.log((end - start) / 100);
Note: If I run the htmlparser2 code 100 times instead of 10, it only takes 7ms per loop. The more the number of loops, the faster it parses. I guess Node.js optimizes repeated loops. But I’m only interested in the first iteration, since I’ll be parsing files only once.