Scraping RSS feeds using XPath

If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.

My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.

There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:

//a	Matches all <a> links
//p/b	Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>)
//table//a	Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works)

You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.

The following XPath matches the IMDb Top 250 exactly.

//tr//tr//tr//td[3]//a

(It’s a link inside the 3rd column in a table row in a table row in a table row.)

Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:

www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a

When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.

This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:

xpath=//tr//tr//tr title->./td[3]//a link->./td[3]//a/@href

This says three things:

//tr//tr//tr	Pick all rows in a row in a row
title->./td[3]//a	For each row, set the title to the link text in the 3rd column
link->./td[3]//a	… and the link to the link href in the 3rd column

That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.

You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!

Mark

December 17, 2007 at 12:00 pm

Have you ever thought about introducing authentication to the XPath server? I would like to parse certain fields of a page that is authenticated with cookies.

S Anand

October 28, 2008 at 1:43 am

Sure Rog. I’ve mailed it to you

Rog

October 28, 2008 at 1:07 am

Any chance you could share your xpath.php code? It seems the server is no longer available.

March 7, 2009 at 10:35 am

Post Yahoo’s introduction of Yahoo Query Language, you’re much better off using that instead of my XPath utility. I’ve covered it in this article on client side scraping.

Pingback: Scraping your way to RSS Feeds! « Technosiastic!

Bart P

March 3, 2012 at 11:48 am

It would be great if you could share this code, I really like to use this server, but want to remove session ids from the links (so my reader doesn’t think all links are new every time).

Is that possible?

Ben

June 11, 2015 at 12:46 am

Is that possible to share your xpath.php code? yahoo pipes is going to be shut down

Scraping RSS feeds using XPath

7 thoughts on “Scraping RSS feeds using XPath”

Leave a Comment

Categories

Archives

Collections

Pages

Related Posts

7 thoughts on “Scraping RSS feeds using XPath”

Leave a Comment