If a site doesn’t have an RSS feed, your simplest option is to use Page2Rss, which gives a feed of what’s changed on a page.
My needs, sometimes, are a bit more specific. For example, I want to track new movies on the IMDb Top 250. They don’t offer a feed. I don’t want to track all the other junk on that page. Just the top 250.
There’s a standard called XPath. It can be used to search in an HTML document in a pretty straightforward way. Here are some examples:
//a | Matches all <a> links |
//p/b | Matches all <b> bold items in a <p> para. (the <b> must be immediately under the <p>) |
//table//a | Matches all links inside a table (the links need not be immediately inside the table — anywhere inside the table works) |
You get the idea. It’s like a folder structure. / matches the a tag that’s immediately below. // matches a tag that’s somewhere below. You can play around with XPath using the Firefox XPath Checker add-on. Try it — it’s much easier to try it than to read the documentation.
The following XPath matches the IMDb Top 250 exactly.
//tr//tr//tr//td[3]//a
(It’s a link inside the 3rd column in a table row in a table row in a table row.)
Now, all I need is to get something that converts that to an RSS feed. I couldn’t find anything on the Web, so I wrote my own XPath server. The URL:
www.s-anand.net/xpath?
url=http://www.imdb.com/chart/top&
xpath=//tr//tr//tr//td[3]//a
When I subscribe to this URL on Google Reader, I get to know whenever there’s a new movie on the IMDb Top 250.
This gives only the names of the movies, though, and I’d like the links as well. The XPath server supports this. It accepts a root XPath, and a bunch of sub-XPaths. So you can say something like:
This says three things:
//tr//tr//tr | Pick all rows in a row in a row |
title->./td[3]//a | For each row, set the title to the link text in the 3rd column |
link->./td[3]//a | … and the link to the link href in the 3rd column |
That provides a more satisfactory RSS feed — one that I’ve subscribed to, in fact. Another one that I track is a list of mininova top seeded movies category.
You can whiff up more complex examples. Give it a shot. Start simple, with something that works, and move up to what you need. Use XPath Checker liberally. Let me know if you have any isses. Enjoy!
Have you ever thought about introducing authentication to the XPath server? I would like to parse certain fields of a page that is authenticated with cookies.
Sure Rog. I’ve mailed it to you
Any chance you could share your xpath.php code? It seems the server is no longer available.
Post Yahoo’s introduction of Yahoo Query Language, you’re much better off using that instead of my XPath utility. I’ve covered it in this article on client side scraping.
Pingback: Scraping your way to RSS Feeds! « Technosiastic!
It would be great if you could share this code, I really like to use this server, but want to remove session ids from the links (so my reader doesn’t think all links are new every time).
Is that possible? 🙂
Is that possible to share your xpath.php code? yahoo pipes is going to be shut down 🙁