“Scraping” is extracting content from a website. It’s often used to build something on top of the existing content. For example, I’ve built a site that tracks movies on the IMDb 250 by scraping content.
There are libraries that simplify scraping in most languages:
- Perl: WWW::Mechanize
- Python: BeautifulSoup
- Ruby: HPricot
- PHP: XPath (built-in)
- Javascript: jQuery on env.js on Rhino
But all of these are on the server side. That is, the program scrapes from your machine. Can you write a web page where the viewer’s machine does the scraping?
Let’s take an example. I want to display Amazon’s bestsellers that cost less than $10. I could write a program that scrapes the site and get that information. But since the list updates hourly, I’ll have to run it every hour.
That may not be so bad. But consider Twitter. I want to display the latest iPhone tweets from http://search.twitter.com/search.atom?q=iPhone
, but the results change so fast that your server can’t keep up.
Nor do you want it to. Ideally, your scraper should just be Javascript on your web page. Any time someone visits, their machine does the scraping. The bandwidth is theirs, and you avoid the popularity tax.
This is quite easily done using Yahoo Query Language. YQL converts the web into a database. All web pages are in a table called html
, which has 2 fields: url
and xpath
. You can get IBM’s home page using:
select * from html where url="http://www.ibm.com"
Try it at Yahoo’s developer console. The whole page is loaded into the query.results
element. This can be retrieved using JSONP. Assuming you have jQuery, try the following on Firebug. You should see the contents of IBM’s site on your page.
$.getJSON( 'http://query.yahooapis.com/v1/public/yql?callback=?', { q: 'select * from html where url="http://www.ibm.com"', format: 'json' }, function(data) { console.log(data.query.results) } ); |
That’s it! Now, it’s pretty easy to scrape, especially with XPath. To get the links on IBM’s page, just change the query to
select * from html where url="http://www.ibm.com" and xpath="//a"
Or to get all external links from IBM’s site:
select * from html where url="http://www.ibm.com" and xpath="//a[not(contains(@href,'ibm.com'))][contains(@href,'http')]""
Now you can display this on your own site, using jQuery.
This leads to interesting possibilities, such as Map-Reduce in the browser. Here’s one example. Each movie on the IMDb (e.g. The Dark Knight) comes with a list of recommendations (like this). I want to build a repository of recommendations based on the IMDb Top 250. So here’s the algorithm. First, I’ll get the IMDb Top 250 using:
select * from html where url="http://www.imdb.com/chart/top" and xpath="//tr//tr//tr//td[3]//a"
Then I’ll get a random movie’s recommendations like this:
select * from html where url="http://www.imdb.com/title/tt0468569/recommendations" and xpath="//td/font//a[contains(@href,'/title/')]"
Then I’ll send off the results to my aggregator.
Check out the full code at http://250.s-anand.net/build-reco.js.
In fact, if you visited my IMDb Top 250 tracker, you already ran this code. You didn’t know it, but you just shared a bit of your bandwidth and computation power with me. (Thank you.)
And, if you think a little further, here another way of monetising content: by borrowing a bit of the user’s computation power to build complex tasks. There already are startups built around this concept.
Great!
Thank u anand. Nice Information.
Awesome… This is the future
Pingback: grep imdb part 2 ? « taeyoungchoon
Many thanks for your efforts in simplifying the ‘scraping’.
Hey Anand,
Thanks. This post is really useful. I have managed to scrape Amazon wishlist using this technique (http://www.venkatsworld.com/WIP/JSON_amazon.html) However YQL seems to be taking some time to respond. I am not sure if you experienced this.I did a similar exercise with Picasa JSON feed and a Jquery plug-in for image scrolling. It works well there too.
Interesting article! There is another powerful scraping technology available that offers Javascript, jQuery, CSS, and XPath instead of XPath-only. It’s called Bobik (http://usebobik.com). The is a cool example of scraping restaurant menus using Bobik at http://news.ycombinator.com/item?id=4066478.
Hey Anand,
Great Article. Just one query though in the client side implementation its the client IP that will be hitting the website (in your case http://www.ibm.com)? or it will use yahoo SQL server IP to hit.