Web Scraping
Part I
As I was searching through the web to find something useful concerning “web scraping”, I decided to put up something myself. Isn’t there anything useful out there? I know “web scraping” (or “screen scraping” in general) is a disgusting technique and I have to admit: it usually makes me puke.
But, well, there are times, you have no other chance (or even worse: you have a chance but the alternative is even more horrible).
After doing several web scraping I will put together some of the results and wrong ways. The following examples will be shown in PHP (I’m using version 5.1.4) and Tcl (8.4.2 and tdom 0.8.0). But as far as I know there are other languages that could be used with similar techniques (Ruby for example).
But first of all a …
WARNING
Before starting to scrape something off the web, be sure there is no better way. Often you may find an official API that should be used (e.g through Web Services) or there are other services that deliver the kind of requested information.
And moreover convince yourself that web scraping is at least not forbidden. Some big sites state in their terms and conditions that scraping is not allowed. You should respect that. And furthermore be aware that your requests add on the load of the target site. Always keep in mind, that you are retrieving information in a way that’s surely not intended by the sites-operators. So be friendly and don’t make too much requests.
If you’re taking content from other sites without the permission of the creators you will, depending on the usage of this content, violate copyright law.
Having said that, we start with the simplest method.
Regular expressions
That’s always the first method mentioned, when somebody speaks of analyzing texts (and thats what you do when you scrape a website). Though this might be feasible for grabbing specialized texts from a page, you get in hell if you want more.
So let’s look at a small example where a regular expression is enough. We want to extract the current value of the DAX. Have a look at http://boerse.ftd.de/ftd/kurse_listen.htm.
There is certainly some webservice to retrieve this kind of data. But as I wanted to make a really simple example, let’s assume there is no way around scraping.
We are concentration our attention to the table with the row “DAX” and the column “Punkte”. The actual value should be extracted, so let’s look at the relevant HTML-code-extract:
... DAX</td> <td class="....">5.560,13</td> ...
HTML-Code 1
By the way, the DAX is from July, 25 2006. This part could be extracted simply be
<?php $url = 'boerse.html'; // or 'http://boerse.ftd.de/ftd/kurse_listen.htm' $html = file_get_contents($url); $regexp = '/>DAX<\/a><\/td>.*?(.*?)<\/td>/s'; if (preg_match_all($regexp, $html, $hit) && count($hit[1]) == 1) { print 'Found DAX: '.$hit[1][0]; } else { print 'Error! Retrieved '.count($hit[1]).' matches!'; } ?>
- PHP-Code 1 -
set f [open boerse.html r]; set html [read $f]; close $f // or package require http set token [::http::geturl "http://boerse.ftd.de/ftd/kurse_listen.htm"] set html [::http::data $token] set regexp ">DAX<\/a><\/td>.*?(.*?)<\/td>" // -all -inline counts complete match and braced expression if {[set l [regexp -all -inline $regexp $html]] && [llength $l] == 2} { puts "Found DAX: [lindex $l 1]" } else { puts "Error! Retrieved [llength $l] matches" }
- Tcl-Code 1 -
To have a better way of testing, I’m usually storing the page locally. With file_get_contents you can simply switch from the local stored file to the web-address (as far as I know there is nothing that easy in Tcl to switch between file and url). As long as you’re trying to find out the correct regular-expression for the match, you should definitely do that with a locally stored HTML-file.
Make sure that this pattern only matches once or you might retrieve the wrong part of the page. To do so, the regular expression pattern contains a little bit of the surrounding tags. Assuming that there will only be one linked text “DAX” in a table-cell, with the next cell containing a number.
Further more the regexp in PHP has the modifier /s (treat string as single-line) or in Tcl the switch -inline. Because the text to match stretches multiple lines and I simply wanted to ignore that.
Because of unexpected and surely unannounced changes to the page (at least unannounced to you as an ”nearly” anonymous scraper), make sure that you check for the right data. If the pattern doesn’t match, there is definitely something wrong and you have to look at the HTML-Code for changes. Or maybe the pattern matches more than once, this should be wrong, too. Therefore I’m always using preg_match_all (or in Tcl -all).
Well, this was easy and in fact I wouldn’t call this “web scraping”. If you want more to scrape than a single number or word from a page in fixed text-pattern, forget about regular expressions.
We need something more powerful. Something which can be used on nested structures. Have you ever tried to match paired
"<div>...</div>" with regular expressions? No way!
Part II
A more powerful way than regular expressions? Nearly imaginable? Small mind!
DOM
DOM is for correctly structured XML-like data only? Oh no. There is more. At least in PHP you can use the usual DOMDocument. And as far as I know even “Internet Explorer” somehow handles badly formatted HTML. And it is using a
DOM-representation internaly. So there are other “convert bad-bad-bad html to dom”-tools out there.
Let’s start with another simple example. We want to find out how long a search on msn takes.
First we have to feed the HTML into the DOMDocument (let’s serch for “scraping”). To get the url just go to the website, enter “scraping” and copy the resulting url to the code.
$url='http://search.msn.de/results.aspx?q=scraping';
$html=file_get_contents($url);
// create DOM from bad HTML
$dom = DOMDocument::loadHTML($html);
if ($dom) {
// go on with parsing
}
- PHP-Code 2 -
package require tdom
package require http
set url "http://search.msn.de/results.aspx?q=scraping"
set token [::http::geturl $url]
set html [::http::data $token]
# create DOM from bad HTML
if {![catch {dom parse -html $html} dom]} {
set root [$dom documentElement]
# go on with parsing
}
- Tcl-Code 2 -
You will get tons of warnings from the method loadHTML. As we know that this is badly formatted HTML, we will silently ignore those.
If we got a dom-object we’re starting to parse the HTML. We’re doing this with XQuery. After analyzing the HTML-code of the result-page you can find this specific text (newlines inserted for clearness):
<div id="header">
<h1>Webergebnisse</h1>
<h5>Seite 1 von 307 Ergebnissen, die
<strong>webscraping</strong> enthalten
<span id="rendertime">(0,10 Sekunden)</span>
</h5>
</div>
- HTML-Code 2 -
Search for the duration of the search, we simply have to get the span-
tag with id rendertime.
$xpath = new domXPath($dom);
// get the span-tag with id=rendertime
$queryTime = '//span[@id='rendertime']';
$nodeTimeList = $xpath->query($queryTime);
if ($nodeTimeList && $nodeTimeList->length == 1) {
print 'Query took: '.$nodeTimeList->item(0)->nodeValue;
// further queries ... see below
} else {
// something went wrong, get errormessage from page
$nodeMsgList = $xpath->query("//h1[@id='notification']");
if ($nodeMsgList && $nodeMsgList->length == 1) {
// print error-message from msn
print 'Sorry:'.$nodeMsgList->item(0)->nodeValue;
} else {
print 'Error:Got '.($nodeTimeList?$nodeTimeList->length:0).' nodes.';
}
}
- PHP-Code 3 -
if {![catch {$root selectNodes {//span[@id='rendertime']}} nodeTimeList]
&& [llength $nodeTimeList] == 1} {
puts "Query took: [[$nodeTimeList firstChild] nodeValue]"
# further queries ... see below
} else {
# something went wrong, get errormessage from page
if {![catch {$root selectNodes {//h1[@id='notification']}} nodeMsgList]
&& [llength $nodeMsgList] == 1} {
puts "Sorry: [$nodeMsgList asText]"
} else {
set noNodes [expr {[info exists nodeTimeList]?[llength $nodeTimeList]:0}]
puts "Error:Got $noNodes nodes"
}
}
- Tcl-Code 3 -
With the XQuery [@//span[@id='rendertime']@] we get all span-tags with id rendertime and because it is an id it really should be only one. But you never know. The search might give no results. In that case we wouldn’t get a node-list-object, so we check for the existance and that there is exactly one element ($nodeTimeList->length == 1). You should always completely check your results if they exactly meet your expectations.
If the search doesn’t return results a check for the tag <h1 id="notification">...Message...</h1>
is done and the message is printed.
You will ask yourself: “Why haven’t we used the method getElementById?” This would return the node directly. But have a close look to this method. As mentioned in the
documentation, you have to call validate() before. You won’t expect that rubbish to be validated, don’t you?
Now let’s print the search result-list. We fill the gap // further queries ... see below
Looking through the html-code we find (newlines inserted for clearness):
<div id="results">
<h2>Ergebnisse</h2>
<ul>
<li>
<h3><a href="...">TITLE</a></h3>
<p>DESCRIPTION</p>
<ul>
<li class="first">...</li>
</ul>
</li>
<li>
<h3><a href="...">...
...
- HTML-Code 3 -
By now we would come to complex parsing with regular expressions, with XQuery we simple ask for this nodes:
[@//div[@id='results']//h3@]. The script would look like this:
$nodeHitList = $xpath->query("//div[@id='results']//h3/a");
foreach ($nodeHitList as $node) {
print $node->nodeValue;
}
- PHP-Code 4 -
foreach node [$root selectNodes {//div[@id='results']//h3/a}] {
puts [$node asText]
}
- Tcl-Code 4 -
Could it be shorter and cleaner? I guess no. Maybe we could again add some error-checking? I will leave this as an excercise to you.
Some word about User-agent
The way I retrieve the pages in the example is surely most simple. When using file_get_contents PHP doesn’t send a useragent-string within the request. Retrieving the url in Tcl with geturl sends the useragent “Tcl http client package
::http::config -useragent "lala"
In PHP you have to use a full-blown http-reader like HTTP_Request if you want to do more fancy things like setting the useragent or retrieving the pages through a proxy.
Setting the useragent might be necessary because of the target-page checking against the used browser and retrieving the page as “tcl client” might not be the most used “browser”
.
But as stated in the warning at the beginning, you should be honest and friendly toward the scraped site and identifying yourself as a “scraper” is one way to do that.
Conclusions
If I’ve got some time I will add some chapters concerning sessions (e.g if you like to get your bank-balance automatically) and ssl and maybe even some warnings about javascript.
But for the time being I leave it as is. Unless someone wants to improve this pigeon-english (I’m always glad if someone corrects me, please don’t hesitate to mail me all errors).
References
As said in the beginning, there is not much information around for this subject.
Professional screen-scraping software:
Wonderful article, very clean, easy to read and very educational.
Just a comment about tcl code : it seems you forgot to mention that selectNodes applies only to a domNode and not a domDocument.
So your exemple should be :
…[catch {dom parse -html $html} dom]…
>>> set root [$dom documentElement]
… $root selectNodes …
Bye,
Robert
Thanks for your hint, Robert. And you are quite right (and so I changed the examples). Nevertheless, running tdom 0.8.0 it even works to selectNodes from the domDocument-object.
Thank for a good article. I wrote a screen scraper a short while ago but this was dedicated to a single web database. If you are interested I can send you the code!
I was using the technic you demonstated in php but I was reading the links in the first part, and scraping the names and addresses in the second half.
Thanks for the feedback.
Yes, go ahead, maybe I can use some of that for the third part of the tutorial!
Nice article. I’m a little new to PHP so I’ll assume compatibility with php4 and php5. Your English is MUCH better than my Deutche.
Let me know when I might be able to read about the bank account and SSL info.
Hi Stefan,
Cool stuff! Your article on web scrapping is very helpful.
I have been looking for a Java Web Scraping API which is as simple to use like WebZinc ( http://www.webzinc.com/ ) for .NET (VB and C#). Are there really no Java APIs available? Or is it possible to use the WebZinc API with Java (I know, a very difficult question and to do this it needs a lot of very special technical hacker know-how).
Dominik
Thanks for your comment. I don’t know WebZinc, but you should be able to wrap the API with JNI (you can find some tutorials for the Java Native Interface on the Sun-Website
Good luck
Hi. Thanks for this article. While Perl articles on web scraping abound (there’s even a module!) there’s nothing on PHP, so this is good to have. But your technique seems to be so nitty gritty about one website that it cannot be used for generic scraping? Is there any way we can use something that’ll work on any website?
Thanks for your feedback, Erick. I think that generic scraping is a problem in itself. Does the perl-module do “generic scraping”? But what is the “something” that you want to do on any website? Is there something that could be done similar on any website?
Don’t know if this is useful at all, but along with Kapow and Screen Scraper (in the professional category), there is also Mozenda (http://www.mozenda.com) which actually shows you the XPath for each element on the page and allows you to create actions based off of the XPath. You can also do RegEx stuff if that’s your preference, but its nice to be able to mess with XPath.
That’s a cool tool. Not so much professional, but it’s a good help for me: the Firebug-Addon (added with XPather and DomInspector)