Part I
As I was searching through the web to find something useful concerning „web scraping“, I was astonished about the lack of information. So I decided to put up something myself. Isn’t there anything useful out there? I know „web scraping“ (or „screen scraping“ in general) is a disgusting technique and I have to admit: it usually makes me puke.
But, well, there are times, when you have no other chance (or even worse: you have a chance but that one is even more horrible).
After doing several web scraping-projects I will put together some of the experience. The following examples will be shown in PHP and Tcl (version > 8.4.2 and tdom 0.8). But as far as I know, other languages could easily used with similar techniques (Ruby for example).
But first of all a …
WARNING
Before starting to scrape something off the web, be sure there is no better way. Often you may find an official API that should be used (e.g through Web Services or a REST-API) or there are other services that deliver the needed information.
And moreover convince yourself that web scraping is at least not forbidden. Some big sites state in their terms and conditions that scraping is not allowed. You should respect that. And furthermore be aware that your requests add to the load of the target site. Always keep in mind, that you are retrieving information in a way that’s surely not intended by the sites-owners and -operators. So be nice and don’t make too much requests.
If you’re taking content from other sites without the permission of the creators you will, depending on the usage of this content, violate copyright law.
Having said that, we start with the simplest method.
Regular expressions
That’s always the first method mentioned, when somebody speaks of analyzing texts (and „analyzing text“ is in general what you do when you scrape a website). Though this might be feasible for grabbing specialized texts from a page, you get in hell if you want more.
So let’s look at a small example where a regular expression is enough. We want to extract the current value of the DAX.
There is certainly some webservice to retrieve this kind of data. But as I wanted to make a really simple example, let’s assume there is no way around scraping.
Have a look at any financial-site and you will find some HTML similar to that:
...
DAX
5.560,13 |
...
HTML-Code 1
We are concentration our attention to the table with the row „DAX“ and the column „Punkte“.
To extract the DAX-value, this could be done simply by
DAX.*?(.*?) | /s';
if (preg_match_all($regexp, $html, $hit) && count($hit[1]) == 1) {
print 'Found DAX: '.$hit[1][0];
} else {
print 'Error! Retrieved '.count($hit[1]).' matches!';
}
?>
PHP-Code 1
Or if you prefer to write that in Tcl:
set f [open boerse.html r]; set html [read $f]; close $f
// or
package require http
set token [::http::geturl "http://boerse.ftd.de/ftd/kurse_listen.htm"]
set html [::http::data $token]
set regexp ">DAX.*?(.*?) | "
// -all -inline counts complete match and braced expression
if {[set l [regexp -all -inline $regexp $html]] && [llength $l] == 2} {
puts "Found DAX: [lindex $l 1]"
} else {
puts "Error! Retrieved [llength $l] matches"
}
Tcl-Code 1
To have a better way of testing, I’m usually storing the page locally. With file_get_contents
you can simply switch from the local stored file to the web-address (as far as I know there is nothing that easy in Tcl to switch between file and url). As long as you’re trying to find out the correct regular-expression for the match, you should definitely do that with a locally stored HTML-file.
Make sure that this pattern only matches once or you might retrieve the wrong part of the page. To do so, the regular expression pattern contains a little bit of the surrounding tags. Assuming that there will only be one linked text „DAX“ in a table-cell, with the next cell containing a number.
Further in PHP add the modifier /s
(treat string as single-line) to the regular expression (or in Tcl the switch -inline
). Because the text to match stretches multiple lines (see „HTML-Code 1“) and I simply wanted to ignore that.
Because of unexpected and surely unannounced changes to the page (at least unannounced to you as an “nearly“ anonymous scraper), make sure that you check for the right data. If the pattern doesn’t match, there is definitely something wrong and you have to look at the HTML-Code for changes. Or maybe the pattern matches more than once, this should be wrong, too. Therefore I’m always using preg_match_all
(or in Tcl -all
).
Well, this was easy and in fact I wouldn’t call this „web scraping“. If you want more to scrape than a single number or word from a page, forget about regular expressions.
We need something more powerful. Something which can be used on nested structures. Have you ever tried to match paired
"
...
" with regular expressions? No way! Go directly to jail! Do not pass go!
Part II
A more powerful way than regular expressions? Nearly imaginable? Small mind!
DOM
DOM is for correctly structured XML-like data only? Oh no. There is more. At least in PHP you can use the usual DOMDocument
. And as far as I know even „Internet Explorer“ somehow handles badly formatted HTML. And it is using a
DOM-representation internaly. So there are other „convert bad-bad-bad html to dom“-tools out there.
Let’s start with another simple example. We want to find out how long a search on google takes.
First we have to feed the HTML into the DOMDocument (let’s search for „scraping“). To get the url just go to the website, enter „scraping“ and copy the resulting url to the code.
$url='http://www.google.de/#q=scraping';
$html=file_get_contents($url);
// create DOM from bad HTML
$dom = DOMDocument::loadHTML($html);
if ($dom) {
// go on with parsing
}
PHP-Code 2
package require tdom
package require http
set url "http://www.google.de/#q=scraping"
set token [::http::geturl $url]
set html [::http::data $token]
# create DOM from bad HTML
if {![catch {dom parse -html $html} dom]} {
set root [$dom documentElement]
# go on with parsing
}
Tcl-Code 2
You will get tons of warnings from the method loadHTML
. As we know that this is badly formatted HTML, we will silently ignore those.
If we got a dom
-object we’re starting to parse the HTML. We’re doing this with XQuery. After analyzing the HTML-code of the result-page you can find this specific text (newlines inserted for clearness):