Patching simple html dom PHP library to have it work with files with a lot of ‘noise’

simple_html_dom is a PHP library used when doing screen scraping.
It is easy and works pretty well, allowing you to use CSS selectors to identify the parts of the page you want to grab.

If you use this code:

$html = file_get_html($your_url);

The $html variable ends up containing the Document Object Module, or DOM of the web page at $your_url. From that it is almost a breeze to grab parts of the page. For example, this code:

foreach ($dom->find('div#TrCons table tr.evidence') as $i => $row) 
{
  echo trim($row->innertext) . "\n";
}

would output all the html content of the table rows (TR) of class “evidence”, within the DIV that has the attribute id set to “TrCons”.

While scraping some of the internal pages at the official italian Senate web site, we found some really weird errors while parsing.

The parsed DOM was plain wrong, right after the invocation of the file_get_html function. Some branches of the DOM contained pieces of HTML taken from other branches, plus some weird integer numbers.

There were two hints:

  • this only happened in some pages
  • the mismatched HTML code had always to do with comments

So, I analyzed the source code (it is all in one single file) and got this:
the algorythm works by firstly removing the noise from the html code, then restoring it after the parsing.
By noise, all those information not useful for the parsing purpose is intended and, among that, the html comments.

Since the error had to do with comments, I concentrated on the noise removing and restoring parts of the code.

There are two functions for this: remove_noise and restore_noise.

Very basically, the remove_noise function

  • searches the whole text for some patterns (for example the html comment open and close tags),
  • stores each found pattern in the noise hash, with a key named “___noise___DDD” (DDD is a 3 digits number),
  • substitutes the found pattern with the string “___noise___DDD”.

The restore_noise function does the exact reverse.
For each ___noise___DDD pattern found in the text, it grabs the corresponding value in the haash and substitutes it back.

Very nice, and it is also suddenly clear where the problem is!

What if there are more than 1000 substitutions to do?
The code would break and would result in weird errors, exactly the errors I was finding (only on those pages that had more than a thousand noise in it).

So, I patched the code, by increasing the number of digits used in the ___noise___ pattern from 3 to 5.
That way, the library can handle pages with up to 100 thousands noise in it.

It is a rough patch, I admit, but it should be still sufficient for most of the web pages out there.

A bug was filed by me on sourceforge.
You can download the proposed patch here.

1 Comment

  1. Gareth said,

    March 17, 2010 at 8:54 pm

    I’m parsing the dump of wikipedia by running the wikitext through mediawiki then stripping a bunch of stuff out, when I ran into this. I though I was screwed, thanks SO much for posting this. Patch works wonderfully.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: