Parse Links Article

Article taken from: The Art of the Web"

PHP: Parsing HTML to find Links

From blogging to log analysis and search engine optimisation (SEO) people are looking for scripts that can parse web pages and RSS feeds from other websites - to see where their traffic is coming from among other things.

Parsing your own HTML should be no problem - assuming that you use consistent formatting - but once you set your sights at parsing other people's HTML the frustration really sets in. This page presents some regular expressions and a commentary that will hopefully point you in the right direction.

Simplest Case

Let's start with the simplest case - a well formatted link with no extra attributes:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU
 

This, believe it or not, is a very simple regular expression (or "regexp" for short). It can be broken down as follows:

We're also using two 'pattern modifiers':

The first modifier means that we're matching <A> as well as <a>. The 'ungreedy' modifier is necessary because otherwise the second captured string could (by being 'greedy') extend from the contents of one link all the way to the end of another link.

One shortcoming of this regexp is that it won't match link tags that include a line break - fortunately there's a modifer for this as well:

/<a\shref=\"([^\"]*)\">(.*)<\/a>/siU
 

Now the '.' character will match any character including line breaks. We've also changed the first space to a 'whitespace' character type so that it can match a space, tab or line break. It's necessary to have some kind of whitespace in that position so we don't match other tags such as <area>.

For more information on pattern modifiers see the link at the bottom of this page.

Room for Extra Attributes

Most link tags contain a lot more than just an href attribute. Other common attributes include: rel, target and title. They can appear before or after the href attribute:

/<a\s[^>]*href=\"([^\"]*)\"[^>]*>(.*)<\/a>/siU
 

We've added extra patterns before and after the href attribute. They will match any series of characters NOT containing the > symbol. It's always better when writing regular expressions to specify exactly which characters are allowed and not allowed - 0rather that using the wildcard ('.') character.

Allow for Missing Quotes

Up to now we've assumed that the link address is going to be enclosed in double-quotes. Unfortunately there's nothing enforcing this so a lot of people simply leave them out. The problem is that we were relying on the quotes to be there to indicate where the address starts and ends. Without the quotes we have a problem.

It would be simple enough (even trivial) to write a second regexp, but where's the fun in that when we can do it all with one:

/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
 

What can I say? Regular expressions are a lot of fun to work with but when it takes a half-hour to work out where to put an extra ? your really know you're in deep.

Firstly, what's with those extra ?'s?

Because we used the U modifier, all patterns in the regexp default to 'ungreedy'. Adding an extra ? after a ? or * reverses that behaviour back to 'greedy' but just for the preceding pattern. Without this, for reasons that are difficult to explain, the expression fails. Basically anything following href= is lumped into the [^>]* expression.

We've added an extra capture to the regexp that matches a double-quote if it's there: (\"??). There is then a backreference \\1 that matches the closing double-quote - if there was an opening one.

To cater for links without quotes, the pattern to match the link address itself has been changed from [^\"]* to [^\" >]*?. That means that the link can be terminated by not just a double-quote (the previous behaviour) but also a space or > symbol.

This means that links with addresses containing unescaped spaces will no longer be captured!

Refining the Regexp

Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.

Spaces around the = after href:

 
/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
 

Matching only links starting with http:

 
/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
 

Single quotes around the link address:

 
/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
 

And yes, all of these modifications can be used at the same time to make one super-regexp, but the result is just too painful to look at so I'll leave that as an exercise.

Using the Regular Expression to parse HTML

Using the default for preg_match_all the array returned contains an array of the first 'capture' then an array of the second capture and so forth. By capture we mean patterns contained in ():

<?PHP
  // Original PHP code by Chirp Internet: www.chirp.com.au
  // Please acknowledge use of this code by including this header.

  $url = "http://www.example.net/somepage.html";
  $input = @file_get_contents($url) or die("Could not access file: $url");
  $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
  if(preg_match_all("/$regexp/siU", $input, $matches))
		{
		// $matches[2] = array of link addresses
		// $matches[3] = array of link text - including HTML code
		}
?>
 

Using PREG_SET_ORDER each link matched has it's own array in the return value:

<?PHP
  // Original PHP code by Chirp Internet: www.chirp.com.au
  // Please acknowledge use of this code by including this header.

  $url = "http://www.example.net/somepage.html";
  $input = @file_get_contents($url) or die("Could not access file: $url");
  $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
  if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
	{
	foreach($matches as $match)
		{
		// $match[2] = link address
		// $match[3] = link text
		}
	}
?>