RegEx to get the Root Domain from a String of HTML

Extracting URL’s from a string of HTML text using the php preg_match_all() function can grab you an array of the full URLs or an array of just the root domains depending on the RegEx used. I’ll provide a premade function snippet that you can use for both. The only difference between these two functions is the RegEx used. Whichever one that you decide to use, the array returned will always contain the URL in the first column of the array element and it will contain the link text in the second. Using them would just be a matter of looping through the array to grab the URLs.

Get the Full URLs from HTML Text

This method is a lot simpler to create because the RegEx can just search for the closing ” tag of the HTML hyperlink to figure out the end of the URLs as it goes through.

<?php
function urlExtractor($html)
{
    $linkArray = array();
    if(preg_match_all('/<a\s+.*?href=[\"\']?([^\"\'>]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html, $matches, PREG_SET_ORDER)){
        foreach ($matches as $match) {
            array_push($linkArray, array($match[1], $match[2]));
        }
    }
    return $linkArray;
}
?>

Get the Root Domains from HTML Text

This method has some more complexity involved in finding the end of the root domain because the ending will vary. What I ended up doing was I made it search for the 3rd ‘/’ in the string after we found that a hyperlink was being made in the event that the URL was for something other than the homepage.
Link Example: <a href=”http://test.com/demopage.html”>test</a>

I also had to add in a check to see if it ran into a ‘ or ” character before the backslash, which would mean that the URL was already limited to the root domain and we could end the capture there.
Link Example: <a href=”http://test.com”>test</a>

<?php
function rootExtractor($html)
{
    $linkArray = array();
    if(preg_match_all('/<a\s+.*?href=[\"\']?([^\/\"\'>]*[^\/]*\/[^\/]*\/[^\/|\'\"]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html, $matches, PREG_SET_ORDER)){
        foreach ($matches as $match) {
            array_push($linkArray, array($match[1], $match[2]));
        }
    }
    return $linkArray;
}
?>

Usage Examples

We can send the following example string of html text to each function and see what the output would be:
$html = ‘This is an example block of text with a <a href=”http://test.com/demopage.html”>link</a>. The regex will determine what gets saved in the <a href=”http://test2.com”>link2</a>’;

Using the Full Domain Extractor Function

$linkArray = urlExtractor($html);
linkArray would contain:
[ {‘http://test.com/demopage.html’,’link’}, {‘http://test2.com’,’link2′} ]

Using the Root Domain Extractor Function

$linkArray = rootExtractor($html);
linkArray would contain:
[ {‘http://test.com’,’link’},  {‘http://test2.com’,’link2′} ]


Get the Root Domain from a Plain URL String

This method can be used if you just have a plain old url string and want to extract the root domain from it. It’s a much simpler example, but it’s always useful to have it handy as well and sometimes you don’t need the full RegEx code for dissecting a block of HTML text. This would be used when you have a string containing only a long URL and you want to chop off the extra content to grab only the root domain. There shouldn’t be any HTML in the string when using this function and it needs the full URL format including the http://.
String Example: http://test.com/subdirectory/page1/

<?php
function simpleRootExtractor($string)
{
    preg_match('/([^\/\"\'>]*[^\/]*\/[^\/]*\/[^\/|\'\"]*)[\"\']?[^>]*/i', $string, $matches)
    return $matches;
}
?>
Usage Example with the URL String Root Domain Extractor Function

$string = ‘http://test.com/subdirectory/page1/’;
$linkArray = simpleRootExtractor($string);
linkArray would contain:
[‘http://test.com’]

Code Summary

Regex code for finding full URLs in HTML Text:
'/<a\s+.*?href=[\"\']?([^\"\'>]*)[\"\']?[^>]*>(.*?)<\/a>/i'

Regex code for finding only the root domains in HTML Text:
'/<a\s+.*?href=[\"\']?([^\/\"\'>]*[^\/]*\/[^\/]*\/[^\/|\'\"]*)[\"\']?[^>]*>(.*?)<\/a>/i'

Regex code for finding only the root domain in a URL String:
'/([^\/\"\'>]*[^\/]*\/[^\/]*\/[^\/|\'\"]*)[\"\']?[^>]*/i'

Speak Your Mind

*