Parsing domain name from url only in PHP
I need a function to extract just the name from the URL.
Likewise, when entering www.google.com
, I want the result to be google
.
www.facebook.com
β facebook
After several searches, I found this function parse_url($url, PHP_URL_HOST);
With this function, when I enter www.google.com/blahblah/blahblah
I get the output aswww.google.com
source to share
In my opinion there is only one reliable way to semi-halt and you need to create a class for it; I personally use something like a namespace\Domain extends namespace\URI
thing - Domain, which is essentially a subset of URIs - technically I create 2 classes.
Your domain will probably need a static class member to store the list of valid TLDs, and this can also exist in the URI class as you can reuse it with other subclasses.
namespace My;
class URI {
protected static $tldList;
private static $_tldRepository = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
protected $uri;
public function __construct($sURI = "") {
if(!self::$tldList) {
//static method to load the TLD list from Mozilla
// and parse it into an array, which sets self::$tldList
self::loadTLDList();
}
//if the URI has been passed in - set it
if($sURI) $this->setURI($sURI);
}
public function setURI($sURI) {
$this->uri = $sURI; //needs validation and sanity checks of course
}
public function getURI() {
return $this->uri;
}
//other methods ...
}
I actually make a copy of the TLD list to a file on the server and use it, and only update it every 6 months to avoid the overhead of reading the full TLD list when you first create a URI object to any page.
You can now have a subclass of the Domain class that extends \ My \ URI and allows you to split the URI into component parts - there can be a method to remove the TLD (based on the list of TLDs you loaded in parent::$tldList
from mxr.mozilla.org
) once you infer the valid TLD that is just to the left of it (between the last one .
and the TLD), there should be a domain, all that is left of it will be subdomains.
You can have methods to retrieve this data as needed.
source to share
It does what you ask, although I agree with the comments about removing the TLD
preg_match("/([^\.\/]+)\.[a-z\.]{2,6}$/i", "http://www.google.com", $match);
echo $match[1];
It essentially matches the part before the TLD. I believe the RFC indicates that the longest public TLD can be 6 characters long. The TLD portion is not fool proof, but it works for most inputs.
source to share
Regex and parse_url () are not the solution for you.
You need a package that uses the Public Suffix List , only this way you can correctly extract domains with two third-level domains (co.uk, a.bg, b.bg, etc.) and multi-level subdomains.
I recommend using TLD Extract . Here's some sample code:
$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('www.google.com/blahblah/blahblah');
$result->getHostname(); // will return (string) 'google'
$result->getRegistrableDomain(); // will return (string) 'google.com'
source to share