Parsing domain name from url only in PHP

I need a function to extract just the name from the URL.

Likewise, when entering www.google.com

, I want the result to be google

.

www.facebook.com

β†’ facebook

After several searches, I found this function parse_url($url, PHP_URL_HOST);

With this function, when I enter www.google.com/blahblah/blahblah

I get the output aswww.google.com

+1


source to share


3 answers


In my opinion there is only one reliable way to semi-halt and you need to create a class for it; I personally use something like a namespace\Domain extends namespace\URI

thing - Domain, which is essentially a subset of URIs - technically I create 2 classes.

Your domain will probably need a static class member to store the list of valid TLDs, and this can also exist in the URI class as you can reuse it with other subclasses.

namespace My;

class URI {

  protected static $tldList;
  private static $_tldRepository = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';

  protected $uri;

  public function __construct($sURI = "") {
    if(!self::$tldList) {

      //static method to load the TLD list from Mozilla
      //  and parse it into an array, which sets self::$tldList
      self::loadTLDList();
    }

  //if the URI has been passed in - set it
  if($sURI) $this->setURI($sURI);
  }

  public function setURI($sURI) {
    $this->uri = $sURI; //needs validation and sanity checks of course
  }

  public function getURI() {
    return $this->uri;
  }


  //other methods ...

}

      



I actually make a copy of the TLD list to a file on the server and use it, and only update it every 6 months to avoid the overhead of reading the full TLD list when you first create a URI object to any page.

You can now have a subclass of the Domain class that extends \ My \ URI and allows you to split the URI into component parts - there can be a method to remove the TLD (based on the list of TLDs you loaded in parent::$tldList

from mxr.mozilla.org

) once you infer the valid TLD that is just to the left of it (between the last one .

and the TLD), there should be a domain, all that is left of it will be subdomains.

You can have methods to retrieve this data as needed.

+1


source


It does what you ask, although I agree with the comments about removing the TLD

preg_match("/([^\.\/]+)\.[a-z\.]{2,6}$/i", "http://www.google.com", $match);
echo $match[1];

      



It essentially matches the part before the TLD. I believe the RFC indicates that the longest public TLD can be 6 characters long. The TLD portion is not fool proof, but it works for most inputs.

0


source


Regex and parse_url () are not the solution for you.

You need a package that uses the Public Suffix List , only this way you can correctly extract domains with two third-level domains (co.uk, a.bg, b.bg, etc.) and multi-level subdomains.

I recommend using TLD Extract . Here's some sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('www.google.com/blahblah/blahblah');
$result->getHostname(); // will return (string) 'google'
$result->getRegistrableDomain(); // will return (string) 'google.com'

      

0


source







All Articles