Return root domain from url to R

The specified website addresses, for example


How to return the root domain to R



For my purposes, I would define the root domain to have the structure



where example_name excludes "www" and public_suffix is ​​in the list:

Is this an even better regex based solution:


Something in R

that parses the root domain based on the open suffix list, for example:

Edited: adding more information based on Richard's comment

Usage XML::parseURI

seems to return stuff between the first "//" and "/". eg.

> parseURI("")$server
[1] ""


So the question comes down to having a function R

that can return a public suffix from a URI, or implement the following algorithm in the suffix list:

  • Match the domain with all the rules and pay attention to the corresponding ones.
  • If no rules match, "*" is the prevailing rule.
  • If more than one rule is matched, the prevailing rule is the exclusion rule.
  • If there is no corresponding exclusion rule, the rule with the most labels is the prevailing rule.
  • If the prevailing rule is an exclusion rule, change it to remove the leftmost label.
  • An open suffix is ​​a set of labels from a domain that directly correspond to the prevailing rule labels (connected by dots).
  • The registered or registered domain is an open suffix plus one additional label.

source to share

2 answers

There are two tasks here. The first is parsing the url to get the hostname, which can be done with the httr package parse_url


host <- parse_url("")$hostname
# [1] ""


The second is to retrieve the organizational domain (or root domain, top private domain - whatever you want to name it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses the Mozilla suffix list): <- tldextract(host)
#                       host subdomain   domain   tld
# 1 subdomain example2



returns a dataframe with a string for each domain you give it, but you can easily insert the relevant parts:

paste($domain,$tld, sep=".")
# [1] ""




Somthing lik this should help

> strsplit(gsub("http://|https://|www\\.", "", ""), "/")[[c(1, 1)]]
[1] ""

> strsplit(gsub("http://|https://|www\\.", "", ""), "/")[[c(1, 1)]]
[1] ""




All Articles