Return root domain from url to R
The specified website addresses, for example
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How to return the root domain to R
eg.
example.com
example2.co.uk
For my purposes, I would define the root domain to have the structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is ββin the list:
https://publicsuffix.org/list/effective_tld_names.dat
Is this an even better regex based solution:
Something in R
that parses the root domain based on the open suffix list, for example:
http://simonecarletti.com/code/publicsuffix/
Edited: adding more information based on Richard's comment
Usage XML::parseURI
seems to return stuff between the first "//" and "/". eg.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
So the question comes down to having a function R
that can return a public suffix from a URI, or implement the following algorithm in the suffix list:
- Match the domain with all the rules and pay attention to the corresponding ones.
- If no rules match, "*" is the prevailing rule.
- If more than one rule is matched, the prevailing rule is the exclusion rule.
- If there is no corresponding exclusion rule, the rule with the most labels is the prevailing rule.
- If the prevailing rule is an exclusion rule, change it to remove the leftmost label.
- An open suffix is ββa set of labels from a domain that directly correspond to the prevailing rule labels (connected by dots).
- The registered or registered domain is an open suffix plus one additional label.
source to share
There are two tasks here. The first is parsing the url to get the hostname, which can be done with the httr package parse_url
:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is to retrieve the organizational domain (or root domain, top private domain - whatever you want to name it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses the Mozilla suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract
returns a dataframe with a string for each domain you give it, but you can easily insert the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
source to share
Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"
source to share