Return root domain from url to R

The specified website addresses, for example

http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2

      

How to return the root domain to R

eg.

example.com
example2.co.uk

      

For my purposes, I would define the root domain to have the structure

example_name.public_suffix

      

where example_name excludes "www" and public_suffix is ​​in the list:

https://publicsuffix.org/list/effective_tld_names.dat

Is this an even better regex based solution:

Stackoverflow

Something in R

that parses the root domain based on the open suffix list, for example:

http://simonecarletti.com/code/publicsuffix/

Edited: adding more information based on Richard's comment

Usage XML::parseURI

seems to return stuff between the first "//" and "/". eg.

> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"

      

So the question comes down to having a function R

that can return a public suffix from a URI, or implement the following algorithm in the suffix list:

Algorithm
  • Match the domain with all the rules and pay attention to the corresponding ones.
  • If no rules match, "*" is the prevailing rule.
  • If more than one rule is matched, the prevailing rule is the exclusion rule.
  • If there is no corresponding exclusion rule, the rule with the most labels is the prevailing rule.
  • If the prevailing rule is an exclusion rule, change it to remove the leftmost label.
  • An open suffix is ​​a set of labels from a domain that directly correspond to the prevailing rule labels (connected by dots).
  • The registered or registered domain is an open suffix plus one additional label.
+3


source to share


2 answers


There are two tasks here. The first is parsing the url to get the hostname, which can be done with the httr package parse_url

:

host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"

      

The second is to retrieve the organizational domain (or root domain, top private domain - whatever you want to name it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses the Mozilla suffix list):



domain.info <- tldextract(host)
domain.info
#                       host subdomain   domain   tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk

      

tldextract

returns a dataframe with a string for each domain you give it, but you can easily insert the relevant parts:

paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"

      

+6


source


Somthing lik this should help



> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"

> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"

      

+1


source







All Articles