Web scraper with R and rvest

I am experimenting with rvest

to explore a web scraper with R. I am trying to reproduce the Lego example for several other sections of the page and using selector gadget

for id.

I pulled an example from the R Studio tutorial . With the code below, 1 and 2 work, but 3 don't.

library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")

# 1 - Get rating
lego_movie %>% 
  html_node("strong span") %>%
  html_text() %>%
  as.numeric()

# 2 - Grab actor names
lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()

# 3 - Get Meta Score 
lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text() %>%
  as.numeric()

      

+3


source to share


2 answers


Actually I can't speed up all pipes and related codes, so maybe there are some new fandangled tools for that ... but given what the above answer gives you "83/100"

, you could do something like:

as.numeric(unlist(strsplit("83/100", "/")))[1]
[1] 83

      

I think this will look like with pipes:



lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text(trim=TRUE) %>%
  strsplit(., "/") %>%
  unlist(.) %>%
  as.numeric(.) %>% 
  head(., 1)

[1] 83

      

Or, as Frank suggested, you can evaluate an expression "83/100"

with something like:

lego_movie %>% 
  html_node(".star-box-details a:nth-child(4)") %>%
  html_text(trim=TRUE) %>%
  parse(text = .) %>%
  eval(.)
[1] 0.83

      

+3


source


You can see that before converting to a numeric value it returns " 83/100\n"

lego_movie %>% 
    html_node(".star-box-details a:nth-child(4)") %>%
     html_text() 
# [1] " 83/100\n"

      

You can use trim=TRUE

to skip \n

. You cannot convert this to a numeric value because you have /

.

lego_movie %>% 
     html_node(".star-box-details a:nth-child(4)") %>%
     html_text(trim=TRUE) 
# [1] "83/100"

      



If you convert this number to a numeric one, you get NA

warnings that are not unexpected:

# [1] NA
# Warning message:
# In function_list[[k]](value) : NAs introduced by coercion

      

If you want a numeric answer as your final answer 83

, you can use regex tools such as gsub

to remove 100

and \

(assuming the full score is 100 for all movies).

lego_movie %>% 
    html_node(".star-box-details a:nth-child(4)") %>%
     html_text(trim=TRUE) %>%
     gsub("100|\\/","",.)%>%
     as.numeric()
# [1] 83

      

+2


source







All Articles