Web scraper with R and rvest
I am experimenting with rvest
to explore a web scraper with R. I am trying to reproduce the Lego example for several other sections of the page and using selector gadget
for id.
I pulled an example from the R Studio tutorial . With the code below, 1 and 2 work, but 3 don't.
library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
# 1 - Get rating
lego_movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
# 2 - Grab actor names
lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
# 3 - Get Meta Score
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text() %>%
as.numeric()
source to share
Actually I can't speed up all pipes and related codes, so maybe there are some new fandangled tools for that ... but given what the above answer gives you "83/100"
, you could do something like:
as.numeric(unlist(strsplit("83/100", "/")))[1]
[1] 83
I think this will look like with pipes:
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text(trim=TRUE) %>%
strsplit(., "/") %>%
unlist(.) %>%
as.numeric(.) %>%
head(., 1)
[1] 83
Or, as Frank suggested, you can evaluate an expression "83/100"
with something like:
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text(trim=TRUE) %>%
parse(text = .) %>%
eval(.)
[1] 0.83
source to share
You can see that before converting to a numeric value it returns " 83/100\n"
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text()
# [1] " 83/100\n"
You can use trim=TRUE
to skip \n
. You cannot convert this to a numeric value because you have /
.
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text(trim=TRUE)
# [1] "83/100"
If you convert this number to a numeric one, you get NA
warnings that are not unexpected:
# [1] NA
# Warning message:
# In function_list[[k]](value) : NAs introduced by coercion
If you want a numeric answer as your final answer 83
, you can use regex tools such as gsub
to remove 100
and \
(assuming the full score is 100 for all movies).
lego_movie %>%
html_node(".star-box-details a:nth-child(4)") %>%
html_text(trim=TRUE) %>%
gsub("100|\\/","",.)%>%
as.numeric()
# [1] 83
source to share