Web scraping make / model / year of VIN numbers in RStudio
I am currently working on a project where I need to find the manufacturer, model and year of VIN numbers. I have a list of 300 different VIN numbers. Going through each individual VIN number and manually entering the manufacturer, model and year in excel is very inefficient and tedious.
I tried using REST packages with SelectorGadget to write a few lines of code in R to clean up this site to get information, but I was not successful: http://www.vindecoder.net/?vin=1G2HX54K724118697&submit=Decode
Here is my code:
library("rvest")
Vnum = "1G2HX54K724118697"
site <- paste("http://www.vindecoder.net/?vin=", Vnum,"&submit=Decode",sep="")
htmlpage <- html(site)
VINhtml <- html_nodes(htmlpage, ".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")
VIN <- html_text(forecasthtml)
paste(forecast, collapse =" ")
When I try to run VINhtml I get the error: list () atp ("class") [1] "XMLNodeSet"
I don't know what I am doing wrong. I think it doesn't work because it is a dynamic webpage, but I could be wrong. Anyone have any suggestions on the best way to solve this problem?
I am also open to using other websites or alternative approaches to figure this out. I just want to find the model, manufacturer and model year of these VINs. Can anyone help me find an efficient way to do this?
Here are some examples VIN: YV4SZ592561226129 YV4SZ592371288470 YV4SZ592371257784 YV4CZ982871331598 YV4CZ982581428985 YV4CZ982481423003 YV4CZ982381423543 YV4CZ982171380593 YV4CZ982081460887 YV4CZ852361288222 YV4CZ852281454409 YV4CZ852281454409 YV4CZ852281454409 YV4CZ592861304665 YV4CZ592861267682 YV4CZ592561266859
source to share
Here is a solution using RSelenium
and rvest
.
To run RSelenium, you must first download the selenium server from here (Mine - version 2.45). Let's say the downloaded file is in the "My Documents" folder. Then you need to follow the next two steps in cmd before running RSelenium
in the IDE.
Enter the following in cmd: a) cd My Documents
# I have selenium driver installed in My Documents folder b) and then enter:java -jar selenium-server-standalone-2.45.0.jar
library(RSelenium)
library(rvest)
startServer()
remDr <- remoteDriver(browserName = 'firefox')
remDr$open()
Vnum<- c("YV4SZ592371288470","1G2HX54K724118697","YV4SZ592371288470")
kk<-lapply(Vnum,function(j){
remDr$navigate(paste("http://www.vindecoder.net/?vin=",j,"&submit=Decode",sep=""))
Sys.sleep(30) # this is critical
test.html <- html(remDr$getPageSource()[[1]]) # this is RSelenium but after this we can use rvest functions until we close the session
test.text<-test.html%>%
html_nodes(".odd:nth-child(6) , .even:nth-child(5) , .even:nth-child(7)")%>%
html_text()
})
kk
[[1]]
[1] "Model: XC70" "Type: Multipurpose Passenger Vehicle" "Make: Volvo"
[[2]]
[1] "Model: Bonneville" "Make (Manufacturer): Pontiac" "Model year: 2002"
[[3]]
[1] "Model: XC70" "Type: Multipurpose Passenger Vehicle" "Make: Volvo"
remDr$close()
PS You can see that the same css path is not applicable for all VINs. You have to figure it out beforehand (I just used the path you provided in the question). You can use tryCatch .
source to share