Break REST for site redirects

Question

Break REST for site redirects

My position: I have a long (20k lines) list of URLs that I need to clean up individual data items for analysis. For the purposes of this example, I'm looking for a specific field called "sol-num" that is the invitation number. Using the following function, I can get the invitation number for any purchases listed in FedBizOpps:

require(rvest)
require(magrittr)
fetchSolNum<-function(URL){
  URL<-as.character(URL)
  solNum<-html(URL)%>%
    html_node(".sol-num")%>%
    html_text()
}

Now I have a list of thousands of URLs and I want to pull the request number for each one by entering it in a new column in the dataframe from which I got the list of URLs. For your own testing, here are the first ten lines in the list of URLs:

list<-c("https://www.fbo.gov/spg/DISA/D4AD/DITCO/HC1028-12-T-0025/listing.html",
"https://www.fbo.gov/notices/c360b067077aabde331d66e0fe2d1f8f",         
"https://www.fbo.gov/notices/f63053a7a6e858a5b7b537a660c473b7",         
"https://www.fbo.gov/spg/DLA/J3/DSCP-I/SPM300-12-R-0024/listing.html",  
"https://www.fbo.gov/spg/DLA/J3/DAPS/SP7000-11-Q-0047/listing.html",    
"https://www.fbo.gov/spg/USAF/AFMC/OCALCCC/F3YCDW1245A001/listing.html",
"https://www.fbo.gov/spg/USAF/AFMC/AFFTC/FA9300-12-R-5001/listing.html",
"https://www.fbo.gov/notices/17ddec6ae37feb69704b1a52e22eeb26",         
"https://www.fbo.gov/notices/3b76d40705a23a749aad46df88dcee0c",         
"https://www.fbo.gov/notices/91873b727968dc664ada76c48e53e4df") 
raw <- data.frame(matrix(unlist(list), nrow=10, byrow=T))

I want to store the output in a variable solNum

in my named dataframe raw

, so my function should have used a loop right now:

raw$solNum<-0

j=1
for (i in list){
  raw$solNum[j]<-fetchSolNum(i)
  j=j+1
}

Running the code is currently deferring the values for the top five lines and then returning the following error:

 Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: NULL

After further investigation I found out that the problem is probably that this URL is from the list: https://www.fbo.gov/spg/USAF/AFMC/OCALCCC/F3YCDW1245A001/listing.html goes to this kind of page with ambiguous values as there are two deals with this URL.

Considering my full list is 20k lines long, I don't have time to go through and clear the list of all invalid URLs. Is there a way for my current function to just insert the NA value for strings where the url is invalid or something? How can I get it not to break this error?

Also, reading around tells me that it might be faster and more efficient to perform this operation as a vectorized function rather than a loop. Can anyone give any advice on what might look like in my case?

+3

vectorization for-loop r web-scraping rvest

jtexnl 01 june 15 at 16:57

source to share

1 answer

cory · Accepted Answer · 2015-06-01T17:04:39+0000

tryCatch()

Will probably just work here to catch the error and give NA instead. When it comes to vectorization, I doubt you will see any real benefits. It takes a while to read the website (a second or two times). With 20 thousand of them will take some time. Definitely check Hadley's chapter on Exceptions and Debugging and set up some code checking so it doesn't run 4 hours of work ... http://adv-r.had.co.nz/Exceptions-Debugging.html

Break REST for site redirects

More articles: