Using R2HTML with rvest / xml2

I was reading this blog in the new XML2 package. It used to rvest

depend on XML

, and it made my job easier (at least) by combining the functions in two packages: for example, I would use htmlParse

from the XML package when I can't read the HTML page with html

(they are now called read_html

).

See this one for an example and then I can use functions rvest

like html_nodes

that html_attr

on the parsed page. Now with rvest

depending on XML2

this is not possible (at least on the surface).

I'm just wondering what the main difference is between XML and XML2. Apart from the author's explanation of the XML package in the post I mentioned earlier, the author of the package does not explain the differences between XML and XML2.

Another example:

library(R2HTML) #save page as html and read later
library(XML)
k1<-htmlParse("/questions/2235682/html-in-rvest-verses-htmlparse-in-xml")
head(getHTMLLinks(k1),5) #This works

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later

HTML(k1,"k1") #Later I can work with this
rm(k1)
#read stored html file k1
head(getHTMLLinks("k1"),5)#This works too 

[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

#with read_html in rvest package, this is not possible (as I know)
library(rvest)
library(R2HTML)
k2<-read_html("/questions/2235682/html-in-rvest-verses-htmlparse-in-xml")

#This works
df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

head(df1,5)
[1] "//stackoverflow.com"           "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"          
[5] "http://meta.stackoverflow.com"

# But, I want to save HTML file now in my working directory and work later
HTML(k2,"k2") #Later I can work with this
rm(k2,df1)
#Now extract webpages by reading back k2 html file
#This doesn't work
k2<-read_html("k2") 

df1<-k2 %>%
html_nodes("a")%>%
html_attr("href")

df1
character(0)

      

Updates:

#I have following versions of packages loaded: 
lapply(c("rvest","R2HTML","XML2","XML"),packageVersion)
[[1]]
[1] β€˜0.2.0.9000’

[[2]]
[1] β€˜2.3.1’

[[3]]
[1] β€˜0.1.1’

[[4]]
[1] β€˜3.98.1.2’

      

I am using Windows 8, R 3.2.1 and RStudio 0.99.441.

+3


source to share


1 answer


The package R2HTML

just looks capture.out

at the XML object and then writes it back to disk. It doesn't seem like a reliable way to save HTML / XML data to disk. The reason the two might be different is because the data is XML

printed differently than xml2

. You can define a function to call as.character()

instead of relying oncapture.output

HTML.xml_document<-function(x, ...) HTML(as.character(x),...)

      

Or you can probably skip altogether R2HTML

and write the data xml2

directly with write_xml

.



And perhaps the best approach is to download the file first and then import it.

download.file("http://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml", "local.html")
k2 <- read_html("local.html")

      

+4


source







All Articles