Using getNodeSet in XMLNodeSet (XML Package)
I have a problem using the R XML package for the particular application I mean. Consider the following sample document. I'm interested in getting information in b inside the first node. But the nature of my problem (application) is that I first needed to identify all the nodes in the document and then a subset of that set of nodes to get the first a node and then get b node. The first step is simple:
doc <- "
<div></div>
<a id='1'><b id='3'>text1</b></a>
<a id='2'><b id='4'>text2</b></a>
"
parsed <- htmlParse(doc)
step1 <- getNodeSet(parsed, "//a")
> step1
[[1]]
<a id="1">
<b id="3">text1</b>
</a>
[[2]]
<a id="2">
<b id="4">text2</b>
</a>
attr(,"class")
[1] "XMLNodeSet"
This gives the expected results. The next step in my application is to extract the b nodes from the first a node. If I use getNodeSet in step 1 [[1]], I get nodes b from both nodes in node node step1.
step2 <- getNodeSet(step1[[1]], "//b")
step2
[[1]]
<b id="3">text1</b>
[[2]]
<b id="4">text2</b>
attr(,"class")
[1] "XMLNodeSet"
I figured I could use XPath "b" to get the information in this example, but ultimately I need "// b" to work here. As I understand it, the XML package works, I don't think this behavior is a bug, but a consequence of referencing the C-level document representation of that document. Is there a way how I can achieve this "two step" process? I want step [[1]] to work like a new document.
source to share
There are a number of techniques you can use to achieve what you want. First, you can customize the XPATH:
doc <- "
<div></div>
<a id='1'><b id='3'>text1</b></a>
<a id='2'><b id='4'>text2</b></a>
"
parsed <- htmlParse(doc)
parsed["//a[1]/b"]
> parsed["//a[1]/b"]
[[1]]
<b id="3">text1</b>
attr(,"class")
[1] "XMLNodeSet"
If you need to work with yours step1
, you can use relative links in XPATH:
step1 <- getNodeSet(parsed, "//a")
getNodeSet(step1[[1]], "./b")
> getNodeSet(step1[[1]], "./b")
[[1]]
<b id="3">text1</b>
attr(,"class")
[1] "XMLNodeSet"
To work with step1[[1]]
as if it were a new XML document, there are two methods that might work:
mydoc2 <- xmlParse(saveXML(step1[[1]]))
> mydoc2["//b"]
[[1]]
<b id="3">text1</b>
attr(,"class")
[1] "XMLNodeSet"
and maybe better:
mydoc3 <- xmlDoc(step1[[1]])
> mydoc3["//b"]
[[1]]
<b id="3">text1</b>
attr(,"class")
[1] "XMLNodeSet"
source to share