Using getNodeSet in XMLNodeSet (XML Package)

I have a problem using the R XML package for the particular application I mean. Consider the following sample document. I'm interested in getting information in b inside the first node. But the nature of my problem (application) is that I first needed to identify all the nodes in the document and then a subset of that set of nodes to get the first a node and then get b node. The first step is simple:

    doc <- "
    <div></div>
    <a id='1'><b id='3'>text1</b></a>
    <a id='2'><b id='4'>text2</b></a>
    "
    parsed <- htmlParse(doc)

    step1 <- getNodeSet(parsed, "//a")

    > step1

    [[1]]

    <a id="1">

      <b id="3">text1</b>

    </a> 


    [[2]]

    <a id="2">

      <b id="4">text2</b>

    </a> 


    attr(,"class")

    [1] "XMLNodeSet"

      

This gives the expected results. The next step in my application is to extract the b nodes from the first a node. If I use getNodeSet in step 1 [[1]], I get nodes b from both nodes in node node step1.

    step2 <- getNodeSet(step1[[1]], "//b")
    step2

    [[1]]
    <b id="3">text1</b> 

    [[2]]
    <b id="4">text2</b> 

    attr(,"class")
    [1] "XMLNodeSet"

      

I figured I could use XPath "b" to get the information in this example, but ultimately I need "// b" to work here. As I understand it, the XML package works, I don't think this behavior is a bug, but a consequence of referencing the C-level document representation of that document. Is there a way how I can achieve this "two step" process? I want step [[1]] to work like a new document.

+3


source to share


1 answer


There are a number of techniques you can use to achieve what you want. First, you can customize the XPATH:

doc <- "
    <div></div>
    <a id='1'><b id='3'>text1</b></a>
    <a id='2'><b id='4'>text2</b></a>
    "
parsed <- htmlParse(doc)
parsed["//a[1]/b"]

> parsed["//a[1]/b"]
[[1]]
<b id="3">text1</b> 

attr(,"class")
[1] "XMLNodeSet"

      

If you need to work with yours step1

, you can use relative links in XPATH:

step1 <- getNodeSet(parsed, "//a")
getNodeSet(step1[[1]], "./b")

> getNodeSet(step1[[1]], "./b")
[[1]]
<b id="3">text1</b> 

attr(,"class")
[1] "XMLNodeSet"

      



To work with step1[[1]]

as if it were a new XML document, there are two methods that might work:

mydoc2 <- xmlParse(saveXML(step1[[1]]))

> mydoc2["//b"]
[[1]]
<b id="3">text1</b> 

attr(,"class")
[1] "XMLNodeSet"

      

and maybe better:

mydoc3 <- xmlDoc(step1[[1]])

> mydoc3["//b"]
[[1]]
<b id="3">text1</b> 

attr(,"class")
[1] "XMLNodeSet"

      

+1


source







All Articles