Working with empty XML nodes in R

I have the following XML file (I am missing the root node, but the editor won't allow me - assume there is a root node here):

<Indvls>
    <Indvl>
        <Info lastNm="HANSON" firstNm="LAURIE"/>
        <CrntEmps>
            <CrntEmp orgNm="ABC INCORPORATED" str1="FOURTY FOUR BRYANT PARK" city="NEW YORK" state="NY" cntry="UNITED STATES" postlCd="10036">
            <BrnchOfLocs>
                <BrnchOfLoc str1="833 NE 55TH ST" city="BELLEVUE" state="WA" cntry="UNITED STATES" postlCd="98004"/>
            </BrnchOfLocs>
            </CrntEmp>
        </CrntEmps>
    </Indvl>
    <Indvl>
        <Info lastNm="JACKSON" firstNm="SHERRY"/>
        <CrntEmps>
            <CrntEmp orgNm="XYZ INCORPORATED" str1="3411 GEORGE STREET" city="SAN FRANCISCO" state="CA" cntry="UNITED STATES" postlCd="94105">
            <BrnchOfLocs>
            </BrnchOfLocs>
            </CrntEmp>
        </CrntEmps>
    </Indvl>
</Indvls>

      

Using R, I want to extract the following columns in a table: (a) lastNm and firstNm from / Info node - values ​​are always present; (b) orgNm from / CrntEmps / CrntEmp node - always present with values; and also (c) str1, city, state from / CrntEmps / BrnchOfLocs / BrnchofLoc node - may or may not have values ​​(in my example, the second object has no office address).

My problem is that many nodes will not have a BrnchOfLoc node. I want to create a record even if the nodes are missing (otherwise the table is not balanced and I get an error when created in the dataframe).

Any thoughts or suggestions? I appreciate any inputs.

Application: Here is my code:

xmlGetNodeAttr <- function(n, xp, attr, default=NA) {
ns<-getNodeSet(n, xp)
if(length(ns)<1) {
    return(default)
} else {
    sapply(ns, xmlGetAttr, attr, default)
}
}

do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) {
data.frame(
    fname=xmlGetNodeAttr(x, "//Info","firstNm",NA),
    lname=xmlGetNodeAttr(x, "//Info","lastNm",NA),
  orgname=xmlGetNodeAttr(x,"//CrntEmps/CrntEmp[1]","orgNm",NA),
    zip=xmlGetNodeAttr(x, "//CrntEmps/CrntEmp[1]/BrnchOfLocs/BrnchOfLoc[1]","city",NA)
)
}))

      

+3


source to share


1 answer


You have to do

do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) {
data.frame(
    fname=xmlGetNodeAttr(x, "./Info","firstNm",NA),
    lname=xmlGetNodeAttr(x, "./Info","lastNm",NA),
    orgname=xmlGetNodeAttr(x, "./CrntEmps/CrntEmp[1]","orgNm",NA),
    zip=xmlGetNodeAttr(x, "./CrntEmps/CrntEmp[1]/BrnchOfLocs/BrnchOfLoc[1]","city",NA)
)
}))

      



Pay attention to the use ./

, not //

. The latter will search throughout the document, ignoring the current node that you are going to lapply

. Usage ./

will start at the current x

node and only look at descendants. This returns

        fname   lname          orgname      zip
Indvl  LAURIE  HANSON ABC INCORPORATED BELLEVUE
Indvl1 SHERRY JACKSON XYZ INCORPORATED     <NA>

      

+2


source







All Articles