R xml to dataFrame Questions

I am new to R XML

and want to parse below XML in data.frame. Searching StackOverflow seems to be better used xpath

to get data.frame like below.

   locationName                     StartTime     MaxT  MinT 
     TaipeiCity      2015-08-06T12:00:00+08:00      34    30
     TaipeiCity      2015-08-06T18:00:00+08:00      30    25
     TaipeiCity      2015-08-07T06:00:00+08:00      30    25
New Taipei City      2015-08-06T12:00:00+08:00      33    30
New Taipei City      2015-08-06T18:00:00+08:00      30    25
New Taipei City      2015-08-07T06:00:00+08:00      30    25

      

Somehow I don't know how to do this by elementName

parsing and grouping it into a data.frame.

Below are my XML samples

<?xml version="1.0" encoding="UTF-8"?>
<cwbopendata xmlns="urn:cwb:gov:tw:cwbcommon:0.1">
    <identifier >6a9fd4e8-cf93-7884-fa2e-4a30f6960e13</identifier>
    <sender >weather@cwb.gov.tw</sender>
    <sent >2015-08-06T11:09:03+08:00</sent>
    <status >Actual</status>
    <msgType >Issue</msgType>
    <source >MFC</source>
    <dataid >C0032-001</dataid>
    <scope >Public</scope>
    <dataset >
        <datasetInfo>
            <datasetDescription>36 hours wealther predicts</datasetDescription>
            <issueTime>2015-08-06T11:00:00+08:00</issueTime>
            <update>2015-08-06T11:09:03+08:00</update>
</datasetInfo>
        <location>
            <locationName>Taipei City</locationName>
            <weatherElement>
                <elementName>Wx</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Cloudy</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Rain</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Rain</parameterName>
                        <parameterValue>26</parameterValue>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MaxT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>34</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MinT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>CI</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>HOT</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>comforatble</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>comforatble</parameterName>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>PoP</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>50</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>70</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>80</parameterName>
                        <parameterUnit>percentage</parameterUnit>
</parameter>
</time>
</weatherElement>
</location>
        <location>
            <locationName>New Taipei City</locationName>
            <weatherElement>
                <elementName>Wx</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>12</parameterValue>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>rainly</parameterName>
                        <parameterValue>26</parameterValue>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MaxT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>33</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>MinT</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>30</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>25</parameterName>
                        <parameterUnit>C</parameterUnit>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>CI</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>Hot</parameterName>
</parameter>
</time>
</weatherElement>
            <weatherElement>
                <elementName>PoP</elementName>
                <time>
                    <startTime>2015-08-06T12:00:00+08:00</startTime>
                    <endTime>2015-08-06T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>50</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-06T18:00:00+08:00</startTime>
                    <endTime>2015-08-07T06:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>60</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
                <time>
                    <startTime>2015-08-07T06:00:00+08:00</startTime>
                    <endTime>2015-08-07T18:00:00+08:00</endTime>
                    <parameter>
                        <parameterName>70</parameterName>
                        <parameterUnit>pertcentage</parameterUnit>
</parameter>
</time>
</weatherElement>
</location>
</dataset>
</cwbopendata>

      

+3


source to share


1 answer


Dude, this was one of the hardest questions I've ever worked on. The problem may look quite simple, but the general complexities of navigating XML data in code, coupled with the task of extracting a subset of the XML structure and laying out it in a regular tabular format, coupled with the specific complexity of your question, which you actually need to combine the data MaxT

and MinT

on StartTime

. all this is very difficult. But I'm happy to say, I think I get it.

library('XML');
doc <- xmlInternalTreeParse('sample.xml');
ns <- c(m=xmlNamespaceDefinitions(doc)[[1]]$uri);
df <- do.call(rbind,xpathApply(doc,'//m:location',namespaces=ns,function(locationNode) {
    locationName <- xpathSApply(locationNode,'m:locationName/text()',namespaces=ns,xmlValue);
    cbind(locationName,do.call(merge,xpathApply(locationNode,'m:weatherElement[m:elementName/text()="MaxT" or m:elementName/text()="MinT"]',namespaces=ns,function(elementNode) {
        elementName <- xpathSApply(elementNode,'m:elementName/text()',namespaces=ns,xmlValue);
        startTimes <- xpathSApply(elementNode,'m:time/m:startTime/text()',namespaces=ns,xmlValue);
        values <- xpathSApply(elementNode,'m:time/m:parameter/m:parameterName/text()',namespaces=ns,xmlValue);
        setNames(data.frame(startTimes,values,stringsAsFactors=F),c('StartTime',elementName));
    })));
}));
## fix data types from raw character strings
df$MaxT <- as.integer(df$MaxT);
df$MinT <- as.integer(df$MinT);
tzoSuffixRegex <- '([+-])(\\d{2}):(\\d{2})$';
df$StartTime <- do.call(c,lapply(df$StartTime,function(t) as.POSIXct(t,format='%Y-%m-%dT%H:%M:%S',chartr('+-','-+',sub(perl=T,'\\b0+','',sub(perl=T,paste0('.*',tzoSuffixRegex),'Etc/GMT\\1\\2',t)))))); ## four notes: (1) We have to use lapply() because the tz= parameter of as.POSIXct() (also strptime()) is unfortunately not vectorized. (2) Because POSIXct cannot store a time zone offset, but rather requires a time zone, we must "adapt" the full offset suffix to the truncated Etc/GMT pseudo time zone name. (3) We must use do.call(c,lapply(...)) rather than sapply(...) because sapply() weirdly simplifies to a named numeric vector, rather than a POSIXct vector. (4) We have to reverse the offset sign, because the Etc/GMT pseudo time zone names are bizarrely reversed from the standard notation; see <https://en.wikipedia.org/wiki/Tz_database#Area>
df;
##      locationName           StartTime MaxT MinT
## 1     Taipei City 2015-08-06 00:00:00   34   30
## 2     Taipei City 2015-08-06 06:00:00   30   25
## 3     Taipei City 2015-08-06 18:00:00   30   25
## 4 New Taipei City 2015-08-06 00:00:00   33   30
## 5 New Taipei City 2015-08-06 06:00:00   30   25
## 6 New Taipei City 2015-08-06 18:00:00   30   25

      


The code is obviously strongly related to package design XML

, so refer to the package documentation for important information. Here's my own summary of the features XML

I used in my code:

  • xmlInternalTreeParse()

    I use this to parse your raw data, which I saved as sample.xml

    pwd in my R session. Note that there is also a function xmlTreeParse()

    . The difference is that the inner function uses the "inner" C pointer nodes, which are more powerful because it allows you to traverse the tree structure backwards; not an internal function returns data as a plain old recursive list structure. I didn't actually have to use transitions for my solution (although I did, since I thought it was a different design at first), but it's better to use the more powerful version in general. See http://www.omegahat.org/RSXML/shortIntro.html .
  • xmlNamespaceDefinitions()

    Namespace issues in XML are common and annoying. Typically, if nodes in an XML document are marked with an attribute xmlns

    or live under such an element, they will be considered to exist in that namespace and you will need to identify them appropriately in your XPath expressions. There is one top-level namespace for your document, which is urn:cwb:gov:tw:cwbcommon:0.1

    . I used this function XML

    to extract this namespace and create a named vector around it. The name I am using is m

    . I pass this vector as an argument to namespaces

    my subsequent function calls XML

    , which allows me to use a compressed prefix m

    to prefix all element tag names.
  • xmlValue()

    When you fetch text node from document, you can get its original text content using this function. It is important to understand the difference between the text node and the actual text content of the node; a text node is a data structure representing a node and its relationship to the rest of the document; the text content is just a character string of the original node text.
  • xpathApply()

    As well lapply()

    , the function is run once for each item in the list. In this case, however, the list consists of all matches of the XPath query against the given XML document or node or node. There are four important arguments here: (1) the XML document or node, (2) the XPath query as a character string, (3) the namespaces in effect in the XPath query, and (4) a function to run each corresponding node.
  • xpathSApply()

    To xpathApply()

    both sapply()

    belongs to lapply()

    . Generally, you won't be simplifying node lists, so it makes sense to use that when you're passing in a custom lambda and returning a primitive value like a character vector. I have used this since xmlValue()

    to get the original text values ​​of the text nodes.

XML traversal begins by applying an XPath expression //m:location

that finds all elements location

anywhere in the document (in your example, only two).

For each node location, I then get the location name with the relative XPath m:locationName/text()

estimated using the node's location as the context node, using the xpathSApply()

+ pattern xmlValue()

to get the raw text. Then I dive into the contained nodes weatherElement

with relative XPath m:weatherElement[m:elementName/text()="MaxT" or m:elementName/text()="MinT"]

. Notice how I use predicates to filter for specific weather items; I use more relative XPath subexpressions inside a predicate modifier to filter against the original text of weather element names. Note that XPath 2.0 allows for a more concise: syntax m:weatherElement[m:elementName/text()=("MaxT","MinT")]

, but the R package XML

does not seem to support it.

For each weather element node, I get its name, the start time of all nodes time

under it, and all the required numeric values ​​(which are actually under the elements with the tag name parameterName

, weirdly) using xpathSApply()

+ xmlValue()

. Then I create a result data.frame that contains two columns: start time as StartTime

, and values ​​as any element that has a current element in the XML document.

So the return value of the call xmlApply()

that was done on both weather elements will be a list with two components, each of which consists of a data.frame of the required data under this weather element. The first data.frame will have columns StartTime

and MaxT

, and the second will have columns StartTime

and MinT

. Then we can combine them with a simple call merge()

. We could store the return value and then start the call manually, eg. merge(returnValue[[1]],returnValue[[2]])

but I decided to get a little fancy here and just called do.call()

, which has the same effect.

Then, still in the context of the node, we have cbind()

to give the location name as the leading column of the merged data.frame, and we can bring it back to the top level.

The last step at the top level is rbind()

data.frames from all locations that were matched by the original XPath query and commit the result to a variable.

And I figured that you also want to coerce the source to the appropriate datatypes, so I added a few lines for that. MaxT

and it MinT

looks like it should be whole (although you can also use it as.double()

for doubles), but it StartTime

should be POSIXct or POSIXlt (personally, I always use POSIXct as it's more compact).

I've tried to be as reliable as possible with the conversion StartTime

, but IMO, modern software is often just not capable of fully handling the complexities of date / time data. In this case, we have date / time values ​​with timezone offsets in the XML data, but the POSIXct R type can only accept an optional timezone specifier. We don't have a time zone specifier. My solution was to use several outdated and vaguely designed Etc/GMT...

Olson timezone names that allow the timezone offset to be passed to the POSIXct function to the nearest hour. Oddly enough, when collapsing all values ​​into one vector, which must necessarily discard individual attributes tzone

and replace them with one single attribute, or not at all,tzone

is actually completely removed from the resulting vector, Fortunately, by themselves, are still correct (by the clock) since times are stored internally as seconds since 1970-01-01 00:00:00 UTC, but the original timezone offsets are lost. The time is displayed in the current timezone of the user, for me it is EDT (UTC-4), so the original times of my demo output are 12 hours behind the XML data.


Random links I used:


To add one thing, you might notice that XML

there is a feature in the package xmlToDataFrame()

that might seem ideal for the task. However, this is a rather limited function and depends on a very regular structure to give a reasonable result. In general, if you want to retrieve a subset of XML data that is somewhat scattered throughout the document, you will have to move the document yourself in code.

Here's how we can't use a simple call or an application call set to get the data we need:

Attempt # 1: from location nodes



xpathApply(doc,'//m:location',namespaces=ns,xmlToDataFrame);
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Wx",  :
##   duplicate subscripts for columns

      

The data structure under the location nodes is too irregular and xmlToDataFrame()

refuses to try to hush it in the data.frame.

Attempt # 2: from weather nodes

xpathApply(doc,'//m:location/m:weatherElement',namespaces=ns,xmlToDataFrame);
## [[1]]
##   text                 startTime                   endTime parameter
## 1   Wx                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  Cloudy12
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00    Rain12
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00    Rain26
##
## [[2]]
##   text                 startTime                   endTime parameter
## 1 MaxT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       34C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       30C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       30C
##
## [[3]]
##   text                 startTime                   endTime parameter
## 1 MinT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       30C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       25C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       25C
##
## [[4]]
##   text                 startTime                   endTime   parameter
## 1   CI                      <NA>                      <NA>        <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00         HOT
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 comforatble
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 comforatble
##
## [[5]]
##   text                 startTime                   endTime    parameter
## 1  PoP                      <NA>                      <NA>         <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50percentage
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 70percentage
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 80percentage
##
## [[6]]
##   text                 startTime                   endTime parameter
## 1   Wx                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  rainly12
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00  rainly12
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00  rainly26
##
## [[7]]
##   text                 startTime                   endTime parameter
## 1 MaxT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       33C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       30C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       30C
##
## [[8]]
##   text                 startTime                   endTime parameter
## 1 MinT                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       30C
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       25C
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       25C
##
## [[9]]
##   text                 startTime                   endTime parameter
## 1   CI                      <NA>                      <NA>      <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00       Hot
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00       Hot
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00       Hot
##
## [[10]]
##   text                 startTime                   endTime     parameter
## 1  PoP                      <NA>                      <NA>          <NA>
## 2 <NA> 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50pertcentage
## 3 <NA> 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 60pertcentage
## 4 <NA> 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 70pertcentage
##

      

The above data is missing the location names and so even if we were collecting all the location names separately, we wouldn't know which data.frames came from where the nodes are located - unless we want to start making some assumptions about what kind of weather element names occur under all location nodes, which is theoretically doable, but obviously this is starting to get a little unreasonable. And obviously there is still a lot of filtering and rebuilding work to be done to get the required data in the required form.

It should also be noted that the text content parameterName

and parameterUnit

was combined into a column parameter

( parameter

being the name of the element immediately ancestor). In this case, the result does look reasonable, at least for temperature parameters, since together they contain numerical values ​​with units of measurement (for example 30C

), which is a very common notation, but in general this behavior is probably a bit questionable and if you really want numbers without ones, you'll have to do some text processing to "undo" the concatenation, which again starts to deviate from the realm of sane code.

Attempt # 3: from temporary nodes

xpathApply(doc,'//m:location/m:weatherElement/m:time',namespaces=ns,xmlToDataFrame);
## [[1]]
##                        text parameterName parameterValue
## 1 2015-08-06T12:00:00+08:00          <NA>           <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        Cloudy             12
##
## [[2]]
##                        text parameterName parameterValue
## 1 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 3                      <NA>          Rain             12
##
## [[3]]
##                        text parameterName parameterValue
## 1 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>          Rain             26
##
## [[4]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            34             C
##
## [[5]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[6]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[7]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[8]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[9]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[10]]
##                        text parameterName
## 1 2015-08-06T12:00:00+08:00          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>
## 3                      <NA>           HOT
##
## [[11]]
##                        text parameterName
## 1 2015-08-06T18:00:00+08:00          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>
## 3                      <NA>   comforatble
##
## [[12]]
##                        text parameterName
## 1 2015-08-07T06:00:00+08:00          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>
## 3                      <NA>   comforatble
##
## [[13]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            50    percentage
##
## [[14]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            70    percentage
##
## [[15]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            80    percentage
##
## [[16]]
##                        text parameterName parameterValue
## 1 2015-08-06T12:00:00+08:00          <NA>           <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             12
##
## [[17]]
##                        text parameterName parameterValue
## 1 2015-08-06T18:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             12
##
## [[18]]
##                        text parameterName parameterValue
## 1 2015-08-07T06:00:00+08:00          <NA>           <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>           <NA>
## 3                      <NA>        rainly             26
##
## [[19]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            33             C
##
## [[20]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[21]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[22]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            30             C
##
## [[23]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[24]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            25             C
##
## [[25]]
##                        text parameterName
## 1 2015-08-06T12:00:00+08:00          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[26]]
##                        text parameterName
## 1 2015-08-06T18:00:00+08:00          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[27]]
##                        text parameterName
## 1 2015-08-07T06:00:00+08:00          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>
## 3                      <NA>           Hot
##
## [[28]]
##                        text parameterName parameterUnit
## 1 2015-08-06T12:00:00+08:00          <NA>          <NA>
## 2 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            50   pertcentage
##
## [[29]]
##                        text parameterName parameterUnit
## 1 2015-08-06T18:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 3                      <NA>            60   pertcentage
##
## [[30]]
##                        text parameterName parameterUnit
## 1 2015-08-07T06:00:00+08:00          <NA>          <NA>
## 2 2015-08-07T18:00:00+08:00          <NA>          <NA>
## 3                      <NA>            70   pertcentage
##

      

Now, not only are we missing the location names and mappings, but we are missing the weather item names and mappings to the above data frames.

Attempt # 4: Single Call Definition Nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location',namespaces=ns));
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Taipei City",  :
##   duplicate subscripts for columns

      

Same problem.

Attempt # 5: Single Calls Passing Weather Nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location/m:weatherElement',namespaces=ns));
## Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("Wx",  :
##   duplicate subscripts for columns

      

The data under the individual weather element nodes seems to be regular enough for xmlToDataFrame()

as it worked in try # 2, but combining them all into one data.frame doesn't work.

Attempt # 6: Single Call Time Nodes

xmlToDataFrame(doc,nodes=xpathApply(doc,'//m:location/m:weatherElement/m:time',namespaces=ns));
##                    startTime                   endTime     parameter
## 1  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00      Cloudy12
## 2  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00        Rain12
## 3  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00        Rain26
## 4  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           34C
## 5  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           30C
## 6  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           30C
## 7  2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           30C
## 8  2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           25C
## 9  2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           25C
## 10 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           HOT
## 11 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00   comforatble
## 12 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00   comforatble
## 13 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00  50percentage
## 14 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00  70percentage
## 15 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00  80percentage
## 16 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00      rainly12
## 17 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00      rainly12
## 18 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00      rainly26
## 19 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           33C
## 20 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           30C
## 21 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           30C
## 22 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           30C
## 23 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           25C
## 24 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           25C
## 25 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00           Hot
## 26 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00           Hot
## 27 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00           Hot
## 28 2015-08-06T12:00:00+08:00 2015-08-06T18:00:00+08:00 50pertcentage
## 29 2015-08-06T18:00:00+08:00 2015-08-07T06:00:00+08:00 60pertcentage
## 30 2015-08-07T06:00:00+08:00 2015-08-07T18:00:00+08:00 70pertcentage

      

As you can see, we have the same problems here that I talked about earlier.

So this is not a viable approach. As I said, manual traversal of the XML tree is required here.

Finally, one could argue that we could combine manual traversal with calls to on xmlToDataFrame()

, and that doesn't sound so unreasonable to me. However, this would not take us very far; we still have to handle most of the navigation ourselves and we still have a lot of work to do to change the results to the desired result. IMO, trying to use an xmlToDataFrame()

already complex manual bypass scheme inside of it doesn't give a big bang for the dollar. We could just extract whatever we need manually and concatenate it into a data.frame using our own constructor function call data.frame()

, just like in my solution. This provides maximum control and (relative) simplicity.

+2


source







All Articles