How to effectively flatten nested lists and data frames into a single data block?

Question

How to effectively flatten nested lists and data frames into a single data block?

I have some data that is formatted in a way that is difficult to use, so I am trying to flatten this out. minimal reproducible example here .

> str(sampleData)
List of 4
 $ Events       :'data.frame':  2 obs. of  3 variables:
  ..$ CateringOptions:List of 2
  .. ..$ :'data.frame': 1 obs. of  3 variables:
  .. .. ..$ Agreed : logi TRUE
  .. .. ..$ Tnc    :'data.frame': 1 obs. of  5 variables:
  .. .. .. ..$ Identity      : chr "SpicyOWing"
  .. .. .. ..$ Schema        : logi NA
  .. .. .. ..$ ElementId     : chr "105031"
  .. .. .. ..$ ElementType   : logi NA
  .. .. .. ..$ ElementVersion: logi NA
  .. .. ..$ Address: chr "New York"
  .. ..$ :'data.frame': 1 obs. of  3 variables:
  .. .. ..$ Agreed : logi TRUE
  .. .. ..$ Tnc    :'data.frame': 1 obs. of  5 variables:
  .. .. .. ..$ Identity      : chr "BaconEggs"
  .. .. .. ..$ Schema        : logi NA
  .. .. .. ..$ ElementId     : chr "105032"
  .. .. .. ..$ ElementType   : logi NA
  .. .. .. ..$ ElementVersion: logi NA
  .. .. ..$ Address: chr "Seattle"
  ..$ Action         : num [1:2] 1 1
  ..$ Volume         : num [1:2] 1000 2000
 $ Host         :List of 5
  ..$ Identity      : chr "John"
  ..$ Schema        : logi NA
  ..$ ElementId     : chr "101505"
  ..$ ElementType   : logi NA
  ..$ ElementVersion: logi NA
 $ Sender       :List of 5
  ..$ Identity      : chr "Jane"
  ..$ Schema        : logi NA
  ..$ ElementId     : chr "101005"
  ..$ ElementType   : logi NA
  ..$ ElementVersion: logi NA
 $ CompletedDate: chr "/Date(1490112000000)/"

Expected

> expectedOutcome
  Events.CateringOptions.Agreed Events.CateringOptions.Tnc.Identity Events.CateringOptions.Tnc.Schema Events.CateringOptions.Tnc.ElementId
1                            NA                          SpicyOWing                                TRUE                               105031
2                            NA                           BaconEggs                                TRUE                               105032
  Events.CateringOptions.Tnc.ElementType Events.CateringOptions.Tnc.ElementVersion Events.CateringOptions.Address Events.Action Events.Volume Host.Identity
1                                     NA                                        NA                       New York             1          1000          John
2                                     NA                                        NA                        Seattle             1          2000          John
  Host.Schema Host.ElementId Host.ElementType Host.ElementVersion Sender.Identity Sender.Schema Sender.ElementId Sender.ElementType Sender.ElementVersion
1          NA         101505               NA                  NA            Jane            NA           101005                 NA                    NA
2          NA         101505               NA                  NA            Jane            NA           101005                 NA                    NA
          CompletedDate
1 /Date(1490112000000)/
2 /Date(1490112000000)/

Check function

check<-function(li){
  areDF<-sapply(1:length(li), function(i) class(li[[i]]) == "data.frame")
  areList<-sapply(1:length(li), function(i) class(li[[i]]) == "list")
  tmp1 <- NULL
  tmp2 <- NULL
  if(any(areDF)){
    for(j in which(areDF)){
      columns <- jsonlite::flatten(li[[j]])
      li[[j]]  <- check(columns)
    }
    tmp1<-plyr::rbind.fill(li[areDF])
    #return(tmp1)
  }
  if(any(areList)){
    for(j in which(areList)){
      li[[j]]<-check(li[[j]])
    }
    tmp2<-do.call(cbind,li)
    #return(tmp2)
  }
  if(!is.null(tmp1) & !is.null(tmp2)){
    return (cbind(tmp1,tmp2))
  }
  else if(!is.null(tmp1)){
    return (tmp1)
  }
  else if(!is.null(tmp2)){
    return (tmp2)
  }
  return(li)
}

results

> str(check(sampleData))
'data.frame': 2 obs. of  29 variables:
 $ CateringOptions.Agreed                   : logi  TRUE TRUE
 $ CateringOptions.Address                  : chr  "New York" "Seattle"
 $ CateringOptions.Tnc.Identity             : chr  "SpicyOWing" "BaconEggs"
 $ CateringOptions.Tnc.Schema               : logi  NA NA
 $ CateringOptions.Tnc.ElementId            : chr  "105031" "105032"
 $ CateringOptions.Tnc.ElementType          : logi  NA NA
 $ CateringOptions.Tnc.ElementVersion       : logi  NA NA
 $ Action                                   : num  1 1
 $ Volume                                   : num  1000 2000
 $ Events.CateringOptions.Agreed            : logi  TRUE TRUE
 $ Events.CateringOptions.Address           : chr  "New York" "Seattle"
 $ Events.CateringOptions.Tnc.Identity      : chr  "SpicyOWing" "BaconEggs"
 $ Events.CateringOptions.Tnc.Schema        : logi  NA NA
 $ Events.CateringOptions.Tnc.ElementId     : chr  "105031" "105032"
 $ Events.CateringOptions.Tnc.ElementType   : logi  NA NA
 $ Events.CateringOptions.Tnc.ElementVersion: logi  NA NA
 $ Events.Action                            : num  1 1
 $ Events.Volume                            : num  1000 2000
 $ Host.Identity                            : Factor w/ 1 level "John": 1 1
 $ Host.Schema                              : logi  NA NA
 $ Host.ElementId                           : Factor w/ 1 level "101505": 1 1
 $ Host.ElementType                         : logi  NA NA
 $ Host.ElementVersion                      : logi  NA NA
 $ Sender.Identity                          : Factor w/ 1 level "Jane": 1 1
 $ Sender.Schema                            : logi  NA NA
 $ Sender.ElementId                         : Factor w/ 1 level "101005": 1 1
 $ Sender.ElementType                       : logi  NA NA
 $ Sender.ElementVersion                    : logi  NA NA
 $ CompletedDate                            : Factor w/ 1 level "/Date(1490112000000)/": 1 1

I almost don't have it, but the nested file structure is being fooled. Also, my code is taking quite a long time. Does anyone know how I can iron this out?

Edit:

I added my solution at the end to the gist

+3

r

Jean 28 Mar 17 at 10:08

source to share

2 answers

Aurèle · Answer 1 · 2017-03-28T12:16:55+0000

Here's my example using purrr

.
The idea is similar to yours, only with a different syntax: the flatten()

most nested data frames, then rbind()

theirs.
If I understand your code correctly, mine is slightly different at the end as I will try to get a more " jsonlite::flatten

-friendly" structure to apply it again to the end result:

library(jsonlite)
library(purrr)
res <- 
  sampleData %>% 
  modify_if(
    is.list, 
    .f = ~ modify_if(
      .x,
      .p = function(x) all(sapply(x, is.data.frame)), 
      .f = ~ do.call("rbind", lapply(.x, jsonlite::flatten))
    )
  ) %>% 
  as.data.frame() %>% 
  jsonlite::flatten()
str(res)

# 'data.frame': 2 obs. of  20 variables:
#  $ Events.Action                            : num  1 1
#  $ Events.Volume                            : num  1000 2000
#  $ Host.Identity                            : chr  "John" "John"
#  $ Host.Schema                              : logi  NA NA
#  $ Host.ElementId                           : chr  "101505" "101505"
#  $ Host.ElementType                         : logi  NA NA
#  $ Host.ElementVersion                      : logi  NA NA
#  $ Sender.Identity                          : chr  "Jane" "Jane"
#  $ Sender.Schema                            : logi  NA NA
#  $ Sender.ElementId                         : chr  "101005" "101005"
#  $ Sender.ElementType                       : logi  NA NA
#  $ Sender.ElementVersion                    : logi  NA NA
#  $ CompletedDate                            : chr  "/Date(1490112000000)/" "/Date(1490112000000)/"
#  $ Events.CateringOptions.Agreed            : logi  TRUE TRUE
#  $ Events.CateringOptions.Address           : chr  "New York" "Seattle"
#  $ Events.CateringOptions.Tnc.Identity      : chr  "SpicyOWing" "BaconEggs"
#  $ Events.CateringOptions.Tnc.Schema        : logi  NA NA
#  $ Events.CateringOptions.Tnc.ElementId     : chr  "105031" "105032"
#  $ Events.CateringOptions.Tnc.ElementType   : logi  NA NA
#  $ Events.CateringOptions.Tnc.ElementVersion: logi  NA NA

I have one inconsistency with yours expectedOutcome

, but if possible it might be on your side:

all.equal(expectedOutcome[sort(names(expectedOutcome))], res[sort(names(res))])
# [1] "Component "Events.CateringOptions.Agreed": 'is.NA' value mismatch: 0 in current 2 in target"

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2017-03-29T02:46:52+0000

Not sure if this simplifies your problem, but with the sample you shared it works. Basically, if the column is not a vector, when you do data.frame(your_list)

, that's the unlist

data that does matrix

.

FLAT <- function(inlist) {
  A <- data.frame(inlist)
  out <- lapply(A, function(y) {
    if (is.list(y)) {
      y <- unlist(y)
      m <- matrix(y, nrow(A), byrow = TRUE, dimnames = list(NULL, unique(names(y))))
      y <- data.frame(m, stringsAsFactors = FALSE)
      y[] <- lapply(y, type.convert)
    }
    y
  })
  do.call(cbind, out)
}
FLAT(sampleData)

Here, str

according to your approximate data:

str(FLAT(sampleData))
## 'data.frame':    2 obs. of  20 variables:
##  $ Events.CateringOptions.Agreed            : logi  TRUE TRUE
##  $ Events.CateringOptions.Tnc.Identity      : Factor w/ 2 levels "BaconEggs","SpicyOWing": 2 1
##  $ Events.CateringOptions.Tnc.Schema        : logi  NA NA
##  $ Events.CateringOptions.Tnc.ElementId     : int  105031 105032
##  $ Events.CateringOptions.Tnc.ElementType   : logi  NA NA
##  $ Events.CateringOptions.Tnc.ElementVersion: logi  NA NA
##  $ Events.CateringOptions.Address           : Factor w/ 2 levels "New York","Seattle": 1 2
##  $ Events.Action                            : num  1 1
##  $ Events.Volume                            : num  1000 2000
##  $ Host.Identity                            : Factor w/ 1 level "John": 1 1
##  $ Host.Schema                              : logi  NA NA
##  $ Host.ElementId                           : Factor w/ 1 level "101505": 1 1
##  $ Host.ElementType                         : logi  NA NA
##  $ Host.ElementVersion                      : logi  NA NA
##  $ Sender.Identity                          : Factor w/ 1 level "Jane": 1 1
##  $ Sender.Schema                            : logi  NA NA
##  $ Sender.ElementId                         : Factor w/ 1 level "101005": 1 1
##  $ Sender.ElementType                       : logi  NA NA
##  $ Sender.ElementVersion                    : logi  NA NA
##  $ CompletedDate                            : Factor w/ 1 level "/Date(1490112000000)/": 1 1

How to effectively flatten nested lists and data frames into a single data block?

Edit:

More articles: