How to effectively flatten nested lists and data frames into a single data block?
I have some data that is formatted in a way that is difficult to use, so I am trying to flatten this out. minimal reproducible example here .
> str(sampleData)
List of 4
$ Events :'data.frame': 2 obs. of 3 variables:
..$ CateringOptions:List of 2
.. ..$ :'data.frame': 1 obs. of 3 variables:
.. .. ..$ Agreed : logi TRUE
.. .. ..$ Tnc :'data.frame': 1 obs. of 5 variables:
.. .. .. ..$ Identity : chr "SpicyOWing"
.. .. .. ..$ Schema : logi NA
.. .. .. ..$ ElementId : chr "105031"
.. .. .. ..$ ElementType : logi NA
.. .. .. ..$ ElementVersion: logi NA
.. .. ..$ Address: chr "New York"
.. ..$ :'data.frame': 1 obs. of 3 variables:
.. .. ..$ Agreed : logi TRUE
.. .. ..$ Tnc :'data.frame': 1 obs. of 5 variables:
.. .. .. ..$ Identity : chr "BaconEggs"
.. .. .. ..$ Schema : logi NA
.. .. .. ..$ ElementId : chr "105032"
.. .. .. ..$ ElementType : logi NA
.. .. .. ..$ ElementVersion: logi NA
.. .. ..$ Address: chr "Seattle"
..$ Action : num [1:2] 1 1
..$ Volume : num [1:2] 1000 2000
$ Host :List of 5
..$ Identity : chr "John"
..$ Schema : logi NA
..$ ElementId : chr "101505"
..$ ElementType : logi NA
..$ ElementVersion: logi NA
$ Sender :List of 5
..$ Identity : chr "Jane"
..$ Schema : logi NA
..$ ElementId : chr "101005"
..$ ElementType : logi NA
..$ ElementVersion: logi NA
$ CompletedDate: chr "/Date(1490112000000)/"
Expected
> expectedOutcome
Events.CateringOptions.Agreed Events.CateringOptions.Tnc.Identity Events.CateringOptions.Tnc.Schema Events.CateringOptions.Tnc.ElementId
1 NA SpicyOWing TRUE 105031
2 NA BaconEggs TRUE 105032
Events.CateringOptions.Tnc.ElementType Events.CateringOptions.Tnc.ElementVersion Events.CateringOptions.Address Events.Action Events.Volume Host.Identity
1 NA NA New York 1 1000 John
2 NA NA Seattle 1 2000 John
Host.Schema Host.ElementId Host.ElementType Host.ElementVersion Sender.Identity Sender.Schema Sender.ElementId Sender.ElementType Sender.ElementVersion
1 NA 101505 NA NA Jane NA 101005 NA NA
2 NA 101505 NA NA Jane NA 101005 NA NA
CompletedDate
1 /Date(1490112000000)/
2 /Date(1490112000000)/
Check function
check<-function(li){
areDF<-sapply(1:length(li), function(i) class(li[[i]]) == "data.frame")
areList<-sapply(1:length(li), function(i) class(li[[i]]) == "list")
tmp1 <- NULL
tmp2 <- NULL
if(any(areDF)){
for(j in which(areDF)){
columns <- jsonlite::flatten(li[[j]])
li[[j]] <- check(columns)
}
tmp1<-plyr::rbind.fill(li[areDF])
#return(tmp1)
}
if(any(areList)){
for(j in which(areList)){
li[[j]]<-check(li[[j]])
}
tmp2<-do.call(cbind,li)
#return(tmp2)
}
if(!is.null(tmp1) & !is.null(tmp2)){
return (cbind(tmp1,tmp2))
}
else if(!is.null(tmp1)){
return (tmp1)
}
else if(!is.null(tmp2)){
return (tmp2)
}
return(li)
}
results
> str(check(sampleData))
'data.frame': 2 obs. of 29 variables:
$ CateringOptions.Agreed : logi TRUE TRUE
$ CateringOptions.Address : chr "New York" "Seattle"
$ CateringOptions.Tnc.Identity : chr "SpicyOWing" "BaconEggs"
$ CateringOptions.Tnc.Schema : logi NA NA
$ CateringOptions.Tnc.ElementId : chr "105031" "105032"
$ CateringOptions.Tnc.ElementType : logi NA NA
$ CateringOptions.Tnc.ElementVersion : logi NA NA
$ Action : num 1 1
$ Volume : num 1000 2000
$ Events.CateringOptions.Agreed : logi TRUE TRUE
$ Events.CateringOptions.Address : chr "New York" "Seattle"
$ Events.CateringOptions.Tnc.Identity : chr "SpicyOWing" "BaconEggs"
$ Events.CateringOptions.Tnc.Schema : logi NA NA
$ Events.CateringOptions.Tnc.ElementId : chr "105031" "105032"
$ Events.CateringOptions.Tnc.ElementType : logi NA NA
$ Events.CateringOptions.Tnc.ElementVersion: logi NA NA
$ Events.Action : num 1 1
$ Events.Volume : num 1000 2000
$ Host.Identity : Factor w/ 1 level "John": 1 1
$ Host.Schema : logi NA NA
$ Host.ElementId : Factor w/ 1 level "101505": 1 1
$ Host.ElementType : logi NA NA
$ Host.ElementVersion : logi NA NA
$ Sender.Identity : Factor w/ 1 level "Jane": 1 1
$ Sender.Schema : logi NA NA
$ Sender.ElementId : Factor w/ 1 level "101005": 1 1
$ Sender.ElementType : logi NA NA
$ Sender.ElementVersion : logi NA NA
$ CompletedDate : Factor w/ 1 level "/Date(1490112000000)/": 1 1
I almost don't have it, but the nested file structure is being fooled. Also, my code is taking quite a long time. Does anyone know how I can iron this out?
Edit:
I added my solution at the end to the gist
source to share
Here's my example using purrr
.
The idea is similar to yours, only with a different syntax: the flatten()
most nested data frames, then rbind()
theirs.
If I understand your code correctly, mine is slightly different at the end as I will try to get a more " jsonlite::flatten
-friendly" structure to apply it again to the end result:
library(jsonlite)
library(purrr)
res <-
sampleData %>%
modify_if(
is.list,
.f = ~ modify_if(
.x,
.p = function(x) all(sapply(x, is.data.frame)),
.f = ~ do.call("rbind", lapply(.x, jsonlite::flatten))
)
) %>%
as.data.frame() %>%
jsonlite::flatten()
str(res)
# 'data.frame': 2 obs. of 20 variables:
# $ Events.Action : num 1 1
# $ Events.Volume : num 1000 2000
# $ Host.Identity : chr "John" "John"
# $ Host.Schema : logi NA NA
# $ Host.ElementId : chr "101505" "101505"
# $ Host.ElementType : logi NA NA
# $ Host.ElementVersion : logi NA NA
# $ Sender.Identity : chr "Jane" "Jane"
# $ Sender.Schema : logi NA NA
# $ Sender.ElementId : chr "101005" "101005"
# $ Sender.ElementType : logi NA NA
# $ Sender.ElementVersion : logi NA NA
# $ CompletedDate : chr "/Date(1490112000000)/" "/Date(1490112000000)/"
# $ Events.CateringOptions.Agreed : logi TRUE TRUE
# $ Events.CateringOptions.Address : chr "New York" "Seattle"
# $ Events.CateringOptions.Tnc.Identity : chr "SpicyOWing" "BaconEggs"
# $ Events.CateringOptions.Tnc.Schema : logi NA NA
# $ Events.CateringOptions.Tnc.ElementId : chr "105031" "105032"
# $ Events.CateringOptions.Tnc.ElementType : logi NA NA
# $ Events.CateringOptions.Tnc.ElementVersion: logi NA NA
I have one inconsistency with yours expectedOutcome
, but if possible it might be on your side:
all.equal(expectedOutcome[sort(names(expectedOutcome))], res[sort(names(res))])
# [1] "Component "Events.CateringOptions.Agreed": 'is.NA' value mismatch: 0 in current 2 in target"
source to share
Not sure if this simplifies your problem, but with the sample you shared it works. Basically, if the column is not a vector, when you do data.frame(your_list)
, that's the unlist
data that does matrix
.
FLAT <- function(inlist) {
A <- data.frame(inlist)
out <- lapply(A, function(y) {
if (is.list(y)) {
y <- unlist(y)
m <- matrix(y, nrow(A), byrow = TRUE, dimnames = list(NULL, unique(names(y))))
y <- data.frame(m, stringsAsFactors = FALSE)
y[] <- lapply(y, type.convert)
}
y
})
do.call(cbind, out)
}
FLAT(sampleData)
Here, str
according to your approximate data:
str(FLAT(sampleData))
## 'data.frame': 2 obs. of 20 variables:
## $ Events.CateringOptions.Agreed : logi TRUE TRUE
## $ Events.CateringOptions.Tnc.Identity : Factor w/ 2 levels "BaconEggs","SpicyOWing": 2 1
## $ Events.CateringOptions.Tnc.Schema : logi NA NA
## $ Events.CateringOptions.Tnc.ElementId : int 105031 105032
## $ Events.CateringOptions.Tnc.ElementType : logi NA NA
## $ Events.CateringOptions.Tnc.ElementVersion: logi NA NA
## $ Events.CateringOptions.Address : Factor w/ 2 levels "New York","Seattle": 1 2
## $ Events.Action : num 1 1
## $ Events.Volume : num 1000 2000
## $ Host.Identity : Factor w/ 1 level "John": 1 1
## $ Host.Schema : logi NA NA
## $ Host.ElementId : Factor w/ 1 level "101505": 1 1
## $ Host.ElementType : logi NA NA
## $ Host.ElementVersion : logi NA NA
## $ Sender.Identity : Factor w/ 1 level "Jane": 1 1
## $ Sender.Schema : logi NA NA
## $ Sender.ElementId : Factor w/ 1 level "101005": 1 1
## $ Sender.ElementType : logi NA NA
## $ Sender.ElementVersion : logi NA NA
## $ CompletedDate : Factor w/ 1 level "/Date(1490112000000)/": 1 1
source to share