Converting from wide to long without sorting columns

I want to convert a dataframe from wide format to long format.

Here's an example of a toy:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

      

There are some variables that will stay the same (only ID here) and some that will be converted to long format (all other variables here, all ending in _1, _2, or _7)

To convert it to long format, I use data.table melt and dcast, a common way to detect variables automatically. Other solutions are also welcome.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

      

As you can see, the columns are ordered alphabetically, but I would prefer to keep the original order as far away as possible, for example taking into account the order in which the variable first appeared.

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

Should be

ID ZA BB CC

      

I don't mind if the idvars columns come in at the beginning at the beginning, or if they stay in their original position as well.

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

will be

ID ZA TEMP BB CC

      

or

ID TEMP ZA BB CC

      

I prefer the latter option.

Another problem is that everything is converted to a character.

+2


source to share


5 answers


The OP updated his answer to his own question, complaining about the memory consumption of the intermediate step melt()

when half the columns are id.vars

. He asked that he data.table

needed a straightforward way to do this without creating giant middle steps.

Well, it data.table

already has this ability, it's called join.

Given fetching data from Q, the entire operation can be implemented at the cost of less memory consumption by modifying with only one id.var and then attaching to the reconstructed result with the original data. table:

setDT(mydata)

# add unique row number to join on later 
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]

# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")

# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)

# split column names of measure.vars
# Note that "variable" is reused to save memory 
molten[, c("variable", "measure") := tstrsplit(variable, "_")]

# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]

# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]

# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
#    ID rn measure ZA BB CC
# 1:  1  1       1  1  3 NA
# 2:  1  1       2  5  6 NA
# 3:  1  1       7 NA NA  6
# 4:  2  2       1  2  3 NA
# 5:  2  2       2  4  6 NA
# 6:  2  2       7 NA NA  5
# 7:  3  3       1  3  3 NA
# 8:  3  3       2  3  6 NA
# 9:  3  3       7 NA NA  4
#10:  4  4       1  4  3 NA
#11:  4  4       2  2  6 NA
#12:  4  4       7 NA NA  3
#13:  5  5       1  5  3 NA
#14:  5  5       2  1  6 NA
#15:  5  5       7 NA NA  2

      

Finally, you can remove the line number if you no longer need it result[, rn := NULL]

.



Alternatively, you can delete the intermediate molten

by rm(molten)

.

We started with a data.table

1 column id, 5 units, and 5 rows. The reformed result has 1 id column, 3 measure cols and 15 rows. Thus, the amount of data stored in the id columns has effectively tripled. However, the intermediate step only needed one id.var rn

.

EDIT . If memory consumption is critical, it might be worth considering storing id.vars and measure.vars in two separate data.tables and only joining the required id.var columns with the measure. upon request.

Please note that the option measure.vars

to melt()

allow a special function patterns()

. With this one could write a call melt()

as well

molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))

      

+1


source


You can melt multiple columns at the same time if you pass a list of column names to the argument measure =

. One way to do it in a scalable way:

  • Extract the column names and the corresponding first two letters:

    measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
    groups <- gsub("_[1-9]$","",measurevars)
    
          

  • Turn groups

    into factor object and make sure the levels are not alphabetically ordered. We will use this in the next step to create a list object with the correct structure.

    split_on <- factor(groups, levels = unique(groups))
    
          

  • Create a list with measurevars

    with split()

    and create a vector for the argument value.name =

    in melt()

    .

    measure_list <- split(measurevars, split_on)
    measurenames <- unique(groups)
    
          



Putting it all together:

melt(setDT(mydata), 
     measure = measure_list, 
     value.name = measurenames,
     variable.name = "measure")
#    ID measure ZA BB
# 1:  1       1  1  3
# 2:  2       1  2  3
# 3:  3       1  3  3
# 4:  4       1  4  3
# 5:  5       1  5  3
# 6:  1       2  5  6
# 7:  2       2  4  6
# 8:  3       2  3  6
# 9:  4       2  2  6
#10:  5       2  1  6

      

+2


source


Here is a method using basic R split.default

and do.call

.

# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))

# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
      do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
    ID ZA BB
1.1  1  1  3
1.2  2  2  3
1.3  3  3  3
1.4  4  4  3
1.5  5  5  3
2.1  1  5  6
2.2  2  4  6
2.3  3  3  6
2.4  4  2  6
2.5  5  1  6

      

The first line breaks the data.frame variables (minus IDs) into lists that match the final character of the variable name. This criterion is defined using gsub

. The second line uses do.call

to invoke rbind

variables in this list that have been modified with setNames

so that the final digit and underscore are removed from their names. Finally, it cbind

attaches an identifier to the resulting data frame.

Please note that the data should be structured regularly, without missing variables, etc.

+1


source


Alternative approach with data.table

:

melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
                        ][, variable := factor(variable, levels = unique(variable))
                          ][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]

      

which gives:

    ID measure ZA BB CC
 1:  1       1  1  3 NA
 2:  1       2  5  6 NA
 3:  1       7 NA NA  6
 4:  2       1  2  3 NA
 5:  2       2  4  6 NA
 6:  2       7 NA NA  5
 7:  3       1  3  3 NA
 8:  3       2  3  6 NA
 9:  3       7 NA NA  4
10:  4       1  4  3 NA
11:  4       2  2  6 NA
12:  4       7 NA NA  3
13:  5       1  5  3 NA
14:  5       2  1  6 NA
15:  5       7 NA NA  2

      

+1


source


Finally I found a way by changing my initial solution

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

      

And it gives you the correct values. In any case, this solution requires a lot of memory.

The trick was to convert the var variable to a coefficient specifying the order I want with the levels, as mtoto did. Mtoto's solution is nice because it doesn't have to be cast and melted, only melted, but doesn't work in my updated example, only works when there is the same number of numbers for each word.

PD: I go over each step and found that the melt step can be a big problem when working with big data. If you have a data table. The column is only 100,000 rows by 1,000 columns and half of the columns are id.vars, the output is around 50,000,000 x 500, which is too much for the next step to proceed. data.table requires a straight forward way to do this without generating giant average steps.

0


source







All Articles