Convert width to long with frequency column

Question

Convert width to long with frequency column

I am trying to convert my data.frame from a large table to a long table with a frequency column.

data("UCBAdmissions")
ucb_admit <- as.data.frame(UCBAdmissions)
ucb_admit
   Admit    Gender Dept Freq
1  Admitted   Male    A  512
2  Rejected   Male    A  313
3  Admitted Female    A   89
4  Rejected Female    A   19
...

And I want to collect this data (a melt-like tidyr package with a reshape), but use Freq to specify how many times the line should be repeated.

So my target data looks something like this:

     Admit      Gender Dept
1    Admitted   Male    A
2    Admitted   Male    A
3    Admitted   Male    A
4    Admitted   Male    A
5    Admitted   Male    A
6    Admitted   Male    A
...
4523 Rejected Female    F
4524 Rejected Female    F
4525 Rejected Female    F
4526 Rejected Female    F

I would like to use tidyr :: gather () to do this, however my results are incorrect as I am not sure what if / how to include the Freq column?

thank

+3

r dataframe tidyr

user27612 01 Aug 17 at 19:11

source to share

2 answers

Here is a solution using dplyr

, tidyr

and purrr

.

library(dplyr)
library(tidyr)
library(purrr)

ucb_admit2 <- ucb_admit %>%
  mutate(Freq = map2(1, Freq, `:`)) %>%
  unnest() %>%
  select(-Freq)

Or use this similar approach, which only needs functions from dplyr

and tidyr

.

ucb_admit2 <- ucb_admit %>%
  rowwise() %>%
  mutate(Freq = list(seq(1, Freq))) %>%
  ungroup() %>%
  unnest() %>%
  select(-Freq)

They both take the same strategy: create a list column and then unnest

it.

We can also use the function separate_row

from tidyr

to achieve this task.

ucb_admit2 <- ucb_admit %>%
  rowwise() %>%
  mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
  ungroup() %>%
  separate_rows(Freq) %>%
  select(-Freq)

Benchmarking

I have compared the two methods suggested by eipi10 and the three methods I suggested using the following microbenchmarking

. The result shows that the basic R approach is the fastest, followed by the dplyr

repeat and slice approach . So, I guess if there are no other considerations like code readability don't need to use tidyr

or purrr

for this question.

library(microbenchmark)

library(microbenchmark)


microbenchmark(m1 = (ucb_admit[rep(1:nrow(ucb_admit), 
                                   ucb_admit$Freq), 
                               -grep("Freq", names(ucb_admit))]),
               m2 = (ucb_admit %>% 
                       slice(rep(1:n(), Freq)) %>% 
                       select(-Freq)),
               m3 = (ucb_admit %>%
                       mutate(Freq = map2(1, Freq, `:`)) %>%
                       unnest() %>%
                       select(-Freq)),
               m4 = (ucb_admit %>%
                       rowwise() %>%
                       mutate(Freq = list(seq(1, Freq))) %>%
                       ungroup() %>%
                       unnest() %>%
                       select(-Freq)),
               m5 = (ucb_admit %>%
                       rowwise() %>%
                       mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
                       ungroup() %>%
                       separate_rows(Freq) %>%
                       select(-Freq)))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   m1  3.455026  3.585888  4.295322  3.845367  4.147506   8.60228   100
   m2  6.888881  7.541269  8.849527  8.031040  9.428189  15.53991   100
   m3 23.252458 24.959122 29.706875 27.414396 32.506805  61.00691   100
   m4 20.033499 21.914645 25.888155 23.611688 27.310155 101.15088   100
   m5 28.972557 31.127297 35.976468 32.652422 37.669135  64.43884   100

+2

www 01 Aug 17 at 19:30

source to share

eipi10 · Accepted Answer · 2017-08-01T19:22:07+0000

This does not seem like a job for gather

, since the data is aggregated rather than wide. You can "disaggregate" the data by indexing, repeating the row indices Freq

once for each row. Below are the methods using base R and dplyr

.

library(dplyr)

# Base R
ucb_admit_disagg = ucb_admit[rep(1:nrow(ucb_admit), ucb_admit$Freq), 
                             -grep("Freq", names(ucb_admit))]

# dplyr
ucb_admit_disagg = ucb_admit %>% 
  slice(rep(1:n(), Freq)) %>% 
  select(-Freq)

Here is part of the data frame. I've added ellipses to the output to mark breaks in the sequence of lines.

ucb_admit_disagg[c(1:6, 510:514, 4523:4526), ]

          Admit Gender Dept
1      Admitted   Male    A
1.1    Admitted   Male    A
1.2    Admitted   Male    A
1.3    Admitted   Male    A
1.4    Admitted   Male    A
1.5    Admitted   Male    A
...
1.509  Admitted   Male    A
1.510  Admitted   Male    A
1.511  Admitted   Male    A
2      Rejected   Male    A
2.1    Rejected   Male    A
...
24.313 Rejected Female    F
24.314 Rejected Female    F
24.315 Rejected Female    F
24.316 Rejected Female    F

Convert width to long with frequency column

Benchmarking

More articles: