Convert width to long with frequency column
I am trying to convert my data.frame from a large table to a long table with a frequency column.
data("UCBAdmissions")
ucb_admit <- as.data.frame(UCBAdmissions)
ucb_admit
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
...
And I want to collect this data (a melt-like tidyr package with a reshape), but use Freq to specify how many times the line should be repeated.
So my target data looks something like this:
Admit Gender Dept
1 Admitted Male A
2 Admitted Male A
3 Admitted Male A
4 Admitted Male A
5 Admitted Male A
6 Admitted Male A
...
4523 Rejected Female F
4524 Rejected Female F
4525 Rejected Female F
4526 Rejected Female F
I would like to use tidyr :: gather () to do this, however my results are incorrect as I am not sure what if / how to include the Freq column?
thank
source to share
This does not seem like a job for gather
, since the data is aggregated rather than wide. You can "disaggregate" the data by indexing, repeating the row indices Freq
once for each row. Below are the methods using base R and dplyr
.
library(dplyr)
# Base R
ucb_admit_disagg = ucb_admit[rep(1:nrow(ucb_admit), ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]
# dplyr
ucb_admit_disagg = ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)
Here is part of the data frame. I've added ellipses to the output to mark breaks in the sequence of lines.
ucb_admit_disagg[c(1:6, 510:514, 4523:4526), ]
Admit Gender Dept
1 Admitted Male A
1.1 Admitted Male A
1.2 Admitted Male A
1.3 Admitted Male A
1.4 Admitted Male A
1.5 Admitted Male A
...
1.509 Admitted Male A
1.510 Admitted Male A
1.511 Admitted Male A
2 Rejected Male A
2.1 Rejected Male A
...
24.313 Rejected Female F
24.314 Rejected Female F
24.315 Rejected Female F
24.316 Rejected Female F
source to share
Here is a solution using dplyr
, tidyr
and purrr
.
library(dplyr)
library(tidyr)
library(purrr)
ucb_admit2 <- ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)
Or use this similar approach, which only needs functions from dplyr
and tidyr
.
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)
They both take the same strategy: create a list column and then unnest
it.
We can also use the function separate_row
from tidyr
to achieve this task.
ucb_admit2 <- ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)
Benchmarking
I have compared the two methods suggested by eipi10 and the three methods I suggested using the following microbenchmarking
. The result shows that the basic R approach is the fastest, followed by the dplyr
repeat and slice approach . So, I guess if there are no other considerations like code readability don't need to use tidyr
or purrr
for this question.
library(microbenchmark)
library(microbenchmark)
microbenchmark(m1 = (ucb_admit[rep(1:nrow(ucb_admit),
ucb_admit$Freq),
-grep("Freq", names(ucb_admit))]),
m2 = (ucb_admit %>%
slice(rep(1:n(), Freq)) %>%
select(-Freq)),
m3 = (ucb_admit %>%
mutate(Freq = map2(1, Freq, `:`)) %>%
unnest() %>%
select(-Freq)),
m4 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = list(seq(1, Freq))) %>%
ungroup() %>%
unnest() %>%
select(-Freq)),
m5 = (ucb_admit %>%
rowwise() %>%
mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
ungroup() %>%
separate_rows(Freq) %>%
select(-Freq)))
Unit: milliseconds
expr min lq mean median uq max neval
m1 3.455026 3.585888 4.295322 3.845367 4.147506 8.60228 100
m2 6.888881 7.541269 8.849527 8.031040 9.428189 15.53991 100
m3 23.252458 24.959122 29.706875 27.414396 32.506805 61.00691 100
m4 20.033499 21.914645 25.888155 23.611688 27.310155 101.15088 100
m5 28.972557 31.127297 35.976468 32.652422 37.669135 64.43884 100
source to share