Rotate a large dataset
I have a csv that looks something like this (tabs added for reading):
Dimension, Date, Metric
A, Mon, 23
A, Tues, 25
B, Mon, 7
B, Tues, 9
I want to do the hclust distance analysis that I did before. But I like (and probably need) this format:
Dimension, Mon, Tues A, 23, 25 B, 7, 9
I could do this quite easily in Excel with a rod. The problem is I have ~ 10,000 dimensions and ~ 1,200 dates - so the original CSV is about 12 million rows by 3 columns. I want ~ 10,000 rows by ~ 1,200 columns.
Is there a way to do this conversion in R? The logic behind the small Python script for this is simple, but I'm not sure how this would handle such a large CSV - and I can't imagine this is a new problem. Don't try to reinvent the wheel!
Thanks for any advice :)
source to share
Or simply spread
:
library(tidyr)
spread(df, Date, Metric)
Dimension Mon Tues
1 a 23 25
2 b 7 9
Benchmarks
library(microbenchmark)
microbenchmark(spread(df, Date, Metric))
Unit: milliseconds
expr min lq mean median uq max neval
spread(df, Date, Metric) 1.461595 1.491919 1.628366 1.566753 1.635374 2.606135 100
microbenchmark(suppressMessages(dcast(dt, Dimension~Date)))
Unit: milliseconds
expr min lq mean median uq max neval
suppressMessages(dcast(dt, Dimension ~ Date)) 3.365726 3.416384 3.770659 3.471678 4.011316 7.235719 100
microbenchmark(suppressMessages(dcast.data.table(dt, Dimension~Date)))
Unit: milliseconds
expr min lq
mean median uq
suppressMessages(dcast.data.table(dt, Dimension ~ Date)) 2.375445 2.52218 2.7684 2.614706 2.703075
max neval
15.96149 100
and here is the data table without sppressMessages
Unit: milliseconds
expr min lq mean median uq max neval
dcast.data.table(dt, Dimension ~ Date) 2.667337 3.428127 4.749301 4.0476 5.289618 14.3823 100
and here the data table is not supposed to guess:
microbenchmark(dcast.data.table(dt, Dimension ~ Date, value.var = "Metric"))
Unit: milliseconds
expr min lq mean median
dcast.data.table(dt, Dimension ~ Date, value.var = "Metric") 2.077276 2.118707 2.28623 2.168667
uq max neval
2.320579 5.780479 100
source to share