How do I add seasonal dummy variables?
I would like to add seasonality dummies to mine R data.table
based on quarters. I've looked at several examples, but I haven't been able to solve this problem yet. My knowledge about is R
limited, so I was wondering if you can get me on the right track.
Mine data.table
looks like this:
Year_week artist_id number_of_events number_of_streams
1: 16/30 8296 1 957892
2: 16/33 8296 6 882282
3: 16/34 8296 5 926037
4: 16/35 8296 2 952704
5: 15/37 17879 1 89515
6: 16/22 22690 2 119653
I would like to have a format like this:
Year_week artist_id number_of_events number_of_streams Q2 Q3 Q4
1: 16/50 8296 1 957892 0 0 1
source to share
Two approaches:
1) Using dcast
, cut
and sub
:
dcast(DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
breaks = c(0,13,26,39,53),
labels = paste0('Q',1:4))],
Year_week + artist_id + number_of_events + number_of_streams ~ Q,
value.var = 'Q',
drop = c(TRUE,FALSE),
fun = length)
gives:
Year_week artist_id number_of_events number_of_streams Q1 Q2 Q3 Q4
1: 15/37 17879 1 89515 0 0 1 0
2: 16/22 22690 2 119653 0 1 0 0
3: 16/30 8296 1 957892 0 0 1 0
4: 16/33 8296 6 882282 0 0 1 0
5: 16/34 8296 5 926037 0 0 1 0
6: 16/35 8296 2 952704 0 0 1 0
What does it do:
-
as.integer(sub('.*/','',Year_week))
outputs the week number from the columnYear_week
- Using
cut
, you will split it into quarters with appropriate labels (see also?cut
) - With,
dcast
you convert the quarter column to wide format using the aggregation (length
) function . By usingdrop = c(TRUE,FALSE)
in a functiondcast
, you will ensure that all quarters are on.
Notes:
-
Q
-column is an ordered coefficient, so you can use it to organize and filter your data. - Depending on your use of dummy columns: you don't always need them. If you want to use them as a grouping or filtering of variables, you can just work with the variable
Q
. - However, some statistical tests require dummy variables (which justifies the step
dcast
).
2) Using cut
, sub
and lapply
:
DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
breaks = c(0,13,26,39,53),
labels = paste0('Q',1:4))
][, paste0('Q',1:4) := lapply(paste0('Q',1:4), function(q) as.integer(q == Q))][]
which gives a similar result. Instead of transposing with, dcast
you just check if one of the square labels is in the column Q
.
Data used:
DT <- fread(' Year_week artist_id number_of_events number_of_streams
16/30 8296 1 957892
16/33 8296 6 882282
16/34 8296 5 926037
16/35 8296 2 952704
15/37 17879 1 89515
16/22 22690 2 119653')
source to share
I assumed that Year_week
is where we can extract the recording date.
library(data.table)
whichQuart <- function(x){
data.frame(+(x <= 13),
+(x >13 & x <= 26),
+(x > 26 & x <= 39),
+(x > 39 & x <= 52))
}
dt <- setDT(read.table(text="Year_week artist_id number_of_events number_of_streams
1: 16/30 8296 1 957892
2: 16/33 8296 6 882282
3: 16/34 8296 5 926037
4: 16/35 8296 2 952704
5: 15/37 17879 1 89515
6: 16/22 22690 2 119653", header=TRUE, stringsAsFactors=FALSE))
dt[, week := strsplit(Year_week, "/")[2]]
dt[, c("Q1", "Q2", "Q3", "Q4") := whichQuart(week)]
# Year_week artist_id number_of_events number_of_streams week Q1 Q2 Q3 Q4
#1: 16/30 8296 1 957892 16 0 1 0 0
#2: 16/33 8296 6 882282 33 0 0 1 0
#3: 16/34 8296 5 926037 16 0 1 0 0
#4: 16/35 8296 2 952704 33 0 0 1 0
#5: 15/37 17879 1 89515 16 0 1 0 0
#6: 16/22 22690 2 119653 33 0 0 1 0
source to share