How to avoid a slow cycle with a large data set?
Consider this dataset:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the lines contain duplicate information. For a small dataset like this, it is very easy to remove duplicates. I could use a function coalesce
( dplyr package ) to get rid of the "action columns" and then remove any unnecessary lines. There are many other ways, though. The end result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem is my real dataset is much larger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more missing values. The function coalesce
seems to be very slow. The best loop I could make so far takes up to 5-10 minutes.
Is there an easy way to make it faster? I have a feeling that there must be some function in R for this type of work, but I couldn't find it.
source to share
The OP said his production data is 100k rows x 270 columns and speed is for him. So I suggest using data.table
.
I know Harland also suggested using data.table
and dcast()
, but the solution below is a different approach. It returns the lines in the correct order and copies the line ratification_date
into the signature line. After some cleaning, we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Data
The OP noted that his production data contains 270 columns. To simulate this, I added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note what is set.seed()
used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
Addendum: dcast()
with extra columns
Harland suggested using data.table
and dcast()
. Apart from a few other flaws in his answer, it does not handle the extra columns that the OP mentioned.
The approach dcast()
below will return additional columns as well:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
Note that this approach will reorder the columns.
source to share
I think you need dcast
. The version in the library data.table
calls itself "fast" and in my experience it speeds up on large datasets.
First create one column which is either signature_date
or ratification_date
, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now release it so that the action is the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
Such a wide view looks like
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
source to share
Here is another solution data.table
using uwe-block data.frame. It is similar to the uwe-block method, but uses it max
to fold data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
lapply
traverses variables not included in the by statement and computes the maximum after removing the NA values. The next link in the chain discards the unneeded action variable, and the final (unnecessary) link outputs the result.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
source to share