How to avoid a slow cycle with a large data set?

Question

How to avoid a slow cycle with a large data set?

Consider this dataset:

> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+                    country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"), 
+                    action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"), 
+                    signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+                    ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country       action signature_date ratification_date
              1  Canada    signature           2000                NA
             1  Canada ratification             NA              2001
             1     USA    signature           2000                NA
             1     USA ratification             NA              2002
             2  Canada    signature           2001                NA
             2  Canada ratification             NA              2001
             2     USA    signature           2002                NA
             2     USA ratification             NA              2002

As you can see, half of the lines contain duplicate information. For a small dataset like this, it is very easy to remove duplicates. I could use a function coalesce

( dplyr package ) to get rid of the "action columns" and then remove any unnecessary lines. There are many other ways, though. The end result should look like this:

> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+                     country = c("Canada", "USA", "Canada","USA"), 
+                     signature_date = c(2000,2000,2001,2002),
+                     ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
             1  Canada           2000              2001
             1     USA           2000              2002
             2  Canada           2001              2001
             2     USA           2002              2002

The problem is my real dataset is much larger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more missing values. The function coalesce

seems to be very slow. The best loop I could make so far takes up to 5-10 minutes.

Is there an easy way to make it faster? I have a feeling that there must be some function in R for this type of work, but I couldn't find it.

+3

r dataframe large-data

Benjamin Tremblay-Auger 01 Aug 17 at 12:24 am

source to share

3 answers

I think you need dcast

. The version in the library data.table

calls itself "fast" and in my experience it speeds up on large datasets.

First create one column which is either signature_date

or ratification_date

, depending on the action

library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]

Now release it so that the action is the columns and the value is the date

wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')

Such a wide view looks like

  Agreement_number country ratification signature
1                1  Canada         2001      2000
2                1     USA         2002      2000
3                2  Canada         2001      2001
4                2     USA         2002      2002

+4

HarlandMason 01 Aug 17 at 1:04

source to share

Here is another solution data.table

using uwe-block data.frame. It is similar to the uwe-block method, but uses it max

to fold data.

# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
                                    function(x) if(is.factor(x)) as.character(x) else x)]

# collapse data set, by agreement and country. Take max of remaining variables.
 DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]

lapply

traverses variables not included in the by statement and computes the maximum after removing the NA values. The next link in the chain discards the unneeded action variable, and the final (unnecessary) link outputs the result.

This returns

   Agreement_number country signature_date ratification_date dummy1 dummy2
1:                1  Canada           2000              2001      2      D
2:                1     USA           2000              2002      3      A
3:                2  Canada           2001              2001      1      B
4:                2     USA           2002              2002      4      C

+2

lmo 01 Aug 17 at 12:56

source to share

Uwe · Accepted Answer · 2017-08-01T07:25:07+0000

The OP said his production data is 100k rows x 270 columns and speed is for him. So I suggest using data.table

.

I know Harland also suggested using data.table

and dcast()

, but the solution below is a different approach. It returns the lines in the correct order and copies the line ratification_date

into the signature line. After some cleaning, we get the desired result.

library(data.table)

# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]

# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)

# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"), 
                by = c("Agreement_number", "country")][
                  # keep only signature rows, remove action column
                  action == "signature"][, action := NULL]
result

   Agreement_number country signature_date ratification_date dummy1 dummy2
1:                1  Canada           2000              2001      2      D
2:                1     USA           2000              2002      3      A
3:                2  Canada           2001              2001      1      B
4:                2     USA           2002              2002      4      C

Data

The OP noted that his production data contains 270 columns. To simulate this, I added two dummy columns:

set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"), 
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"), 
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))

Note what is set.seed()

used for repeatable results when sampling.

  Agreement_number country       action signature_date ratification_date dummy1 dummy2
1                1  Canada    signature           2000                NA      2      D
2                1  Canada ratification             NA              2001      2      D
3                1     USA    signature           2000                NA      3      A
4                1     USA ratification             NA              2002      3      A
5                2  Canada    signature           2001                NA      1      B
6                2  Canada ratification             NA              2001      1      B
7                2     USA    signature           2002                NA      4      C
8                2     USA ratification             NA              2002      4      C

Addendum: `dcast()`

with extra columns

Harland suggested using data.table

and dcast()

. Apart from a few other flaws in his answer, it does not handle the extra columns that the OP mentioned.

The approach dcast()

below will return additional columns as well:

library(data.table)

# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]

# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]

# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action, 
                value.var = "signature_date")
result

   Agreement_number country dummy1 dummy2 signature ratification
1:                1  Canada      2      D      2000         2001
2:                1     USA      3      A      2000         2002
3:                2  Canada      1      B      2001         2001
4:                2     USA      4      C      2002         2002

Note that this approach will reorder the columns.

How to avoid a slow cycle with a large data set?

Data

Addendum: dcast() (adsbygoogle = window.adsbygoogle || []).push({}); with extra columns

More articles:

Addendum: `dcast()`

with extra columns