Sequential filtering using data.table in R

Question

Sequential filtering using data.table in R

I have the following data:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    10 1986     NA
    10 1986     NA 
    10 1987    240 
    10 1987     NA  
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA 
    12 1984    240

I need to filter rows based on the following criteria:

For each, PERMNO

sort the data bydate
Parse the sorted data and remove the rows after the company gets delisted (i.e. DLSTCD! = NA)
If the first line matches a company ad, it does not include lines for that company.

Based on these criteria, the following is my expected result:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA

I am using data.table

R to work with this data. The example above is a simplified version of my actual data, which contains about 3M rows corresponding to 30,000 PERMNO.

I have implemented three different methods for this, as can be seen here:
r-script: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3

Below, I compare my implementations with a small dataset of 50k rows. Here are my results:

Time comparison

system.time(dt <- filterbydelistingcode(dt))   # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt))   # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt))   # 114.3 seconds

As you can see, all of my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank.

Edit . Here is a link to a sample 50k row data that I used to compare times: https://ufile.io/q9d8u

Also, here's a set up read function for this data:

readdata = function(filename){
    data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
    PRCABS = abs(data$PRC)
    mcap = PRCABS * data$SHROUT
    hpr = data$RET
    HPR = as.numeric(levels(hpr))[hpr]
    HPR[HPR==""] = NA
    data = cbind(data,PRCABS,mcap, HPR)
    return(data)
}

data <- readdata('fewdata.csv')
dt <- as.data.table(data)

+3

r data.table

DenaG 05 june 17 at 22:36

source to share

2 answers

thelatemail · Answer 1 · 2017-06-06T03:56:42+0000

Here's an attempt at data.table

:

dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]

#   PERMNO date DLSTCD
#1:     10 1983     NA
#2:     10 1985    250
#3:     11 1984     NA
#4:     11 1984     NA
#5:     11 1985     NA
#6:     11 1987     NA

Testing for 2.5 million lines, 400,000 with delisting date:

set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)

system.time({
dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]
})
#   user  system elapsed 
#   0.74    0.00    0.76

Less than a second is not bad.

Matt dowle · Answer 2 · 2017-06-08T01:22:39+0000

Building on @ thelatemail answer, here are two more variations on the same topic.

In both cases, setkey()

first simplifies the reasoning:

setkey(dat,PERMNO,date)  # sort by PERMNO, then by date within PERMNO

Option 1: add the desired data (if any) from each group

system.time(
  ans1 <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (!length(w)) .SD
    else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
)
   user  system elapsed 
  2.604   0.000   2.605

This is quite slow because allocating and filling all the little bits of memory for the result for each group, only to fit into one result again at the end, takes time and memory.

Option 2: (closer to how you phrased the question) find the line numbers to remove and then remove them.

system.time({
  todelete <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
  }, keyby=PERMNO]

  ans2 <- dat[ -todelete$V1 ]
})
   user  system elapsed 
  0.160   0.000   0.159

It's faster because it only stacks the delete line numbers followed by one operation to delete the required rows in a single mass operation. Since it is grouped by the first column of the key, it uses the key to speed up the grouping (the groups are contiguous in RAM).

More information on ?.SD

and ?.I

on this manual page .

You can test and debug what's going on within each group by simply adding a call browser()

and looking like this.

> ans1 <- dat[, {
     browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
   }, keyby=PERMNO]
Browse[1]> .SD      # type .SD to look at it
        date DLSTCD
  1:   21679     NA
  2:   46408      1
  3:   68378     NA
  4:   75362     NA
  5:   77690     NA
 ---               
111: 2396559      1
112: 2451629     NA
113: 2461958     NA
114: 2484403     NA
115: 2485217     NA
Browse[1]> w   # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD)))  # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
    date DLSTCD
1: 21679     NA
2: 46408      1
Browse[1]> # that is what is returned for this group
Browse[1]> n   # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help  # for browser commands

Let's say you encounter a problem or error with one specific PERMNO. You can make a browser call conditional like this.

> ans1 <- dat[, {
     if (PERMNO==42) browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
Browse[1]> .SD
        date DLSTCD
  1:   31018     NA
  2:   35803      1
  3:   37494     NA
  4:   50012     NA
  5:   52459     NA
 ---               
128: 2405818     NA
129: 2429995     NA
130: 2455519     NA
131: 2478605      1
132: 2497925     NA
Browse[1]>

Sequential filtering using data.table in R

More articles: