Sequential filtering using data.table in R
I have the following data:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
10 1986 NA
10 1986 NA
10 1987 240
10 1987 NA
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
12 1984 240
I need to filter rows based on the following criteria:
- For each,
PERMNO
sort the data bydate
- Parse the sorted data and remove the rows after the company gets delisted (i.e. DLSTCD! = NA)
- If the first line matches a company ad, it does not include lines for that company.
Based on these criteria, the following is my expected result:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
I am using data.table
R to work with this data. The example above is a simplified version of my actual data, which contains about 3M rows corresponding to 30,000 PERMNO.
I have implemented three different methods for this, as can be seen here:
r-script: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3
Below, I compare my implementations with a small dataset of 50k rows. Here are my results:
Time comparison
system.time(dt <- filterbydelistingcode(dt)) # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt)) # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt)) # 114.3 seconds
As you can see, all of my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank.
Edit . Here is a link to a sample 50k row data that I used to compare times: https://ufile.io/q9d8u
Also, here's a set up read function for this data:
readdata = function(filename){
data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
PRCABS = abs(data$PRC)
mcap = PRCABS * data$SHROUT
hpr = data$RET
HPR = as.numeric(levels(hpr))[hpr]
HPR[HPR==""] = NA
data = cbind(data,PRCABS,mcap, HPR)
return(data)
}
data <- readdata('fewdata.csv')
dt <- as.data.table(data)
source to share
Here's an attempt at data.table
:
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
# PERMNO date DLSTCD
#1: 10 1983 NA
#2: 10 1985 250
#3: 11 1984 NA
#4: 11 1984 NA
#5: 11 1985 NA
#6: 11 1987 NA
Testing for 2.5 million lines, 400,000 with delisting date:
set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)
system.time({
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
})
# user system elapsed
# 0.74 0.00 0.76
Less than a second is not bad.
source to share
Building on @ thelatemail answer, here are two more variations on the same topic.
In both cases, setkey()
first simplifies the reasoning:
setkey(dat,PERMNO,date) # sort by PERMNO, then by date within PERMNO
Option 1: add the desired data (if any) from each group
system.time(
ans1 <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
)
user system elapsed
2.604 0.000 2.605
This is quite slow because allocating and filling all the little bits of memory for the result for each group, only to fit into one result again at the end, takes time and memory.
Option 2: (closer to how you phrased the question) find the line numbers to remove and then remove them.
system.time({
todelete <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
}, keyby=PERMNO]
ans2 <- dat[ -todelete$V1 ]
})
user system elapsed
0.160 0.000 0.159
It's faster because it only stacks the delete line numbers followed by one operation to delete the required rows in a single mass operation. Since it is grouped by the first column of the key, it uses the key to speed up the grouping (the groups are contiguous in RAM).
More information on ?.SD
and ?.I
on this manual page .
You can test and debug what's going on within each group by simply adding a call browser()
and looking like this.
> ans1 <- dat[, {
browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD # type .SD to look at it
date DLSTCD
1: 21679 NA
2: 46408 1
3: 68378 NA
4: 75362 NA
5: 77690 NA
---
111: 2396559 1
112: 2451629 NA
113: 2461958 NA
114: 2484403 NA
115: 2485217 NA
Browse[1]> w # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD))) # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
date DLSTCD
1: 21679 NA
2: 46408 1
Browse[1]> # that is what is returned for this group
Browse[1]> n # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help # for browser commands
Let's say you encounter a problem or error with one specific PERMNO. You can make a browser call conditional like this.
> ans1 <- dat[, {
if (PERMNO==42) browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD
date DLSTCD
1: 31018 NA
2: 35803 1
3: 37494 NA
4: 50012 NA
5: 52459 NA
---
128: 2405818 NA
129: 2429995 NA
130: 2455519 NA
131: 2478605 1
132: 2497925 NA
Browse[1]>
source to share