Getting rid of line flushing in R file by groups
This is what my dataframe looks like:
df <- read.table(text='
CustomerName Sales TradeDate
John 1000 1/1/2015
John -1000 1/1/2015
John 1000 1/1/2015
John 5000 2/1/2015
John -2000 3/1/2015
John 2000 3/2/2015
John 2000 3/3/2015
John -2000 3/4/2015
John 2000 3/5/2015
John 2000 3/6/2015
John -3000 4/1/2015
John 3000 4/1/2015
John -3000 4/1/2015
John 2000 5/1/2015
John -2000 5/1/2015
John 2000 5/1/2015
Tom 1000 1/1/2015
Tom -1000 1/1/2015
Tom 1000 1/1/2015
Tom 5000 2/1/2015
Tom -2000 3/1/2015
Tom 2000 3/1/2015
Tom -2000 3/1/2015
Tom 2000 3/1/2015
Tom 2000 3/1/2015
Tom -3000 4/1/2015
Tom 3000 4/1/2015
Tom -3000 4/1/2015
', header=T)
I want to get rid of all Sales that are equal in quantity and opposite in sign (+, -) and only show the remaining net sales (preferably at the earliest possible time, but it doesn't matter anyway). My desired dataframe looks like this:
CustomerName Sales TradeDate
John 1000 1/1/2015
John 5000 2/1/2015
John 2000 3/3/2015
John 2000 3/6/2015
John -3000 4/1/2015
John 2000 5/1/2015
Tom 1000 1/1/2015
Tom 5000 2/1/2015
Tom 2000 3/1/2015
Tom -3000 4/1/2015
I picked two 2000s (in John's case in March) from 3/3/2015 and 3/6/2015. But I am also fine with the output that gave me two 2000s on 3/2/2015 or 5/5/2015. Your help is greatly appreciated!
+3
source to share
2 answers
Here's what I would do, in data.table
:
library(data.table)
# identify how many transactions we need to keep
setDT(df)[,
n_keep := sum(Sales)/transval
,by=.(CustomerName,transval=abs(Sales))]
# tag those transactions
df[sign(Sales)==sign(n_keep),
keep := 1:.N %in% tail(1:.N,abs(n_keep[1]))
,by=.(CustomerName,Sales)]
# keep 'em
df[(keep)][,c("n_keep","keep"):=NULL][]
which gives
CustomerName Sales TradeDate
1: John 1000 1/1/2015
2: John 5000 2/1/2015
3: John 2000 3/5/2015
4: John 2000 3/6/2015
5: John -3000 4/1/2015
6: Tom 1000 1/1/2015
7: Tom 5000 2/1/2015
8: Tom 2000 3/1/2015
9: Tom -3000 4/1/2015
I'm sure my code can be simplified, but I think the procedure is pretty transparent.
+5
source to share
An alternative solution is to simply calculate the daily totals:
library(dplyr)
df %>%
group_by(CustomerName, TradeDate) %>%
summarise(Sales = sum(Sales))
#> Source: local data frame [14 x 3]
#> Groups: CustomerName
#>
#> CustomerName TradeDate Sales
#> 1 John 1/1/2015 1000
#> 2 John 2/1/2015 5000
#> 3 John 3/1/2015 -2000
#> 4 John 3/2/2015 2000
#> 5 John 3/3/2015 2000
#> 6 John 3/4/2015 -2000
#> ...
0
source to share