Python pandas dataframe: needs to speed up the process associated with calculating 3 rows of data
I have data as shown below:
Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B
....
I need to filter out all tran that have 3 types A, AA, CC and A.Amount + AA.Amount = CC.Amount
The data is huge (100M records)
My code is below but it is rather slow
df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]
dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})
dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]
Please help with advice.
source to share
You can use pivot
with query
:
#If necessary filtering:
#df = df[df.groupby("Tran")['Type'].transform('size') == 3]
idx = df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC').index
print (idx)
Int64Index([1212, 1213, 1214], dtype='int64', name='Tran')
df = df[df.Tran.isin(idx)]
#same as
#df = df.query('Tran in @idx')
print (df)
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
Another filtering solution:
df = df.set_index('Tran').loc[idx].reset_index()
print (df)
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
source to share
Use set_index
. The good thing is, A + AA == CC
it won't happen if all three are not there, so there is no need to check if all three are there.
df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')
Type A AA CC
Tran
1212 12.0 13.0 25.0
1213 78.0 1112.0 1190.0
1214 78.0 1112.0 1190.0
You can get a subset of the original with
t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
source to share
UPDATE: Looking at @ piRSquared's perfect solution , I realized that we don't need to filter the original DF in advance.
So this should be enough:
In [28]: df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[28]:
Type A AA CC
Tran
1212 12.0 13.0 25.0
1213 78.0 1112.0 1190.0
1214 78.0 1112.0 1190.0
OLD answer:
In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)
In [24]: x
Out[24]:
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type A AA CC
Tran
1212 12 13 25
1213 78 1112 1190
1214 78 1112 1190
source to share