Python pandas dataframe: needs to speed up the process associated with calculating 3 rows of data

Question

Python pandas dataframe: needs to speed up the process associated with calculating 3 rows of data

I have data as shown below:

Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B


....

I need to filter out all tran that have 3 types A, AA, CC and A.Amount + AA.Amount = CC.Amount

The data is huge (100M records)

My code is below but it is rather slow

df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]

dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})

dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]

Please help with advice.

+3

python pandas dataframe

Thái Bùi 05 Apr 17 at 8:37

source to share

3 answers

Use set_index

. The good thing is, A + AA == CC

it won't happen if all three are not there, so there is no need to check if all three are there.

df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')

Type     A      AA      CC
Tran                      
1212  12.0    13.0    25.0
1213  78.0  1112.0  1190.0
1214  78.0  1112.0  1190.0

You can get a subset of the original with

t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]

   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

+3

piRSquared 05 Apr 17 at 8:45

source to share

UPDATE: Looking at @ piRSquared's perfect solution , I realized that we don't need to filter the original DF in advance.

So this should be enough:

In [28]: df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[28]:
Type     A      AA      CC
Tran
1212  12.0    13.0    25.0
1213  78.0  1112.0  1190.0
1214  78.0  1112.0  1190.0

OLD answer:

In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)

In [24]: x
Out[24]:
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type   A    AA    CC
Tran
1212  12    13    25
1213  78  1112  1190
1214  78  1112  1190

+2

MaxU 05 Apr 17 at 8:44

source to share

jezrael · Accepted Answer · 2017-04-05T08:44:25+0000

You can use pivot

with query

:

#If necessary filtering:
#df = df[df.groupby("Tran")['Type'].transform('size') == 3]

idx = df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC').index
print (idx)
Int64Index([1212, 1213, 1214], dtype='int64', name='Tran')

df = df[df.Tran.isin(idx)]
#same as
#df = df.query('Tran in @idx')
print (df)
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

Another filtering solution:

df = df.set_index('Tran').loc[idx].reset_index()
print (df)
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

Python pandas dataframe: needs to speed up the process associated with calculating 3 rows of data

More articles: