Padding NaN in DataFrame based on column values
I have data that resembles the following simplified example:
Col1 Col2 Col3
a A 10.1
b A NaN
d B NaN
e B 12.3
f B NaN
g C 14.1
h C NaN
i C NaN
... for many thousands of lines. I need to fill based on a value in Col2 using something similar to the ffill method. The result I'm looking for is the following:
Col1 Col2 Col3
a A 10.1
b A 10.1
d B NaN
e B 12.3
f B 12.3
g C 14.1
h C 14.1
i C 14.1
However, this method ignores the value in Col2. Any ideas?
+3
DrTRD
source
to share
4 answers
If I understand correctly, you can group by 'Col2' and then call the broadcast to 'Col3' and call ffill
:
In [35]:
df['Col3'] = df.groupby('Col2')['Col3'].transform(lambda x: x.ffill())
df
Out[35]:
Col1 Col2 Col3
0 a A 10.1
1 b A 10.1
2 d B NaN
3 e B 12.3
4 f B 12.3
5 g C 14.1
6 h C 14.1
7 i C 14.1
+2
EdChum
source
to share
One answer I found is the following:
df['col3'] = df.groupby('Col2').transform('fillna',method='ffill')['col3']
Any thoughts?
+1
DrTRD
source
to share
Is this what you are looking for?
import pandas as pd
import numpy as np
df['Col3'] = np.where(df['Col2'] == 'A', df['Col3'].fillna(10.1), df["Col3"])
Replace accordingly, of course.
0
Leb
source
to share
You can take slices of the DataFrame for each item Col2
and then combine the results.
>>> pd.concat((df.loc[df.Col2 == letter, :].ffill() for letter in df.Col2.unique()))
Col1 Col2 Col3
0 a A 10.1
1 b A 10.1
2 d B NaN
3 e B 12.3
4 f B 12.3
5 g C 14.1
6 h C 14.1
7 i C 14.1
EDIT: It seems that the method provided by @EdChum is the fastest to date.
%timeit pd.concat((df.loc[df.Col2 == letter, :].ffill() for letter in df.Col2.unique()))
100 loops, best of 3: 3.57 ms per loop
%timeit df.groupby('Col2').transform('fillna',method='ffill')['Col3']
100 loops, best of 3: 4.59 ms per loop
%timeit df.groupby('Col2')['Col3'].transform(lambda x: x.ffill())
1000 loops, best of 3: 746 ยตs per loop
0
Alexander
source
to share