Fetching indices of 1st fill of elements in series in dataframe columns
Consider a python dataframe
A B C
1 random imp1
2 random imp2
5 random imp3
1 yes ---
2 yes ---
3 no ---
4 no ---
5 yes ---
Whenever column B is yes, I want to get the values โโof A. And ultimately for those values โโof A, I want C when those values โโhappened 1st in A. So in this case, I finally want imp1. imp2 and imp3.
Is there an elegant way to do this.
+3
source to share
3 answers
You can use boolean indexing
with loc
first and then the duplicated
last filter with isin
values a
:
a = df.loc[df['B'] == 'yes', 'A']
df = df.drop_duplicates('A')
df = df.loc[df['A'].isin(a), 'C']
print (df)
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Delay
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'B': np.random.choice(['yes','no', 'a', 'b', 'c'], N),
'A':np.random.randint(1000, size=N),
'C':np.random.randint(1000, size=N)})
print (df)
print (df[df.A.isin(df[df.B == 'yes'].A)].drop_duplicates('A').C)
print (df[df.A.isin(df[df.B == 'yes'].drop_duplicates('A').A)].C)
def fjez(df):
a = df.loc[df['B'] == 'yes', 'A']
df = df.drop_duplicates('A')
return df.loc[df['A'].isin(a), 'C']
def fpir(df):
a = df.A.values
b = df.B.values == 'yes'
d = df.drop_duplicates('A')
return d.C[np.in1d(d.A.values, a[b])]
print (fjez(df))
print (fpir(df))
In [296]: %timeit (df[df.A.isin(df[df.B == 'yes'].A)].drop_duplicates('A').C)
1 loop, best of 3: 226 ms per loop
In [297]: %timeit (df[df.A.isin(df[df.B == 'yes'].drop_duplicates('A').A)].C)
1 loop, best of 3: 185 ms per loop
In [298]: %timeit (fjez(df))
10 loops, best of 3: 156 ms per loop
In [299]: %timeit (fpir(df))
10 loops, best of 3: 87.1 ms per loop
+2
source to share
It should be really fast
a = df.A.values
b = df.B.values == 'yes'
d = df.drop_duplicates('A')
d.C[np.in1d(d.A.values, a[b])]
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
Above the top approach. 50% faster than my other approach.
from numba import njit
@njit
def proc(f, m):
mx = f.max() + 1
a = [False] * mx
b = [0] * mx
z = [0] * f.size
for i in range(f.size):
x = f[i]
y = m[i]
b[x] += 1
z[i] = b[x]
a[x] = a[x] or y
return np.array(z) == 1, np.array(a)[f]
df.C[np.logical_and(*proc(pd.factorize(df.A.values)[0], df.B.values == 'yes'))]
0 imp1
1 imp2
2 imp3
Name: C, dtype: object
+1
source to share