Count the number of spaces before the current line in pandas
I have a DataFrame with a string is_blank
that indicates if a string is NaN
or not. I would like to create a new function that counts the number of rows NaN
before the current row in each set of records, grouped by id
.
See example below:
import pandas as pd is_blank = [0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1] id = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2] outval = [0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0] test_df = pd.DataFrame({'id': id, 'is_blank': is_blank, 'outval': outval})
This is what a toy dataset looks like. I want to create a column outval
. outval[3]
equals 2 because there are two spaces in front of it. Then it is outval[4]
reset to zero because there is no blank line before that line.
In[2]: test_df
Out[2]:
id is_blank outval
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 2
4 1 1 0
5 1 1 1
6 1 1 2
7 2 0 0
8 2 0 0
9 2 0 0
10 2 1 0
I am currently trying to use some form of cumulative account:
In[3]: test_df.groupby(['id'])['is_blank'].cumsum().shift(1)
Out[3]:
0 NaN
1 0.0
2 1.0
3 2.0
4 2.0
5 3.0
6 4.0
7 5.0
8 0.0
9 0.0
10 0.0
But obviously the counter is not reset inside the group and ends up counting all empty lines. I am looking at the option expanding_apply
but cannot fully understand how it works.
Any thoughts on how to effectively solve this problem?
source to share
Vector approach
b = np.append(0, test_df.is_blank.values[:-1])
i = test_df.id.values
bc = b.cumsum()
w1 = np.where(b == 0)[0]
w2 = np.append(0, np.where(i[:-1] != i[1:])[0] + 1)
bd2 = bc[w2].repeat(np.diff(np.append(r[w2], k)))
bd1 = bc[w1].repeat(np.diff(np.append(r[w1], k)))
test_df.assign(outval=bc - np.fmax(bd1, bd2))
id is_blank outval
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 2
4 1 1 0
5 1 1 1
6 1 1 2
7 2 0 0
8 2 0 0
9 2 0 0
10 2 1 0
The purpose of this is to find the positions where the "reset" will occur. That is, when it id
changes and when it is_blank
is zero.
Let be the bc
total is_blank
and subtract the corresponding values to "reset"
For the price of less intuitive code ... you get faster execution times
naive time test
source to share
You can create another group variable based on is_blank
to reset cumsum:
test_df['outval'] = (test_df.groupby([test_df.id, (test_df.is_blank.diff() != 0).cumsum()])
.is_blank.cumsum().groupby(test_df.id).shift().fillna(0))
test_df
Break up:
# create a group variable whose id increases when the blanks are not consecutive
g = (test_df.is_blank.diff() != 0).cumsum()
g
#0 1
#1 2
#2 2
#3 3
#4 4
#5 4
#6 4
#7 5
#8 5
#9 5
#10 6
#Name: is_blank, dtype: int64
# group data frame on both id and g, then do cumsum on the is_blank column
test_df.groupby([test_df.id, g]).is_blank.cumsum().groupby(test_df.id).shift().fillna(0)
#0 0.0
#1 0.0
#2 1.0
#3 2.0
#4 0.0
#5 1.0
#6 2.0
#7 0.0
#8 0.0
#9 0.0
#10 0.0
#Name: is_blank, dtype: float64
source to share