Count the number of spaces before the current line in pandas

Question

Count the number of spaces before the current line in pandas

I have a DataFrame with a string is_blank

that indicates if a string is NaN

or not. I would like to create a new function that counts the number of rows NaN

before the current row in each set of records, grouped by id

.

See example below:

import pandas as pd
is_blank = [0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1]
id = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2]
outval = [0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0]
test_df = pd.DataFrame({'id': id, 'is_blank': is_blank, 'outval': outval})

This is what a toy dataset looks like. I want to create a column outval

. outval[3]

equals 2 because there are two spaces in front of it. Then it is outval[4]

reset to zero because there is no blank line before that line.

In[2]: test_df
Out[2]: 
    id  is_blank  outval
0    1         0       0
1    1         1       0
2    1         1       1
3    1         0       2
4    1         1       0
5    1         1       1
6    1         1       2
7    2         0       0
8    2         0       0
9    2         0       0
10   2         1       0

I am currently trying to use some form of cumulative account:

In[3]: test_df.groupby(['id'])['is_blank'].cumsum().shift(1)
Out[3]: 
0     NaN
1     0.0
2     1.0
3     2.0
4     2.0
5     3.0
6     4.0
7     5.0
8     0.0
9     0.0
10    0.0

But obviously the counter is not reset inside the group and ends up counting all empty lines. I am looking at the option expanding_apply

but cannot fully understand how it works.

Any thoughts on how to effectively solve this problem?

+3

python pandas

sriramn Apr 16 17 at 22:37

source to share

2 answers

You can create another group variable based on is_blank

to reset cumsum:

test_df['outval'] = (test_df.groupby([test_df.id, (test_df.is_blank.diff() != 0).cumsum()])
                     .is_blank.cumsum().groupby(test_df.id).shift().fillna(0))
test_df

Break up:

# create a group variable whose id increases when the blanks are not consecutive
g = (test_df.is_blank.diff() != 0).cumsum()
g

#0     1
#1     2
#2     2
#3     3
#4     4
#5     4
#6     4
#7     5
#8     5
#9     5
#10    6
#Name: is_blank, dtype: int64

# group data frame on both id and g, then do cumsum on the is_blank column
test_df.groupby([test_df.id, g]).is_blank.cumsum().groupby(test_df.id).shift().fillna(0)

#0     0.0
#1     0.0
#2     1.0
#3     2.0
#4     0.0
#5     1.0
#6     2.0
#7     0.0
#8     0.0
#9     0.0
#10    0.0
#Name: is_blank, dtype: float64

+3

Psidom Apr 16 17 at 22:45

source to share

piRSquared · Accepted Answer · 2017-04-17T01:49:26+0000

Vector approach

b = np.append(0, test_df.is_blank.values[:-1])
i = test_df.id.values

bc = b.cumsum()
w1 = np.where(b == 0)[0]
w2 = np.append(0, np.where(i[:-1] != i[1:])[0] + 1)
bd2 = bc[w2].repeat(np.diff(np.append(r[w2], k)))
bd1 = bc[w1].repeat(np.diff(np.append(r[w1], k)))

test_df.assign(outval=bc - np.fmax(bd1, bd2))

    id  is_blank  outval
0    1         0       0
1    1         1       0
2    1         1       1
3    1         0       2
4    1         1       0
5    1         1       1
6    1         1       2
7    2         0       0
8    2         0       0
9    2         0       0
10   2         1       0

The purpose of this is to find the positions where the "reset" will occur. That is, when it id

changes and when it is_blank

is zero.

Let be the bc

total is_blank

and subtract the corresponding values to "reset"

For the price of less intuitive code ... you get faster execution times

naive time test

Count the number of spaces before the current line in pandas

More articles: