Counting changes in line items
I am dealing with a dataset with rows in a column and I need to count the number of changes in a dataframe for that column. Therefore, if the data frame was grouped by the "id" column, one instance of the group would look like this:
id vehicle
'abc' 'bmw'
'abc' 'bmw'
'abc' 'yamaha'
'abc' 'suzuki'
'abc' 'suzuki'
'abc' 'kawasaki'
So, in this case, I would like to say that id 'abc' changed the car brand 3 times. Is there an efficient way to do this over multiple groups for the "id" column?
source to share
I can imagine two ways:
1) groupby
in the 'id' and call apply
on the "vehicle" column and the transfer method nunique
, you need to subtract 1 as you are looking for a change, not just a total unique score:
In [292]:
df.groupby('id')['vehicle'].nunique() -1
Out[292]:
id
'abc' 3
Name: vehicle, dtype: int64
2) a apply
lambda that checks if the current car is the same as the previous car using shift
, this is more semantically correct, since it detects changes, not just a generic unique count, calling sum
on booleans converts True
both False
to 1
and 0
respectively:
In [293]:
df.groupby('id')['vehicle'].apply(lambda x: x != x.shift()).sum() - 1
Out[293]:
3
It is required -1
as for the first string to be compared to a string that does not exist, and comparisons to NaN
are meaningless in this case, see below:
In [301]:
df.groupby('id')['vehicle'].apply(lambda x: x != x.shift())
Out[301]:
0 True
1 False
2 True
3 True
4 False
5 True
Name: 'abc', dtype: bool
source to share