Slice pandas dataframe in groups of sequential values
I have a block of data containing sections of consecutive values โโthat end up "skipping" (ie increasing by more than 1). I would like to split a dataframe like a function groupby
(show-only alphabetical indexing):
A
a 1
b 2
c 3
d 6
e 7
f 8
g 11
h 12
i 13
# would return
a 1
b 2
c 3
-----
d 6
e 7
f 8
-----
g 11
h 12
i 13
source to share
We can use shift
for comparison if the difference between the lines is greater than 1, and then build a list of tuple pairs of the required indices:
In [128]:
# list comprehension of the indices where the value difference is larger than 1, have to add the first row index also
index_list = [df.iloc[0].name] + list(df[(df.value - df.value.shift()) > 1].index)
index_list
Out[128]:
['a', 'd', 'g']
we have to build a list of root pairs of the ranges we are interested in, note that in pandas, the start and end index values โโare included, so we need to find the label for the previous row for the ending range label:
In [170]:
final_range=[]
for i in range(len(index_list)):
# handle last range value
if i == len(index_list) -1:
final_range.append((index_list[i], df.iloc[-1].name ))
else:
final_range.append( (index_list[i], df.iloc[ np.searchsorted(df.index, df.loc[index_list[i + 1]].name) -1].name))
final_range
Out[170]:
[('a', 'c'), ('d', 'f'), ('g', 'i')]
I am using numpy searchsorted to find the index value (integer based) where we can insert our value and then subtract 1 from that to get the index mark value of the previous row
In [171]:
# now print
for r in final_range:
print(df[r[0]:r[1]])
value
index
a 1
b 2
c 3
value
index
d 6
e 7
f 8
value
index
g 11
h 12
i 13
source to share