How can I find the start and end of a regex match using the python pandas framework?

Question

How can I find the start and end of a regex match using the python pandas framework?

I am getting DNA or protein sequences from databases. The sequences are aligned, so although I always know one input sequence, it is often truncated and includes spaces as appended "-" characters. First, I want to find the area in the query string. In this case, a regex search makes sense. Then I want to extract equivalent regions from other aligned lines (I called them here "markup" and "bump"). Since the sequences are aligned, the region I want on all lines will have the same start and stop. Is there an easy way to get the start and end of a regex match in a pandas framework?

import pandas as pd
import re
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})
#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

Pandas has a nice fetch function to strip out a consistent sequence from a request:

df['query'].str.extract(regex_desired_region_from_query)

However, I need to start and end the match in order to extract the equivalent areas from the markup and hit columns. For one line, this is done as follows:

match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
sliced_hit
Out[3]:'PLETQDA'

My current solution is as follows. (Edited to include nhahtdh's suggestion and therefore avoid searching twice.)

#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)

Now I want to use the resulting start and end integers to slice the strings. This code will be nice:
df.sliced = df.string[df.start:df.end]

But I don't think it currently exists. Instead, I used lambda functions again:

#create slice functions
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
fn_slice_markup = lambda x : x['markup'][x['start']:x['end']]

#apply the slice functions
df['sliced_markup'] = df.apply(fn_slice_markup, axis = 1)
df['sliced_hit'] = df.apply(fn_slice_hit, axis = 1)
print(df)

                       hit                   markup                    query   start  end sliced_markup sliced_hit
0  MPTMGFWVYITVELAIAVLAILG  || ||  ||||||||||||||||  MPIMGSSVYITVELAIAVLAILG       1    7        | ||       PTMGFW
1  MP-NSSLVYIGLELVIACLSVAG  ||   | ||| :|| || |:: |  MPIMGSSVYITVELAIAVLAILG       1    7        |   |      P-NSSL
2  MPLETQDALYVALELAIAALSVA  ||:    ::|: :||||| |::   MPI-MGSSVYITVELAIAVLAIL       1    8       |:    :    PLETQDA

Do pandas.match, .extract, .findall functions have an equivalent .start () or .end () attribute?
Is there a way to cut more elegantly?
Any help would be appreciated!

+3

python pandas regex

Mark teese 30 oct. 14 at 16:21

source to share

1 answer

jkitchen · Answer 1 · 2014-10-30T18:34:22+0000

I don't think this exists in pandas, but would be a great addition. Go to https://github.com/pydata/pandas/issues and add a new issue. Explain that this is the improvement you would like to see.

For the .start () and .end () methods, they probably make more sense as kwargs for the extract () method. If str.extract (pat, start_index = True) then returns the Serial or Dataframe of the starting indices, not the capturing group value. The same goes for end_index = True. They should probably be mutually exclusive.

I also like your suggestion

df.sliced = df.string[df.start:df.end]

Pandas already has a str.slice method

df.sliced = df.string.str.slice(1, -1)

But they must be ints. Add a separate issue on Github to have the str.slice method apply sequential objects and apply item-wise.

Sorry you don't have a better solution other than your lambda hack, but these are the same use cases that help make Pandas better.

How can I find the start and end of a regex match using the python pandas framework?

More articles: