Check words from list and remove those words in pandas dataframe column

I have a list like this:

remove_words = ['abc', 'deff', 'pls']

      

Below is the dataframe I have with the column name 'string'

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

      

I want to check for words from remove_words list in pandas dataframe column and remove those words in pandas framework. I want to check that words are encountered individually, without meeting other words.

For example, if pandas df column has 'abc', replace it with '', but if this happens with abc123, we need to leave it as it is. The conclusion here should be,

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

      

In my actual data, I have 2000 words in the remove_words list and 5 billion entries in the pandas framework. Therefore, I am looking for the best efficient way to do this.

I have tried several things in python, without much success. Can anyone help me with this? Any ideas would be helpful.

thank

+6


source to share


3 answers


Try the following:



In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc Qaru   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

      

+8


source


Absolutely accepting @MaxU template!

We can use it by setting the parameter to and passing in a dictionary of dictionaries that defines the template and what to replace for each column. pd.DataFrame.replace

regex

True



pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc Qaru   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

      

+3


source


How do I do this case insensitively? ex. remove_words = ['ABC', 'Def', 'Pls'] must remove ABC and abc both from the bite.

0


source







All Articles