Check words from list and remove those words in pandas dataframe column
I have a list like this:
remove_words = ['abc', 'deff', 'pls']
Below is the dataframe I have with the column name 'string'
data['string']
0 abc stack overflow
1 abc123
2 deff comedy
3 definitely
4 pls lkjh
5 pls1234
I want to check for words from remove_words list in pandas dataframe column and remove those words in pandas framework. I want to check that words are encountered individually, without meeting other words.
For example, if pandas df column has 'abc', replace it with '', but if this happens with abc123, we need to leave it as it is. The conclusion here should be,
data['string']
0 stack overflow
1 abc123
2 comedy
3 definitely
4 lkjh
5 pls1234
In my actual data, I have 2000 words in the remove_words list and 5 billion entries in the pandas framework. Therefore, I am looking for the best efficient way to do this.
I have tried several things in python, without much success. Can anyone help me with this? Any ideas would be helpful.
thank
source to share
Try the following:
In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))
In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'
In [100]: df['new'] = df['string'].str.replace(pat, '')
In [101]: df
Out[101]:
string new
0 abc Qaru stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
source to share
Absolutely accepting @MaxU template!
We can use it by setting the parameter to and passing in a dictionary of dictionaries that defines the template and what to replace for each column. pd.DataFrame.replace
regex
True
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
df.assign(new=df.replace(dict(string={pat: ''}), regex=True))
string new
0 abc Qaru stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
source to share