Removing unicode from text in pandas

for one line, the code below removes Unicode characters and newlines / carriage returns:

t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"

t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))

      

but when I try to write a function in pandas to apply it to every cell in the column, it either fails because of an attribute error, or I get a warning that the value is trying to set a slice from the DataFrame on the copy

def clean_text(row):
    row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
    import sys
    sys.stdout.write(row.strip('\n\r'))
    return row

      

applies to my file frame:

df["text"] = df.apply(clean_text, axis=1)

      

how can i apply this code to each element of the series?

+3


source to share


3 answers


The problem is that you are trying to access and modify row['text']

and return the string directly when you execute the apply function, when you execute apply

on DataFrame

, applying it to each series, so if it is changed to that should help:

import pandas as pd

df = pd.DataFrame([t for _ in range(5)], columns=['text'])

df 
                                                text
0  We've      been invited to attend TEDxTeen, an ind...
1  We've      been invited to attend TEDxTeen, an ind...
2  We've      been invited to attend TEDxTeen, an ind...
3  We've      been invited to attend TEDxTeen, an ind...
4  We've      been invited to attend TEDxTeen, an ind...

      




def clean_text(row):
    # return the list of decoded cell in the Series instead 
    return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]

df['text'] = df.apply(clean_text)

df
                                                text
0  We'vebeen invited to attend TEDxTeen, an indep...
1  We'vebeen invited to attend TEDxTeen, an indep...
2  We'vebeen invited to attend TEDxTeen, an indep...
3  We'vebeen invited to attend TEDxTeen, an indep...
4  We'vebeen invited to attend TEDxTeen, an indep...

      

Alternatively, you can use lambda

as below and apply directly to the column text

:

df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
                                          encode('ascii', 'ignore').\
                                          strip())

      

+7


source


In fact, I cannot reproduce your error: the following code runs for me without error or warning.

df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)

      



If that helps, I think a more "pandas" approach to this problem might be to use regex with one of the methods DataFrame.str

, for example:

df["text"] =  df.text.str.replace('[^\x00-\x7F]','')

      

+4


source


Something like this, where column_to_convert is the column you want to convert:

series = df['column_to_convert']
df["text"] =  [s.encode('ascii', 'ignore').strip()
               for s in series.str.decode('unicode_escape')]

      

0


source







All Articles