Removing non-ASCII from a corpus

Question

Removing non-ASCII from a corpus

I am using NLTK for my project. However, if there is a non-ascii word like "•". NLTK can't fake it. I am using nltk.word_tokenize

as a tokenizer. How to remove such words from the whole body or make the tokenizer aware of such words?

+3

python unicode ascii nltk information-retrieval

IllSc 04 nov. 14 at 7:14

source to share

1 answer

Irshad Bhat · Accepted Answer · 2014-11-04T07:32:28+0000

Use the below code to remove nonascii from your corpus:

ip=open(nonascii.txt,'r')
#Edit should be in w mode
op=open(ascii.txt,'w')
for line in ip:
        line=line.strip().decode("ascii","ignore").encode("ascii")
        if line=="":continue
        op.write(line)
ip.close()
op.close()

Removing non-ASCII from a corpus

More articles: