Invalid UTF-8 when removing "English" stopwords from text corpus in R-text

Question

Invalid UTF-8 when removing "English" stopwords from text corpus in R-text

When doing text mining, I got an error when deleting stop words from a text corpus with 500 documents in it. I am using R 3.1.3 in Ubuntu 14.04 LTS and 0.6-1 text smart package. Here is the code please help.

unsup.corpus = Corpus(DirSource(directory.location, encoding = "UTF-8"),
                      readerControl = list(language = "en_US"))


document.collection = unsup.corpus    
document.collection = tm_map(document.collection, stripWhitespace, mc.cores = 1)    
document.collection = tm_map(document.collection, content_transformer(tolower), mc.cores = 1)    
document.collection = tm_map(document.collection, removeNumbers, mc.cores = 1)    
document.collection = tm_map(document.collection, removePunctuation, mc.cores = 1)

document.collection = tm_map(document.collection, removeWords, stopwords("english"), mc.cores = 1)

###### Mistake #
Error in gsub (sprintf ("(* UCP) \ b (% s) \ b", insert (sort (words decreasing = TRUE): input line 21 is not valid UTF-8

+3

r utf-8 tm

Suvro May 21 '15 at 21:48

source to share

1 answer

Samantha karlaina rhoads · Answer 1 · 2018-03-29T16:03:59+0000

One thing you can do is

document.collection = 
        tm_map(document.collection[-21], removeWords, stopwords("english"), mc.cores = 1)

This gets rid of the "string" with the problematic character.

If you want to work around this problem independently, you can simply call

document.collection[-21]

and do some research on the specifics.

Invalid UTF-8 when removing "English" stopwords from text corpus in R-text

More articles: