Invalid UTF-8 when removing "English" stopwords from text corpus in R-text

When doing text mining, I got an error when deleting stop words from a text corpus with 500 documents in it. I am using R 3.1.3 in Ubuntu 14.04 LTS and 0.6-1 text smart package. Here is the code please help.

unsup.corpus = Corpus(DirSource(directory.location, encoding = "UTF-8"),
                      readerControl = list(language = "en_US"))

document.collection = unsup.corpus    
document.collection = tm_map(document.collection, stripWhitespace, mc.cores = 1)    
document.collection = tm_map(document.collection, content_transformer(tolower), mc.cores = 1)    
document.collection = tm_map(document.collection, removeNumbers, mc.cores = 1)    
document.collection = tm_map(document.collection, removePunctuation, mc.cores = 1)

document.collection = tm_map(document.collection, removeWords, stopwords("english"), mc.cores = 1)


###### Mistake #

Error in gsub (sprintf ("(* UCP) \ b (% s) \ b", insert (sort (words decreasing = TRUE): input line 21 is not valid UTF-8


source to share

1 answer

One thing you can do is

document.collection = 
        tm_map(document.collection[-21], removeWords, stopwords("english"), mc.cores = 1) 


This gets rid of the "string" with the problematic character.

If you want to work around this problem independently, you can simply call



and do some research on the specifics.



All Articles