Invalid UTF-8 when removing "English" stopwords from text corpus in R-text
When doing text mining, I got an error when deleting stop words from a text corpus with 500 documents in it. I am using R 3.1.3 in Ubuntu 14.04 LTS and 0.6-1 text smart package. Here is the code please help.
unsup.corpus = Corpus(DirSource(directory.location, encoding = "UTF-8"),
readerControl = list(language = "en_US"))
document.collection = unsup.corpus
document.collection = tm_map(document.collection, stripWhitespace, mc.cores = 1)
document.collection = tm_map(document.collection, content_transformer(tolower), mc.cores = 1)
document.collection = tm_map(document.collection, removeNumbers, mc.cores = 1)
document.collection = tm_map(document.collection, removePunctuation, mc.cores = 1)
document.collection = tm_map(document.collection, removeWords, stopwords("english"), mc.cores = 1)
###### Mistake #Error in gsub (sprintf ("(* UCP) \ b (% s) \ b", insert (sort (words decreasing = TRUE): input line 21 is not valid UTF-8
+3
source to share
1 answer
One thing you can do is
document.collection =
tm_map(document.collection[-21], removeWords, stopwords("english"), mc.cores = 1)
This gets rid of the "string" with the problematic character.
If you want to work around this problem independently, you can simply call
document.collection[-21]
and do some research on the specifics.
+1
source to share