Invalid UTF-8 when removing "English" stopwords from text corpus in R-text

When doing text mining, I got an error when deleting stop words from a text corpus with 500 documents in it. I am using R 3.1.3 in Ubuntu 14.04 LTS and 0.6-1 text smart package. Here is the code please help.

unsup.corpus = Corpus(DirSource(directory.location, encoding = "UTF-8"),
                      readerControl = list(language = "en_US"))


document.collection = unsup.corpus    
document.collection = tm_map(document.collection, stripWhitespace, mc.cores = 1)    
document.collection = tm_map(document.collection, content_transformer(tolower), mc.cores = 1)    
document.collection = tm_map(document.collection, removeNumbers, mc.cores = 1)    
document.collection = tm_map(document.collection, removePunctuation, mc.cores = 1)

document.collection = tm_map(document.collection, removeWords, stopwords("english"), mc.cores = 1)

      

###### Mistake #

Error in gsub (sprintf ("(* UCP) \ b (% s) \ b", insert (sort (words decreasing = TRUE): input line 21 is not valid UTF-8

+3


source to share


1 answer


One thing you can do is

document.collection = 
        tm_map(document.collection[-21], removeWords, stopwords("english"), mc.cores = 1) 

      

This gets rid of the "string" with the problematic character.



If you want to work around this problem independently, you can simply call

document.collection[-21] 

      

and do some research on the specifics.

+1


source







All Articles