Quanteda: How to Delete Your Own Word List

Since there is no ready-made version for Polish words in Kwandede, I would like to use my own list. I have it in a text file as a space separated list. If needed, I can also prepare a newline-delimited list.

How can I remove a custom long list of stop words from my corpus? How can I do this after completion?

I've tried creating various formats converting to string vectors, for example

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

      

I also tried to use word vectors like this in the syntax

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

      

or

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

      

Nothing works. My stop words are displayed in the corpus and in the analysis. What should be the correct way / syntax to apply custom stop words?

+3


source to share


1 answer


Assuming yours polish.stopwords.txt

is like this , you can easily remove them from your corpus like this:

stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")

dfm(mycorpus,
    remove = stopwordsPL,
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3))

      

The readtext solution doesn't work because it reads the entire file as one document. To get individual words, you will need to label them and force the markers to character. Probably readLines()

easier.



It is not necessary to create a dictionary from stopwordsPL

, as it remove

must accept a character vector. Besides, I am not afraid that the Polish Stockmer does not exist.

Currently (v0.9.9-65) removing the function in dfm()

does not get rid of the stop words that form bigrams. To reverse this try:

# form the tokens, removing punctuation
mytoks <- tokens(mycorpus, remove_punct = TRUE)
# remove the Polish stopwords, leave pads
mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
## can't do this next one since no Polish stemmer in 
## SnowballC::getStemLanguages()
# mytoks <- tokens_wordstem(mytoks, language = "polish")
# form the ngrams
mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
# construct the dfm
dfm(mytoks)

      

+5


source







All Articles