Creating Dummy Variables from a List

So I'm trying to create dummy variables to bind to a data frame based on whether a particular column of the frame has certain words in it. The column will look something like this:

 dumcol = c("good night moon", "good night room", "good morning room", "hello moon")

      

and I will create dummy variables based on what words each line contains, eg. for the first it contains "good", "night",

and "moon"

, but not "room", "morning"

or "hello"

.

The way I've done it so far, in an extremely primitive way, creates a 0-valued matrix of the appropriate size and then uses a for loop like this:

result=matrix(ncol=6,nrow=4)
wordlist=unique(unlist(strsplit(dumcal, " ")))
for (i in 1:6)
{ result[grep(wordlist[i], dumcol),i] = 1 }

      

or something similar. I guess there is a faster / more resource efficient way to do this. Any advice?

+3


source to share


4 answers


You may try:

library(tm)
myCorpus <- Corpus(VectorSource(dumcol))
myTDM <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
as.matrix(myTDM)

      

What gives:



#         Docs
#Terms     1 2 3 4
#  good    1 1 1 0
#  hello   0 0 0 1
#  moon    1 0 0 1
#  morning 0 0 1 0
#  night   1 1 0 0
#  room    0 1 1 0

      

If you want dummy variables in columns, you can use instead DocumentTermMatrix

:

#    Terms
#Docs good hello moon morning night room
#   1    1     0    1       0     1    0
#   2    1     0    0       0     1    1
#   3    1     0    0       1     0    1
#   4    0     1    1       0     0    0

      

+3


source


Try

 library(qdapTools)
 mtabulate(strsplit(dumcol, ' '))
 #    good hello moon morning night room
 #1    1     0    1       0     1    0
 #2    1     0    0       0     1    1
 #3    1     0    0       1     0    1
 #4    0     1    1       0     0    0

      



or

 library(splitstackshape)
 cSplit_e(as.data.frame(dumcol), 'dumcol', sep=' ', 
                      type='character', fill=0, drop=TRUE)
 #  dumcol_good dumcol_hello dumcol_moon dumcol_morning dumcol_night dumcol_room
 #1           1            0           1              0            1           0
 #2           1            0           0              0            1           1
 #3           1            0           0              1            0           1
 #4           0            1           1              0            0           0

      

+3


source


I would do

sdum <- strsplit(dumcol," ")
us   <- unique(unlist(sdum))
res  <- sapply(sdum,function(x)table(factor(x,levels=us)))
#         [,1] [,2] [,3] [,4]
# good       1    1    1    0
# night      1    1    0    0
# moon       1    0    0    1
# room       0    1    1    0
# morning    0    0    1    0
# hello      0    0    0    1

      

The result can be transposed using t(res)

for dummy variables in columns (R convention).

+2


source


Put your dummy variables right back into your dataframe (I'll call it dfr

); and use grepl

to get the desired values TRUE

/ FALSE

.

for (word in wordlist) {
  dfr[,paste0(word, ".exists")] <- grepl(word, dfr$dumcol)
}

      

-1


source







All Articles