How to calculate document frequency in R?

I have a data frame called pertanian:

DOCS <- c(1:5)
TEXT <- c("tanaman jagung seumur jagung " , 
          "tanaman jagung kacang ketimun rusak dimakan kelinci" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak ")
pertanian <- data.frame(DOCS , TEXT)

      

From the generated data core, I create this document frequency like so:

term     DOCS 1  DOCS 2  DOCS 3  DOCS 4  DOCS 5
dimakan    0       1       1       1       0
diserbu    0       0       1       1       1
jagung     2       1       1       1       1
kacang     0       1       1       1       1
kelinci    0       1       1       1       1
ketimun    0       1       1       1       1
ladang     0       0       1       1       1
rusak      0       1       1       1       1
seumur     1       0       0       0       0
tanaman    1       1       1       1       1

      

From the terminology matrix above, I want to make this document frequency like this:

Term        DF
dimakan     3 
diserbu     3
jagung      5
kacang      4
kelinci     4
ketimun     4
ladang      3
rusak       4
seumur      1
tanaman     5

      

I tried this code:

myCorpus <- Corpus(VectorSource(pertanian$TEXT))
myCorpus2 <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus2)
temp<-inspect(tdm)
colnames(temp) <- paste("DOCS", pertanian$DOCS)
Doc.Freq<-data.frame(apply(temp, 1, sum))
#rename column name
Doc.Freq <- cbind(Term = rownames(Doc.Freq), Doc.Freq)
row.names(Doc.Freq) <- NULL
names(Doc.Freq)[names(Doc.Freq)=="apply.temp..1..sum."] <- "DF"

      

but the output result caused "temporal frequency" and not "document frequency" because the term "jagung" calculated as 6 should be 5 for document frequency

+3


source to share


2 answers


Something like that?

Note . Here I am assuming your desired output has an error and kacang is present in 4 of 5 docs

library(tm)
library(dplyr)

v <- Corpus(VectorSource(TEXT))

data.frame(inspect(TermDocumentMatrix(v))) %>%
  add_rownames() %>%
  mutate(DF = rowSums(.[-1] >= 1)) %>%
  select(Term = rowname, DF)

      

What gives:



#Source: local data frame [10 x 2]
#
#      Term DF
#1  dimakan  3
#2  diserbu  3
#3   jagung  5
#4   kacang  4
#5  kelinci  4
#6  ketimun  4
#7   ladang  3
#8    rusak  4
#9   seumur  1
#10 tanaman  5

      

Or you could just do:

transform(rowSums(inspect(TermDocumentMatrix(v)) >= 1))

      

+5


source


Try the following:



dd <- strsplit(as.character(TEXT),' ') 

> transform(table(unlist(lapply(dd,unique))))
#      Var1 Freq
#1  dimakan    3
#2  diserbu    3
#3   jagung    5
#4   kacang    4
#5  kelinci    4
#6  ketimun    4
#7   ladang    3
#8    rusak    4
#9   seumur    1
#10 tanaman    5

      

+1


source







All Articles