Embedding n-grams for next word prediction
I am trying to use trigram for next word prediction.
I was able to download the corpus and identify the most common trigrams by their frequencies. I used the packages "ngrams", "RWeka" and "tm" in R. I followed this question for guidance:
What algorithm do I need to find n-grams?
text1<-readLines("MyText.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
If the user had to enter a set of words, how would I start generating the next word? For example, if the user types "can from", how would you get the three most likely words (eg beer, soda, paint, etc.)?
source to share
Here's one way as a starter:
f <- function(queryHistoryTab, query, n = 2) {
require(tau)
trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
query <- tolower(query)
idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
res <- substr(res, nchar(query) + 2, nchar(res))
return(res)
}
f(c("Can of beer" = 3, "can of Soda" = 2, "A can of water" = 1, "Buy me a can of soda, please" = 2), "Can of")
# [1] "soda" "beer"
source to share
I just tried it! Hopefully the following commented code helps you, but I would like to see how RNN can work on trigrams! NaiveBayes didn't do a decent job due to the fact that it might be a rarity of trigrams. Gram_12 is actually two grams of the first two words in the trigram. Consider this as a first step, not the ultimate model for your efforts.
library(stringr)
library(qdap)
if (word_count(qry) >= 2){
lastwd<-word(qry,-2:-1)
test<-paste(lastwd[1],lastwd[2])
#Check if you find matching last two words in trigram Gram_12
index1 <- with(tri.df, grepl(test, tri.df$Gram_12))
#If found
if(any(index1)){
#Subset the trigram and group by Gram_3
index1 <- with(tri.df, grepl(test, tri.df$Gram_12))
filtered<-tri.df[index1, ]
#Find frequency of each unique group
freq<-data.frame(table(filtered$Gram_3))
#Order by Frequency of Gram_3 & return top 5
freq<-head(freq[order(-freq$Freq),],5)
predict<-as.character(freq[(freq$Freq>0),]$Var1)
#return(predict)
}
else { #If notfound
#Get only last word
library(stringr)
lastwd<-word(qry,-1)
#Search in bi gram Gram_1 and Group by Gram_2
index2 <- with(bi.df, grepl(lastwd, bi.df$Gram_1))
if(any(index2)){
filtered<-bi.df[index2, ]
#Find frequency of each unique group
freq<-data.frame(table(filtered$Gram_2))
#Order by Frequency of Gram 2
freq<-head(freq[order(-freq$Freq),],5)
predict<-as.character(freq[(freq$Freq>0),]$Var1)
}
else{
(predict<-"Need more training to predict")
}
}
}
else {
#else if length words==1 & Applied
library(stringr)
lastwd<-word(qry,-1)
index3 <- with(bi.df, grepl(lastwd, bi.df$Gram_1))
if(any(index3)){
filtered<-bi.df[index3, ]
#Find frequency of each unique group
freq<-data.frame(table(filtered$Gram_2))
#Order by Frequency of Gram 2
freq<-head(freq[order(-freq$Freq),],5)
predict<-as.character(freq[(freq$Freq>0),]$Var1)
}
else{
(predict<-"Need more training to predict")
}
}
source to share