Using the word classifier package on an out-of-sample dataset

I recently used the Bag-of-Words classifier to generate a matrix of documents with 96% deadlines. I then used a Decision Tree for model training on a bag of word input to predict if a sentence is important or not. The model worked really well on the test dataset, but when I used the dataset to sample it can't predict. Instead, it gives an error.

Here's the model I made in R


data= read.csv('comments.csv', stringsAsFactors = FALSE)
corpus = Corpus(VectorSource(data$Word))

# Pre-process data
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stemDocument)

# Create matrix
dtm = DocumentTermMatrix(corpus)

# Remove sparse terms
#dtm = removeSparseTerms(dtm, 0.96)
# Create data frame
labeledTerms =

# Add in the outcome variable
labeledTerms$IsImp = data$IsImp 

#Splitting into train and test data using caTools


spl = sample.split(labeledTerms$IsImp , 0.60)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

#Build CART Model
CART = rpart(IsImp ~., data=train, method="class")


This works great on a test dataset that is around 83% accurate. However, when I use this cart model to predict based on a sample dataset it gives me an error.

terms A B C D E F..............(n terms)
Freqs 0 1 2 1 3 0..............(n terms)

terms A B C D E F..............(n terms)
Freqs 0 0 1 1 1 0..............(n terms)

data_random = read.csv('comments_random.csv', stringsAsFactors = FALSE)

terms A B D E F H..............(n terms)
Freqs 0 0 1 1 1 0..............(n terms)


The error I am getting is "cannot find C" in data_random. I don't know what I have to do to make this work. Is laplace a smoothing path here?


source to share

2 answers

The problem is that C is part of your training set. Therefore, it is considered for the model. This means that prediction for the dataset requires C to have a value.

Your test case does not have a C. You need to add a column that says the test case has 0 C.



It is very good that this "error" is addressed. As @Felix suggests this error is simply because you are missing a variable in the prediction dataset. Therefore, the error is quite redundant, and its fix has nothing to do with Laplace fixes, etc. You just need to make sure you have the same variables in your training dataset and your forecasting dataset. It can fx. do:

names(trainingdata) %in% names(predictiondata)

... And some extra code

Now, the reason I think the bug is interesting is because it touches on a fundamental discussion of how to actually approach text data modeling. Because if you just add variables that are missing for the prediction data (i.e. C) and fill the cells with zeros, you end up with a completely redundant variable that fills up space and memory. This means that you can also infer the variable from the TRAINING data instead of the prediction data.

However, the best way to approach the problem is to generate words of words based on both training and prediction data, and further split the data into a training set and a prediction set. This will take care of your problem AND at the same time be more theoretically "correct" because you base your bag words on a larger fraction of the total population of patterns (ie Texts)

This is my business. I hope this helps!



All Articles