SMOTE oversampling and cross validation

I am working on a binary classification problem in Weka with a very imbalanced dataset (90% in one category and 10% in another). I first applied SMOTE ( http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html ) to the whole dataset to align the categories and then did 10 - cross-checking the received data over the news. I found (too?) Optimistic results with F1 around 90%.

Is it due to oversampling? Is it a good idea to cross-validate the data that SMOTE applies to? Are there any ways to solve this problem?

+3


source to share


2 answers


I think you need to split the testing and training data first, then do SMOTE on the training part only, and then test the algorithm from the dataset side that has no synthetic examples, which will give you a better idea of ​​the algorithm's performance.



+8


source


In my experience, splitting data manually is not a good way to deal with this problem. When you have 1 dataset, you should have cross validation for each classifier you use in such a way that 1 time your cross validation is your testnet, which you should not implement SMOTE on it, and you have 9 others folds like your training set in which you must have a balanced dataset. Repeat this action 10 times in a loop. Then you will get better results than splitting the entire dataset by hand.



Obviously, if you apply SMOTE to both the test and training set, you have a synthesized test suite that gives you high accuracy, which is actually wrong.

+2


source







All Articles