SMOTE oversampling and cross validation
I am working on a binary classification problem in Weka with a very imbalanced dataset (90% in one category and 10% in another). I first applied SMOTE ( http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html ) to the whole dataset to align the categories and then did 10 - cross-checking the received data over the news. I found (too?) Optimistic results with F1 around 90%.
Is it due to oversampling? Is it a good idea to cross-validate the data that SMOTE applies to? Are there any ways to solve this problem?
source to share
In my experience, splitting data manually is not a good way to deal with this problem. When you have 1 dataset, you should have cross validation for each classifier you use in such a way that 1 time your cross validation is your testnet, which you should not implement SMOTE on it, and you have 9 others folds like your training set in which you must have a balanced dataset. Repeat this action 10 times in a loop. Then you will get better results than splitting the entire dataset by hand.
Obviously, if you apply SMOTE to both the test and training set, you have a synthesized test suite that gives you high accuracy, which is actually wrong.
source to share