Why would my training set also be skewed in terms of the number of class distributions just because my test set is skewed
My question is why my training set will also be skewed (the number of instances of the positive class is much less compared to the negative class) when my test suite is skewed as well. I read that it is important to keep the distribution between classes the same for both training and testing to get the most realistic performance. For example, if my test suite has a 90% -10% distribution of class instances, should my training suite have the same proportions?
I find it difficult to understand why it is important to maintain the proportions of the class instances in the training set that are present in the test set.
Why is it hard for me to figure out if we want the classifier to just learn the patterns in both classes? So, should it matter to maintain skewness in the training set just because the test set is skewed?
Any thoughts would be helpful
source to share
IIUC, you are asking about how to use Stratified Sampling (like ScikitStratifiedKFold
.
Once you've split your data into trains and test cases, you have three datasets:
- the "real world" on which your classifier will actually run
- train set on which you will recognize patterns
- the test case that you will use to evaluate the performance of the classifier
(So ββusing 2. + 3. is really just for evaluating how things would work on 1, including possibly settings.)
Suppose your data has some class that is far from uniformly represented - it is said to appear only 5% of the time when it will display uniformly. Moreover, you think that this is not the case of GIGO - in the real world the probability of this class will be about 5%.
When you divide by 2. + 3., you risk everything being skewed relative to 1 .:
-
It is very possible that the class will not show up 5% of the time (on the train or test set), but more or less.
-
It is very possible that some instances of class functions will be garbled in the train or test set relative to 1.
In these cases, when you make decisions based on a 2. + 3. combination, it is likely that it will not indicate effect well 1. which it really is.
By the way, I don't think the emphasis is on train skew to match the test, but rather to make the train and test match all sample data.
source to share