In classification, how do you validate the model in the case of an imbalanced dataset?

I'm kind of a beginner in machine learning and trying to solve a classification problem. Im working on a very unbalanced (sequential) dataset (only 2% positives from 20k records) and I am using LSTM / GRU for modeling in python using tensorflow.

This is what I am doing: Load data. Divide the dataset into 3 datasets: A for training (70%) B for validation (15%) C for testing (15%); For each dataset (A, B, C), oversample the positive class to increase the positive rate from 2% to 30%. This gives me 3 new datasets that are more balanced: A, B, C

Then I train my model with dataset A using GRU.

My goal: to get the maximum F score on my test set C (Are there any better scores? From what I've seen, the F score depends on the distribution of the data, i.e. how skewed the data is. If my actual positives increase, then accuracy decreases (due to an increase in False negatives), although more or less remind the same thing, and therefore my overall F-value decreases)

My questions:

Can cross entropy be used as my cost function during training? (I am not changing the cost function to be more sensitive to false positives, as I have already sampled my positives)

Which dataset should I use for validation? B or B? and what metric should I use to build the validation learning curve to see where my model fits? (Currently I am using precision on both A (train) and B (validation) to see if there are any additions. But it seems that the precision on B and f does not correlate the score on B. Because in ultimately I want a good f score on C, which means I need a model that gives a good f score on B)

Thanks in advance for your time! Regards.

+3


source to share


1 answer


(Below is a longer comment than a full answer - I need to think about it. Hopefully I can find time to update it tonight / tomorrow)

Which set should be a test set?

We use a test set to estimate the real score (error / accuracy / F1-score / recall / accuracy / ...), hence the score will be obtained if we test the model on all possible samples (which would be an extraordinary huge number of samples, for example, if you got 32x32 px grayscale images that would be 256 ^ 1024 \ approx 10 ^ 2466).

Hence, you take C for testing, not C '.

What set should a validation set be?



We use a set of validations that do not overlap with the test suite. Usually for an early stop. If the score is an optimization goal, it should be B (not B). If the score is something else, you may need to think about how the two go hand in hand (for example, when the optimization goal improves, does the score also get better?). If they don't go hand in hand in many cases, you should set up an optimization goal.

You have an F1 score and you are thinking about using cross entropy as an optimization target. Cross entropy ignores the class, so you are balancing the classes.

edit: Thinking about it, I would rate F1 to B as a stopping criterion. Other options may also be valid, but this seems to matter most since the F1 score should be the maximum.

In which set should the training set be used?

If you take A, you have the problem that your network learns to always predict a more general class. Therefore, you must take A '.

+1


source







All Articles