Deep learning with an unbalanced dataset
I have two datasets that look like this:
DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)
DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)
I am trying to build a deep forward neural network in Tensorflow. I am getting accuracy in 90s and AUC estimates in 80s. Of course, the dataset is highly imbalanced, so these metrics are useless. My emphasis is on getting good recall value and I don't want to overestimate grade 1. I played with the complexity of the model to no avail, the best model correctly predicted only 25% of the positive grade.
My question is, given the distribution of these datasets, is it a futile move to build models without getting more data (I can't get more data), or is there a way to work with data that is so much imbalanced.
Thank!
source to share
Question
Can I use tensorflow to study imbalance classification with a ratio of about 30: 1
Answer
Yes, and I have. Specifically, Tensorflow provides a weight matrix feed capability. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can file in a matrix that matches the Y shape , and for each Y value, indicate the relative weight that the training example should have.
One way to find the correct weights is to start different balances and do your training and then look at your confusion matrix and skip precision versus precision for each grade. Once you get both classes to have the same precision with a precision factor, they are balanced.
Implementation example
Here is an example implementation that converts Y to a weight matrix which did very well for me
def weightMatrix( matrix , most=0.9 ) :
b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
a = 1./( b * 2. )
weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
return weights
The parameter most poses the greatest fractional difference. 0.9 corresponds to .1: .9 = 1: 9, where as.5 is 1: 1. Values โโbelow .5 do not work.
source to share
You may be interested to see this question and its answer. Its scope is a priori more limited than yours because it allows for specific weights for classification, but it seems very relevant to your case.
Also, the AUC definitely doesn't matter: it doesn't actually depend on your data imbalance.
source to share