Deep learning with an unbalanced dataset

I have two datasets that look like this:

DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)

DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)

      

I am trying to build a deep forward neural network in Tensorflow. I am getting accuracy in 90s and AUC estimates in 80s. Of course, the dataset is highly imbalanced, so these metrics are useless. My emphasis is on getting good recall value and I don't want to overestimate grade 1. I played with the complexity of the model to no avail, the best model correctly predicted only 25% of the positive grade.

My question is, given the distribution of these datasets, is it a futile move to build models without getting more data (I can't get more data), or is there a way to work with data that is so much imbalanced.

Thank!

+3


source to share


2 answers


Question

Can I use tensorflow to study imbalance classification with a ratio of about 30: 1

Answer

Yes, and I have. Specifically, Tensorflow provides a weight matrix feed capability. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can file in a matrix that matches the Y shape , and for each Y value, indicate the relative weight that the training example should have.

One way to find the correct weights is to start different balances and do your training and then look at your confusion matrix and skip precision versus precision for each grade. Once you get both classes to have the same precision with a precision factor, they are balanced.



Implementation example

Here is an example implementation that converts Y to a weight matrix which did very well for me

def weightMatrix( matrix , most=0.9 ) :
    b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
    a = 1./( b * 2. )
    weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
    return weights

      

The parameter most poses the greatest fractional difference. 0.9 corresponds to .1: .9 = 1: 9, where as.5 is 1: 1. Values โ€‹โ€‹below .5 do not work.

+3


source


You may be interested to see this question and its answer. Its scope is a priori more limited than yours because it allows for specific weights for classification, but it seems very relevant to your case.



Also, the AUC definitely doesn't matter: it doesn't actually depend on your data imbalance.

+2


source







All Articles