Keras image preprocessing for unbalanced data

All,

I am trying to use Keras to classify images into two classes. For one class, I have a very limited number of images, say 500. For another class, I have an almost infinite number of images. So if I want to use keras image pre-processing, how do I do it? Ideally, I need something like this. For the first class, I load 500 images and use ImageDataGenerator to get more images. For the second class, every time I retrieve 500 images in a sequence of 1,000,000 image datasets and there is probably no need for data expansion. Looking at an example here as well as Keras DocumentationI found that the training folder contains an equal number of images for each default class. So my question is, are there any existing APIs for this trick? If so, please let me know. If not, is there any workaround for this?

+3


source to share


1 answer


You have several options.

Option 1

Use the class_weight

function parameter fit (), which is a class for matching weighted dictionaries. Let's say you have 500 class 0 samples and 1500 class 1 samples than in class_weight = {0:3 , 1:1}

. This gives class 0 three times the weight of class 1.

train_generator.classes

gives you the correct class names for your weight.

If you want to figure this out programmatically than using scikit-learn's sklearn.utils.compute_class_weight()

: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py

The function looks at the distribution of labels and produces weights for class equivalence or override in the training set.



See also this helpful thread here: https://github.com/fchollet/keras/issues/1875

This thread might also help: Is it possible to automatically infer class_weight from flow_from_directory in Keras?

Option 2

You are using a dummy training run with a generator in which you apply magnification to the image, such as rotating, scaling, cropping, flipping, etc. and then save the enlarged images for real training. Thanks to this, you can create a more or even balanced dataset for your under-represented class.

In this dummy run, you set save_to_dir

in functions flow_from_directory

to a folder of your choice and then only grab images from the class for which you need more samples. You are obviously giving up any learning outcomes since you are only using this run to get additional data.

+2


source







All Articles