The choice of training and validation sets for a convolutional neural network has a large impact on testing accuracy.

I am doing a traffic sign recognition job using Benchmark's traffic sign detection database. There are 43 classes with at least 400 images in each class. Images can have up to 3 road signs.

When I randomly select images for training and validation, I get a huge difference in the accuracy of my network testing. I built two datasets: one has 75% training images and 25% validation images; the other is 70% instructional images and 30% confirmation images.

I am using GoogLeNet with identical hyperparameters for training including 30 epochs.

After training, I test another set of images for testing. With the first set of data, I get almost 10% less accuracy than with the second. Can someone explain this?

Maybe he accidentally chose "simpler" images for training, which is why I get lower results?

PS for both datasets I am using the same images, just dividing them by percentages.

Dataset link: http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset

+3


source to share


1 answer


How many character classes do you have? Is GoogLeNet known to perform well on this dataset? Remember, GoogLeNet was designed specifically to execute the ILSVRC 2012 dataset: almost 1.3M training images, 1000 classes. It expands to several hundred parallel cores giving it a lot of flexibility for the problem.

If you have significantly less of a problem - say 900 images in multiple categories - then the GoogLeNet value is likely too much for your application. For example, note that the final fully connected 1000 parameter level is more than necessary to individually recognize each image in the training set. Intermediate layers of filters 128-200 + are going to find many false features, such as a green pattern around the six speed zone signs.

Also, remember that GoogLeNet was created to identify one shape in the input image: the Traffic Sign database can have up to 6 characters per image. It can also interfere with your learning, depending on how you classify ambiguous images.



If you think you need the complexity of GoogLeNet for this task, I suggest you reduce the width of the layers. For example, when ILSVRC history dictates that your model will learn to identify facial features, vehicle parts, and flower petals, road signs are much more limited in their visuals. This way, while you might want the first layer (edge ​​and area detection) to remain saturated, you won't need as many filters in the middle layers.

No, I can't give you a solid starting point: I haven't done the months of experimentation required to tune a model for a problem. If you want to see an extreme example, drive GoogLeNet for the recommended number of iterations, but feed it the MNIST database. Better yet, give him textual screenshots of tick-tock feet, classified only as "win", "draw" and "lose".

+2


source







All Articles