Neural network  significantly uneven progress in learning across all input vectors
I am implementing a Feedback Neural Network that trains using backpropagation. When I output the error rate after each test, he finds out  I notice that after a few epochs he starts to learn well certain test cases, but others very poorly. that is, some test cases have a very low margin of error, but others have a very high margin of error.
Essentially, after several epochs, I notice that the mean square error gets stuck on the next pattern  (each line represents an MSE after one test case).
0.6666666657496451
0.6666666657514261
1.5039854423139616E10
1.4871467103001578E10
1.5192940136144856E10
1.4951558809679557E10
0.6666521719715195
1.514803547256445E10
1.5231135866323182E10
0.6666666657507451
1.539071732985272E10
Could there be some possible reason (s) why this is happening?
I originally thought these cases causing high error rates might just be outliers, but there are too many of them as the template shows. Maybe my student has just hit local lows and needs some way to get out of it?
source to share
My answer aims at a possible solution to the "uneven" progress in training your classifier. As for the "why", you see this behavior, I postpone. In particular, I don't want to try to attribute the reasons to the artifacts I observe in the middle of training  that is, is it data? Or MLP implementation? Or is it a custom configuration? The point is that it is the interaction of your classifier with the data that caused this observation, and not any inherent feature.
It is not uncommon for a classifier to study some input vectors well enough, and also quite quickly  i.e. [observable  predicted] ^ 2 becomes very small after only a small number of cycles / epochs  and for the same classifier not to repeat over and over (and not improve) on other input vectors.
To successfully complete training your classifier, Boosting is the tutorial answer to the problem described in your Question.
Before going any further, a small mistake in your configuration / setup could also account for the behavior you observed.
In particular, maybe check these items in your config:

Are your input vectors correctly encoded  eg so their range is [1, 1]?

Have you coded your response variables correctly (i.e. 1ofC encoding )?

Have you chosen a reasonable initial learning rate and momentum? And have you tried to train with learning levels adjusted on each side of the initial initial learning rate? t
Anyway, assuming these configuration and customization issues are in order, here are the relevant implementation details regarding Boosting (which is, strictly, a method of combining multiple classifiers) works like this:
after a number of epochs, investigate the results (as you did). * These vectors of data, which the classifier could not recognize, are assigned a weighting factor to increase erro * r (some number is greater than 1); similarly, those vectors of data for which the classifier is well known are also weighted, but here the value is less than one, so the importance of the learning error is reduced.
So, for example, suppose at the end of the first epoch (iterating over all the data vectors containing your training dataset) the total error is 100; in other words, the square error (observed value  predicted value) summed over all data vectors in the training set.
These are the two MSE values listed in your question
0.667 # poorly learned input vector => assign error multiplier > 1
1.5e10 # welllearned input vector => assign error multiplier < 1
In Boosting, you will find the input vectors that correspond to these two dimensions of error and relate each error weight; this weight will be more than one in the first case and less than one in the second. Suppose you assign uncertainty values to 1.3 and 0.7, respectively. Further, suppose that after the next epoch, your classifier has not improved in relation to learning the first of these two input vectors  that is, it returns the same predicted values as in the previous epoch. However, for this iteration / epoch, the contribution to the total error from this input vector is not 0.67, but 1.3 x 0.67, or approx. 0.87.
What is the effect of this magnification error during training?
A larger error means a steeper gradient and hence for the next iteration a larger adjustment to the appropriate weights containing the weight matrices  in other words, faster training is focused on that particular input vector.
You can imagine that each of these data vectors has an implicit error weight of 1.0. An increase only increases this error weight (for vectors that the classifier cannot learn) and decreases this weight for vectors that it learns well.
What I just described is a concrete implementation called AdaBoost , which is probably the bestknown Boosting implementation. For guidance and even code for langaugespecific implementations take a look at boosting.com] 1 (seriously). This site is no longer maintained, so here are some finer resources that I have relied on and can highly recommend. The first is an academic site in the form of an annotated bibliography (including links to documents discussed on the site). The first document referenced on this site (with a pdf link), The Approach to Machine Learning: An overview, excellent overview, and effective source for getting working knowledge of this family of methods.
There is also a great video tutorial on Boosting and AdaBoost at videolectures.net
source to share