How to avoid overfitting with a genetic algorithm

I am facing the following problem. I have a system capable of ranking certain operations according to their anomaly scores. To improve performance, I applied a genetic algorithm to perform feature selection so that most abnormal operations appear in the first positions. What I am doing is not really a selection of objects, because I am not using binary variables, but rather floating variables between 0-1, which sum to 1.

I currently have a population of 200 people for 50 generations. I use the system itself as a scoring function and rate the quality of the solution using the true positive rate by counting how many abnormal operations appear in the first N positions (where N is the number of abnormal operations). Whereas the operator is a uniform crossover, and I change the value of the individual's cell for mutation. Of course, every time I check to correct the person so that the sum is 1. Finally, I use elitism to maintain the best solution over time.

I noticed that one feature has a very high value, which is often important, but not always, and this leads to very low values ​​for other features. I suspect my GA is overrated. Can you help me find good stopping criteria?

+3


source to share


1 answer


Retraining in genetic algorithms and programming is a big issue that is currently the focus of the GP community, myself included. Most of the research focuses on genetic programming and the evolution of classification / regression models, but this may apply to your problem as well. There are some docs that might help you (and which I work with too):

  • Gonsalvis, Ivo and Sara Silva. "Experiments to control overfitting in genetic programming". Proceedings of the 15th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence, EPIA. Volume 84.2011
  • Langdon, WB "Minimizing Testing in Genetic Programming". RN 11.10 (2011): 1.
  • Gonçalves, Ivo, et al. "A random sampling method for processing control in genetic programming." Genetic programming. Springer Berlin Heidelberg, 2012.218-229.
  • Gonsalvis, Ivo and Sara Silva. Balancing training and retraining in genetic programming with interleaved training data sampling. Springer Berlin Heidelberg, 2013.

You can find the documents (the first two are directly in pdf format) by searching for their titles at academer.google.com.



Basically, what all the papers work with is the idea of ​​using only a subset of the learning data to guide evolution and (randomly) change that subset of each generation (using the same subset for all people in the same generation). Interestingly, experiments show that the smaller the subset, the less overfitting occurs, to the point of extreme use of only a singleton subset. The docs work with this idea and extend it with some tweaks (like switching between a full dataset and a subset). But as I said at the beginning, this is all about symbolic regression (more or less), not feature selection.

I personally once tried a different approach (again for symbolic regression using genetic programming) - using a subset of the training data (e.g. half) to drive evolution (i.e. fitness), but the "best so far" solution was defined with using the results from the remaining training data. The retraining was much less significant.

+9


source







All Articles