Separating training and test data

Can someone recommend what is the best separation percentage of training data and testing data in Machine Learning. What are the disadvantages if I separate the training and test data by 30-70% ?

+3


source to share


2 answers


There is no "right way" to separate your data, unfortunately people use different values ​​that are chosen based on different heuristics, gut feeling and personal experience / preference. The Pareto principle ( 80-20 ) is a good starting point .

Sometimes using simple partitioning is not an option as you may have too much data - in which case you may need to sample your data or use smaller test suites if your algorithm is computationally complex. The important part is randomly selecting your data. The tradeoff is pretty simple: less test data = your algorithm's performance will have more variance. Less training data = parameter estimates will have more variance.



More important to me personally than the size of the split is that you obviously don't always have to run your tests only once in the same test as that might be biased (you may or may not be lucky with your split) ... This is why you should test for multiple configurations (for example, you run your tests X times each time, choosing 20% ​​for testing). This is where you may have problems with the so-called variance of the model - different splittings will lead to different values. This is why you should run tests multiple times and average the results.

With the above method, you might find it difficult to test all possible splits. A well established method of separating data is so called cross validation , and as you can read in the wiki article there are several types of it (both exhaustive and non-exhaustive). Pay special attention to the k-fold cross check .

0


source


Read the different cross-validation strategies.

The 10% -90% spread is popular as it emerges from 10x cross validation. But you can check 3 or 4 times. (33-67 or 25-75)

Significantly larger errors arise from:



  • having duplicates both in the test and on the train
  • unbalanced data

Be sure to merge all duplicates first and split the split layers if you have unbalanced data.

0


source







All Articles