How do I create an artificial dataset through a simple simulation model to analyze the classification using the Binary Response function and 4-5?

I need a simulation model that generates an artificial classification dataset with a binary response variable. Then I want to test the performance of various classifiers using this dataset. A dataset can have any number of functions, predictors.

+3


source to share


2 answers


This is a bad idea and won't tell you anything about the relative merits of classifiers.

I'll explain first how to create data and then why you won't learn anything by doing this. You need a vector of binary functions: there are many ways to do this, but let's take the simplest one. Bernoulli vector of independent variables. Here's a recipe for creating as many instances as possible:

  • For each function i, generate the theta_i parameter, where 0 <theta_i <1, from the uniform distribution
  • For each desired instance j, generate the i-th feature f_ji by resampling from the uniform distribution. If the number you chose is less than theta_i, set f_ij = 1, otherwise set it to 0

This will allow you to generate as many instances as possible. However, the problem is that you know the true distribution of the data, so you can get the Bayes Optimal decision rule: this is the theoretically optimal classifier. Within the generation scheme I gave you above, the Naive Bayes classifier is close to optimal (if you used the actual Bayesian version in which you integrated the parameters, this would be the optimal classifier).

Does this mean that Naive Bayes is the best classifier? No, of course not: as a rule, we are practically interested in classifier performance on datasets where we do not know the true distribution of the data. Indeed, the whole concept of discriminatory modeling is based on the idea that when the true distribution is unknown, trying to estimate it solves a more complex problem than is required for classification.

In short, then: think very carefully about what you want to do. You cannot simulate the data and use it to determine which classifier is "best", because it will best depend on the recipe you are using to simulate. If you want to look at kinds of data where certain classifiers perform poorly or strangely, you can simulate data like that to support your assumption, but I don't think you are trying.

EDIT:



I understand that you really want a binary result, not binary functions. You can ignore some of what I have said.

The binary responses come from a logistic regression model:

log (p / (1-p)) = wx

where w is your weight vector and x is your vector function. To simulate a given x from this model, take the dot product wx, apply the inverse logit function:

logit ^ -1 = 1 / (1 + exp (-wx))

this gives you a number in the range 0-1. Then run the response as a Bernoulli variable with parameter p, i.e. Take a single number in [0,1] and return 1, if it is less than p, else will return 0.

If you want to simulate xs too, you can, but you're back in the areas of my discussion above. Also note that since this is a logistic regression sample, this classifier will have an obvious advantage here, as I described above ...

+3


source


you need to know what distribution you want to generate data for. This is most likely normal distribution. Then you need to map data points to its classes.

normal distribution: an example of an algorithm for generating a random value in a normally distributed dataset?



gaussian distribution: C ++: generate gaussian distribution

generating data in excel: http://www.databison.com/index.php/how-to-generate-normal-distribution-sample-set-in-excel/

+1


source







All Articles