How to extend the model in the training set to cover all aspects of training data

I was asked in an interview to solve a use case using machine learning. I have to use a machine learning algorithm to detect transaction fraud. My set of tutorials says 100,200 transactions, of which 100,000 are legal transactions and 200 are fraudulent transactions.

I cannot use the dataset as a whole to create the model because it would be a biased dataset and the model would be very bad.

Say, for example, I take a sample of 200 good transactions that represent a good dataset (good transactions) and 200 fraudulent actions and make the model usable as training data.

The question I was asking about is how can I increase 200 good transactions to a whole dataset of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved a scenario like this, so I didn't know how to approach it.

Any guidance as to how I can accomplish this would be helpful.

+3


source to share


2 answers


This is a common question asked in interviews. Information about the problem is short and uncertain (we do not know, for example, the number of functions!). The first thing you need to ask yourself is, what does the interviewer do to get me to answer? So, based on this context, the answer should be phrased in a similar way. This means that we do not need to find a "solution", but instead give arguments that show that we really know how to approximate the problem instead of the solution .. p>

The problem we presented is that the minority class (fraud) is only 0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as "not fraudulent" would have a classification accuracy of 99.8%! So something definitely needs to be done.

We define our main task as a binary classification problem, where we want to predict whether a transaction is positive (fraudulent) or negative (not fraudulent).

The first step is to consider what methods we have to reduce the imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have disadvantages. The former implies a significant loss of potential useful information from the dataset, while the latter can present overfitting problems. Some of the methods to improve retraining are SMOTE and ADASYN, which use strategies to improve diversity in the generation of new synthetic samples.



Of course, cross-validation becomes paramount in this case. Also, in the event that we finally do oversampling, it must be "coordinated" with the cross-validation approach so that we can use most of these two ideas. For more details, http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation .

Apart from these sampling ideas, when choosing our student, many ML techniques can be trained / optimized for specific metrics. In our case, we do not want to accurately determine the accuracy. Instead, we want to train the optimization model for either ROC-AUC , or specifically look for high recall even with loss of precession, since we want to predict all positive "scams" or at least trigger alarms, although some of them will generate false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between the two metrics. Take a look at this nice blog to learn more about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/

Finally, it is only a matter of evaluating the model empirically to test which parameters and parameters are most appropriate given the dataset. Following these ideas, we cannot guarantee 100% that we will be able to solve this problem. But it does ensure that we are in a much better position to try to learn from the data and be able to get rid of these villainous scammers out there, although you might be doing well along the way;)

+2


source


In this problem, you want to classify transactions as good or fraudulent. However, your data is truly imbalanced . This is where you will probably be interested in Anomaly Detection . I'll let you read the entire article for more details, but I'll provide several parts in my answer.

I think this will convince you that this is what you are looking for to solve this problem:

Isn't that just a classification?

The answer is yes if the following three conditions are met.

You have provided training data. Abnormal and normal classes are balanced (say minimum 1: 5). The data is not autocorrelated. (That data point over there is independent of previous data points. This often breaks down in time series data). If all of the above is true, we don't need anomalies and we can use an algorithm like Random Forests or Assisted Vector Machines (SVMs).

However, it is often difficult to find training data, and even when you can find it, most anomalies are from 1: 1000 to 1:10 ^ 6 events where the classes are not balanced.

Now, to answer your question:

Typically, class imbalances are resolved with an ensemble built by resampling the data many times. The idea is to first create new datasets by taking all the anomalous data points and adding a subset of the normal data (for example, 4 times as many anomalous data points). Then a classifier is built for each dataset using SVM or Random Forest, and those classifiers are combined using ensemble learning. This approach worked well and produced very good results.

If the data points are autocorrelated with each other, then simple classifiers will not work well. We handle these use cases using time series classification methods or repetitive neural networks.

I would also suggest a different approach to the problem . In this article, the author said:



If you don't have training data, you can still do anomaly detection using unsupervised learning and semi-supervised learning. However, after creating the model, you have no idea how well it does it as you have nothing to test. Hence, the results of these methods must be field tested before placing them in the critical path.

However, you have some scam data to check if your unsupervised algorithm is working or not, and if it does a good enough job, this might be the first solution to help collect more data for training the supervised classifier later.


Note that I am not an expert and this is exactly what I came up with after mixing my knowledge and some of the articles I recently read on the subject.

For more information on machine learning I suggest you use this packexchange community

Hope this helps you :)

+1


source







All Articles