Linking a machine learning prediction to the original dataset

I am involved in the process of creating a POC for retail transaction data using several machine learning algorithms and coming up with a predictive model for stock analysis. My questions may sound silly, but I would really appreciate it if you or someone else can answer me.

So far I have managed to get the dataset ==> Function Conversion to (labelpoint, Feature Vectors) ==> Train ML Model ==> Run Model in Test DataSet and ==> Get Predictions.

Problem 1:

Since I have no experience with any JAVA / Python / Scala languages, I create my functions in a database and save this data as a CSV file for my computer learning algorithm.

How we create functions with Scala from raw data.

Problem 2:

Source dataset consists of many set functions (Store, Product, date) and recorded OOS events (Target)

StoreID (text column), ProductID (text column), TranDate, (label / target), Feature1, Feature2 ........................ FeatureN

Since functions can only contain numeric values, I just create functions from numeric columns, not text columns (which is a natural key for me). When I run the model on a validation set, I get an array (Prediction, Label).

Now, how to link this result set to the original dataset and see which particular (Store, Product, Date) might have a possible Out Of Stock event?

I hope the description of the problem was clear enough.



source to share

1 answer

Linear Regression Spark Example

Here's a snippet from the Linear Regression Example Docs Docs which is instructive and easy to use.

It solves both your "problem 1" and "problem 2"

It doesn't need JOINs and doesn't even rely on RDD ordering.

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/")


Here data

is the RDD of text strings

val parsedData = { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))


Problem 1: Analysis of characteristics

It depends on the data. Here we see that the lines are split ,

into fields. It looks like this CSV data was completely numeric data.

The first field is treated as the tagged point (dependent variable) and the rest of the fields are converted from text to double (floating point) and get stuck in the vector. This vector contains functions or independent variables.

In your own projects, part of this you have to remember is the purpose of parsing into RDD LabeledPoints, where the first parameter of LabeledPoint, the label, is the true dependent numeric value and the function or second parameter is a vector of numbers.

Getting data in this state requires knowledge of the code. Python can be the easiest way to parse data. You can always use other tools to create a purely numeric CSV with the dependent variable in the first column and numeric functions in the other columns and no header row, and then duplicate the example parsing function.

// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)


At this point, we have a trained object model

. The object model

has a method predict

that works with feature vectors and returns estimates for the dependent variable.

Text encoding functions

ML routines usually need numeric object vectors, but you can often translate text or categorical functions (color, size, brand name) into numeric vectors in some space. There are many ways to do this, such as "Bag words for text" or "One hot coding" for categorical data where you code 1.0 or 0.0 for membership in every possible category (note the multicollinearity though). These methodologies can create large functional vectors, which is why Spark provides iterative methods for learning. Spark also has a class SparseVector()

where you can easily create vectors with all but certain parameters set to 0.0

Problem 2: Comparing Model Predictions to True Values

They then test this model with training data, but the calls will be the same with external test data provided the test data is an RDD LabeledPoint (dependent value, Vector (functions)). The input can be changed by changing the variable parsedData

to a different RDD.

// Evaluate model on training examples and compute training error
val valuesAndPreds = { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)


Note that this returns tuples of the true dependent variable previously stored in point.label

, and the model prediction from point.features

for each row or LabeledPoint.

We are now ready to run the mean square error, since the valuesAndPreds

RDD contains the (v,p)

true value tuples v

and the prediction of p

both Double types.

The MSE is a single number, first the tuples are matched against the rdd squared distances ||v-p||**2

individually and then averaged to give one number.

val MSE ={case(v, p) => math.pow((v - p), 2)}.mean()

println("training Mean Squared Error = " + MSE)


Spark logistics example

It looks similar, but here you can see that the data has already been parsed and divided into training and test cases.

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)


This is where the model is trained on the training set.

// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()


And is tested (compared) with the test suite. Note that even if it is a different model (Logistic instead of Linear), there is still a method model.predict

that takes a point features

as a parameter and returns a prediction for that point.

Again, the prediction is coupled with the true value, from the label, in the comparison tuple in the performance metric.

// Compute raw scores on the test set.
val predictionAndLabels = { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)

// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision)


How about sharing? So RDD.join comes in if you have two RDD (key, value) pairs and also requires an RDD that matches the intersection of the keys with both values. But we don't need that here.



All Articles