Spark MLlib packages NaN weight

Question

Spark MLlib packages NaN weight

I am trying to run Spark MLlib packages in pyspark with a dataset for testing test machines. I am splitting datasets into half training dataset and half test dataset. Below is my code that builds the model. However, it shows the weight of NaN, NaN .. across all dependent variables. I can't figure out why. But it works when I try to standardize the data using the StandardScaler feature.

model = LinearRegressionWithSGD.train(train_data, step = 0.01)  
# evaluate model on test data set
valuesAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)))

Many thanks for the help.

Below is the code I used to scale.

scaler = StandardScaler(withMean = True, withStd = True).fit(data.map(lambda x:x.features))
feature = [scaler.transform(x) for x in data.map(lambda x:x.features).collect()]
label = data.map(lambda x:x.label).collect()
scaledData = [LabeledPoint(l, f) for l,f in zip(label, feature)]

+3

machine-learning apache-spark pyspark apache-spark-mllib

help_needed Apr 16 15 at 17:48

source to share

1 answer

Rishi Dua · Answer 1 · 2015-04-21T14:53:16+0000

Try to scale functions

StandardScaler standardizes functions by scaling the variance to one and / or removing the mean using column statistics from the training set samples. This is a very common preprocessing step.

Standardization can improve the convergence rate during the optimization process, and it can also prevent very large biases from appearing during model training. Since you have some variables that are large numbers (for example: income) and some variables are smaller (for example, number of customers), this should solve your problem.

Spark MLlib packages NaN weight

More articles: