Spark MLlib packages NaN weight
I am trying to run Spark MLlib packages in pyspark with a dataset for testing test machines. I am splitting datasets into half training dataset and half test dataset. Below is my code that builds the model. However, it shows the weight of NaN, NaN .. across all dependent variables. I can't figure out why. But it works when I try to standardize the data using the StandardScaler feature.
model = LinearRegressionWithSGD.train(train_data, step = 0.01)
# evaluate model on test data set
valuesAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)))
Many thanks for the help.
Below is the code I used to scale.
scaler = StandardScaler(withMean = True, withStd = True).fit(data.map(lambda x:x.features))
feature = [scaler.transform(x) for x in data.map(lambda x:x.features).collect()]
label = data.map(lambda x:x.label).collect()
scaledData = [LabeledPoint(l, f) for l,f in zip(label, feature)]
source to share
Try to scale functions
StandardScaler standardizes functions by scaling the variance to one and / or removing the mean using column statistics from the training set samples. This is a very common preprocessing step.
Standardization can improve the convergence rate during the optimization process, and it can also prevent very large biases from appearing during model training. Since you have some variables that are large numbers (for example: income) and some variables are smaller (for example, number of customers), this should solve your problem.
source to share