LinearRegressionWithSGD () returns NaN

I'm trying to use LinearRegressionWithSGD on a dataset of millions of songs and my model is returning NaN as weights and 0.0 as intercept. What could be causing the error? I am using Spark 1.40 offline.

Sample data: http://www.filedropper.com/part-00000

Here is my complete code:

// Import dependencies

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

      

// Define RDD

val data =  
sc.textFile("/home/naveen/Projects/millionSong/YearPredictionMSD.txt")

      

// Convert to bullet point

def parsePoint (line: String): LabeledPoint = {
val x = line.split(",")
val head = x.head.toDouble
val tail = Vectors.dense(x.tail.map(x => x.toDouble))
return LabeledPoint(head,tail)
}

      

// Find the range

val parsedDataInit = data.map(x => parsePoint(x))
val onlyLabels = parsedDataInit.map(x => x.label)
val minYear = onlyLabels.min()
val maxYear = onlyLabels.max()

      

// Change shortcuts

val parsedData = parsedDataInit.map(x => LabeledPoint(x.label-minYear   
,   x.features))

      

// Train, validate and test suite

val splits = parsedData.randomSplit(Array(0.8, 0.1, 0.1), seed = 123)
val parsedTrainData = splits(0).cache()
val parsedValData = splits(1).cache()
val parsedTestData = splits(2).cache()

val nTrain = parsedTrainData.count()
val nVal = parsedValData.count()
val nTest = parsedTestData.count()

      

// RMSE

def squaredError(label: Double, prediction: Double): Double = {

return scala.math.pow(label - prediction,2)
}

def calcRMSE(labelsAndPreds: RDD[List[Double]]): Double = {
return scala.math.sqrt(labelsAndPreds.map(x =>    
           squaredError(x(0),x(1))).mean())
}
val numIterations = 100
val stepSize = 1.0
val regParam = 0.01
val regType = "L2"
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize) 
.setRegParam(regParam)
val model = algorithm.run(parsedTrainData) 

      

+3


source to share


2 answers


I am not familiar with this particular SGD implementation, but generally, if the gradient descent solver goes to nan, it means the learning rate is too fast. (in this case I think it is a variable stepSize

).



Try to lower it by an order of magnitude each time until it begins to converge

+2


source


I think there are two possibilities.



  • stepSize

    big. You should try something like 0.01, 0.03, 0.1, 0.3, 1.0, 3.0 ....
  • Your train data has NaN. If so, the result will probably be NaN.
0


source







All Articles