Why does Spark ML ALS algorithm print RMSE = NaN?

Question

Why does Spark ML ALS algorithm print RMSE = NaN?

I am using ALS to predict rating, this is my code:

val als = new ALS()
  .setMaxIter(5)
  .setRegParam(0.01)
  .setUserCol("user_id")
  .setItemCol("business_id")
  .setRatingCol("stars")
val model = als.fit(training)

// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(testing)
predictions.sort("user_id").show(1000)
val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setLabelCol("stars")
  .setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")

But get some negative ratings and the RMSE is Nan:

+-------+-----------+---------+------------+
|user_id|business_id|    stars|  prediction|
+-------+-----------+---------+------------+
|      0|       2175|      4.0|   4.0388923|
|      0|       5753|      3.0|   2.6875196|
|      0|       9199|      4.0|   4.1753435|
|      0|      16416|      2.0|   -2.710618|
|      0|       6063|      3.0|         NaN|
|      0|      23076|      2.0|  -0.8930751|

Root-mean-square error = NaN

How to get a good result?

+3

scala machine-learning apache-spark

Pi Pi Apr 21 17 at 14:06

source to share

3 answers

As of Spark 2.2.0, you can set the parameter coldStartStrategy

to drop

to remove any rows in the Prediction DataFrame that contain NaN values. The scoring metric will then be computed over data other than NaN and will be valid.

model.setColdStartStrategy("drop");

+1

Stanley 03 jan. 18 at 2:36 am

source to share

A small correction will help solve this problem:

prediction.na.drop ()

0

Ajay 11 nov. 17 at 9:34 am

source to share

jamborta · Accepted Answer · 2017-04-21T14:28:43+0000

Negative values are irrelevant, as the RMSE squares the values first. You probably have empty prediction values. You can leave them:

predictions.na().drop(["prediction"])

While this might be a little flawed, you can alternatively fill in these values with the lowest / highest / average rating.

I would also recommend to round x < min_rating

and x > max_rating

to the smallest / highest ratings, which will improve your RMSE.

EDIT:

More info here: https://issues.apache.org/jira/browse/SPARK-14489

Why does Spark ML ALS algorithm print RMSE = NaN?

More articles: