You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Krishna Sankar (JIRA)" <ji...@apache.org> on 2016/07/25 17:05:20 UTC

[jira] [Comment Edited] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

    [ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392293#comment-15392293 ] 

Krishna Sankar edited comment on SPARK-14489 at 7/25/16 5:04 PM:
-----------------------------------------------------------------

From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag (ignoreNaN=false, to keep the current behavior) would be a good choice. I have a suspicion that we would need the ignoreNaN elsewhere as well, for example in the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default average recommendation or even 0; current NaN is the right one. Depending on the context it is possible that an application might decide not to recommend anything, have a default recommendation or even have a dynamic calculated value e.g. over a recent window.  So the parameter defaultRecommendation="NaN" or "average" or a value would be a good choice to cover all the possibilities. Or the developer can use the na.fill() for other operations.
Note : Saw the coldStartStrategy in Nick's patch. Will dig further. 


was (Author: ksankar):
From my experience in the field and R experience, couple of thoughts:
The ALS and the evaluator are doing the right thing - with the information they have and without any contextual directives.
1. For the evaluator, as mentioned earlier, similar to what R has, a na.rm flag (ignoreNaN=false, to keep the current behavior) would be a good choice. I have a suspicion that we would need the ignoreNaN elsewhere as well, for example in the crossValidator
2. For ALS, in the absence of a directive, we shouldn't calculate a default average recommendation or even 0; current NaN is the right one. Depending on the context it is possible that an application might decide not to recommend anything, have a default recommendation or even have a dynamic calculated value e.g. over a recent window.  So a parameter defaultRecommendation="NaN" or "average" or a value would be a good choice to cover all the possibilities. Or the developer can use the na.fill() for other operations.

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>              Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or other metrics (ie, removing users or items in validation test that is missing in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org