You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2019/11/10 19:21:00 UTC

[jira] [Resolved] (SPARK-29812) Missing persist on predictionAndLabels in MulticlassClassificationEvaluator

     [ https://issues.apache.org/jira/browse/SPARK-29812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-29812.
----------------------------------
    Resolution: Duplicate

> Missing persist on predictionAndLabels in MulticlassClassificationEvaluator
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-29812
>                 URL: https://issues.apache.org/jira/browse/SPARK-29812
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.4.3
>            Reporter: Dong Wang
>            Priority: Major
>
> The rdd predictionAndLabels in ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. When MulticlassMetrics uses predictionAndLabels to initialize fileds, there will be at least five actions executed on predictionAndLabels.
> {code:scala}
>   override def evaluate(dataset: Dataset[_]): Double = {
>     val schema = dataset.schema
>     SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
>     SchemaUtils.checkNumericType(schema, $(labelCol))
>     // Needs to be persisted
>     val predictionAndLabels =
>       dataset.select(col($(predictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map {
>         case Row(prediction: Double, label: Double) => (prediction, label)
>       }
>     // The initialization will use predictionAndLabels multi times in different actions.
>     val metrics = new MulticlassMetrics(predictionAndLabels)
>     val metric = $(metricName) match {
>       case "f1" => metrics.weightedFMeasure
>       case "weightedPrecision" => metrics.weightedPrecision
>       case "weightedRecall" => metrics.weightedRecall
>       case "accuracy" => metrics.accuracy
>     }
>     metric
>   }
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org