You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2019/05/28 09:46:00 UTC

[jira] [Created] (SPARK-27867) RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

zhengruifeng created SPARK-27867:
------------------------------------

             Summary: RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation
                 Key: SPARK-27867
                 URL: https://issues.apache.org/jira/browse/SPARK-27867
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng


In most cases, given a model, we have to obtain multi metrics of it.

For examples, a regression model, we may need to obtain the R2, MAE and MSE.

However, current design of `Evaluator` do not support computing multi metrics at once.

In practice, we usually use RegressionEvaluator like this:
{code:java}
val evaluator = new RegressionEvaluator()


val r2 = evaluator.setMetricName("r2").evaluate(df)


val mae = evaluator.setMetricName("mae").evaluate(df)


val mse = evaluator.setMetricName("mse").evaluate(df){code}
 

However, current impl of RegressionEvaluator needs one pass of the whole input dataset to compute one metric. So, above example needs 3 passes.

This can be optimized since in \{RegressionMetrics}  all metrics can be computed at once.

If we cache the lastest inputs, and then if the next evaluate call keep the inputs (except the metricName), then we can directly obtain the metric from the internal intermediate summary.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org