You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "yuhao yang (JIRA)" <ji...@apache.org> on 2016/12/04 07:31:58 UTC

[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics

    [ https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15719499#comment-15719499 ] 

yuhao yang commented on SPARK-18704:
------------------------------------

One implementation for the tuning summary is available at https://github.com/hhbyyh/spark/tree/tuningsummary/mllib/src/main/scala/org/apache/spark/ml/tuning for anyone with interest.

> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
>                 Key: SPARK-18704
>                 URL: https://issues.apache.org/jira/browse/SPARK-18704
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models during the training process, yet it only passes the average metrics to CrossValidatorModel. From which some important information like variances for the same paramMap cannot be retrieved, and users cannot be sure if the k number is proper. Since the CrossValidator is relatively expensive, we probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either by passing a metrics matrix to the CrossValidatorModel, or we can introduce a CrossValidatorSummary. I would vote for introducing the TunningSummary class, which can also be used by TrainValidationSplit. In the summary we can present a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics          |
> +---------------+------------+--------+-----------------+
> |0.0            |true        |0.1     |9.747795248932505|
> |0.0            |true        |0.01    |9.751942357398603|
> |0.0            |false       |0.1     |9.71727627087487 |
> |0.0            |false       |0.01    |9.721149803723822|
> |0.5            |true        |0.1     |9.719358515436005|
> |0.5            |true        |0.01    |9.748121645368501|
> |0.5            |false       |0.1     |9.687771328829479|
> |0.5            |false       |0.01    |9.717304811419261|
> |1.0            |true        |0.1     |9.696769467196487|
> |1.0            |true        |0.01    |9.744325276259957|
> |1.0            |false       |0.1     |9.665822167122172|
> |1.0            |false       |0.01    |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org