You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/11/03 14:42:00 UTC

[jira] [Commented] (SPARK-22433) Linear regression R^2 train/test terminology related

    [ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237676#comment-16237676 ] 

Sean Owen commented on SPARK-22433:
-----------------------------------

Likewise, the goal here is not to adopt statistics terminology. As the name implies, MLlib is coming more from ML terminology and practices. We should stick to standard language and approaches where there is a clear standard, completely agree; where there is a ML-oriented standard, we should probably prefer that one for consistency.

I'm not sure I agree with these yet. What is a regression prediction metric? and why can't R^2 be an evaluation metric? I am not sure I'd use it that way, but it's coherent. There's no issue with computing R^2 on a test set vs train set it was trained on -- could be negative, sure.

I understand your distinction, but "linear regression with L1 reg" still produces a linear model. The cost function is not the same as in linear regression, but the name is also different. LASSO is, I think, less well known for people that would use this than L1. While I wouldn't mind LASSO, I don't see a clear motive to change this.


> Linear regression R^2 train/test terminology related 
> -----------------------------------------------------
>
>                 Key: SPARK-22433
>                 URL: https://issues.apache.org/jira/browse/SPARK-22433
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Teng Peng
>            Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org