You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "yuhao yang (JIRA)" <ji...@apache.org> on 2017/10/04 23:54:00 UTC

[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

    [ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192217#comment-16192217 ] 

yuhao yang commented on SPARK-3181:
-----------------------------------

Regarding to whether to separate Huber loss an an independent Estimator, I don't see there's an direct conflict.

IMO, LinearRegression should act as an all-in-one Estimator that allow user to combine whichever loss function, optimizer and regularization to use. It should targets flexibility and also provides some fundamental infrastructure for regression algorithms.

In the meantime, we may also support HuberRegression, RidgeRegression and others in independent Estimator, which is more convenient but with less flexibility (also allow specific parameters). As mentioned by Seth, this would require better code abstraction and plugin interface. Besides  loss/prediction/optimizer, we also need to provide infrastructure for model summary and serialization. This should only happen after we can compose Estimator like HuberRegression without noticeable code duplication. 


> Add Robust Regression Algorithm with Huber Estimator
> ----------------------------------------------------
>
>                 Key: SPARK-3181
>                 URL: https://issues.apache.org/jira/browse/SPARK-3181
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Fan Jiang
>            Assignee: Yanbo Liang
>              Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression  to employ a fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for "maximum likelihood type". The method is resistant to outliers in the response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org