You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/02/01 01:51:00 UTC
[jira] [Resolved] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

     [ https://issues.apache.org/jira/browse/SPARK-26787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-26787.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 23705
[https://github.com/apache/spark/pull/23705]

> Fix standardization error message in WeightedLeastSquares
> ---------------------------------------------------------
>
>                 Key: SPARK-26787
>                 URL: https://issues.apache.org/jira/browse/SPARK-26787
>             Project: Spark
>          Issue Type: Documentation
>          Components: MLlib
>    Affects Versions: 2.3.0, 2.3.1, 2.4.0
>         Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta.
>  
>            Reporter: Brian Scannell
>            Assignee: Brian Scannell
>            Priority: Trivial
>             Fix For: 3.0.0
>
>
> There is an error message in WeightedLeastSquares.scala that is incorrect and thus not very helpful for diagnosing an issue. The problem arises when doing regularized LinearRegression on a constant label. Even when the parameter standardization=False, the error will falsely state that standardization was set to True:
> {{The standard deviation of the label is zero. Model cannot be regularized with standardization=true}}
> This is because under the hood, LinearRegression automatically sets a parameter standardizeLabel=True. This was chosen for consistency with GLMNet, although WeightedLeastSquares is written to allow standardizeLabel to be set either way and work (although the public LinearRegression API does not allow it).
>  
> I will submit a pull request with my suggested wording.
>  
> Relevant:
> [https://github.com/apache/spark/pull/10702]
> [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5] 
>  
> The following Python code will replicate the error. 
> {code:java}
> import pandas as pd
> from pyspark.ml.feature import VectorAssembler
> from pyspark.ml.regression import LinearRegression
> df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
> spark_df = spark.createDataFrame(df)
> vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 'features')
> train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
> lr = LinearRegression(featuresCol='features', labelCol='label', fitIntercept=False, standardization=False, regParam=1e-4)
> lr_model = lr.fit(train_sdf)
> {code}
>  
> For context, the reason someone might want to do this is if they are trying to fit a model to estimate components of a fixed total. The label indicates the total is always 100%, but the components vary. For example, trying to estimate the unknown weights of different quantities of substances in a series of full bins. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org