You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pablo J. Villacorta (JIRA)" <ji...@apache.org> on 2018/07/01 18:55:00 UTC

[jira] [Updated] (SPARK-24712) TrainValidationSplit ignores label column name and forces to be "label"

     [ https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pablo J. Villacorta updated SPARK-24712:
----------------------------------------
    Description: 
When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only works when the variable labelColumn is set to "label"
{code:java}
from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn))
vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
{code}

  was:
When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only wors when the variable labelColumn is set to "label"
{code:java}
from pyspark.sql.functions import rand, randn
from pyspark.ml.regression import LinearRegression

labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS

df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn))
vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
mypipeline = Pipeline(stages = [vectorAssembler, lr])

paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.01, 0.1])\
.build()

trainValidationSplit = TrainValidationSplit()\
.setEstimator(mypipeline)\
.setEvaluator(RegressionEvaluator())\
.setEstimatorParamMaps(paramGrid)\
.setTrainRatio(0.8)

trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
{code}


> TrainValidationSplit ignores label column name and forces to be "label"
> -----------------------------------------------------------------------
>
>                 Key: SPARK-24712
>                 URL: https://issues.apache.org/jira/browse/SPARK-24712
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Pablo J. Villacorta
>            Priority: Major
>
> When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only works when the variable labelColumn is set to "label"
> {code:java}
> from pyspark.sql.functions import rand, randn
> from pyspark.ml.regression import LinearRegression
> labelColumn = "target"  # CHANGE THIS TO "label" AND THE CODE WORKS
> df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn))
> vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features")
> lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn)
> mypipeline = Pipeline(stages = [vectorAssembler, lr])
> paramGrid = ParamGridBuilder()\
> .addGrid(lr.regParam, [0.01, 0.1])\
> .build()
> trainValidationSplit = TrainValidationSplit()\
> .setEstimator(mypipeline)\
> .setEvaluator(RegressionEvaluator())\
> .setEstimatorParamMaps(paramGrid)\
> .setTrainRatio(0.8)
> trainValidationSplit.fit(df)  # FAIL UNLESS labelColumn IS SET TO "label"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org