You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2018/02/13 01:11:00 UTC

[jira] [Comment Edited] (SPARK-23377) Bucketizer with multiple columns persistence bug

    [ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361154#comment-16361154 ] 

Joseph K. Bradley edited comment on SPARK-23377 at 2/13/18 1:10 AM:
--------------------------------------------------------------------

[~viirya]'s patch currently changes DefaultParamsWriter to save only the explicitly set Param values.  This means that loading a model into a new version of Spark could use different Param values if the default values have changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make behavior exactly reproducible.  This means that default Param values do need to be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this edge case (by removing the value for inputCol when inputCols is set).  This may be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and default values.  Then update Bucketizer's loading logic to make use of this distinction.  I'm not a fan of this approach since it would involve shoving a huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option now, and perhaps the 3rd option later on.  Opinions?


was (Author: josephkb):
[~viirya]'s patch currently changes DefaultParamsWriter to save only the explicitly set Param values.  This means that loading a model into a new version of Spark could use different Param values if the default values have changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make behavior exactly reproducible.  This means that default Param values do need to be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this edge case (by removing the value for inputCol when inputCols is set).  This may be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and default values.  Then update Bucketizer's loading logic to make use of this distinction.  I'm not a fan of this approach since it would involve shoving a huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option.  Opinions?

> Bucketizer with multiple columns persistence bug
> ------------------------------------------------
>
>                 Key: SPARK-23377
>                 URL: https://issues.apache.org/jira/browse/SPARK-23377
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol.
> 	at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
> 	at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
> 	at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
> 	at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
> 	at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org