You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bago Amirbekian (JIRA)" <ji...@apache.org> on 2018/02/09 23:14:00 UTC

[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug

     [ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bago Amirbekian updated SPARK-23377:
------------------------------------
    Description: 
A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}

And the trace:

{code:java}
java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol.
	at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
	at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
	at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
	at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
	at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)

{code}



  was:
A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}


> Bucketizer with multiple columns persistence bug
> ------------------------------------------------
>
>                 Key: SPARK-23377
>                 URL: https://issues.apache.org/jira/browse/SPARK-23377
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol.
> 	at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
> 	at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
> 	at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
> 	at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
> 	at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org