You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/02/13 05:08:08 UTC

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20594

    [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

    ## What changes were proposed in this pull request?
    
    #### Problem:
    
    Since 2.3, `Bucketizer` supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if `inputCols` and `outputCol` are both set, an error will be thrown.
    
    However, when we write `Bucketizer`, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default `outputCol` param in `HasOutputCol` trait will be set in `paramMap` and become an user-supplied param. That makes the check of exclusive params failed.
    
    #### Fix:
    
    This changes the saving logic of Bucketizer to handle this case. This is a quick fix to catch the time of 2.3. We should consider modify the persistence mechanism later.
    
    Please see the discussion in the JIRA.
    
    Note: The multi-column `QuantileDiscretizer` also has the same issue.
    
    ## How was this patch tested?
    
    Modified tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23377-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20594
    
----
commit 9cd7c86fad04c814b2c8f5547583122ba12c359b
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-13T03:51:41Z

    Remove outputCol default value if inputCols is set.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87392/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r167914538
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    +      if (instance.isSet(instance.inputCols)) {
    --- End diff --
    
    this can create a lot of issues with the Python API. Please see #20410 for reference. Thus I am against this fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87379/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87458/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/854/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87366/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87440/testReport)** for PR 20594 at commit [`3a29039`](https://github.com/apache/spark/commit/3a290392e87e6476b3a9253b902850a078dfc4ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Thanks!  LGTM pending tests


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r167801392
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    --- End diff --
    
    I doubt whether it need a "lock" here, because it is the way "clear default value first, then save model, then restore default value".
    Maybe wrapping the code block here by `synchronized` is safer ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87379/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Thanks @jkbradley ! I've updated this based on your comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87392/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r167762013
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    +      if (instance.isSet(instance.inputCols)) {
    +        removedOutputCol = instance.getDefault(instance.outputCol)
    +        instance.clearDefault(instance.outputCol)
    +      }
    +      DefaultParamsWriter.saveMetadata(instance, path, sc)
    +      // Add the default param back.
    +      removedOutputCol.map(instance.setDefault(instance.outputCol, _))
    --- End diff --
    
    Although the saving logic is the same as `QuantileDiscretizerWriter`, I leave them as duplicate for now since this is a quick fix. If there is strong preference, I can make a common class for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r168067865
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    --- End diff --
    
    yep. But I have some new thoughts, see my comments at bottom. -:)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/903/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87379/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    I thought again, instead of "removing default value and restore it again later (which may cause some side effects)", maybe the better way is, add a parameter to `DefaultParamsWriter.saveMetadata`, specify which default param need to skip when saving.
    
    @mgaido91 Yes I agree with you. Either #20410 or #18982 need to be merged to 2.3, the related issue is possible to cause some strange bugs.
    
    cc @jkbradley 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87392/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class QuantileDiscretizerWriter(instance: QuantileDiscretizer) extends MLWriter `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    @jkbradley thanks for your answer. I think that the 3rd approach you suggested on the JIRA is the right way to go on a long term plan. Personally, I disagree with you when you say that we should keep the default values. I think that changing a default value doesn't happen often and if it happens it is not a problem: if the user cares about the value of a parameter, he sets it. But this is just my opinion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r167812275
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    --- End diff --
    
    I was thinking about this too. But looks like we don't add lock to the places we change params in ML. I guess that we assume the usage of ML models is single-threaded. So I leave it as this. Will add it if others think this is required too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87458/testReport)** for PR 20594 at commit [`174c114`](https://github.com/apache/spark/commit/174c1148e595b5eda731dbdae2c6fbf1a4dd4837).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87440/testReport)** for PR 20594 at commit [`3a29039`](https://github.com/apache/spark/commit/3a290392e87e6476b3a9253b902850a078dfc4ea).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/843/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/833/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r168254258
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -213,6 +217,9 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       override def copy(extra: ParamMap): Bucketizer = {
         defaultCopy[Bucketizer](extra).setParent(parent)
       }
    +
    +  @Since("2.3.0")
    --- End diff --
    
    No need for this since annotation; the signature isn't changed in 2.3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Well I succeeded in merging this with master, but the merge script isn't working for branch-2.3.  I wait to see if the read-only repo syncs and fixes the issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87366/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/888/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    @WeichenXu123 @viirya as I said in the other PR, I think no default value should be persisted. #20410 and #18982 are not regression: they are problem which have been present in all the release so far, but they are showing up more and more "thanks" to all the models having the dualism inputCol/inputCols.
    
    Every usage of `isSet`on Scala side is a problem with the Python API until either one of them will be merged. And this issue is the same. After persisting, every usage of `isSet` is not working as intended. Therefore I'd be for either not to store any default value or store them writing explicitly that they are default values. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87366/testReport)** for PR 20594 at commit [`9cd7c86`](https://github.com/apache/spark/commit/9cd7c86fad04c814b2c8f5547583122ba12c359b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Because this is a quick fix, my idea is to have a surface patch that doesn't change existing API. The approach of adding parameter to `DefaultParamsWriter.saveMetadata` also sounds good to me, but the parameter seems useless if we get rid of this quick fix in the future.
    
    Instead of adding parameter, I think we can pass the `paramMap` parameter when calling `saveMetadata`.
    
    For #20410 and #18982, I have a question, are they regression? Seems to me they are not new issues to 2.3.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r168069150
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    +      if (instance.isSet(instance.inputCols)) {
    --- End diff --
    
    Yes I think #20410 is not related to this PR for now. But I am afraid in the future, when we add more functionality, potential bugs will possible to be triggered.
    But I think we don't need to care the order of them to be merged. :)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merging with master and branch-2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Success!  Merged to branch-2.3 too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    @mgaido91 Thanks for your thoughts.  We do need to persist default values; please check out the JIRA.  For fixing Python, I think the best fix will be to transfer the default & explicitly set Params to Java separately, rather than treating them all as explicitly set.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87440/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    cc @jkbradley 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r168262315
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +297,28 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we skip the default value
    +      // of `outputCol` if `inputCols` is set when saving the metadata.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var paramWithoutOutputCol: Option[JValue] = None
    +      if (instance.isSet(instance.inputCols)) {
    +        val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
    --- End diff --
    
    I don't think this asInstanceOf cast is necessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20594


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    **[Test build #87458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87458/testReport)** for PR 20594 at commit [`174c114`](https://github.com/apache/spark/commit/174c1148e595b5eda731dbdae2c6fbf1a4dd4837).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple columns...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20594
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20594: [SPARK-23377][ML] Fixes Bucketizer with multiple ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20594#discussion_r168057330
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -290,6 +293,27 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
         }
       }
     
    +
    +  private[Bucketizer] class BucketizerWriter(instance: Bucketizer) extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // SPARK-23377: The default params will be saved and loaded as user-supplied params.
    +      // Once `inputCols` is set, the default value of `outputCol` param causes the error
    +      // when checking exclusive params. As a temporary to fix it, we remove the default
    +      // value of `outputCol` if `inputCols` is set before saving.
    +      // TODO: If we modify the persistence mechanism later to better handle default params,
    +      // we can get rid of this.
    +      var removedOutputCol: Option[String] = None
    +      if (instance.isSet(instance.inputCols)) {
    --- End diff --
    
    Why? I think they are orthogonal and this shouldn't cause the issue in Python side. Besides, as the PySpark multi-column support is not added yet (it's reverted), I think we don't hit the Python API issue. This is a quick fix to deal with the persistence bug. I'm not sure we should be blocked.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org