You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/05/01 14:17:22 UTC

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/17819

    [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bin multiple columns

    ## What changes were proposed in this pull request?
    
    Current ML's Bucketizer can only bin a column of continuous features. If a dataset has thousands of of continuous columns needed to bin, we will result in thousands of ML stages. It is inefficient regarding query planning and execution.
    
    We should have a type of bucketizer that can bin a lot of columns all at once. It would need to accept an list of arrays of split points to correspond to the columns to bin, but it might make things more efficient by replacing thousands of stages with just one.
    
    This current approach in this patch is to add a new `MultipleBucketizer` for this purpose.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-20542

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17819
    
----
commit e8f5d89ab2c344d52ae245b1b22cb4425ae6ffa0
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-05-01T14:07:10Z

    Add a Bucketizer that can bin multiple columns.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)** for PR 17819 at commit [`38dce8b`](https://github.com/apache/spark/commit/38dce8b30ee3119d5c4c3c761e527fca5c2979f5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @WeichenXu123 Yeah, I'm merging it. I just want to clarify adding trait to a class doesn't necessarily makes java incompatible. :) Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    I've created https://issues.apache.org/jira/browse/SPARK-22397 to track the changes in `QuantileDiscretizer`. The PR can be submitted once we finalize this one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76379/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya Oh, I am not saying the compatibility against old version scala application. What I say is about new version `Bucketizer`, when spark user use java language(not scala language), call new added API (such as `setInputSplits`), it cannot work. i.e, the new api isn't java-compatible. for example:
    ```
    public class JavaBucketizerExample {
        public static void main(String[] args) {
            Bucketizer bucketizer = new Bucketizer().setInputCols(new String[] {"features", "features2"});
           ...
       } 
    }
    ```
    The above java program (Note not scala program) do not work. It cannot even pass compiling.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81956/testReport)** for PR 17819 at commit [`92ef9bd`](https://github.com/apache/spark/commit/92ef9bde1e048eef7e3b530286723cad5773debc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143730103
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN data but non-NaN splits") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.1, 0.2, 0.6, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN)
    +    val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0, 4.0)
    +    val expectedBuckets2 = Array(2.0, 1.0, 2.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 4.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.setHandleInvalid("keep")
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    bucketizer.setHandleInvalid("skip")
    +    val skipResults1: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result1").as[Double].collect()
    +    assert(skipResults1.length === 7)
    +    assert(skipResults1.forall(_ !== 4.0))
    +
    +    val skipResults2: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result2").as[Double].collect()
    +    assert(skipResults2.length === 7)
    +    assert(skipResults2.forall(_ !== 4.0))
    +
    +    bucketizer.setHandleInvalid("error")
    +    withClue("Bucketizer should throw error when setHandleInvalid=error and given NaN values") {
    +      intercept[SparkException] {
    +        bucketizer.transform(dataFrame).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN splits") {
    +    val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity, Double.NaN)
    +    withClue("Invalid NaN split was not caught during Bucketizer initialization") {
    +      intercept[IllegalArgumentException] {
    +        new Bucketizer().setSplitsArray(Array(splits))
    +      }
    +    }
    +  }
    +
    +  test("multiple columns:: read/write") {
    +    val t = new Bucketizer()
    +      .setInputCols(Array("myInputCol"))
    +      .setOutputCols(Array("myOutputCol"))
    +      .setSplitsArray(Array(Array(0.1, 0.8, 0.9)))
    +    assert(t.isBucketizeMultipleColumns())
    +    testDefaultReadWrite(t)
    +  }
    +
    +  test("Bucketizer in a pipeline") {
    --- End diff --
    
    It may be overkill - but we would expect a `Bucketizer` with multi cols set to have precisely the same operation as multiple `Bucketizer`. Perhaps a test comparing them?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r142278697
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2120,6 +2120,19 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns with metadata.
    +   */
    +  private[spark] def withColumns(
    --- End diff --
    
    Yeah, I see. I've left comment https://github.com/apache/spark/pull/17819#discussion_r142172037 that I will add test later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MarcKaminski <gi...@git.apache.org>.

Github user MarcKaminski commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    If I may throw in that e.g. ChiSqSelector issues a warning rather than throwing an exception in cases that unnecessary parameters are set, see: 
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala#L223 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143728145
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
    --- End diff --
    
    We should make it clear that in the multi column case, the invalid handling is applied to all columns (so for `error` it will throw the error if any invalids are found in any column, for `skip` it will skip rows with any invalids in any column, etc)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77924/testReport)** for PR 17819 at commit [`4301314`](https://github.com/apache/spark/commit/4301314bc56d47ef386c69d26643337134ce783f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81956/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143733664
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
       setDefault(handleInvalid, Bucketizer.ERROR_INVALID)
     
    +  /**
    +   * Parameter for specifying multiple splits parameters. Each element in this array can be used to
    +   * map continuous features into buckets.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val splitsArray: DoubleArrayArrayParam = new DoubleArrayArrayParam(this, "splitsArray",
    +    "The array of split points for mapping continuous features into buckets for multiple " +
    +      "columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by " +
    +      "splits x,y holds values in the range [x,y) except the last bucket, which also includes y. " +
    +      "The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be " +
    +      "explicitly provided to cover all Double values; otherwise, values outside the splits " +
    +      "specified will be treated as errors.",
    +    Bucketizer.checkSplitsArray)
    +
    +  /**
    +   * Param for output column names.
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",
    --- End diff --
    
    I guess similarly to shared params? I think it makes sense to add a shared param since this, `Imputer` and others will use it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77935/testReport)** for PR 17819 at commit [`08cbfac`](https://github.com/apache/spark/commit/08cbfac4b8899d30454622a64bfe89aa537a2277).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick I've updated this. Please take a look. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82632/testReport)** for PR 17819 at commit [`bb19708`](https://github.com/apache/spark/commit/bb19708dcd359c434c0fac779125f949541f9b8c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Is this ready to go?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81867/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @original-brownbear Thanks for letting me know.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143712667
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -108,26 +173,53 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
    -    }.withName("bucketizer")
    +    val seqOfSplits = if (isBucketizeMultipleColumns()) {
    +      $(splitsArray).toSeq
    +    } else {
    +      Seq($(splits))
    +    }
     
    -    val newCol = bucketizer(filteredDataset($(inputCol)).cast(DoubleType))
    -    val newField = prepOutputField(filteredDataset.schema)
    -    filteredDataset.withColumn($(outputCol), newCol, newField.metadata)
    +    val bucketizers: Seq[UserDefinedFunction] = seqOfSplits.zipWithIndex.map { case (splits, idx) =>
    +      udf { (feature: Double) =>
    +        Bucketizer.binarySearchForBuckets(splits, feature, keepInvalid)
    +      }.withName(s"bucketizer_$idx")
    +    }
    +
    +    val (inputColumns, outputColumns) = if (isBucketizeMultipleColumns()) {
    +      ($(inputCols).toSeq, $(outputCols).toSeq)
    +    } else {
    +      (Seq($(inputCol)), Seq($(outputCol)))
    +    }
    +    val newCols = inputColumns.zipWithIndex.map { case (inputCol, idx) =>
    +      bucketizers(idx)(filteredDataset(inputCol).cast(DoubleType))
    +    }
    +    val newFields = outputColumns.zipWithIndex.map { case (outputCol, idx) =>
    --- End diff --
    
    Have we not done this already in `transformSchema`? Can we just re-use the result of that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143707170
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -684,6 +684,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    --- End diff --
    
    I think as `withColumn` case, we can re-implement it with `withColumns` for metadata too. So this test case can cover it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81917/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Any more comments or thoughts on this I need to address?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r142278563
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2120,6 +2120,19 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns with metadata.
    +   */
    +  private[spark] def withColumns(
    +      colNames: Seq[String],
    +      cols: Seq[Column],
    +      metadata: Seq[Metadata]): DataFrame = {
    +    val newCols = colNames.zip(cols).zip(metadata).map { case ((colName, col), metadata) =>
    --- End diff --
    
    Is that possible the number of elements in metadata do not match? Then, the results will be unexpected because of  this impl 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83519/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81634/testReport)** for PR 17819 at commit [`f8dedd1`](https://github.com/apache/spark/commit/f8dedd1c92a8c48358743626b99c2f2192bc09b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82388/testReport)** for PR 17819 at commit [`2abca6b`](https://github.com/apache/spark/commit/2abca6bee5bf0e786a4c617f734797e119acdea5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77937/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Thanks. Result does look good.
    
    So the improvement is really coming from the new `withColumns` that avoids a bunch of projections in the plan in favor of one (more or less)? So the same approach can benefit any transformer that could operate on multiple cols (at least at transform time)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Btw, the reason that this change isn't java compatible, is not mainly because adding a trait to `Bucketizer`. Looks like It is because the params setter methods such as `setInputCols`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick I've done a benchmark using the test dataset provided in JIRA SPARK-20392 (blockbuster.csv).
    
    The ML pipeline includes 2 `StringIndexer`s and 1 `MultipleBucketizer` or 137 `Bucketizer`s to bin 137 input columns with the same splits.
    
    MultipleBucketizer: 3352 ms
    Bucketizer: 51512 ms



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143724681
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    --- End diff --
    
    Same here, `toSeq` unnecessary?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143724079
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    --- End diff --
    
    Is this type annotation required?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Yea, you're right, only move `setXXX` to concrete class also work fine. The root cause is the `setXXX` return type. But I think the multi / single logic can be merged, because single input column is a special case of multiple input column. What do you think of it ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82388/testReport)** for PR 17819 at commit [`2abca6b`](https://github.com/apache/spark/commit/2abca6bee5bf0e786a4c617f734797e119acdea5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MarcKaminski <gi...@git.apache.org>.

Github user MarcKaminski commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Hello, 
    
    I found a bug that occurs when putting the new Bucketizer into a Pipeline and calling fit on it. 
    Calling fit on a Pipeline calls the corresponding transformSchema of each PipelineStage in it. 
    
    Therefore, the transformSchema [method of the Bucketizer](https://github.com/viirya/spark-1/blob/f8dedd1c92a8c48358743626b99c2f2192bc09b1/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L146) is called, which checks for the params of the **conventional** Bucketizer (i.e. inputCol). 
    
    Steps to reproduce: 
    ```
    import org.apache.spark.ml._
    import org.apache.spark.ml.feature.Bucketizer
    
    case class data(f1: Double, f2: Double) 
    val datArr = Array(data(0.5, 0.3), data(0.5, -0.4))
    val df = spark.createDataFrame(datArr)
    
    val bucket = new Bucketizer()
      .setInputCols(Array("f1", "f2"))
      .setOutputCols(Array("f1_bu", "f2_bu"))
      .setSplitsArray(Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5)))
    
    // Will work
    bucket.transform(df) show
    
    // Will fail catastrophically
    val pl = new Pipeline()
     .setStages(Array(bucket))
     .fit(df)
    ```
    
    Exception thrown by last line: 
    
    ```
    java.util.NoSuchElementException: Failed to find a default value for inputCol
      at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
      at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:690)
      at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
      at org.apache.spark.ml.param.Params$class.$(params.scala:697)
      at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
      at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:147)
      at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
      at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
      at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
      at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
      at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
      at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
      ... 55 elided
    ```
    
    Since this has not yet been merged into Master, maybe you'd be still able to fix this and add a test for?
    Thanks!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @gatorsmile The related test is added. Please take a look again. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143730302
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN data but non-NaN splits") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.1, 0.2, 0.6, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN)
    +    val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0, 4.0)
    +    val expectedBuckets2 = Array(2.0, 1.0, 2.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 4.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.setHandleInvalid("keep")
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    bucketizer.setHandleInvalid("skip")
    +    val skipResults1: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result1").as[Double].collect()
    +    assert(skipResults1.length === 7)
    +    assert(skipResults1.forall(_ !== 4.0))
    +
    +    val skipResults2: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result2").as[Double].collect()
    +    assert(skipResults2.length === 7)
    +    assert(skipResults2.forall(_ !== 4.0))
    +
    +    bucketizer.setHandleInvalid("error")
    +    withClue("Bucketizer should throw error when setHandleInvalid=error and given NaN values") {
    +      intercept[SparkException] {
    +        bucketizer.transform(dataFrame).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN splits") {
    +    val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity, Double.NaN)
    +    withClue("Invalid NaN split was not caught during Bucketizer initialization") {
    +      intercept[IllegalArgumentException] {
    +        new Bucketizer().setSplitsArray(Array(splits))
    +      }
    +    }
    +  }
    +
    +  test("multiple columns:: read/write") {
    +    val t = new Bucketizer()
    +      .setInputCols(Array("myInputCol"))
    +      .setOutputCols(Array("myOutputCol"))
    +      .setSplitsArray(Array(Array(0.1, 0.8, 0.9)))
    +    assert(t.isBucketizeMultipleColumns())
    +    testDefaultReadWrite(t)
    +  }
    +
    +  test("Bucketizer in a pipeline") {
    +    val df = Seq((0.5, 0.3, 1.0, 1.0), (0.5, -0.4, 1.0, 0.0))
    +      .toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucket = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5)))
    +
    +    assert(bucket.isBucketizeMultipleColumns())
    +
    +    val pl = new Pipeline()
    +      .setStages(Array(bucket))
    +      .fit(df)
    +    pl.transform(df).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    --- End diff --
    
    This logic is duplicated across a few test cases - perhaps we could factor it out into a utility method.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81935/testReport)** for PR 17819 at commit [`92ef9bd`](https://github.com/apache/spark/commit/92ef9bde1e048eef7e3b530286723cad5773debc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143947158
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java ---
    @@ -33,6 +33,13 @@
     import org.apache.spark.sql.types.StructType;
     // $example off$
     
    --- End diff --
    
    This is original for testing java pipeline. Added a Scala example.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by leliang65 <gi...@git.apache.org>.

Github user leliang65 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Is there any python example for this api?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Thanks @MLnick 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by tengpeng <gi...@git.apache.org>.

Github user tengpeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r152891005
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -108,26 +164,53 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
    -    }.withName("bucketizer")
    +    val seqOfSplits = if (isBucketizeMultipleColumns()) {
    +      $(splitsArray).toSeq
    --- End diff --
    
    I am interested in the motivation of using `.toSeq` and `Seq()` here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick I have no strong option but @WeichenXu123 seems more preferring merging the new API into current `Bucketizer`. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143706308
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -684,6 +684,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
         }
       }
     
    --- End diff --
    
    I noticed we don't have a test for the single column `withColumn: given metadata`. I wonder if that should be added (though it's not related to this PR)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by original-brownbear <gi...@git.apache.org>.

Github user original-brownbear commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya re `HiveExternalCatalogVersionsSuite`, jup it is https://github.com/apache/spark/commit/dbb824125d4d31166d9a47c330f8d51f5d159515#commitcomment-24354358 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)** for PR 17819 at commit [`6ff9c79`](https://github.com/apache/spark/commit/6ff9c7998688107a835875ea41e6fe9576a1558c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82402 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82402/testReport)** for PR 17819 at commit [`000844a`](https://github.com/apache/spark/commit/000844ab1f0dffef9b51b96f7edc1e1ab9e9e0b7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76379/testReport)** for PR 17819 at commit [`38dce8b`](https://github.com/apache/spark/commit/38dce8b30ee3119d5c4c3c761e527fca5c2979f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya It is possible I think. A similar example is, `HasRegParam`  trait, do not put `setRegParam` in trait but moved into concrete estimator/transformer class, should be the same reason.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81867/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17819

The bunch of projections will be collapsed in optimization. So it doesn't affect query execution. However, every `withColumn` call creates new `DataFrame` along with a projection on previous logical plan. It is costly by creating new query execution, analyzing logical plan, creating encoder, etc. So the improvement is coming from saving the cost by doing this one time with `withColumns`, instead of multiple `withColumn`.

It can benefit other transformers that could work on multiple cols. I even have an idea to revamp the interface of `Transformer`. Because the transformation in `Transformer` is actually ending with a `withColumn` call to add/replace column. They are actually transforming columns in the dataset. But the performance difference is obvious only when the number of transformation stages is large enough like the example of many `Bucketizer`s. So it may not worth doing that. Just a thought.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    I think we can only check `inputCol` and `inputCols`. If both are set, throw an exception. Namely depends on which one is set, we go single column or multiple columns path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya can you resolve the conflicts now that #19229 was merged?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143728324
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN data but non-NaN splits") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.1, 0.2, 0.6, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN)
    +    val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0, 4.0)
    +    val expectedBuckets2 = Array(2.0, 1.0, 2.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 4.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    --- End diff --
    
    Same here, `toSeq` unnecessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81956/testReport)** for PR 17819 at commit [`92ef9bd`](https://github.com/apache/spark/commit/92ef9bde1e048eef7e3b530286723cad5773debc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Thanks for leaving the comments. I think I've addressed all of them. Please take a look if you are free. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Yes, fair enough
    
    On Tue, 10 Oct 2017 at 14:09 Liang-Chi Hsieh <no...@github.com>
    wrote:
    
    > *@viirya* commented on this pull request.
    > ------------------------------
    >
    > In sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
    > <https://github.com/apache/spark/pull/17819#discussion_r143707170>:
    >
    > > @@ -684,6 +684,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    >      }
    >    }
    >
    >
    > I think as withColumn case, we can re-implement it with withColumns for
    > metadata too. So this test case can cover it.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/17819#discussion_r143707170>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AA_SB1wjHVAXkr6b_xHp5URSmPjvFCd_ks5sq16HgaJpZM4NNEr5>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143947510
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN data but non-NaN splits") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.1, 0.2, 0.6, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN)
    +    val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0, 4.0)
    +    val expectedBuckets2 = Array(2.0, 1.0, 2.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 4.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.setHandleInvalid("keep")
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    bucketizer.setHandleInvalid("skip")
    +    val skipResults1: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result1").as[Double].collect()
    +    assert(skipResults1.length === 7)
    +    assert(skipResults1.forall(_ !== 4.0))
    +
    +    val skipResults2: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result2").as[Double].collect()
    +    assert(skipResults2.length === 7)
    +    assert(skipResults2.forall(_ !== 4.0))
    +
    +    bucketizer.setHandleInvalid("error")
    +    withClue("Bucketizer should throw error when setHandleInvalid=error and given NaN values") {
    +      intercept[SparkException] {
    +        bucketizer.transform(dataFrame).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN splits") {
    +    val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity, Double.NaN)
    +    withClue("Invalid NaN split was not caught during Bucketizer initialization") {
    +      intercept[IllegalArgumentException] {
    +        new Bucketizer().setSplitsArray(Array(splits))
    +      }
    +    }
    +  }
    +
    +  test("multiple columns:: read/write") {
    +    val t = new Bucketizer()
    +      .setInputCols(Array("myInputCol"))
    +      .setOutputCols(Array("myOutputCol"))
    +      .setSplitsArray(Array(Array(0.1, 0.8, 0.9)))
    +    assert(t.isBucketizeMultipleColumns())
    +    testDefaultReadWrite(t)
    +  }
    +
    +  test("Bucketizer in a pipeline") {
    --- End diff --
    
    Sure. Added a test for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143723205
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
       setDefault(handleInvalid, Bucketizer.ERROR_INVALID)
     
    +  /**
    +   * Parameter for specifying multiple splits parameters. Each element in this array can be used to
    +   * map continuous features into buckets.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val splitsArray: DoubleArrayArrayParam = new DoubleArrayArrayParam(this, "splitsArray",
    +    "The array of split points for mapping continuous features into buckets for multiple " +
    +      "columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by " +
    +      "splits x,y holds values in the range [x,y) except the last bucket, which also includes y. " +
    +      "The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be " +
    +      "explicitly provided to cover all Double values; otherwise, values outside the splits " +
    +      "specified will be treated as errors.",
    +    Bucketizer.checkSplitsArray)
    +
    +  /**
    +   * Param for output column names.
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",
    +    "output column names")
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  def getSplitsArray: Array[Array[Double]] = $(splitsArray)
    +
    +  /** @group getParam */
    +  @Since("2.3.0")
    +  final def getOutputCols: Array[String] = $(outputCols)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setSplitsArray(value: Array[Array[Double]]): this.type = set(splitsArray, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
    +
    +  /** @group setParam */
    +  @Since("2.3.0")
    +  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
    +
    +  /**
    +   * Determines whether this `Bucketizer` is going to map multiple columns. If and only if
    +   * `inputCols` is set, it will map multiple columns. Otherwise, it just maps a column specified
    +   * by `inputCol`. A warning will be printed if both are set.
    +   */
    +  private[ml] def isBucketizeMultipleColumns(): Boolean = {
    --- End diff --
    
    `private[feature]`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81864/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @leliang65 The PySpark support is not added yet. Please refer to #19892.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by leliang65 <gi...@git.apache.org>.

Github user leliang65 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r182644949
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java ---
    @@ -33,6 +33,13 @@
     import org.apache.spark.sql.types.StructType;
     // $example off$
     
    --- End diff --
    
    Is there any python example?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76349/testReport)** for PR 17819 at commit [`e8f5d89`](https://github.com/apache/spark/commit/e8f5d89ab2c344d52ae245b1b22cb4425ae6ffa0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class MultipleBucketizer @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)`
      * `class DoubleArrayArrayParam(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by barrybecker4 <gi...@git.apache.org>.

Github user barrybecker4 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    I don't see support for withColumns in spark 2.1.1. Which version does it first appear? This work seems related to https://issues.apache.org/jira/browse/SPARK-12225.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77935/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81869/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #83519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83519/testReport)** for PR 17819 at commit [`a970723`](https://github.com/apache/spark/commit/a97072321487bc515fe14704021f6533715e27e3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76349/testReport)** for PR 17819 at commit [`e8f5d89`](https://github.com/apache/spark/commit/e8f5d89ab2c344d52ae245b1b22cb4425ae6ffa0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r114234373
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1883,6 +1883,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns or replacing the existing columns that has
    +   * the same names.
    +   *
    +   * @group untypedrel
    +   * @since 2.3.0
    +   */
    +  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
    --- End diff --
    
    I am wondering shall I make an individual PR for this SQL change. cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143708258
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -24,20 +24,23 @@ import org.apache.spark.annotation.Since
     import org.apache.spark.ml.Model
     import org.apache.spark.ml.attribute.NominalAttribute
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql._
     import org.apache.spark.sql.expressions.UserDefinedFunction
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
     
     /**
    - * `Bucketizer` maps a column of continuous features to a column of feature buckets.
    + * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0,
    + * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple
    --- End diff --
    
    Perhaps:
    
    >Since 2.3.0, `Bucketizer` can map multiple columns at once by setting the `inputCols` parameter. Note that when both the `inputCol` and `inputCols` parameters are set, a log warning will be printed and only `inputCol` will take effect, while `inputCols` will be ignored.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17819


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77937 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77937/testReport)** for PR 17819 at commit [`8386d1e`](https://github.com/apache/spark/commit/8386d1ea52a5daecea24334deb8a122e46333d84).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81963/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya Scala `with trait` is a complex mechanism and `trait` isn't equivalent to java's `interface`. Scala compiler will precompile and generate many other codes, so java-side code cannot directly call methods in `with trait`. In order to test this, you can modify this file and add a line:
    ```
    --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java
    +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java
    @@ -59,7 +59,8 @@ public class JavaBucketizerExample {
         Bucketizer bucketizer = new Bucketizer()
           .setInputCol("features")
           .setOutputCol("bucketedFeatures")
    -      .setSplits(splits);
    +      .setSplits(splits)
    +      .setInputCols(new String[] {"features", "features2"});
     
         // Transform original data into its bucket index.
         Dataset<Row> bucketedData = bucketizer.transform(dataFrame);
    ```
    and you will find this Java class compile failed.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82583/testReport)** for PR 17819 at commit [`1889995`](https://github.com/apache/spark/commit/1889995c12e55b2420726540756b4b0b69b1bb28).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @WeichenXu123 Do you mean we keep both inputCol and inputCols in `Bucketizer`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82016/testReport)** for PR 17819 at commit [`f70fc2a`](https://github.com/apache/spark/commit/f70fc2a956082a83859313f663649a9b17dbbf36).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82388/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82403/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82403/testReport)** for PR 17819 at commit [`000844a`](https://github.com/apache/spark/commit/000844ab1f0dffef9b51b96f7edc1e1ab9e9e0b7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76406/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r114254089
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1883,6 +1883,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns or replacing the existing columns that has
    +   * the same names.
    +   *
    +   * @group untypedrel
    +   * @since 2.3.0
    +   */
    +  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
    --- End diff --
    
    how about we make it `private[sql]`? I'm not sure if this API is good enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r142278414
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2120,6 +2120,19 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns with metadata.
    +   */
    +  private[spark] def withColumns(
    --- End diff --
    
    This is not being tested in SQL


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by huaxingao <gi...@git.apache.org>.

Github user huaxingao commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @AFractalThought @viirya 
    I have made changes for QuantileDiscretizer based on this PR. Once this PR is merged, I will open a jira to submit the PR for QuantileDiscretizer. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77924/testReport)** for PR 17819 at commit [`4301314`](https://github.com/apache/spark/commit/4301314bc56d47ef386c69d26643337134ce783f).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81963/testReport)** for PR 17819 at commit [`60d3ba1`](https://github.com/apache/spark/commit/60d3ba1ec3c2c9d767e8f63f43aadda2de4c4e28).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Sorry I have to reply on a phone, so I may not write codes smoothly.
    
    What I mean it doesn't break binary compatibility, is the existing users
    codes using Bucketizer don't need to recompile. If you want to use new
    inputCols stuff, of course you need to recompile your codes. I think this
    is what about binary compatibility, isn't?
    
    
    
    On Sep 19, 2017 9:13 AM, "WeichenXu" <no...@github.com> wrote:
    
    @viirya <https://github.com/viirya> Scala with trait is a complex mechanism
    and trait isn't equivalent to java's interface. Scala compiler will
    precompile and generate many other codes, so java-side code cannot directly
    call methods in with trait. In order to test this, you can modify this file
    and add a line:
    
    --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java
    +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java
    @@ -59,7 +59,8 @@ public class JavaBucketizerExample {
         Bucketizer bucketizer = new Bucketizer()
           .setInputCol("features")
           .setOutputCol("bucketedFeatures")
    -      .setSplits(splits);
    +      .setSplits(splits)
    +      .setInputCols(new String[] {"features", "features2"});
    
         // Transform original data into its bucket index.
         Dataset<Row> bucketedData = bucketizer.transform(dataFrame);
    
    and you will find this Java class compile failed.
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub
    <https://github.com/apache/spark/pull/17819#issuecomment-330399059>, or mute
    the thread
    <https://github.com/notifications/unsubscribe-auth/AAEM9yeUnEowP2PhdXG2XF9DvRcVkg4Tks5sjxUrgaJpZM4NNEr5>
    .



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82402/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143728663
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer1: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer1.isBucketizeMultipleColumns())
    +
    +    bucketizer1.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    // Check for exceptions when using a set of invalid feature values.
    +    val invalidData1: Array[Double] = Array(-0.9) ++ validData1
    +    val invalidData2 = Array(0.51) ++ validData1
    +    val badDF1 = invalidData1.zipWithIndex.toSeq.toDF("feature", "idx")
    +
    +    val bucketizer2: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature"))
    +      .setOutputCols(Array("result"))
    +      .setSplitsArray(Array(splits(0)))
    +
    +    assert(bucketizer2.isBucketizeMultipleColumns())
    +
    +    withClue("Invalid feature value -0.9 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF1).collect()
    +      }
    +    }
    +    val badDF2 = invalidData2.zipWithIndex.toSeq.toDF("feature", "idx")
    +    withClue("Invalid feature value 0.51 was not caught as an invalid feature!") {
    +      intercept[SparkException] {
    +        bucketizer2.transform(badDF2).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with -inf,inf") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.3, 0.2, 0.5, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9)
    +    val validData2 = Array(-0.1, -0.5, -0.2, 0.0, 0.1, 0.3, 0.5)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0)
    +    val expectedBuckets2 = Array(1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN data but non-NaN splits") {
    +    val splits = Array(
    +      Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
    +      Array(Double.NegativeInfinity, -0.1, 0.2, 0.6, Double.PositiveInfinity))
    +
    +    val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN)
    +    val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN)
    +    val expectedBuckets1 = Array(0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0, 4.0)
    +    val expectedBuckets2 = Array(2.0, 1.0, 2.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 4.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    +
    +    val bucketizer: Bucketizer = new Bucketizer()
    +      .setInputCols(Array("feature1", "feature2"))
    +      .setOutputCols(Array("result1", "result2"))
    +      .setSplitsArray(splits)
    +
    +    assert(bucketizer.isBucketizeMultipleColumns())
    +
    +    bucketizer.setHandleInvalid("keep")
    +    bucketizer.transform(dataFrame).select("result1", "expected1", "result2", "expected2")
    +      .collect().foreach {
    +        case Row(r1: Double, e1: Double, r2: Double, e2: Double) =>
    +          assert(r1 === e1,
    +            s"The feature value is not correct after bucketing. Expected $e1 but found $r1")
    +          assert(r2 === e2,
    +            s"The feature value is not correct after bucketing. Expected $e2 but found $r2")
    +      }
    +
    +    bucketizer.setHandleInvalid("skip")
    +    val skipResults1: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result1").as[Double].collect()
    +    assert(skipResults1.length === 7)
    +    assert(skipResults1.forall(_ !== 4.0))
    +
    +    val skipResults2: Array[Double] = bucketizer.transform(dataFrame)
    +      .select("result2").as[Double].collect()
    +    assert(skipResults2.length === 7)
    +    assert(skipResults2.forall(_ !== 4.0))
    +
    +    bucketizer.setHandleInvalid("error")
    +    withClue("Bucketizer should throw error when setHandleInvalid=error and given NaN values") {
    +      intercept[SparkException] {
    +        bucketizer.transform(dataFrame).collect()
    +      }
    +    }
    +  }
    +
    +  test("multiple columns: Bucket continuous features, with NaN splits") {
    +    val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity, Double.NaN)
    +    withClue("Invalid NaN split was not caught during Bucketizer initialization") {
    +      intercept[IllegalArgumentException] {
    +        new Bucketizer().setSplitsArray(Array(splits))
    +      }
    +    }
    +  }
    +
    +  test("multiple columns:: read/write") {
    --- End diff --
    
    Nit: two `:` here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    I will try to take a look soon. My main concern is whether we should really have a new class - it starts to make things really messy if we introduce `Multi` versions of everything (e.g. we may want to add multi col support to `StringIndexer`, `OneHotEncoder` among others).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #76406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76406/testReport)** for PR 17819 at commit [`6ff9c79`](https://github.com/apache/spark/commit/6ff9c7998688107a835875ea41e6fe9576a1558c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    ping @MLnick Do you have more comments on this? Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r114278633
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1883,6 +1883,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns or replacing the existing columns that has
    +   * the same names.
    +   *
    +   * @group untypedrel
    +   * @since 2.3.0
    +   */
    +  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
    --- End diff --
    
    Sounds good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #83519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83519/testReport)** for PR 17819 at commit [`a970723`](https://github.com/apache/spark/commit/a97072321487bc515fe14704021f6533715e27e3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    ping @MLnick Can you have time to help review this recently? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81864/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81867/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MarcKaminski Thanks! It is great you can provide a reproducible case! I will fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143730685
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -24,20 +24,23 @@ import org.apache.spark.annotation.Since
     import org.apache.spark.ml.Model
     import org.apache.spark.ml.attribute.NominalAttribute
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql._
     import org.apache.spark.sql.expressions.UserDefinedFunction
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
     
     /**
    - * `Bucketizer` maps a column of continuous features to a column of feature buckets.
    + * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0,
    + * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple
    + * columns, it depends on which parameter of `inputCol` and `inputCols` is set. When both are set,
    + * a log warning will be printed and by default it chooses `inputCol`.
    --- End diff --
    
    We should probably also mention that `splits` is only used for single column and `splitsArray` for multi column


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81869/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81917/testReport)** for PR 17819 at commit [`ccf92cd`](https://github.com/apache/spark/commit/ccf92cdd06f336392dc1177f270bfb98c5c5c0d4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82403/testReport)** for PR 17819 at commit [`000844a`](https://github.com/apache/spark/commit/000844ab1f0dffef9b51b96f7edc1e1ab9e9e0b7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Note: since in `Transformer`, there might be other manipulation to the dataset like dropping NaN values. The idea above won't work under that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82016/testReport)** for PR 17819 at commit [`f70fc2a`](https://github.com/apache/spark/commit/f70fc2a956082a83859313f663649a9b17dbbf36).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82583/testReport)** for PR 17819 at commit [`1889995`](https://github.com/apache/spark/commit/1889995c12e55b2420726540756b4b0b69b1bb28).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @WeichenXu123 I see. That's correct this change is not java compatible. Thanks for pointing out. I'm merging the changes into `Bucketizer`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya could you resolve conflicts?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick That's right. I also have concern about this. However, to keep the original single-column Bucketizer and multiple-column Bucketizer in one class seems also producing a messy code.
    
    I'd rethink it and see if there is a good way to incorporate both.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged to master. Thanks @viirya and all the reviewers!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143930542
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
       setDefault(handleInvalid, Bucketizer.ERROR_INVALID)
     
    +  /**
    +   * Parameter for specifying multiple splits parameters. Each element in this array can be used to
    +   * map continuous features into buckets.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val splitsArray: DoubleArrayArrayParam = new DoubleArrayArrayParam(this, "splitsArray",
    +    "The array of split points for mapping continuous features into buckets for multiple " +
    +      "columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by " +
    +      "splits x,y holds values in the range [x,y) except the last bucket, which also includes y. " +
    +      "The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be " +
    +      "explicitly provided to cover all Double values; otherwise, values outside the splits " +
    +      "specified will be treated as errors.",
    +    Bucketizer.checkSplitsArray)
    +
    +  /**
    +   * Param for output column names.
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",
    --- End diff --
    
    I will create `HasOutputCols`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    ok to test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    `HiveExternalCatalogVersionsSuite` seems flaky?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81917/testReport)** for PR 17819 at commit [`ccf92cd`](https://github.com/apache/spark/commit/ccf92cdd06f336392dc1177f270bfb98c5c5c0d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Yes you can only move `setInputCols` into the outer class to resolve this issue. But I prefer merge it together. I think we can unify the `transform` method. (First we check param `inputCol` and `inputCols`, and construct a col list and pass it to the transform code. it will simplify the code).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143929154
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
       setDefault(handleInvalid, Bucketizer.ERROR_INVALID)
     
    +  /**
    +   * Parameter for specifying multiple splits parameters. Each element in this array can be used to
    +   * map continuous features into buckets.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val splitsArray: DoubleArrayArrayParam = new DoubleArrayArrayParam(this, "splitsArray",
    +    "The array of split points for mapping continuous features into buckets for multiple " +
    +      "columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by " +
    +      "splits x,y holds values in the range [x,y) except the last bucket, which also includes y. " +
    +      "The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be " +
    +      "explicitly provided to cover all Double values; otherwise, values outside the splits " +
    +      "specified will be treated as errors.",
    +    Bucketizer.checkSplitsArray)
    +
    +  /**
    +   * Param for output column names.
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",
    --- End diff --
    
    Ah, I think the `final` is copied from previous multiple bucketizer trait. I'll remove it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82583/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    The issue is the in the trait `setXXX` returns `this.type` which in Java in the concrete class doesn't work, so the `setXXX` methods need to be implemented in the concrete subclass. See the decision tree algos for example of this. The `getXXX` are ok since they typically return a concrete type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81634/testReport)** for PR 17819 at commit [`f8dedd1`](https://github.com/apache/spark/commit/f8dedd1c92a8c48358743626b99c2f2192bc09b1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81869/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    About vector bucketizer, seems it might work similarly as multi-col bucketizer. But some behaviors such as `Bucketizer.SKIP_INVALID` need to address.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya Yes. But if there is some better design I will be happy to listen.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @viirya can you post some performance comparisons for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r139729964
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2102,6 +2102,53 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns or replacing the existing columns that has
    +   * the same names.
    +   */
    +  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
    --- End diff --
    
    This is also copied to #19229. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82632/testReport)** for PR 17819 at commit [`bb19708`](https://github.com/apache/spark/commit/bb19708dcd359c434c0fac779125f949541f9b8c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82632/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Thanks @MLnick 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77924/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r142278786
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2120,6 +2120,19 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns with metadata.
    +   */
    +  private[spark] def withColumns(
    +      colNames: Seq[String],
    +      cols: Seq[Column],
    +      metadata: Seq[Metadata]): DataFrame = {
    +    val newCols = colNames.zip(cols).zip(metadata).map { case ((colName, col), metadata) =>
    --- End diff --
    
    Yes. We should check the number of metadata too. I'll add it when adding related test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @WeichenXu123 According to https://docs.oracle.com/javase/specs/jls/se7/html/jls-13.html#jls-13.4.4 and https://wiki.eclipse.org/Evolving_Java-based_APIs_2#Evolving_API_Classes, I think adding an interface to a class doesn't break binary compatibility.
    
    I am not sure the exact reason of implementing setter methods for params in concrete estimators/transformers. Maybe it is because some params should be immutable in models.
    
    Anyway, I will merge the inputCols stuff into existing `Bucketizer` if you prefer this way.
    
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81935/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @barrybecker4 `withColumns` API is first introduced in this PR. So you won't see it in Spark 2.1.1 or current codebase. Thanks for letting me know SPARK-12225. Yes, it is related.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143713315
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaBucketizerExample.java ---
    @@ -33,6 +33,13 @@
     import org.apache.spark.sql.types.StructType;
     // $example off$
     
    --- End diff --
    
    No Scala example?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r142172037
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2102,6 +2102,55 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset by adding columns or replacing the existing columns that has
    +   * the same names.
    +   */
    +  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
    +    require(colNames.size == cols.size,
    +      s"The size of column names: ${colNames.size} isn't equal to " +
    +        s"the size of columns: ${cols.size}")
    +    require(colNames.distinct.size == colNames.size,
    +      s"It is disallowed to use duplicate column names: $colNames")
    +
    +    val resolver = sparkSession.sessionState.analyzer.resolver
    +    val output = queryExecution.analyzed.output
    +
    +    val columnMap = colNames.zip(cols).toMap
    +
    +    val replacedAndExistingColumns = output.map { field =>
    +      val dupColumn = columnMap.find { case (colName, col) =>
    +        resolver(field.name, colName)
    +      }
    +      if (dupColumn.isDefined) {
    +        val colName = dupColumn.get._1
    +        val col = dupColumn.get._2
    +        col.as(colName)
    +      } else {
    +        Column(field)
    +      }
    +    }
    +
    +    val newColumns = columnMap.filter { case (colName, col) =>
    +      !output.exists(f => resolver(f.name, colName))
    +    }.map { case (colName, col) => col.as(colName) }
    +
    +    select(replacedAndExistingColumns ++ newColumns : _*)
    +  }
    +
    +  /**
    +   * Returns a new Dataset by adding columns with metadata.
    +   */
    +  private[spark] def withColumns(
    --- End diff --
    
    I will to add a test for this later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MarcKaminski by the way you mentioned a vector bucketizer. I think in principal that might be useful. I'm not sure if it would make sense to add vector type support to the existing `Bucketizer` or if it would need to be separate.
    
    Perhaps you can post a link to code / design doc on this JIRA for the vector type version?  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AFractalThought <gi...@git.apache.org>.

Github user AFractalThought commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Thanks @huaxingao @MLnick @viirya this will be super helpful


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77937 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77937/testReport)** for PR 17819 at commit [`8386d1e`](https://github.com/apache/spark/commit/8386d1ea52a5daecea24334deb8a122e46333d84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick I've updated the previous solution.  The new API is implemented in an interface which `Bucketizer` extends now. So you can still use `Bucketizer` class. Depends on what parameters you set, it goes for single column or multiple bucketizing.
    
    Please take a look if you have time. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81634/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76349/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81963/testReport)** for PR 17819 at commit [`60d3ba1`](https://github.com/apache/spark/commit/60d3ba1ec3c2c9d767e8f63f43aadda2de4c4e28).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Ok. Let me prepare the comparisons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @MLnick Conflicts resolved. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81935/testReport)** for PR 17819 at commit [`92ef9bd`](https://github.com/apache/spark/commit/92ef9bde1e048eef7e3b530286723cad5773debc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143926137
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -24,20 +24,23 @@ import org.apache.spark.annotation.Since
     import org.apache.spark.ml.Model
     import org.apache.spark.ml.attribute.NominalAttribute
     import org.apache.spark.ml.param._
    -import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasOutputCol}
    +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol}
     import org.apache.spark.ml.util._
     import org.apache.spark.sql._
     import org.apache.spark.sql.expressions.UserDefinedFunction
     import org.apache.spark.sql.functions._
     import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
     
     /**
    - * `Bucketizer` maps a column of continuous features to a column of feature buckets.
    + * `Bucketizer` maps a column of continuous features to a column of feature buckets. Since 2.3.0,
    + * `Bucketizer` can also map multiple columns at once. Whether it goes to map a column or multiple
    --- End diff --
    
    Ok. Looks better.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143722947
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala ---
    @@ -187,6 +188,196 @@ class BucketizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
           }
         }
       }
    +
    +  test("multiple columns: Bucket continuous features, without -inf,inf") {
    +    // Check a set of valid feature values.
    +    val splits = Array(Array(-0.5, 0.0, 0.5), Array(-0.1, 0.3, 0.5))
    +    val validData1 = Array(-0.5, -0.3, 0.0, 0.2)
    +    val validData2 = Array(0.5, 0.3, 0.0, -0.1)
    +    val expectedBuckets1 = Array(0.0, 0.0, 1.0, 1.0)
    +    val expectedBuckets2 = Array(1.0, 1.0, 0.0, 0.0)
    +
    +    val data = (0 until validData1.length).map { idx =>
    +      (validData1(idx), validData2(idx), expectedBuckets1(idx), expectedBuckets2(idx))
    +    }
    +    val dataFrame: DataFrame = data.toSeq.toDF("feature1", "feature2", "expected1", "expected2")
    --- End diff --
    
    `toSeq` not required here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @gatorsmile The SQL change looks good to you? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    @WeichenXu123 I'm ok for that but I think adding an interface doesn't break binary compatibility?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #77935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77935/testReport)** for PR 17819 at commit [`08cbfac`](https://github.com/apache/spark/commit/08cbfac4b8899d30454622a64bfe89aa537a2277).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL][WIP] Add a Bucketizer that can bi...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    cc @MLnick @jkbradley for review. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer t...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17819#discussion_r143710289
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -96,9 +99,71 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
       def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
       setDefault(handleInvalid, Bucketizer.ERROR_INVALID)
     
    +  /**
    +   * Parameter for specifying multiple splits parameters. Each element in this array can be used to
    +   * map continuous features into buckets.
    +   *
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  val splitsArray: DoubleArrayArrayParam = new DoubleArrayArrayParam(this, "splitsArray",
    +    "The array of split points for mapping continuous features into buckets for multiple " +
    +      "columns. For each input column, with n+1 splits, there are n buckets. A bucket defined by " +
    +      "splits x,y holds values in the range [x,y) except the last bucket, which also includes y. " +
    +      "The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be " +
    +      "explicitly provided to cover all Double values; otherwise, values outside the splits " +
    +      "specified will be treated as errors.",
    +    Bucketizer.checkSplitsArray)
    +
    +  /**
    +   * Param for output column names.
    +   * @group param
    +   */
    +  @Since("2.3.0")
    +  final val outputCols: StringArrayParam = new StringArrayParam(this, "outputCols",
    --- End diff --
    
    why are we making this `final` (and not others)? (also the `getOutputCols`?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #81864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81864/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    **[Test build #82402 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82402/testReport)** for PR 17819 at commit [`000844a`](https://github.com/apache/spark/commit/000844ab1f0dffef9b51b96f7edc1e1ab9e9e0b7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82016/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Posted by AFractalThought <gi...@git.apache.org>.

Github user AFractalThought commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Does this extension exist for QuantileDiscretizer as well?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org