You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by crackcell <gi...@git.apache.org> on 2017/03/01 15:26:04 UTC

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

GitHub user crackcell opened a pull request:

    https://github.com/apache/spark/pull/17123

    [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on

    ## What changes were proposed in this pull request?
    
    The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched.
    This PR unify behaviours of processing of NULLs and NaNs.
    
    BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-)
    
    ## How was this patch tested?
    
    manual tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/crackcell/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17123
    
----
commit 2b0751428cad5280df47f96412608967b71a7360
Author: Menglong TAN <ta...@gmail.com>
Date:   2017-03-01T12:37:33Z

    add support for null values in Bucketizer

commit b3f98b66e63c9c61c69a1429819feb236fad56c7
Author: Menglong TAN <ta...@gmail.com>
Date:   2017-03-01T15:10:05Z

    fix a typo

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162874801
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
        * Values at -inf, inf must be explicitly provided to cover all Double values;
        * otherwise, values outside the splits specified will be treated as errors.
        *
    -   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values.
    +   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL
    --- End diff --
    
    This sounds like a behavior change, we should add an item in migration guide of ML docs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162703711
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
        * Binary searching in several buckets to place each data point.
        * @param splits array of split points
        * @param feature data point
    -   * @param keepInvalid NaN flag.
    -   *                    Set "true" to make an extra bucket for NaN values;
    -   *                    Set "false" to report an error for NaN values
    +   * @param keepInvalid NaN/NULL flag.
    +   *                    Set "true" to make an extra bucket for NaN/NULL values;
    +   *                    Set "false" to report an error for NaN/NULL values
        * @return bucket for each data point
        * @throws SparkException if a feature is < splits.head or > splits.last
        */
     
       private[feature] def binarySearchForBuckets(
           splits: Array[Double],
    -      feature: Double,
    +      feature: java.lang.Double,
    --- End diff --
    
    Also change to `Option[Double]` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162703633
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    As @cloud-fan suggested, `Option[Double]` is better. :-)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r104362536
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    actually, can we just use `java.lang.Double` as the type for `feature`? Then we don't need to change https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r103949549
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
        * Binary searching in several buckets to place each data point.
        * @param splits array of split points
        * @param feature data point
    -   * @param keepInvalid NaN flag.
    -   *                    Set "true" to make an extra bucket for NaN values;
    -   *                    Set "false" to report an error for NaN values
    +   * @param keepInvalid NaN/NULL flag.
    +   *                    Set "true" to make an extra bucket for NaN/NULL values;
    +   *                    Set "false" to report an error for NaN/NULL values
        * @return bucket for each data point
        * @throws SparkException if a feature is < splits.head or > splits.last
        */
     
       private[feature] def binarySearchForBuckets(
           splits: Array[Double],
    -      feature: Double,
    +      feature: java.lang.Double,
    --- End diff --
    
    Double here as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Fixed style errors during the unit tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @crackcell thank you for the nice fix.  I've added a few comments.  Please add a test case(s) for the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r106339731
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    +    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
    --- End diff --
    
    I believe you should try to avoid using a udf on a row because the serialization costs will be more expensive... hmm how could we make this perform well and handle nulls?  Does it work with Option[Double] instead of Row?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @cloud-fan Would you please review my code again? I'm now using `Option` to handle NULLs. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r104434899
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a `Row` to  `bucketizer()` and then check NULLs with `isNullAt()` and `getDouble()` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @srowen @cloud-fan Please review my code. Thanks. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r106363022
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    +    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
    --- End diff --
    
    Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    **[Test build #3590 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3590/testReport)** for PR 17123 at commit [`b3f98b6`](https://github.com/apache/spark/commit/b3f98b66e63c9c61c69a1429819feb236fad56c7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    **[Test build #3590 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3590/testReport)** for PR 17123 at commit [`b3f98b6`](https://github.com/apache/spark/commit/b3f98b66e63c9c61c69a1429819feb236fad56c7).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r104224870
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
    +    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
    +      Bucketizer.binarySearchForBuckets(
    +        $(splits),
    +        row.getAs[java.lang.Double]($(inputCol)),
    --- End diff --
    
    Ideally we should use `row.getDouble(index)` and `row.isNullAt(index)` together to get values for primitive types, but technically `Row` is just a `Array[Object]`, so there is no performance penalty by using `java.lang.Double`.(this may change in the future, if possible we should prefer `isNullAt` and `getDouble`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r104572696
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    Thanks a lot. `Option[Double]` is much better. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @crackcell I'm not sure about changing the UDF to be on a row instead of a column, I've found that the serialization costs are much higher and the spark code performs much less.  Maybe an expert like @cloud-fan can comment more here?  Can you keep the UDF on a column instead of a row?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r103949478
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
        * Binary searching in several buckets to place each data point.
        * @param splits array of split points
        * @param feature data point
    -   * @param keepInvalid NaN flag.
    -   *                    Set "true" to make an extra bucket for NaN values;
    -   *                    Set "false" to report an error for NaN values
    +   * @param keepInvalid NaN/NULL flag.
    +   *                    Set "true" to make an extra bucket for NaN/NULL values;
    +   *                    Set "false" to report an error for NaN/NULL values
        * @return bucket for each data point
        * @throws SparkException if a feature is < splits.head or > splits.last
        */
     
       private[feature] def binarySearchForBuckets(
           splits: Array[Double],
    -      feature: Double,
    +      feature: java.lang.Double,
           keepInvalid: Boolean): Double = {
    -    if (feature.isNaN) {
    +    if (feature == null || feature.isNaN) {
           if (keepInvalid) {
             splits.length - 1
           } else {
    -        throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," +
    -          " try setting Bucketizer.handleInvalid.")
    +        throw new SparkException("Bucketizer encountered NaN/NULL values. " +
    +          "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.")
           }
         } else if (feature == splits.last) {
    --- End diff --
    
    could you please add some tests to validate that NULL values can now be handled in addition to NaN values by the bucketizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r104524886
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    see the document of `ScalaUDF`, if you don't like mixing java and scala types, you can use `Option[Double]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @WeichenXu123 sorry to miss the message for two days, I'm working on it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r103954857
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
    +    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
    +      Bucketizer.binarySearchForBuckets(
    +        $(splits),
    +        row.getAs[java.lang.Double]($(inputCol)),
    --- End diff --
    
    Hi, Scala's Double will convert null to zero. Say:
    > scala> val a: Double = null.asInstanceOf[Double]
    > a: Double = 0.0
    So I use Java's Double instead to hold NULLs. I feel it a litte ugly, any better way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162935486
  
    --- Diff: docs/ml-guide.md ---
    @@ -122,6 +122,8 @@ There are no deprecations.
     * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
      We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
      the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
    +* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
    + `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep.
    --- End diff --
    
    Yep, you are right. :-p


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162885968
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
        * Values at -inf, inf must be explicitly provided to cover all Double values;
        * otherwise, values outside the splits specified will be treated as errors.
        *
    -   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values.
    +   * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL
    --- End diff --
    
    @viirya done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r103949274
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
         transformSchema(dataset.schema)
         val (filteredDataset, keepInvalid) = {
           if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
    -        // "skip" NaN option is set, will filter out NaN values in the dataset
    +        // "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset
             (dataset.na.drop().toDF(), false)
           } else {
             (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
           }
         }
     
    -    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
    +    val bucketizer: UserDefinedFunction = udf { (row: Row) =>
    +      Bucketizer.binarySearchForBuckets(
    +        $(splits),
    +        row.getAs[java.lang.Double]($(inputCol)),
    --- End diff --
    
    can you use Double instead of java.lang.Double?  It should be the scala Double type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r103955065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
        * Binary searching in several buckets to place each data point.
        * @param splits array of split points
        * @param feature data point
    -   * @param keepInvalid NaN flag.
    -   *                    Set "true" to make an extra bucket for NaN values;
    -   *                    Set "false" to report an error for NaN values
    +   * @param keepInvalid NaN/NULL flag.
    +   *                    Set "true" to make an extra bucket for NaN/NULL values;
    +   *                    Set "false" to report an error for NaN/NULL values
        * @return bucket for each data point
        * @throws SparkException if a feature is < splits.head or > splits.last
        */
     
       private[feature] def binarySearchForBuckets(
           splits: Array[Double],
    -      feature: Double,
    +      feature: java.lang.Double,
           keepInvalid: Boolean): Double = {
    -    if (feature.isNaN) {
    +    if (feature == null || feature.isNaN) {
           if (keepInvalid) {
             splits.length - 1
           } else {
    -        throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," +
    -          " try setting Bucketizer.handleInvalid.")
    +        throw new SparkException("Bucketizer encountered NaN/NULL values. " +
    +          "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.")
           }
         } else if (feature == splits.last) {
    --- End diff --
    
    My fault! I'll do it now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by imatiach-msft <gi...@git.apache.org>.

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r106339981
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala ---
    @@ -171,34 +173,34 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] {
        * Binary searching in several buckets to place each data point.
        * @param splits array of split points
        * @param feature data point
    -   * @param keepInvalid NaN flag.
    -   *                    Set "true" to make an extra bucket for NaN values;
    -   *                    Set "false" to report an error for NaN values
    +   * @param keepInvalid NaN/NULL flag.
    +   *                    Set "true" to make an extra bucket for NaN/NULL values;
    +   *                    Set "false" to report an error for NaN/NULL values
        * @return bucket for each data point
        * @throws SparkException if a feature is < splits.head or > splits.last
        */
     
       private[feature] def binarySearchForBuckets(
           splits: Array[Double],
    -      feature: Double,
    +      feature: Option[Double],
           keepInvalid: Boolean): Double = {
    -    if (feature.isNaN) {
    +    if (feature.getOrElse(Double.NaN).isNaN) {
    --- End diff --
    
    I think you can equivalently write this as:
    if (feature.isEmpty) { ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    But, pls resolve conflicts first. :) Bucketizer add multiple column support so the code is different now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17123#discussion_r162910933
  
    --- Diff: docs/ml-guide.md ---
    @@ -122,6 +122,8 @@ There are no deprecations.
     * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
      We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version,
      the `OneVsRest` parallelism would be parallelism of the default threadpool in scala.
    +* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
    + `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep.
    --- End diff --
    
    hmm, I think for skip, `dataset.na.drop` drops NULL. We didn't change its behavior. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @WeichenXu123 I have finished my work, plz review it. Any suggestion is welcome. :-)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by crackcell <gi...@git.apache.org>.

Github user crackcell commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    @imatiach-msft Hi, Ilya. I have added two tests based on the original tests for NaN data. Please review my code again. Thanks for your time. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17123
  
    cc @WeichenXu123 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org