You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhengruifeng <gi...@git.apache.org> on 2017/08/10 06:07:55 UTC

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

GitHub user zhengruifeng opened a pull request:

    https://github.com/apache/spark/pull/18902

    [SPARK-21690][ML] one-pass imputer

    ## What changes were proposed in this pull request?
    parallelize the computation of all columns
    
    ## How was this patch tested?
    existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhengruifeng/spark parallelize_imputer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18902
    
----
commit c4042ad23bcf94b758ee8c345c15fc85037cbdb9
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-06-07T04:26:41Z

    create pr

commit 4c35bda0e073084a608df8e8bc28c4dae5a1fc5b
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-06-07T04:30:26Z

    handle missing

commit 9a6ac59d5191a57a9b0b671414a2dbac1a3c3b3d
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-06-07T05:33:30Z

    use summary

commit f6f166fef4e17db7e36ccecf41aebe3443e9fef5
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2017-06-07T06:22:58Z

    x

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Jenkis, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)** for PR 18902 at commit [`5921f51`](https://github.com/apache/spark/commit/5921f514390420c331929c91fe7c8e89b708f7db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @yanboliang RDD-based impl the (former commit)[https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465] 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80666/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit [`8283411`](https://github.com/apache/spark/commit/82834117d7e587e335a599f8d5153e751b524862).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18902


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged into master. Thanks for all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)** for PR 18902 at commit [`fd1eb43`](https://github.com/apache/spark/commit/fd1eb43d26bb08806ec2deefd86014caf7dcefdd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)** for PR 18902 at commit [`660c2db`](https://github.com/apache/spark/commit/660c2dbc3e800a8f8fe4bc1b36a72ccdc37a778e).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)** for PR 18902 at commit [`2cca623`](https://github.com/apache/spark/commit/2cca623702e599d4b96ab093dfad22d228cfb6d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)** for PR 18902 at commit [`f6f166f`](https://github.com/apache/spark/commit/f6f166fef4e17db7e36ccecf41aebe3443e9fef5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80477/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)** for PR 18902 at commit [`660c2db`](https://github.com/apache/spark/commit/660c2dbc3e800a8f8fe4bc1b36a72ccdc37a778e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @MLnick Thanks for pinging me.
    
    I go through this quickly. The basic idea is the same, performing the operations on multiple inputs columns at one single Dataset/DataFrame operation.
    
    Unlike `Bucketizer`, `Imputer` has no compatibility concern because it already supports multiple input columns (`HasInputCols`). In `Bucketizer`, we don't want to break its current API so it makes thing more complicated a bit.
    
    Actually I'm noticed by `ImputerModel` which also applies `withColumn` sequentially on each input column. I'd like to address this part with the `withColumns` API proposed in #17819. What do you think @MLnick?
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18902#discussion_r132939361
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
    @@ -133,23 +134,29 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String)
       override def fit(dataset: Dataset[_]): ImputerModel = {
         transformSchema(dataset.schema, logging = true)
         val spark = dataset.sparkSession
    -    import spark.implicits._
    -    val surrogates = $(inputCols).map { inputCol =>
    -      val ic = col(inputCol)
    -      val filtered = dataset.select(ic.cast(DoubleType))
    -        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
    -      if(filtered.take(1).length == 0) {
    -        throw new SparkException(s"surrogate cannot be computed. " +
    -          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
    -      }
    -      val surrogate = $(strategy) match {
    -        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
    -        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
    -      }
    -      surrogate
    +
    +    val selected = dataset.select($(inputCols).map(col(_).cast("double")): _*).rdd
    +
    +    val summarizer = $(strategy) match {
    +      case Imputer.mean =>
    +        new Imputer.MeanSummarizer($(inputCols).length, $(missingValue))
    +      case Imputer.median =>
    +        new Imputer.MedianSummarizer($(inputCols).length, $(missingValue))
    +    }
    +
    +    val summary = selected.treeAggregate(summarizer)(
    +      seqOp = { case (sum, row) => sum.update(row) },
    +      combOp = { case (sum1, sum2) => sum1.merge(sum2) }
    +    )
    +
    +    val emptyCols = ($(inputCols) zip summary.counts).filter(_._2 == 0).map(_._1)
    +    if(emptyCols.nonEmpty) {
    --- End diff --
    
    Style: space between `if` and `(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80479/testReport)** for PR 18902 at commit [`660c2db`](https://github.com/apache/spark/commit/660c2dbc3e800a8f8fe4bc1b36a72ccdc37a778e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @yanboliang Although dispointed by DF's performance, I also approve the choice of DF just for less code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80780/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80653/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18902#discussion_r133647271
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
    @@ -133,23 +133,45 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String)
       override def fit(dataset: Dataset[_]): ImputerModel = {
         transformSchema(dataset.schema, logging = true)
         val spark = dataset.sparkSession
    -    import spark.implicits._
    -    val surrogates = $(inputCols).map { inputCol =>
    -      val ic = col(inputCol)
    -      val filtered = dataset.select(ic.cast(DoubleType))
    -        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
    -      if(filtered.take(1).length == 0) {
    -        throw new SparkException(s"surrogate cannot be computed. " +
    -          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
    -      }
    -      val surrogate = $(strategy) match {
    -        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
    -        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
    -      }
    -      surrogate
    +
    +    val cols = $(inputCols).map { inputCol =>
    +      when(col(inputCol).equalTo($(missingValue)), null)
    +        .when(col(inputCol).isNaN, null)
    +        .otherwise(col(inputCol))
    +        .cast("double")
    +        .as(inputCol)
    +    }
    +
    +    val results = $(strategy) match {
    +      case Imputer.mean =>
    +        val row = dataset.select(cols.map(avg): _*).head()
    +        Array.range(0, $(inputCols).length).map { i =>
    +          if (row.isNullAt(i)) {
    +            Double.NaN
    +          } else {
    +            row.getDouble(i)
    +          }
    +        }
    +
    +      case Imputer.median =>
    +        dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)
    --- End diff --
    
    Add API annotation to clarify that the relative error of _median_ is 0.001 if _strategy == median_.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    I test on dataframes containing `null`, both `avg` and `stat.approxQuantile` will ignore `null`. And if one column only contain `null`, `null` and `Array.empty[Double]` will be returned respectively.
    Agree that we add more tests for this dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Hi @zhengruifeng Thanks for the idea and implementation. Definitely something worth exploring.
     
    As I understand, the new implementation improves the locality yet it leverages RDD API instead of Dataset API. Since overall this targets a performance improvement, I'd be interested to see the performance comparison. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)** for PR 18902 at commit [`495d701`](https://github.com/apache/spark/commit/495d70127b31f111362a9774da0eefba2b657e63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18902#discussion_r133649640
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
    @@ -133,23 +133,45 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String)
       override def fit(dataset: Dataset[_]): ImputerModel = {
         transformSchema(dataset.schema, logging = true)
         val spark = dataset.sparkSession
    -    import spark.implicits._
    -    val surrogates = $(inputCols).map { inputCol =>
    -      val ic = col(inputCol)
    -      val filtered = dataset.select(ic.cast(DoubleType))
    -        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
    -      if(filtered.take(1).length == 0) {
    -        throw new SparkException(s"surrogate cannot be computed. " +
    -          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
    -      }
    -      val surrogate = $(strategy) match {
    -        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
    -        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
    -      }
    -      surrogate
    +
    +    val cols = $(inputCols).map { inputCol =>
    +      when(col(inputCol).equalTo($(missingValue)), null)
    +        .when(col(inputCol).isNaN, null)
    +        .otherwise(col(inputCol))
    +        .cast("double")
    +        .as(inputCol)
    +    }
    +
    +    val results = $(strategy) match {
    +      case Imputer.mean =>
    +        val row = dataset.select(cols.map(avg): _*).head()
    --- End diff --
    
    Add annotation here to clarify _avg_ function will ignore _null_ automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Any more comments on this PR? It have been about one month since the last modification.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    +1 for using Dataframe-based version code.
    
    @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will both cost on deserialization:
    ```
    ...
    val df = spark.createDataFrame(rows, struct)
    df.persist()
    df.count()
    ...
    // do `imputer.fit`
    ```
    when running `imputer.fit`, it will extract the required columns from the cached input dataframe, and then you compare the perf between `RDD.aggregate` and `dataframe avg`, they both need to deserialize data from input and then do computation, and `dataframe avg` will take advantage of codegen and should be faster. But here the test show that RDD version is slower than Dataframe version, it is not very reasonable, so I want to confirm:
    in your RDD version testing, do you cache again when get `RDD` from the input `Dataframe`?
    If not, your testing has no problem, I will guess there exists other performance issue in SQL layer and cc @cloud-fan to take a look.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80666/testReport)** for PR 18902 at commit [`2cca623`](https://github.com/apache/spark/commit/2cca623702e599d4b96ab093dfad22d228cfb6d9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Sure. I will create JIRA after this perf gap is confirmed. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @hhbyyh @zhengruifeng I'm ok with the _convert to null_ method, I think there is no extra pass for data if we handle it with this way, and the DataFrame/RDD functions to compute _mean/median_ will ignore _null_ . Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18902#discussion_r132939323
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
    @@ -133,23 +134,29 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String)
       override def fit(dataset: Dataset[_]): ImputerModel = {
         transformSchema(dataset.schema, logging = true)
         val spark = dataset.sparkSession
    -    import spark.implicits._
    -    val surrogates = $(inputCols).map { inputCol =>
    -      val ic = col(inputCol)
    -      val filtered = dataset.select(ic.cast(DoubleType))
    -        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
    -      if(filtered.take(1).length == 0) {
    -        throw new SparkException(s"surrogate cannot be computed. " +
    -          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
    -      }
    -      val surrogate = $(strategy) match {
    -        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
    -        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
    -      }
    -      surrogate
    +
    +    val selected = dataset.select($(inputCols).map(col(_).cast("double")): _*).rdd
    +
    +    val summarizer = $(strategy) match {
    +      case Imputer.mean =>
    +        new Imputer.MeanSummarizer($(inputCols).length, $(missingValue))
    +      case Imputer.median =>
    +        new Imputer.MedianSummarizer($(inputCols).length, $(missingValue))
    +    }
    +
    +    val summary = selected.treeAggregate(summarizer)(
    +      seqOp = { case (sum, row) => sum.update(row) },
    +      combOp = { case (sum1, sum2) => sum1.merge(sum2) }
    +    )
    +
    +    val emptyCols = ($(inputCols) zip summary.counts).filter(_._2 == 0).map(_._1)
    --- End diff --
    
    Style - use dot notation here not infix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18902#discussion_r133649896
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
    @@ -133,23 +133,45 @@ class Imputer @Since("2.2.0") (@Since("2.2.0") override val uid: String)
       override def fit(dataset: Dataset[_]): ImputerModel = {
         transformSchema(dataset.schema, logging = true)
         val spark = dataset.sparkSession
    -    import spark.implicits._
    -    val surrogates = $(inputCols).map { inputCol =>
    -      val ic = col(inputCol)
    -      val filtered = dataset.select(ic.cast(DoubleType))
    -        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
    -      if(filtered.take(1).length == 0) {
    -        throw new SparkException(s"surrogate cannot be computed. " +
    -          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
    -      }
    -      val surrogate = $(strategy) match {
    -        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
    -        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
    -      }
    -      surrogate
    +
    +    val cols = $(inputCols).map { inputCol =>
    +      when(col(inputCol).equalTo($(missingValue)), null)
    +        .when(col(inputCol).isNaN, null)
    +        .otherwise(col(inputCol))
    +        .cast("double")
    +        .as(inputCol)
    +    }
    +
    +    val results = $(strategy) match {
    +      case Imputer.mean =>
    +        val row = dataset.select(cols.map(avg): _*).head()
    +        Array.range(0, $(inputCols).length).map { i =>
    +          if (row.isNullAt(i)) {
    +            Double.NaN
    +          } else {
    +            row.getDouble(i)
    +          }
    +        }
    +
    +      case Imputer.median =>
    +        dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)
    --- End diff --
    
    BTW, add annotation here to clarify _approxQuantile_ function will ignore _null_ automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80663/testReport)** for PR 18902 at commit [`fd1eb43`](https://github.com/apache/spark/commit/fd1eb43d26bb08806ec2deefd86014caf7dcefdd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @hhbyyh Good Idea! We can also use this trick to compute median, because method `multipleApproxQuantiles`[https://github.com/apache/spark/blob/0e80ecae300f3e2033419b2d98da8bf092c105bb/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L65]  can handle both `null` and `NaN`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit [`8283411`](https://github.com/apache/spark/commit/82834117d7e587e335a599f8d5153e751b524862).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @hhbyyh Yes, I will test the performance. 
    Btw, the median computation by call `stat.approxQuantile` will also transform df into rdd before aggregation. see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L102


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80780/testReport)** for PR 18902 at commit [`495d701`](https://github.com/apache/spark/commit/495d70127b31f111362a9774da0eefba2b657e63).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)** for PR 18902 at commit [`fd1eb43`](https://github.com/apache/spark/commit/fd1eb43d26bb08806ec2deefd86014caf7dcefdd).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Thanks for the quick update. The implementation may be improved on some details. But first I'd want to confirm the "convert to null" method does not have any defect. 
    @MLnick @srowen @yanboliang 
    
    And we may need more unit tests to constantly monitor  the SQL behavior (avg and stat) on null. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80479/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80667/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80478/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)** for PR 18902 at commit [`df7a0a3`](https://github.com/apache/spark/commit/df7a0a33002e9341c15820d6856b436df3e5ede2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80477/testReport)** for PR 18902 at commit [`f6f166f`](https://github.com/apache/spark/commit/f6f166fef4e17db7e36ccecf41aebe3443e9fef5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    cc @viirya on the multt-column generation issue - could be similar general solution to #17819?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80478/testReport)** for PR 18902 at commit [`660c2db`](https://github.com/apache/spark/commit/660c2dbc3e800a8f8fe4bc1b36a72ccdc37a778e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Eh, I meant that it may be possible to get the mean values purely using DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need to check the performance comparison. But I guess it looks a little hack.
    
    For the median value, it may be harder so we can use the RDD API. (to be confirmed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80663/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @zhengruifeng  Could you verify & compare the performance of this new DF-based approach vs your original RDD-based one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    I test the performance on a small data, the value in the following table is the average duration in seconds:
    
    |numColums| Old Mean | Old Median | New Mean | New Median |
    |------|----------|------------|----------|------------|
    |1|0.0771394713|0.0658712813|0.080779802|0.048165981499999996|
    |10|0.7234340630999999|0.5954440414|0.0867935197|0.13263428659999998|
    |100|7.3756451568|6.2196631259|0.1911931552|0.8625376817000001|
    
    We can see that, even on a small data, the speedup is significant.
    On big dataset that do not fit in memory, we should obtain better speedup.
    
    and the test code is here:
    
    ```
    import org.apache.spark.ml.feature._
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import spark.implicits._
    import scala.util.Random
    
    val seed = 123l
    val random = new Random(seed)
    val n = 10000
    val m = 100
    val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextDouble): _*))
    val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",DoubleType,true)))
    val df = spark.createDataFrame(rows, struct)
    df.persist()
    df.count()
    
    for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
    val imputer = new Imputer().setStrategy(strategy).setInputCols(Array.range(0,k,1).map(i=>s"c$i")).setOutputCols(Array.range(0,k,1).map(i=>s"o$i"))
    var duration = 0.0
    for (i<- 0 until 10) {
    val start = System.nanoTime()
    imputer.fit(df)
    val end = System.nanoTime()
    duration += (end - start) / 1e9
    }
    println((strategy, k, duration/10))
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)** for PR 18902 at commit [`fd1eb43`](https://github.com/apache/spark/commit/fd1eb43d26bb08806ec2deefd86014caf7dcefdd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @zhengruifeng DataFrame-based operation is 2~3x slower than RDD-based operation is a known issue, because of the deserialization cost. If we switch to RDD-based method, we need to implement our own aggregator to calculate _mean_ and _median_, this need much more code than calling DataFrame API. BTW, DF using more compact structure that can reduce memory footprint.
    From my perspective, I'd suggest to keep the current DF-based solution. As it will 5~10 faster than the original implementation. @hhbyyh @MLnick What do you think about it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Seems fine to me to use the DF version even though it's slower. But we should open a JIRA issue to track where the gap is on the SQL side of things and try to improve the performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @hhbyyh  I rewrite the impl, and now all `NaN` and `missingValue` will be transform to `null` at first, then current methods are used.
    For columns only containing `null`,  `null` is returned for `avg(col)`, and `Array.empty[Double]` is returned for `median`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80667/testReport)** for PR 18902 at commit [`df7a0a3`](https://github.com/apache/spark/commit/df7a0a33002e9341c15820d6856b436df3e5ede2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    **[Test build #80675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80675/testReport)** for PR 18902 at commit [`5921f51`](https://github.com/apache/spark/commit/5921f514390420c331929c91fe7c8e89b708f7db).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80675/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @zhengruifeng What _the RDD-based one_ means? It's the code on master or the code in your former commit? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80660/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is [here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465). 
    I use the same testsuit above to test those impls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    @MLnick  @yanboliang  I update the performance comparison. 
    The DF-based impl is a little slower than the RDD-based one when num of column is small. 
    When num of column is large (100), DF-based impl is about 2~3 X slower than RDD-based one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org