You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by oliverpierson <gi...@git.apache.org> on 2016/02/23 05:05:03 UTC

[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

GitHub user oliverpierson opened a pull request:

    https://github.com/apache/spark/pull/11319

    [SPARK-13444] [MLlib] QuantileDiscretizer chooses bad splits on large DataFrames

    ## What changes were proposed in this pull request?
    
    Change line 113 of QuantileDiscretizer.scala to
    
    `val requiredSamples = math.max(numBins * numBins, 10000.0)`
    
    so that `requiredSamples` is a `Double`.  This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
    
    ## How was the this patch tested?
    Manual tests.  I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/oliverpierson/spark SPARK-13444

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11319.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11319
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53852686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -103,6 +103,13 @@ final class QuantileDiscretizer(override val uid: String)
     
     @Since("1.6.0")
     object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {
    +
    +  /**
    +   * Minimum number of samples required for finding splits, regardless of number of bins.  If
    +   * the dataset has less rows than this value, the entire dataset column will be used.
    +   */
    +  val minSamplesRequired: Int = 10000
    --- End diff --
    
    The test will require creating a dataset at least as large as `minSamplesRequired`, so making its value excessively large could slow down testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188777292
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51967/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187729848
  
    Yeah, I think I can put a test together.  Also, the value `10000` in line 113:
    
    `val requiredSamples = math.max(numBins * numBins, 10000)`
    
    is hard coded.  Should this be changed to a class attribute?  I ask because I test would required a dataset with dataset.count > 10000.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188778298
  
    @srowen Let me know if you'd like any further changes.  Also, thanks for the review and your help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187666179
  
    **[Test build #51764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51764/consoleFull)** for PR 11319 at commit [`635fb4e`](https://github.com/apache/spark/commit/635fb4e78433e8760150d775d41b6af9b3cba976).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-189118248
  
    I didn't revert the changes in master. I have to revert the changes in 1.6 because `setSeed` is not available in branch-1.6. Usually we only backport bug fixes instead of features. So you should open a PR against 1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11319


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188784819
  
    Merged to master/1.6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188988644
  
    @mengxr No problem.  I'll do it later today when I get to my laptop.  
    
    I thought since random sampling of data was used to find splits, then `setSeed` would be good for ensuring consistent tests.  However, it's not necessary so happy to remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53762980
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -110,7 +110,7 @@ object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] wi
         val totalSamples = dataset.count()
         require(totalSamples > 0,
           "QuantileDiscretizer requires non-empty input dataset but was given an empty input.")
    -    val requiredSamples = math.max(numBins * numBins, 10000)
    +    val requiredSamples = math.max(numBins * numBins, 10000.0)
    --- End diff --
    
    I'd very slightly prefer to cast this to a double in the following line. This makes `requiredSamples` a double itself, which is slightly wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53937600
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -103,6 +103,13 @@ final class QuantileDiscretizer(override val uid: String)
     
     @Since("1.6.0")
     object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {
    +
    +  /**
    +   * Minimum number of samples required for finding splits, regardless of number of bins.  If
    +   * the dataset has less rows than this value, the entire dataset column will be used.
    +   */
    +  val minSamplesRequired: Int = 10000
    --- End diff --
    
    Yeah we have muuuch bigger tests that turn up whole small clusters. Making a little data set is fine. You can expose the 10000 figure as a `private[spark] val` in the `object` so you can reference it from test code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188777159
  
    **[Test build #51967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51967/consoleFull)** for PR 11319 at commit [`abea876`](https://github.com/apache/spark/commit/abea8765f66d061fca1d5358660fb71e9a194cc2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188947987
  
    Ah... I am 0 for 2 today. I am away from keyboard so feel free to revert it if it is a Blocker. This really doesn't have to go in this branch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188777288
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187719418
  
    I think the explicit cast is better as well; it makes it clear that `requiredSamples` is a `Double`.  I'm a bit new to this whole PR process so I have to ask if I need to do anything to make that change?  I'm happy to do it if so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187730689
  
    I don't think making a 10K element test set is too big. I wouldn't expose it as a configurable setter just for this, no.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187728709
  
    The point is that `requiredSamples` isn't a double, but needs to be construed as one to get a calculation right. Yes like any github PR you just push more commits to your branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53934003
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -103,6 +103,13 @@ final class QuantileDiscretizer(override val uid: String)
     
     @Since("1.6.0")
     object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {
    +
    +  /**
    +   * Minimum number of samples required for finding splits, regardless of number of bins.  If
    +   * the dataset has less rows than this value, the entire dataset column will be used.
    +   */
    +  val minSamplesRequired: Int = 10000
    --- End diff --
    
    I don't think you're missing anything.  It's my first time contributing and I just want to be explicit for reviewers of the patch.  I agree that 10K isn't that big, especially out "in the wild".  However, I wasn't sure if there were standards for time/memory consumption for tests so I added the line note so that reviewers with more experience would be aware in case there are established/de facto testing standards.
    
    I'll make the "->" changes you've indicated and push a new commit sometime today.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187666498
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-189034568
  
    @mengxr Sorry, was in a rush and missed the last part of your comment.  I've opened a new PR, but against `apache:master`.  However, you'd like a PR against `apache:branch-1.6`?  If that's correct, just let me know and I'll close the other PR and open a new PR against 1.6.  Sorry for the mixup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188757876
  
    **[Test build #51967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51967/consoleFull)** for PR 11319 at commit [`abea876`](https://github.com/apache/spark/commit/abea8765f66d061fca1d5358660fb71e9a194cc2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188947055
  
    This broke compilation in branch 1.6:
    
    ```
    [error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-pre-yarn-1.2.1/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala:85: value setSeed is not a member of org.apache.spark.ml.feature.QuantileDiscretizer
    [error] possible cause: maybe a semicolon is missing before `value setSeed'?
    [error]       .setSeed(1)
    [error]        ^
    [error] one error found
    [error] Compile failed at Feb 25, 2016 8:49:20 AM [18.290s]
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187652108
  
    **[Test build #51764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51764/consoleFull)** for PR 11319 at commit [`635fb4e`](https://github.com/apache/spark/commit/635fb4e78433e8760150d775d41b6af9b3cba976).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188968047
  
    @oliverpierson I reverted the change in branch-1.6 so we don't block other builds. Could you remove the `setSeed` line and submit another PR to branch-1.6? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187646991
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53936527
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -103,6 +103,13 @@ final class QuantileDiscretizer(override val uid: String)
     
     @Since("1.6.0")
     object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {
    +
    +  /**
    +   * Minimum number of samples required for finding splits, regardless of number of bins.  If
    +   * the dataset has less rows than this value, the entire dataset column will be used.
    +   */
    +  val minSamplesRequired: Int = 10000
    --- End diff --
    
    Also, my original reason for asking about removing the hard coded value of 10K was because that value is the cause of the bug and so a regression test would need to know the value. 
    
    I could have hard coded 10k in to my test.  However if a developer increased its value later, say to 100K, without increasing the hard coded test value as well, they could potentially render the test useless since it would always pass.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188949430
  
    Hmmm.... any idea why it's failing?  This line is no different than line 115 of the same value, except for the line breaks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187516773
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-188757124
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r53923176
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
    @@ -103,6 +103,13 @@ final class QuantileDiscretizer(override val uid: String)
     
     @Since("1.6.0")
     object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] with Logging {
    +
    +  /**
    +   * Minimum number of samples required for finding splits, regardless of number of bins.  If
    +   * the dataset has less rows than this value, the entire dataset column will be used.
    +   */
    +  val minSamplesRequired: Int = 10000
    --- End diff --
    
    less rows -> fewer rows
    entire dataset column -> entire dataset?
    
    10000 just isn't that big. A dummy data set in a test would be, what, a megabyte in memory? Am I missing a bigger problem there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187666502
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51764/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by oliverpierson <gi...@git.apache.org>.
Github user oliverpierson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11319#discussion_r54153835
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala ---
    @@ -71,6 +71,26 @@ class QuantileDiscretizerSuite
         }
       }
     
    +  test("Test splits on dataset larger than minSamplesRequired") {
    +    val sqlCtx = SQLContext.getOrCreate(sc)
    +    import sqlCtx.implicits._
    +
    +    val datasetSize = QuantileDiscretizer.minSamplesRequired + 1
    +    val numBuckets = 5
    +    val df = sc.parallelize((1.0 to datasetSize by 1.0).map(Tuple1.apply)).toDF("input")
    +    val discretizer = new QuantileDiscretizer()
    +      .setInputCol("input")
    +      .setOutputCol("result")
    +      .setNumBuckets(numBuckets)
    +      .setSeed(1)
    --- End diff --
    
    This is the offending line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13444] [MLlib] QuantileDiscretizer choo...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/11319#issuecomment-187723377
  
    Could you add a regression test for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org