You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by holdenk <gi...@git.apache.org> on 2014/02/10 08:42:24 UTC

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

GitHub user holdenk opened a pull request:

    https://github.com/apache/incubator-spark/pull/572

    MLI-2: Add k-fold cross validation to MLLib

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-spark addkfoldcrossvalidation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/572.patch

----
commit a5a8492fee4265b1a4225a4a89ce942350c76e4f
Author: Holden Karau <ho...@pigscanfly.ca>
Date:   2014-02-05T23:16:54Z

    Add k-fold cross validation to MLLib

----


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35224270
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34792740
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12677/


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34606646
  
    Can one of the admins verify this patch?


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9780337
  
    oops, fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9781015
  
    Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34798894
  
    Merged build finished.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9780345
  
    Changed in both places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35225140
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34797008
  
     Merged build triggered.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34915739
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12695/


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34792737
  
    Merged build finished.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34671873
  
    Sure, I'll take a look at that tonight. From the earlier pull request that
    was abandoned someone had asked that its PartionedRDD (which only did it
    for k=2) be in the core rather than mllib.
    
    
    On Mon, Feb 10, 2014 at 11:04 AM, Xiangrui Meng <no...@github.com>wrote:
    
    > @holdenk <https://github.com/holdenk> , the PartitionwiseSampledRDD was
    > designed with this use case in mind. Both the folded RDD and its complement
    > can be represented by PartitionwiseSampledRDD with BernoulliSamplers. Do
    > you mind modifying your code to use it? Also, cross-validation is a machine
    > learning specific operation. spark.rdd.RDD may not be a good place for it.
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/incubator-spark/pull/572#issuecomment-34668194>
    > .
    >
    
    
    
    -- 
    Cell : 425-233-8271


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34915737
  
    Merged build finished.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34606676
  
    Jenkins, add to whitelist. 


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34800110
  
    Merged build finished.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34608209
  
    Merged build finished.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9800999
  
    I'm not sure which style to use. @rxin ? I prefer the following:
    ~~~
    map { fold => (                                           // "((" seems to be unnecessary
      new PartitionwiseSampledRDD ...
          complement = false), seed),                         // indent 2+4 spaces 
      new PartitionwiseSampledRDD ...
          complement = true), seed)
    )}.toList
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9801137
  
    (1 to folds) is preferred, your style is fine though we use 2 space wrapped indents instead of 4. Would this be possible, though?
    ```
    (1 to folds).map { fold => (
      new PartitionwiseSampledRDD(rdd, 
        new BernoulliSampler[T]((fold - 1) / foldsF, fold / foldsF, complement = false), seed),
       ...
    )}.toList
    ```
    
    anyway up to you but that way avoids breaking a line in a nested expression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34792614
  
     Merged build triggered.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34792615
  
    Merged build started.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9780341
  
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35852737
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34901891
  
    Sounds reasonable.
    
    
    On Wed, Feb 12, 2014 at 10:44 AM, Xiangrui Meng <no...@github.com>wrote:
    
    > @holdenk <https://github.com/holdenk> How about splitting this PR into
    > two? One contains the k-fold splitting method in mllib and the fix to
    > BernoulliSampler, and the other contains the crossValidate function which
    > we can discuss more.
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/incubator-spark/pull/572#issuecomment-34901227>
    > .
    >
    
    
    
    -- 
    Cell : 425-233-8271


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35854651
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35225143
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12734/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34798895
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12678/


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9780330
  
    Done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34795846
  
    Merged build started.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34797009
  
    Merged build started.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34668194
  
    @holdenk , the PartitionwiseSampledRDD was designed with this use case in mind. Both the folded RDD and its complement can be represented by PartitionwiseSampledRDD with BernoulliSamplers. Do you mind modifying your code to use it? Also, cross-validation is a machine learning specific operation. spark.rdd.RDD may not be a good place for it. 


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34901227
  
    @holdenk How about splitting this PR into two? One contains the k-fold splitting method in mllib and the fix to BernoulliSampler, and the other contains the crossValidate function which we can discuss more.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35854653
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12824/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9805702
  
    Basically you want to make sure it is obvious that this returns a tuple (which can also be done through explicit type declaration but probably simpler this way)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34912556
  
    Merged build started.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34608210
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12657/


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34912554
  
     Merged build triggered.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34606891
  
    Merged build started.


[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35852739
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-35224269
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9802309
  
    ```scala
    (1 to numFolds).map { fold =>
      val sampler = new BernoulliSampler[T]((fold-1)/foldsF,fold/foldsF, complement = false)
      val train = new PartitionwiseSampledRDD(rdd, sampler , seed)
      val test = new PartitionwiseSampledRDD(rdd, sampler , seed.complement)  // might need to create this
      (train, test)
    }
    ```
    
    Make sure you rename folds to numFolds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Start adding k-fold cross val...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#discussion_r9780813
  
    SharedSparkContext isn't available inside of mllutils tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34795845
  
     Merged build triggered.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34606890
  
     Merged build triggered.


[GitHub] incubator-spark pull request: MLI-2: Add k-fold cross validation t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/incubator-spark/pull/572#issuecomment-34800114
  
    One or more automated tests failed
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12679/