You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ilganeli <gi...@git.apache.org> on 2014/12/17 19:22:46 UTC

[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

GitHub user ilganeli opened a pull request:

    https://github.com/apache/spark/pull/3723

    [SPARK-4417] New API: sample RDD to fixed number of items	

    Hi all - I've added an interface to split an RDD by a count of elements (instead of simply by percentage). I've also added new tests to validate this performance and I've updated a previously existing function interface to re-use common code. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ilganeli/spark SPARK-4417B

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3723.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3723
    
----
commit 4a8ad6a04a99930dd176dd199f0f26d5ecc80aa8
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-10T18:34:48Z

    Upated RDD class to add a sampling function that provides N elements from the RDD and returns it as an RDD.

commit 564bcc14ad3c271876798b723cc84f7d7c4716dd
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-10T18:37:17Z

    Added testing for new sampling function.

commit 350f039b4be9e9a34104652f5c74b2695254ad91
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-10T18:47:25Z

    Fixing minor formatting problems

commit 8e625d35ecf1a2fa8a3b1f113196d7d3e56781d1
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-17T18:13:06Z

    Merge remote-tracking branch 'upstream/master' into SPARK-4417B

commit 26b4b81ff4e3333769b343aa110c2c9112d80113
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-17T18:17:55Z

    Spcing

commit 8d411c3a2d922028659111c75ed6b0cdc4956608
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2014-12-17T18:19:26Z

    More Spacing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3723


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-70469176
  
    Okay sounds like we'll close this issue as a wont fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-68211996
  
    Hello, could anyone please provide any more feedback on this patch and ideally get this merged? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67560274
  
    Hi Sean - my concern with using take/collect() like in the previous approach is that there is essentially a hard-cap on what is tractable due to memory limitations. I wanted to build an implementation that is independent of memory, even if it is less efficient. 
    
    The sampling "over and over" will only happen a very small fraction of the time (when we're at the very tail end of the statistical distribution used to do the sampling). In general, this approach will only make a couple of passes over the data (once to sample the data and then at the end, if we have too many samples since the sampling is an approximation, pare down to the exact number)/m


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67368189
  
      [Test build #24549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24549/consoleFull) for   PR 3723 at commit [`8d411c3`](https://github.com/apache/spark/commit/8d411c3a2d922028659111c75ed6b0cdc4956608).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-68315776
  
    I agree with Mark about this. This method doesn't seem worth adding an API for by default, especially if it will be tricky to implement. For extracting small samples, takeSample already lets you specify an exact numbers, and for downsampling large RDDs, most users probably don't need an exact number (and wouldn't want to pay an extra pass over the data for it). This and other advanced sampling methods could make a good external package though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-68322948
  
    Mark and Matei - I hear you guys and understand what you're saying. Does it make sense to create  new Jira to refactor the RDD interface to move the advanced sampling methods into a packages class?   This would obviously involve deprecating the presently existing functions so I presume this wouldn't see the light of day for a while. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67557836
  
    Hello  - can anyone take a look at this patch please and provide feedback on the approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67559137
  
    So the emphasis is on *RDD*, right? you can already sample to an *Array* on the driver. You could make the same argument for several other methods. `take(100)` can't be used to make an `RDD`. I think the logic is that it's for taking a smallish number of things. Likewise for sampling.
    
    Put differently, how about just taking the `Array` and using `parallelize()` to make an `RDD` again?
    
    If you really want a huge sample of a much huge-r `RDD`, you probably need to sample by partition, using a different approach, to do it efficiently. Here you're sampling over and over until you get enough.
    
    So I think this may not quite fit in with how other similar API methods work, for better or worse, although maybe you don't have to have this method to do what you want. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67382106
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24549/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-68216759
  
    My biggest problem with this is that, while the existing `sample` is an action, `sampleByCount` is another one of those unholy beasts that is neither an action nor a transformation -- meaning that, while it transforms an RDD into another RDD, it isn't lazy while doing so, but rather embeds several actions (`count`) and makes use of another unholy beast (`zipWithIndex`), all of which means that invoking `sampleByCount` eagerly launches several jobs in order to create the new RDD.
    
    This is by no means the only eager transformation (or whatever we end up calling these unholy beasts), since there is a handful of others that already exist in Spark; but I am really hesitant to add another.  What we need is a larger strategy and re-organization to properly handle, name and document eager transformations, but that is well beyond the scope of this single PR.  In the meantime, eager transformations are just conveniences (inconveniences if you are trying to launch jobs asynchronously) that packages up one or more actions.  They can always be broken up into multiple explicit and ordinary transformations and actions (as Sean was effectively suggesting earlier), so none of them are strictly necessary to achieve their functionality.
    
    I'm really hesitant to add `sampleByCount` to the Spark API and thereby to the list of eager transformations that we need to somehow fix in the future.  Perhaps a better way to handle such convenience packaging of transformations and actions on RDDs is to include them in [Spark Packages](http://spark-packages.org/). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67382094
  
      [Test build #24549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24549/consoleFull) for   PR 3723 at commit [`8d411c3`](https://github.com/apache/spark/commit/8d411c3a2d922028659111c75ed6b0cdc4956608).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait ParquetTest `
      * `protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org