You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ilya Ganelin (JIRA)" <ji...@apache.org> on 2014/12/08 23:58:13 UTC

[jira] [Commented] (SPARK-4417) New API: sample RDD to fixed number of items

    [ https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238613#comment-14238613 ] 

Ilya Ganelin commented on SPARK-4417:
-------------------------------------

Hi, I'd like to work on this. Can someone please assign it to me? Thank you. 

> New API: sample RDD to fixed number of items
> --------------------------------------------
>
>                 Key: SPARK-4417
>                 URL: https://issues.apache.org/jira/browse/SPARK-4417
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>            Reporter: Davies Liu
>
> Sometimes, we just want to a fixed number of items randomly selected from an RDD, for example, before sort an RDD we need to gather a fixed number of keys from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, then calculate the right ratio for sampling. In fact, we could do this in one pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org