You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sandeep Singh (JIRA)" <ji...@apache.org> on 2014/11/23 07:47:12 UTC

[jira] [Commented] (SPARK-4417) New API: sample RDD to fixed number of items

    [ https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222332#comment-14222332 ] 

Sandeep Singh commented on SPARK-4417:
--------------------------------------

Can you assign this to me ?

> New API: sample RDD to fixed number of items
> --------------------------------------------
>
>                 Key: SPARK-4417
>                 URL: https://issues.apache.org/jira/browse/SPARK-4417
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>            Reporter: Davies Liu
>
> Sometimes, we just want to a fixed number of items randomly selected from an RDD, for example, before sort an RDD we need to gather a fixed number of keys from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, then calculate the right ratio for sampling. In fact, we could do this in one pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org