You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/02 10:36:53 UTC

[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

    [ https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392382#comment-14392382 ] 

Sean Owen commented on SPARK-6665:
----------------------------------

If you are doing cross-validation, I think a direct random subsample is better. Yes, the first partition of a randomly permuted RDD is also a random subsample, but it lies all on one partition! That's not good for distributing computation over it. Stochastic algorithms are already picking examples at random, right? they shouldn't be trying to take data in some order. So yeah my question is use case. The one I can think of is iterating over an RDD serially but wanting to encounter it in a random order; this makes sense for smallish RDDs like in streaming maybe.

> Randomly Shuffle an RDD 
> ------------------------
>
>                 Key: SPARK-6665
>                 URL: https://issues.apache.org/jira/browse/SPARK-6665
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Shell
>            Reporter: Florian Verhein
>            Priority: Minor
>
> *Use case* 
> RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g.
> - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) 
> - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds
> Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
> *Possible implementation*
> As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( the element (or key if it's paired) and a random salt)). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org