You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Erik Erlandson (JIRA)" <ji...@apache.org> on 2014/09/12 00:20:33 UTC

[jira] [Commented] (SPARK-3250) More Efficient Sampling

    [ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130809#comment-14130809 ] 

Erik Erlandson commented on SPARK-3250:
---------------------------------------

I developed prototype iterator classes for "fast gap sampling" with and without replacement.  The code, testing rig and test results can be seen here:
https://gist.github.com/erikerlandson/05db1f15c8d623448ff6

I also wrote up some discussion of the algorithms here:

Faster Random Samples With Gap Sampling
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/


> More Efficient Sampling
> -----------------------
>
>                 Key: SPARK-3250
>                 URL: https://issues.apache.org/jira/browse/SPARK-3250
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A number of stochastic algorithms achieve speed ups by exploiting O\(k\) sampling, where k is the number of data points to sample.  Examples of such algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an ArrayBuffer or other data structure supporting random access.  Since many of these stochastic algorithms perform repeated rounds of sampling, it may be feasible to perform a transformation to change the backing data structure followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org