You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by glxc <r....@gmail.com> on 2014/05/21 23:05:08 UTC

Inconsistent RDD Sample size

I have a graph and am trying to take a random sample of vertices without
replacement, using the RDD.sample() method

verts are the vertices in the graph

>  val verts = graph.vertices

and executing this multiple times in a row 

>  verts.sample(false, 10000.toDouble/v1.count.toDouble,
> System.currentTimeMillis).count

yields different results roughly each time (albeit +/- a small % of the
target)

why does this happen? Looked at PartionwiseSampledRDD but can't figure it
out

Also, is there another method/technique to yield the same result each time? 
My understanding is that grabbing random indices may not be the best use of
the RDD model



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Inconsistent RDD Sample size

Posted by Xiangrui Meng <me...@gmail.com>.
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui

On Wed, May 21, 2014 at 2:05 PM, glxc <r....@gmail.com> wrote:
> I have a graph and am trying to take a random sample of vertices without
> replacement, using the RDD.sample() method
>
> verts are the vertices in the graph
>
>>  val verts = graph.vertices
>
> and executing this multiple times in a row
>
>>  verts.sample(false, 10000.toDouble/v1.count.toDouble,
>> System.currentTimeMillis).count
>
> yields different results roughly each time (albeit +/- a small % of the
> target)
>
> why does this happen? Looked at PartionwiseSampledRDD but can't figure it
> out
>
> Also, is there another method/technique to yield the same result each time?
> My understanding is that grabbing random indices may not be the best use of
> the RDD model
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.