You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by smartnut007 <gi...@git.apache.org> on 2014/04/22 07:28:43 UTC

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

GitHub user smartnut007 opened a pull request:

    https://github.com/apache/spark/pull/477

    SPARK-1438 RDD.sample() make seed param optional

    copying form previous pull request https://github.com/apache/spark/pull/462
    
    Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.
    
    In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.
    
    Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
    sample(fraction, withReplacement=false, seed=math.random)
    Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.
    
    If backward compatible is important, 3 new method can be introduced (without default params) like this
    sample(fraction)
    sample(fraction, withReplacement)
    sample(fraction, withReplacement, seed)
    
    Added some tests for the scala RDD takeSample method. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/smartnut007/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/477.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #477
    
----
commit 0c247dba6084313873b539bcf230371c903f04b3
Author: Arun Ramakrishnan <sm...@gmail.com>
Date:   2014-04-21T07:41:09Z

    SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample

commit 69619c6686cc7ff7113f8ef031f3ed3698bafa25
Author: Arun Ramakrishnan <sm...@gmail.com>
Date:   2014-04-22T04:37:22Z

    SPARK-1438 fix spacing issue

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41342611
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14452/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11839631
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    Don't you need to do something later to deal with seed being None? Does random.seed(None) do the right thing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41128017
  
    scala/java: Replaced System.nanoTime shared instance of Random in Utils object. 
    python: Made use of an independent instance of Random instead of seeding language api global Random instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/477


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11866824
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -381,13 +382,11 @@ def takeSample(self, withReplacement, num, seed):
             # If the first sample didn't turn out large enough, keep trying to take samples;
             # this shouldn't happen often because we use a big multiplier for their initial size.
             # See: scala/spark/RDD.scala
    +        random.seed(seed)
    --- End diff --
    
    Is this the global random object? Library code should not be setting the seed and then calling randint. Is there no equivalent of java.util.Random that you can create and use here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41342610
  
    Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11938305
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext {
     
       test("takeSample") {
         val data = sc.parallelize(1 to 100, 2)
    -
    +    
    +    for (num <- List(5,20,100)) {
    --- End diff --
    
    Put spaces after the commas here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11839800
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    @mateiz  both the numpy random and python language random functions should handle None fine.
    
    import numpy
    numpy.random.RandomState(None)
    import random
    random.seed(None)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by advancedxy <gi...@git.apache.org>.

Github user advancedxy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11936467
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -38,17 +38,15 @@ def initRandomGenerator(self, split):
             if self._use_numpy:
                 import numpy
                 self._random = numpy.random.RandomState(self._seed)
    -            for _ in range(0, split):
    -                # discard the next few values in the sequence to have a
    -                # different seed for the different splits
    -                self._random.randint(sys.maxint)
             else:
                 import random
    --- End diff --
    
    Since, we have imported random at the beginning. This line is unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11840059
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    Good point, in that case just use Python's built-in random, or create a Random object if there isn't a global one you can call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41246796
  
    ok. removed redundant 'import random'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41008236
  
    `(math.random * 1000).toInt` can only produce 1000 values, which is very few. You should use a much bigger number than 1000, e.g. `1e12`, and then do `toLong`. Or you can create a static Random object somewhere and call `nextLong` on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41252232
  
    Thanks for the changes, this looks pretty good now. Made a few small comments on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41005327
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by advancedxy <gi...@git.apache.org>.

Github user advancedxy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11840021
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    @smartnut007, @mateiz, if we should not use nanoTime as a seed, we should not pass none to random.seed(None).  random.seed(None) will use time.time().
    
    As reply for your very first comment, I think consistency is important. But it's just my opinion, 
    others may disagree with me. You should do what you think is right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41006955
  
    Same thing applies in Python, don't use the current time, call their random function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41243430
  
    @mateiz @advancedxy new commit covers all suggestions so far. Any thoughts ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41339559
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11840079
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    You can simply add `if seed is None: seed = random.random()` or whatever is required for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by advancedxy <gi...@git.apache.org>.

Github user advancedxy commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41007103
  
    ```seed: Int = (math.random * 1000).toInt)```
    hi, @mateiz should we use Long instead of Int to avoid collision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11938312
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -381,13 +382,11 @@ def takeSample(self, withReplacement, num, seed):
             # If the first sample didn't turn out large enough, keep trying to take samples;
             # this shouldn't happen often because we use a big multiplier for their initial size.
             # See: scala/spark/RDD.scala
    +        rand = Random(seed)
             while len(samples) < total:
    -            if seed > sys.maxint - 2:
    -                seed = -1
    -            seed += 1
    -            samples = self.sample(withReplacement, fraction, seed).collect()
    +            samples = self.sample(withReplacement, fraction, rand.randint(0,sys.maxint)).collect()
     
    -        sampler = RDDSampler(withReplacement, fraction, seed+1)
    +        sampler = RDDSampler(withReplacement, fraction, rand.randint(0,sys.maxint))
    --- End diff --
    
    Put spaces after the comma here and in other instances of `randint(0,sys.maxint)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41348437
  
    Thanks Arun! I've merged this in now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11840558
  
    --- Diff: python/pyspark/rddsampler.py ---
    @@ -19,7 +19,7 @@
     import random
     
     class RDDSampler(object):
    -    def __init__(self, withReplacement, fraction, seed):
    +    def __init__(self, withReplacement, fraction, seed=None):
    --- End diff --
    
    @advancedxy  @mateiz  
    Actually the default python seed in random.seed(None) is the following snippet and picking random.random() will go through these anyway. ( python 2.7.5 )
    
    from binascii import hexlify as _hexlify
    from os import urandom as _urandom
    long(_hexlify(_urandom(16)), 16)
    
    I generally think its better to leave these things to the language implementors. 
    
    But, if we need the code to look similar, then I would follow the last suggestion in RDDSampler.
    if seed is None: seed = random.randint(0, sys.maxint)
    here sys.maxint is arch specific and generally 64 bit on 64 bit machines. 
    
    Can you guys let me know which one ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11839619
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext {
     
       test("takeSample") {
         val data = sc.parallelize(1 to 100, 2)
    -
    +    
    +    for (num <- List(5,20,100)) {
    +    	  val sample = data.takeSample(withReplacement=false, num=num)
    +      assert(sample.size === num)        // Got exactly num elements
    +      assert(sample.toSet.size === num)  // Elements are distinct
    +      assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]")
    +    	}
    --- End diff --
    
    Indenting seems off here, there seem to be some tabs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41339497
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41263746
  
    Great. Fixed the space formatting as suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41005731
  
    @advancedxy If consistency is important then I could set default of  long(time.time() * 1e9) in RDDSampler (python api) constructor like you suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41339545
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by advancedxy <gi...@git.apache.org>.

Github user advancedxy commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41244424
  
    hi, @smartnut007, I don't think I could make the call. So, I didn't reply the question which one to use.
    should ask @mateiz !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by smartnut007 <gi...@git.apache.org>.

Github user smartnut007 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11839813
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext {
     
       test("takeSample") {
         val data = sc.parallelize(1 to 100, 2)
    -
    +    
    +    for (num <- List(5,20,100)) {
    +    	  val sample = data.takeSample(withReplacement=false, num=num)
    +      assert(sample.size === num)        // Got exactly num elements
    +      assert(sample.toSet.size === num)  // Elements are distinct
    +      assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]")
    +    	}
    --- End diff --
    
    will take care of the indent



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41006940
  
    Hey, FYI, it's not a good idea to use System.nanoTime as the seed because multiple RDDs created at the same time (which can easily happen due to lazy evaluation) would have the exact same seed. Use math.random() instead, or the equivalent in PySpark. Math.random is synchronized as far as I know, which is bad for high-performance random number generation but good for getting distinct numbers here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/477#discussion_r11938430
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -334,11 +336,11 @@ abstract class RDD[T: ClassTag](
        * Randomly splits this RDD with the provided weights.
        *
        * @param weights weights for splits, will be normalized if they don't sum to 1
    -   * @param seed random seed, default to System.nanoTime
    +   * @param seed random seed, default to rand.nextLong
    --- End diff --
    
    You don't need to say what the seed defaults to here since users won't understand it; just say `@param seed random seed` and they can guess that if you don't specify it, we will choose one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---