You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2016/05/23 09:27:18 UTC

[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/13260

    [SPARK-15481][CORE] Prevent `takeSample` from calling `collect` multiple times

    ## What changes were proposed in this pull request?
    
    `takeSample` might call `collect` multiple times at a very low probability. This PR logically prevents that by using `count` instead multiple times and calling `collect` once. Also, this PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.
    
    ## How was this patch tested?
    
    Manual. (The probability is less than 0.0001 according to `SamplingUtils.scala`.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-15481

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13260
    
----
commit cbabe458e91b56baf04852ec725d91a3c6f4c10b
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-05-23T09:14:33Z

    [SPARK-15481][CORE] Prevent `takeSample` from calling `collect` multiple times

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221223390
  
    Hi, @andrewor14 .
    Could you review this PR again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221193011
  
    @andrewor14 .
    Now, it's just about fixing `takeSample` testcase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64194388
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    Shouldn't we break of the loop after a fixed number of attempts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13260


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221223033
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
GitHub user dongjoon-hyun reopened a pull request:

    https://github.com/apache/spark/pull/13260

    [SPARK-15481][CORE] Prevent `takeSample` from calling `collect` multiple times

    ## What changes were proposed in this pull request?
    
    `takeSample` might call `collect` multiple times at a very low probability. This PR logically prevents that by using `count` instead multiple times and calling `collect` once. Also, this PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.
    
    ## How was this patch tested?
    
    Manual. (The probability is less than 0.0001 according to `SamplingUtils.scala`.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-15481

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13260
    
----
commit cbabe458e91b56baf04852ec725d91a3c6f4c10b
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-05-23T09:14:33Z

    [SPARK-15481][CORE] Prevent `takeSample` from calling `collect` multiple times

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64335663
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    Ur, right. It'll take one more pass in all cases.
    Hmm, maybe, that might be the main reason not to do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-220932118
  
    **[Test build #59125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59125/consoleFull)** for PR 13260 at commit [`cbabe45`](https://github.com/apache/spark/commit/cbabe458e91b56baf04852ec725d91a3c6f4c10b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64339447
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    By the way, @andrewor14 . This was not about only an memory issue. `collect` is more heavier than `count` always. Isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64334393
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    Yes. I agree with you on the pros and cons of the use of exception for this.
    We can do that, but I hope to avoid that if possible since it'll change the public API signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64331578
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    We could throw an exception after x iterations. It will be a bit of a pain to test though. I don't feel strongly about this, but it seems like a potential source of problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221356072
  
    OK, now it's pretty minor. LGTM merging into master 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-220954569
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59125/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64334880
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    won't this cause another job to be run? It's more memory efficient but it takes 1 more pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64338520
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    Indeed, I agree with you. This PR tried to improve stability, but there is a regression on most common cases. I missed that. I'll close this PR.
    
    Thank you for spending your precious time on this PR and sorry for closing this, @andrewor14 and @hvanhovell .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221223034
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59192/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221359274
  
    Thank you, @andrewor14 !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-220954384
  
    **[Test build #59125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59125/consoleFull)** for PR 13260 at commit [`cbabe45`](https://github.com/apache/spark/commit/cbabe458e91b56baf04852ec725d91a3c6f4c10b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64194968
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    I mean some kind of exception for that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64335944
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    Anyway, thank you for review, @andrewor14 . 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun closed the pull request at:

    https://github.com/apache/spark/pull/13260


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221222784
  
    **[Test build #59192 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59192/consoleFull)** for PR 13260 at commit [`eccb673`](https://github.com/apache/spark/commit/eccb673c9528362e63e02cc586e3d2db5b2b603f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64194892
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    In this PR, I didn't try to change the in-out functionality.
    Do you think we should change that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-220954567
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64344334
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -678,27 +678,26 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
         {
           val sample = data.takeSample(withReplacement = true, num = 20)
    -      assert(sample.size === 20)        // Got exactly 100 elements
    -      assert(sample.toSet.size <= 20, "sampling with replacement returned all distinct elements")
    +      assert(sample.size === 20)        // Got exactly 20 elements
           assert(sample.forall(x => 1 <= x && x <= n), s"elements not in [1, $n]")
         }
         {
           val sample = data.takeSample(withReplacement = true, num = n)
    -      assert(sample.size === n)        // Got exactly 100 elements
    --- End diff --
    
    n is not 100 anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64337732
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    I think it's important to not have regressions. These samples might be large, in which case it's worth using a little more memory than doing another pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64335520
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    Yeah you are right about that. Let's not change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [MINOR][CORE][TEST] Update obsolete `takeSampl...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13260#issuecomment-221192800
  
    **[Test build #59192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59192/consoleFull)** for PR 13260 at commit [`eccb673`](https://github.com/apache/spark/commit/eccb673c9528362e63e02cc586e3d2db5b2b603f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64335756
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    If the situation happens, it will reduce more traffics, too.
    But, the situation happens very rarely. Should I close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64343091
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    The title of this PR will be changed into `[MINOR][CORE] ...`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64337410
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    Thank you, @hvanhovell !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64342963
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
               }
    -          Utils.randomizeInPlace(samples, rand).take(num)
    +          Utils.randomizeInPlace(samples.collect(), rand).take(num)
    --- End diff --
    
    Ur, @andrewor14 . 
    I forget that I fixed the obsolete unit tests in this PR, too.
    May I fix that after reopening and changing this PR?
    Or, should I create another PR for that? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64193997
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -678,27 +678,26 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
         {
           val sample = data.takeSample(withReplacement = true, num = 20)
    -      assert(sample.size === 20)        // Got exactly 100 elements
    -      assert(sample.toSet.size <= 20, "sampling with replacement returned all distinct elements")
    --- End diff --
    
    This is always true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15481][CORE] Prevent `takeSample` from ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13260#discussion_r64194767
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -550,17 +550,19 @@ abstract class RDD[T: ClassTag](
             } else {
               val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
                 withReplacement)
    -          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt())
    +          var count = samples.count()
     
               // If the first sample didn't turn out large enough, keep trying to take samples;
               // this shouldn't happen often because we use a big multiplier for the initial size
               var numIters = 0
    -          while (samples.length < num) {
    +          while (count < num) {
                 logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
    -            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            samples = this.sample(withReplacement, fraction, rand.nextInt())
    +            count = samples.count()
                 numIters += 1
    --- End diff --
    
    Thank you for review, @hvanhovell . 
    According to the testsuite, it was designed to return the exact number always.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org