You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by burness <gi...@git.apache.org> on 2016/05/14 10:32:44 UTC

[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

GitHub user burness opened a pull request:

    https://github.com/apache/spark/pull/13116

    [SPARK-15324] [SQL] Add the takeSample function to the Dataset

    ## What changes were proposed in this pull request?
    
    In this pr, I add the takeSample function with the Dataset which is to sampling with the specify num instead of the fraction in sample function.
    
    
    ## How was this patch tested?
    
    add a test in `DatasetSuite`
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/burness/spark takeSample

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13116
    
----
commit c003f24cf402bcf80c0d920d4291f3753cb76ed1
Author: burness <bu...@gmail.com>
Date:   2016-05-14T10:16:56Z

    add takeSample in Dataset

commit 9874686563de7a5cf2bf312481910126f3dc0f12
Author: burness <bu...@gmail.com>
Date:   2016-05-14T10:20:31Z

    modify the format of the comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13116: [SPARK-15324] [SQL] Add the takeSample function t...

Posted by burness <gi...@git.apache.org>.
Github user burness closed the pull request at:

    https://github.com/apache/spark/pull/13116


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/13116#issuecomment-219546005
  
    @burness thanks for working on this. The PR in its current state has some serious potential memory and performance problems (see the comments). What is the usecase for this? If you do want to press ahead you will need to take a look at the physical `SampleExec` operator and somekind of limit to that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63274249
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -402,6 +402,76 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           3, 17, 27, 58, 62)
       }
     
    +  test("takeSample") {
    +    val n = 1000
    +    val data = sparkContext.parallelize(1 to n, 2).toDS()
    +    for (num <- List(0, 5, 20, 100)) {
    +      val sample = data.takeSample(withReplacement = false, num = num)
    +      assert(sample.count === num) // Got exactly num elements
    +      assert(sample.distinct.count === num) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = false, 20, seed)
    +      assert(sample.count() === 20) // Got exactly 20 elements
    +      assert(sample.distinct.count === 20) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = false, 100, seed)
    +      assert(sample.count === 100) // Got only 100 elements
    +      assert(sample.distinct.count === 100) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = true, 20, seed)
    +      assert(sample.count === 20) // Got exactly 20 elements
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    {
    +      val sample = data.takeSample(withReplacement = true, num = 20)
    +      assert(sample.count === 20) // Got exactly 100 elements
    +    val sampleDisCount = sample.distinct.count
    +      assert(sampleDisCount <= 20, "sampling with replacement returned all distinct elements")
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    {
    +      val sample = data.takeSample(withReplacement = true, num = n)
    +      assert(sample.count === n) // Got exactly 100 elements
    +      // Chance of getting all distinct elements is astronomically low, so test we got < 100
    +      assert(sample.distinct.count < n, "sampling with replacement returned all distinct elements")
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = true, n, seed)
    +      assert(sample.count === n) // Got exactly 100 elements
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = true, 2 * n, seed)
    +      assert(sample.count === 2 * n) // Got exactly 200 elements
    +      // Chance of getting all distinct elements is still quite low, so test we got < 100
    +      assert(sample.distinct.count < n, "sampling with replacement returned all distinct elements")
    +    }
    +    {
    +      val emptySet = sparkContext.parallelize(Seq.empty[Int], 2)
    +      val sample = emptySet.takeSample(false, 20, 1)
    +      assert(sample.length === 0)
    +    }
    --- End diff --
    
    (I think we might not need a extra closure here and below)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by burness <gi...@git.apache.org>.
Github user burness commented on the pull request:

    https://github.com/apache/spark/pull/13116#issuecomment-221513002
  
    @hvanhovell yeah, I agree with your opinion! Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63422967
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1494,6 +1493,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new [[Dataset]] by sampling a fixed num of rows, using a random seed.
    +   *
    +   * @param withReplacement Sample with replacement or not.
    +   * @param num num of rows to generate.
    +   * @param seed Seed for sampling
    +   * @return the sampling [[Dataset]]
    +   */
    +  def takeSample(withReplacement: Boolean, num: Int, seed: Long): Dataset[T] = {
    +    val numStDev = 10.0
    +    require(num >= 0, "Negative number of elements requested")
    +    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    +      "Cannot support a sample size > Int.MaxValue - " +
    +        s"$numStDev * math.sqrt(Int.MaxValue)")
    +    if (num == 0) {
    +      sqlContext.createDataset[T](Seq())
    +    } else {
    +      val initialCount = this.count()
    +      if (initialCount == 0) {
    +        sqlContext.createDataset[T](Seq())
    +      } else {
    +        val rand = new Random(seed)
    +        if (!withReplacement && num >= initialCount) {
    +          sqlContext.createDataset[T](Utils.randomizeInPlace(this.collect(), rand))
    +        } else {
    +          val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
    +            withReplacement)
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var numIters = 0
    +          while (samples.length < num) {
    --- End diff --
    
    Infinite loop if the user requests more rows than in the dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by burness <gi...@git.apache.org>.
Github user burness commented on the pull request:

    https://github.com/apache/spark/pull/13116#issuecomment-220838511
  
    @hvanhovell  Thank you for code review. In my project, I want to sample by the specify num but in DataFrame or Dataset, there is only the sample by fraction. And I have take these days , but it's not easy for me to resolve the serious potential memory and performance problems(only add the line to require that the sample num shouldn't bigger than the dataset's size).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

Posted by burness <gi...@git.apache.org>.
Github user burness commented on the issue:

    https://github.com/apache/spark/pull/13116
  
    @HyukjinKwon  Ok, I will close this now \U0001f606 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63274238
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala ---
    @@ -402,6 +402,76 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
           3, 17, 27, 58, 62)
       }
     
    +  test("takeSample") {
    +    val n = 1000
    +    val data = sparkContext.parallelize(1 to n, 2).toDS()
    +    for (num <- List(0, 5, 20, 100)) {
    +      val sample = data.takeSample(withReplacement = false, num = num)
    +      assert(sample.count === num) // Got exactly num elements
    +      assert(sample.distinct.count === num) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = false, 20, seed)
    +      assert(sample.count() === 20) // Got exactly 20 elements
    +      assert(sample.distinct.count === 20) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = false, 100, seed)
    +      assert(sample.count === 100) // Got only 100 elements
    +      assert(sample.distinct.count === 100) // Elements are distinct
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    for (seed <- 1 to 5) {
    +      val sample = data.takeSample(withReplacement = true, 20, seed)
    +      assert(sample.count === 20) // Got exactly 20 elements
    +      val sampleData = sample.collect()
    +      assert(sampleData.forall(x => 1 <= x && x <= n), s"element not in [1, $n]")
    +    }
    +    {
    +      val sample = data.takeSample(withReplacement = true, num = 20)
    +      assert(sample.count === 20) // Got exactly 100 elements
    +    val sampleDisCount = sample.distinct.count
    --- End diff --
    
    (It seems indentation is not consistent here.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/13116
  
    Ah, that's fine. I just wondered if this is on a progress and we maybe close this for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63423222
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1494,6 +1493,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new [[Dataset]] by sampling a fixed num of rows, using a random seed.
    +   *
    +   * @param withReplacement Sample with replacement or not.
    +   * @param num num of rows to generate.
    +   * @param seed Seed for sampling
    +   * @return the sampling [[Dataset]]
    +   */
    +  def takeSample(withReplacement: Boolean, num: Int, seed: Long): Dataset[T] = {
    +    val numStDev = 10.0
    +    require(num >= 0, "Negative number of elements requested")
    +    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    +      "Cannot support a sample size > Int.MaxValue - " +
    +        s"$numStDev * math.sqrt(Int.MaxValue)")
    +    if (num == 0) {
    +      sqlContext.createDataset[T](Seq())
    +    } else {
    +      val initialCount = this.count()
    +      if (initialCount == 0) {
    +        sqlContext.createDataset[T](Seq())
    +      } else {
    +        val rand = new Random(seed)
    +        if (!withReplacement && num >= initialCount) {
    +          sqlContext.createDataset[T](Utils.randomizeInPlace(this.collect(), rand))
    +        } else {
    +          val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
    +            withReplacement)
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var numIters = 0
    +          while (samples.length < num) {
    --- End diff --
    
    `count` is a non-trivial action.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

Posted by burness <gi...@git.apache.org>.
Github user burness commented on the issue:

    https://github.com/apache/spark/pull/13116
  
    @HyukjinKwon  It is too hard to solve the OOM, I'm so sorry


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63274228
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -18,6 +18,9 @@
     package org.apache.spark.sql
     
     import java.io.CharArrayWriter
    +import java.util.Random
    +
    +import org.apache.spark.util.random.SamplingUtils
    --- End diff --
    
    (it seems we need to reorder imports, https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r64160441
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -51,6 +49,7 @@ import org.apache.spark.sql.execution.python.EvaluatePython
     import org.apache.spark.sql.types._
     import org.apache.spark.storage.StorageLevel
     import org.apache.spark.util.Utils
    +import org.apache.spark.util.random.SamplingUtils
    --- End diff --
    
    I think the imports might be able to be as below:
    
    ```scala
    
    import java.io.CharArrayWriter
    import java.util.Random
    
    import scala.collection.JavaConverters._
    import scala.language.implicitConversions
    import scala.reflect.runtime.universe.TypeTag
    import scala.util.control.NonFatal
    
    import com.fasterxml.jackson.core.JsonFactory
    import org.apache.commons.lang3.StringUtils
    
    import org.apache.spark.annotation.{DeveloperApi, Experimental}
    import org.apache.spark.api.java.JavaRDD
    import org.apache.spark.api.java.function._
    import org.apache.spark.api.python.PythonRDD
    import org.apache.spark.broadcast.Broadcast
    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.catalyst._
    import org.apache.spark.sql.catalyst.analysis._
    import org.apache.spark.sql.catalyst.encoders._
    import org.apache.spark.sql.catalyst.expressions._
    import org.apache.spark.sql.catalyst.expressions.aggregate._
    import org.apache.spark.sql.catalyst.expressions.objects.Invoke
    import org.apache.spark.sql.catalyst.optimizer.CombineUnions
    import org.apache.spark.sql.catalyst.plans._
    import org.apache.spark.sql.catalyst.plans.logical._
    import org.apache.spark.sql.catalyst.util.usePrettyExpression
    import org.apache.spark.sql.execution.{FileRelation, LogicalRDD, QueryExecution, SQLExecution}
    import org.apache.spark.sql.execution.command.ExplainCommand
    import org.apache.spark.sql.execution.datasources.{CreateTableUsingAsSelect, LogicalRelation}
    import org.apache.spark.sql.execution.datasources.json.JacksonGenerator
    import org.apache.spark.sql.execution.python.EvaluatePython
    import org.apache.spark.sql.types._
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.util.Utils
    import org.apache.spark.util.random.SamplingUtils
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13116: [SPARK-15324] [SQL] Add the takeSample function to the D...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/13116
  
    Hi @burness, what's the state of this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63421664
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1494,6 +1493,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new [[Dataset]] by sampling a fixed num of rows, using a random seed.
    +   *
    +   * @param withReplacement Sample with replacement or not.
    +   * @param num num of rows to generate.
    +   * @param seed Seed for sampling
    +   * @return the sampling [[Dataset]]
    +   */
    +  def takeSample(withReplacement: Boolean, num: Int, seed: Long): Dataset[T] = {
    +    val numStDev = 10.0
    +    require(num >= 0, "Negative number of elements requested")
    +    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    +      "Cannot support a sample size > Int.MaxValue - " +
    +        s"$numStDev * math.sqrt(Int.MaxValue)")
    +    if (num == 0) {
    +      sqlContext.createDataset[T](Seq())
    +    } else {
    +      val initialCount = this.count()
    +      if (initialCount == 0) {
    +        sqlContext.createDataset[T](Seq())
    +      } else {
    +        val rand = new Random(seed)
    +        if (!withReplacement && num >= initialCount) {
    +          sqlContext.createDataset[T](Utils.randomizeInPlace(this.collect(), rand))
    --- End diff --
    
    This will cause memory problems (OOME) in the driver in case of a non-trivial dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13116#issuecomment-219213307
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63423262
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1494,6 +1493,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new [[Dataset]] by sampling a fixed num of rows, using a random seed.
    +   *
    +   * @param withReplacement Sample with replacement or not.
    +   * @param num num of rows to generate.
    +   * @param seed Seed for sampling
    +   * @return the sampling [[Dataset]]
    +   */
    +  def takeSample(withReplacement: Boolean, num: Int, seed: Long): Dataset[T] = {
    +    val numStDev = 10.0
    +    require(num >= 0, "Negative number of elements requested")
    +    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    +      "Cannot support a sample size > Int.MaxValue - " +
    +        s"$numStDev * math.sqrt(Int.MaxValue)")
    +    if (num == 0) {
    +      sqlContext.createDataset[T](Seq())
    +    } else {
    +      val initialCount = this.count()
    +      if (initialCount == 0) {
    +        sqlContext.createDataset[T](Seq())
    +      } else {
    +        val rand = new Random(seed)
    +        if (!withReplacement && num >= initialCount) {
    +          sqlContext.createDataset[T](Utils.randomizeInPlace(this.collect(), rand))
    +        } else {
    +          val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
    +            withReplacement)
    +          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +          var numIters = 0
    +          while (samples.length < num) {
    +            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
    +            numIters += 1
    +          }
    +          sqlContext.createDataset[T](Utils.randomizeInPlace(samples, rand).take(num))
    --- End diff --
    
    potential OOME


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/13116#issuecomment-220857549
  
    @burness that is no problem. I have taken a look at the current implementation of the `RDD`'s `takeSample`; it is very similar to your implementation. I do still think that implementation has problems, and I am not sure we should add to the `Dataset`/`Dataframe` API. 
    
    You can always use the datasets 's RDD use its `takeSample` method, i.e.: `spark.createDataset(dataset.rdd.takeSample(...))`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15324] [SQL] Add the takeSample functio...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13116#discussion_r63421995
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -1494,6 +1493,56 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new [[Dataset]] by sampling a fixed num of rows, using a random seed.
    +   *
    +   * @param withReplacement Sample with replacement or not.
    +   * @param num num of rows to generate.
    +   * @param seed Seed for sampling
    +   * @return the sampling [[Dataset]]
    +   */
    +  def takeSample(withReplacement: Boolean, num: Int, seed: Long): Dataset[T] = {
    +    val numStDev = 10.0
    +    require(num >= 0, "Negative number of elements requested")
    +    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    +      "Cannot support a sample size > Int.MaxValue - " +
    +        s"$numStDev * math.sqrt(Int.MaxValue)")
    +    if (num == 0) {
    +      sqlContext.createDataset[T](Seq())
    +    } else {
    +      val initialCount = this.count()
    --- End diff --
    
    This can be a non-trivial call. It will force us to evaluate a significant part of the query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org