You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kaka1992 <gi...@git.apache.org> on 2015/04/27 06:21:43 UTC

[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

GitHub user kaka1992 opened a pull request:

    https://github.com/apache/spark/pull/5711

    [SPARK-7156][SQL] add randomSplit to DataFrame.

    SPARK-7156 add randomSplit to DataFrame.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kaka1992/spark add_randomsplit_to_dataframe

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5711.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5711
    
----
commit e65939a8b47671d9a09c51b6ab18eb525e720a61
Author: 云峤 <ch...@alibaba-inc.com>
Date:   2015-04-27T04:18:08Z

    SPARK-7156 add randomSplit to DataFrame.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-96943758
  
    Thanks for working on this, @kaka1992. Would be great if we can do it in a way that doesn't break the existing logical plan for data frames.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5711#discussion_r29217167
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -17,14 +17,13 @@
     
     package org.apache.spark.sql
     
    -import scala.language.postfixOps
    --- End diff --
    
    scala inports should be first


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by kaka1992 <gi...@git.apache.org>.
Github user kaka1992 closed the pull request at:

    https://github.com/apache/spark/pull/5711


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-97273687
  
    https://github.com/apache/spark/pull/5761 
    
    Somebody else submitted a PR based on your change and my review feedback.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by kaka1992 <gi...@git.apache.org>.
Github user kaka1992 commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-97054424
  
    Can I add InMemoryRelation upon the base logicalPlan? Then I could create several randomSplit plans with the same data. @rxin I'm not sure if this way would break something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by kaka1992 <gi...@git.apache.org>.
Github user kaka1992 commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-97399301
  
    @rxin No problem. I'll close the pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5711#discussion_r29217153
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -967,6 +969,23 @@ class DataFrame private[sql](
       }
     
       /**
    +   * Randomly splits this DataFrame with the provided weights.
    +   *
    +   * @param weights weights for splits, will be normalized if they don't sum to 1
    +   * @param seed random seed
    +   *
    +   * @return split DataFrames in an array
    +   */
    +  def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[DataFrame] = {
    +    val sum = weights.sum
    +    val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
    +    normalizedCumWeights.sliding(2).map { x =>
    +      this.sqlContext.createDataFrame(new PartitionwiseSampledRDD[Row, Row](
    +        rdd, new BernoulliCellSampler[Row](x(0), x(1)), true, seed), schema)
    --- End diff --
    
    this actually breaks the plan -- can we create a logical operator (or generalizes the existing Sample operator) so the returned DataFrame correctly preserves the logical plan?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-96766503
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7156][SQL] add randomSplit to DataFrame...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5711#issuecomment-97339105
  
    @kaka1992 mind closing the pr since https://github.com/apache/spark/pull/5761 subsumes this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org