You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2017/08/06 16:16:57 UTC

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/18861

    [SPARK-19426][SQL] Custom coalescer for Dataset

    ## What changes were proposed in this pull request?
    This pr added a new API for `coalesce` in `Dataset`; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in `RDD#coalesce` added in #11865 (SPARK-14042).
    
    This is the rework of #16766.
    
    ## How was this patch tested?
    Added tests in `DatasetSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark pr16766

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18861
    
----
commit 9abcd75474f188d7b9f4ed4caab6bb8e98c1da4d
Author: Marius van Niekerk <ma...@maxpoint.com>
Date:   2016-11-07T22:06:38Z

    Implement custom coalesce

commit bb7f9b0ab06ec0be8666e14846e19fe0c28e5143
Author: Takeshi Yamamuro <ya...@apache.org>
Date:   2017-08-06T14:44:51Z

    Add more fixes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @gatorsmile ok, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131808426
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -746,8 +746,20 @@ abstract class RepartitionOperation extends UnaryNode {
      * [[RepartitionByExpression]] as this method is called directly by DataFrame's, because the user
      * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is used when the consumer
      * of the output requires some specific ordering or distribution of the data.
    + *
    + * If `shuffle` = false (`coalesce` cases), this logical plan can have an user-specified strategy
    + * to coalesce input partitions.
    + *
    + * @param numPartitions How many partitions to use in the output RDD
    + * @param shuffle Whether to shuffle when repartitioning
    + * @param child the LogicalPlan
    + * @param coalescer Optional coalescer that an user specifies
      */
    -case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
    +case class Repartition(
    +    numPartitions: Int,
    +    shuffle: Boolean,
    +    child: LogicalPlan,
    +    coalescer: Option[PartitionCoalescer] = None)
       extends RepartitionOperation {
       require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80373/testReport)** for PR 18861 at commit [`d7392c4`](https://github.com/apache/spark/commit/d7392c447c6fbfcc8ba7256dc6fa17b709a3fe24).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80328/testReport)** for PR 18861 at commit [`3b4c679`](https://github.com/apache/spark/commit/3b4c679a0f16fcd85e1e11470eea250fd342e898).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Repartition(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @maropu any update on this issue?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80373/testReport)** for PR 18861 at commit [`d7392c4`](https://github.com/apache/spark/commit/d7392c447c6fbfcc8ba7256dc6fa17b709a3fe24).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80327/testReport)** for PR 18861 at commit [`83ac85f`](https://github.com/apache/spark/commit/83ac85ffa04b738c1489422886ab406579f28be9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Repartition(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80307/testReport)** for PR 18861 at commit [`253bdcc`](https://github.com/apache/spark/commit/253bdcc32ad67cda00223d6bca773253ad94d034).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80327/testReport)** for PR 18861 at commit [`83ac85f`](https://github.com/apache/spark/commit/83ac85ffa04b738c1489422886ab406579f28be9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80307/testReport)** for PR 18861 at commit [`253bdcc`](https://github.com/apache/spark/commit/253bdcc32ad67cda00223d6bca773253ad94d034).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)`
      * `case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131570565
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ---
    @@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
      * current upstream partitions will be executed in parallel (per whatever
      * the current partitioning is).
      */
    -case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
    +case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])
    --- End diff --
    
    ok!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80320/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80307/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131809065
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -596,14 +596,15 @@ object CollapseProject extends Rule[LogicalPlan] {
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         // Case 1: When a Repartition has a child of Repartition or RepartitionByExpression,
    -    // 1) When the top node does not enable the shuffle (i.e., coalesce API), but the child
    -    //   enables the shuffle. Returns the child node if the last numPartitions is bigger;
    -    //   otherwise, keep unchanged.
    +    // 1) When the top node does not enable the shuffle (i.e., coalesce with no user-specified
    +    //   strategy), but the child enables the shuffle. Returns the child node if the last
    +    //   numPartitions is bigger; otherwise, keep unchanged.
         // 2) In the other cases, returns the top node with the child's child
    -    case r @ Repartition(_, _, child: RepartitionOperation) => (r.shuffle, child.shuffle) match {
    -      case (false, true) => if (r.numPartitions >= child.numPartitions) child else r
    -      case _ => r.copy(child = child.child)
    -    }
    +    case r @ Repartition(_, _, child: RepartitionOperation, None) =>
    --- End diff --
    
    sorry, my bad. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80328/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Yeah, maybe close it first. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80306/testReport)** for PR 18861 at commit [`bb7f9b0`](https://github.com/apache/spark/commit/bb7f9b0ab06ec0be8666e14846e19fe0c28e5143).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80321/testReport)** for PR 18861 at commit [`c0306d3`](https://github.com/apache/spark/commit/c0306d346e336a3bae6335e27f676c3254d915cb).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80306/testReport)** for PR 18861 at commit [`bb7f9b0`](https://github.com/apache/spark/commit/bb7f9b0ab06ec0be8666e14846e19fe0c28e5143).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)`
      * `case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80320/testReport)** for PR 18861 at commit [`413b0eb`](https://github.com/apache/spark/commit/413b0eb55659d31cd21fbc1c858d3da1603d2248).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80321/testReport)** for PR 18861 at commit [`c0306d3`](https://github.com/apache/spark/commit/c0306d346e336a3bae6335e27f676c3254d915cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80306/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131571620
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
     }
     
     /**
    + * Returns a new RDD that has at most `numPartitions` partitions. This behavior can be modified by
    + * supplying a `PartitionCoalescer` to control the behavior of the partitioning.
    + */
    +case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
    +  extends UnaryNode {
    --- End diff --
    
    yea, I think so. I'll try and plz give me days to do so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    ok, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @gatorsmile ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r132881491
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
     object CollapseRepartition extends Rule[LogicalPlan] {
    --- End diff --
    
    Please also add new test cases to `CollapseRepartitionSuite` for the changes in this rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80320/testReport)** for PR 18861 at commit [`413b0eb`](https://github.com/apache/spark/commit/413b0eb55659d31cd21fbc1c858d3da1603d2248).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131570851
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends PartitionCoalescer with Seria
           totalSum += splitSize
         }
     
    -    while (index < partitions.size) {
    +    while (index < partitions.length) {
           val partition = partitions(index)
    -      val fileSplit =
    -        partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
    -      val splitSize = fileSplit.getLength
    +      val splitSize = getPartitionSize(partition)
           if (currentSum + splitSize < maxSize) {
             addPartition(partition, splitSize)
             index += 1
    -        if (index == partitions.size) {
    -          updateGroups
    +        if (index == partitions.length) {
    +          updateGroups()
             }
           } else {
    -        if (currentGroup.partitions.size == 0) {
    +        if (currentGroup.partitions.isEmpty) {
               addPartition(partition, splitSize)
               index += 1
             } else {
    -          updateGroups
    +          updateGroups()
    --- End diff --
    
    All the above changes are not related to this PR, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131571449
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends PartitionCoalescer with Seria
           totalSum += splitSize
         }
     
    -    while (index < partitions.size) {
    +    while (index < partitions.length) {
           val partition = partitions(index)
    -      val fileSplit =
    -        partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
    -      val splitSize = fileSplit.getLength
    +      val splitSize = getPartitionSize(partition)
           if (currentSum + splitSize < maxSize) {
             addPartition(partition, splitSize)
             index += 1
    -        if (index == partitions.size) {
    -          updateGroups
    +        if (index == partitions.length) {
    +          updateGroups()
             }
           } else {
    -        if (currentGroup.partitions.size == 0) {
    +        if (currentGroup.partitions.isEmpty) {
               addPartition(partition, splitSize)
               index += 1
             } else {
    -          updateGroups
    +          updateGroups()
    --- End diff --
    
    I am fine about this, but it might confuse the others. Maybe just remove them in this PR? You can submit a separate PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r132881614
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
     object CollapseRepartition extends Rule[LogicalPlan] {
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131766593
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -746,8 +746,20 @@ abstract class RepartitionOperation extends UnaryNode {
      * [[RepartitionByExpression]] as this method is called directly by DataFrame's, because the user
      * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is used when the consumer
      * of the output requires some specific ordering or distribution of the data.
    + *
    + * If `shuffle` = false (`coalesce` cases), this logical plan can have an user-specified strategy
    + * to coalesce input partitions.
    + *
    + * @param numPartitions How many partitions to use in the output RDD
    + * @param shuffle Whether to shuffle when repartitioning
    + * @param child the LogicalPlan
    + * @param coalescer Optional coalescer that an user specifies
      */
    -case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
    +case class Repartition(
    +    numPartitions: Int,
    +    shuffle: Boolean,
    +    child: LogicalPlan,
    +    coalescer: Option[PartitionCoalescer] = None)
       extends RepartitionOperation {
       require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    --- End diff --
    
    Add a new require here? 
    ```
    require(!shuffle || coalescer.isEmpty, "Custom coalescer is not allowed for repartition(shuffle=true)")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80373/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80321/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu closed the pull request at:

    https://github.com/apache/spark/pull/18861


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80327/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131768761
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -596,14 +596,15 @@ object CollapseProject extends Rule[LogicalPlan] {
     object CollapseRepartition extends Rule[LogicalPlan] {
       def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
         // Case 1: When a Repartition has a child of Repartition or RepartitionByExpression,
    -    // 1) When the top node does not enable the shuffle (i.e., coalesce API), but the child
    -    //   enables the shuffle. Returns the child node if the last numPartitions is bigger;
    -    //   otherwise, keep unchanged.
    +    // 1) When the top node does not enable the shuffle (i.e., coalesce with no user-specified
    +    //   strategy), but the child enables the shuffle. Returns the child node if the last
    +    //   numPartitions is bigger; otherwise, keep unchanged.
         // 2) In the other cases, returns the top node with the child's child
    -    case r @ Repartition(_, _, child: RepartitionOperation) => (r.shuffle, child.shuffle) match {
    -      case (false, true) => if (r.numPartitions >= child.numPartitions) child else r
    -      case _ => r.copy(child = child.child)
    -    }
    +    case r @ Repartition(_, _, child: RepartitionOperation, None) =>
    --- End diff --
    
    ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131570879
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---
    @@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
     }
     
     /**
    + * Returns a new RDD that has at most `numPartitions` partitions. This behavior can be modified by
    + * supplying a `PartitionCoalescer` to control the behavior of the partitioning.
    + */
    +case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
    +  extends UnaryNode {
    --- End diff --
    
    Adding new logical nodes also needs the updates in multiple different components. (e.g., Optimizer). 
    
    Is that possible to reuse the existing node `Repartition`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    **[Test build #80328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80328/testReport)** for PR 18861 at commit [`3b4c679`](https://github.com/apache/spark/commit/3b4c679a0f16fcd85e1e11470eea250fd342e898).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131570472
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ---
    @@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
      * current upstream partitions will be executed in parallel (per whatever
      * the current partitioning is).
      */
    -case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
    +case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])
    --- End diff --
    
    Could you add the parm description of `coalescer`? also update function descriptions? Thanks~!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @gatorsmile ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131766797
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2662,6 +2662,30 @@ class Dataset[T] private[sql](
       }
     
       /**
    +   * Returns a new Dataset that an user-defined `PartitionCoalescer` reduces into fewer partitions.
    +   * `userDefinedCoalescer` is the same with a coalescer used in the `RDD` coalesce function.
    +   *
    +   * If a larger number of partitions is requested, it will stay at the current
    +   * number of partitions. Similar to coalesce defined on an `RDD`, this operation results in
    +   * a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not
    +   * be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
    +   *
    +   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
    +   * this may result in your computation taking place on fewer nodes than
    +   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
    +   * you can call repartition. This will add a shuffle step, but means the
    +   * current upstream partitions will be executed in parallel (per whatever
    +   * the current partitioning is).
    +   *
    +   * @group typedrel
    +   * @since 2.3.0
    +   */
    +  def coalesce(numPartitions: Int, userDefinedCoalescer: Option[PartitionCoalescer])
    +    : Dataset[T] = withTypedPlan {
    --- End diff --
    
    ```Scala
    def coalesce(
        numPartitions: Int,
        userDefinedCoalescer: Option[PartitionCoalescer]): Dataset[T] = withTypedPlan {
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131571547
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends PartitionCoalescer with Seria
           totalSum += splitSize
         }
     
    -    while (index < partitions.size) {
    +    while (index < partitions.length) {
           val partition = partitions(index)
    -      val fileSplit =
    -        partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
    -      val splitSize = fileSplit.getLength
    +      val splitSize = getPartitionSize(partition)
           if (currentSum + splitSize < maxSize) {
             addPartition(partition, splitSize)
             index += 1
    -        if (index == partitions.size) {
    -          updateGroups
    +        if (index == partitions.length) {
    +          updateGroups()
             }
           } else {
    -        if (currentGroup.partitions.size == 0) {
    +        if (currentGroup.partitions.isEmpty) {
               addPartition(partition, splitSize)
               index += 1
             } else {
    -          updateGroups
    +          updateGroups()
    --- End diff --
    
    ok, I'll drop these from this pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18861#discussion_r131571248
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends PartitionCoalescer with Seria
           totalSum += splitSize
         }
     
    -    while (index < partitions.size) {
    +    while (index < partitions.length) {
           val partition = partitions(index)
    -      val fileSplit =
    -        partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
    -      val splitSize = fileSplit.getLength
    +      val splitSize = getPartitionSize(partition)
           if (currentSum + splitSize < maxSize) {
             addPartition(partition, splitSize)
             index += 1
    -        if (index == partitions.size) {
    -          updateGroups
    +        if (index == partitions.length) {
    +          updateGroups()
             }
           } else {
    -        if (currentGroup.partitions.size == 0) {
    +        if (currentGroup.partitions.isEmpty) {
               addPartition(partition, splitSize)
               index += 1
             } else {
    -          updateGroups
    +          updateGroups()
    --- End diff --
    
    Yea, I just left the changes of the original author (probably refactoring stuffs?) ..., better remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @gatorsmile better to close this pr and the jira?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    @gatorsmile ok, could you check again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18861
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org