You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/08/06 16:28:53 UTC

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22008

    [SPARK-24928][SQL] Optimize cross join according to stats

    ## What changes were proposed in this pull request?
    
    The cartesian product of 2 RDDs perform a nested loop. This means that the iterator for the inner RDD is built as many times as the number of rows of the outer one. If the two RDDs have a very different size, the performance difference can be huge.
    
    As there is no way to know which is the best RDD to choose as outer one (since we don't know the sizes), this cannot be addressed at RDD level. Only a comment has been added to warn/help the user to be careful about how they write their code.
    
    The PR proposed to add an optimizer rule which uses statistics collected on tables in order to change the sides of the cartesian product so that the outer table is the smaller one.
    
    ## How was this patch tested?
    
    added test suite

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-24928

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22008
    
----
commit 1caf2567694c56cca019e6608609b81ac70deefa
Author: Marco Gaido <ma...@...>
Date:   2018-08-06T15:51:58Z

    [SPARK-24928][SQL] Optimize cross join according to stats

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94365/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait JoinHelper extends PredicateHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94378 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94378/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94293/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1906/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208352309
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * Swaps right and left logical plans of a join when left is bigger than right. This is useful
    + * because underlying cartesian product performs a nested loop, thus if the outer table is
    + * smaller there are less iterator initialization.
    --- End diff --
    
    I think this only makes sense when building left iterator and right iterator are the same cost. When building right iterator is less costly than building left, swapping them might be performance regression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94365 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94365/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208802495
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -158,8 +158,9 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
           ConvertToLocalRelation,
           PropagateEmptyRelation) :+
         // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
    -    Batch("Check Cartesian Products", Once,
    -      CheckCartesianProducts) :+
    +    Batch("Check and Optimize Cartesian Products", Once,
    +      CheckCartesianProducts,
    +      ReorderCrossJoinOperands) :+
    --- End diff --
    
    IMO this optimization should be located in `CostBasedJoinReorder` or `ReorderJoin`. Why is this an independent rule?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1853/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    cc @wzhfy 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208353576
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -158,8 +158,9 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
           ConvertToLocalRelation,
           PropagateEmptyRelation) :+
         // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
    -    Batch("Check Cartesian Products", Once,
    -      CheckCartesianProducts) :+
    +    Batch("Check and Optimize Cartesian Products", Once,
    +      CheckCartesianProducts,
    +      ReorderCrossJoinOperands) :+
    --- End diff --
    
    Will reordering here break join order optimized by `CostBasedJoinReorder`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208372709
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -158,8 +158,9 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
           ConvertToLocalRelation,
           PropagateEmptyRelation) :+
         // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
    -    Batch("Check Cartesian Products", Once,
    -      CheckCartesianProducts) :+
    +    Batch("Check and Optimize Cartesian Products", Once,
    +      CheckCartesianProducts,
    +      ReorderCrossJoinOperands) :+
    --- End diff --
    
    This doesn't reorder the joins, but it just swaps the sides of a cartesian join, so there is no difference neither in the join order nor in any other aspect of what is the plan before or after it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94365/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94293/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208831933
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ---
    @@ -158,8 +158,9 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
           ConvertToLocalRelation,
           PropagateEmptyRelation) :+
         // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
    -    Batch("Check Cartesian Products", Once,
    -      CheckCartesianProducts) :+
    +    Batch("Check and Optimize Cartesian Products", Once,
    +      CheckCartesianProducts,
    +      ReorderCrossJoinOperands) :+
    --- End diff --
    
    Because this doesn't reorder the Join order, but it just swaps the two sides of a cross join.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94378 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94378/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait JoinHelper extends PredicateHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    **[Test build #94293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94293/testReport)** for PR 22008 at commit [`1caf256`](https://github.com/apache/spark/commit/1caf2567694c56cca019e6608609b81ac70deefa).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait JoinHelper extends PredicateHelper `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22008#discussion_r208379788

--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
@@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
}
}
+
+/**
+ * Swaps right and left logical plans of a join when left is bigger than right. This is useful
+ * because underlying cartesian product performs a nested loop, thus if the outer table is
+ * smaller there are less iterator initialization.
--- End diff --

This is indeed an interesting point. I am not sure how/if we can measure the cost in the creation of the involved iterator and the cost of creating it.

Anyway, actually this will optimize not only the initialization cost for the iterator, but also the overall number of record read/processed. Let's take an example. Imagine that we have a table A with 10M record and a table B with 100 records. The total number of record retrieved is:

- if A is the left table, we process: 10M (all the records from A) + 100 * 10M (all the records from B for every record from A) = 101 * 10M
- if B is the left table, we process: 100 (all the records from B) + 100 * 10M (all the records from A for each record from B) = ~ 100 * 10M

So in the second case we process size of A - size B less records (same applies to number of bytes read).

And there is another good point for the second option: ie. Spark is much better at computing/reading 10 times 10M records that 10M times 2 records as it can exploits its parallelism.

That said, your comment still applies, ie. there may be cases in which one side is very onerous despite is the one with less records involved. Do you have any suggestion about how to estimate this? Thanks.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1916/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22008#discussion_r208385422
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ---
    @@ -152,3 +153,45 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
           if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
       }
     }
    +
    +/**
    + * Swaps right and left logical plans of a join when left is bigger than right. This is useful
    + * because underlying cartesian product performs a nested loop, thus if the outer table is
    + * smaller there are less iterator initialization.
    --- End diff --
    
    I have no idea that we have a good way so far to estimate the effort of materializing elements in one RDD, especially before we materialize it. That is why I think this optimization of swapping cross join doesn't always introduce improvement but sometimes regression. Let us see if others have more ideas.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Actually the SQL cartesian product already uses an optimized (but unsafe version) of the cartesian RDD.  So this rule is not needed at all. We might still want tough to expose the optimized version in the RDD API too. I am closing this PR and eventually create a new one for the RDD API. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22008: [SPARK-24928][SQL] Optimize cross join according ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 closed the pull request at:

    https://github.com/apache/spark/pull/22008


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22008: [SPARK-24928][SQL] Optimize cross join according to stat...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22008
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org