You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Srinath (JIRA)" <ji...@apache.org> on 2016/09/06 23:05:20 UTC
[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

    [ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468891#comment-15468891 ] 

Srinath commented on SPARK-16026:
---------------------------------

I have a couple of comments/questions on the proposal.
Regarding the join reordering algorithm: 
One of the big wins we should be able to get is avoiding shuffles/broadcasts. If the costing and dynamic programming algo doesn't take into account change costs and output partitioning we may produce some bad plans.
Here's an example: Suppose we start with completely unpartitioned tables A(a), B(b1, b2), C(c) and D(d), in increasing order of size and let's assume none of them are small enough to broadcast. Suppose we want to optimize the following join 
(A join B on A.a = B.b1) join C on (B.b2 = C.c) join D on (B.b1 = D.d).
Since A, B C and D are in increasing order of size and we try to minimize intermediate result size, we end up with the following “cheapest” plan (join order A-B-C-D):
{noformat}
Plan I
Join(B.b1 = D.d)
|-Exchange(b1)
|   Join(B.b2 = c)
|   |-Exchange(b2)
|   |   Join(A.a = B.b1)
|   |   |-Exchange(a)
|   |   |   A
|   |   | Exchange(b1)
|   |       B
|   | Exchange(c)
|       C
|-Exchange(d)
    D
{noformat}
Ignoring leaf node sizes, the cost according to the proposed model, i.e. the intermediate data size is Size(A join B) + size(ABC). This is also the size of intermediate data exchanged.
But a better plan may be to join to D before C (i.e. join order A-B-D-C) because that would avoid a re-shuffle 
{noformat}
Plan II
Join(B.b2 = C.c)
|-Exchange(B.b2)
|   Join (B.b1 = d)
|   |-Join(A.a = B.b1)
|   | |-Exchange(a)
|   | |   A
|   | | Exchange(b1)
|   |     B  
|   |-Exchange(d)
|       D
|-Exchange(c)
    C
{noformat}
The cost of this plan, i.e. the intermediate data size, is size(AB) + size(ABD), which is higher than Plan I. But the size of intermediate data exchanged is  size(ABD) which may be lower than size(AB) + size(ABC) of Plan I. This plan could be significantly faster as a result.

It should be relatively painless to incorporate partition-awareness into the dynamic programming proposal for cost-based join ordering — with a couple of tweaks
i) Take into account intermediate data exchanged, not just total intermediate data. For example, a good and simple start would be to use (exchanged-data, total-data) as the cost function, with a preference for the former (i.e. prefer lower exchanged data, and lower total-data if the exchanged data is the same). You could certainly have a more complex model, though. 
ii) Preserve (i.e. don't prune) partial plans based on output partitioning. e.g. consider a partial plan involving A, B and C. A join B join C may have a different output partitioning than A join C join B. If ACB is more expensive but has an output partitioning scheme that is useful for further joins, its worth preserving.

Another question I have is regarding statistics: With separate analyze column/analyze table statements it's possible for your statistics to have two different views of data, leading to weird results and inconsistent cardinality estimates.

For filter factor, what are the default selectivities assumed ? We may also want to cap the minimum selectivity, so that C1 && C2 && C3 etc. doesn’t lead to ridiculously low cardinality estimates.

> Cost-based Optimizer framework
> ------------------------------
>
>                 Key: SPARK-16026
>                 URL: https://issues.apache.org/jira/browse/SPARK-16026
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>         Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework beyond broadcast join selection. This framework can be used to implement some useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller logical units. For example, changes to statistics class, system catalog, cost estimation/propagation in expressions, cost estimation/propagation in operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org