You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhenhua Wang (JIRA)" <ji...@apache.org> on 2016/09/08 01:06:20 UTC

[jira] [Comment Edited] (SPARK-16026) Cost-based Optimizer framework

    [ https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472323#comment-15472323 ] 

Zhenhua Wang edited comment on SPARK-16026 at 9/8/16 1:06 AM:
--------------------------------------------------------------

1. It's a good point. I think it's worth a try, even without the phase 2.
But we need to notice that the search space will become larger for this adapted algo, which makes join reordering more time-consuming.
Actually, I'm thinking that maybe we can provide several algorithms for join reordering, and leave the decision to users/DBAs to choose which one should be used. Or we can simply switch the algo internally based on number of enumerable tables, e.g. when the number of tables is small (some threshold), we use the one with larger search space.
2. Agree with [~ron8hu]
3. For condition C1(on attribute a) && C2(on attribute b), when estimating C2, we assume b is in uniform distribution on a. Its cardinality would be ridiculous only when that assumption is ridiculously wrong (b and a have strong correlation). In this case, we need multi-dimensional histograms to improve accuracy. 
On the other hand, defining a reasonable minimum selectivity is difficult, do you have any suggestions?


was (Author: zenwzh):
1. It's a good point. I think it's worth a try, even without the phase 2.
But we need to notice that the search space will become larger for this adapted algo, which makes join reordering more time-consuming.
Actually, I'm thinking that maybe we can provide several algorithms for join reordering, and leave the decision to users/DBAs to choose which one should be used. Or we can simply switch the algo internally based on number of enumerable tables, e.g. when the number of tables is small (some threshold), we use the one with larger search space.
2. Agree with [~ron8hu]
3. For condition C1(on attribute a) && C2(on attribute b), when estimating C2, we assume b is in uniform distribution on a. Its cardinality would be ridiculous only when that assumption is ridiculously wrong. In this case, we need multi-dimensional histograms to improve accuracy. 
On the other hand, defining a reasonable minimum selectivity is difficult, do you have any suggestions?

> Cost-based Optimizer framework
> ------------------------------
>
>                 Key: SPARK-16026
>                 URL: https://issues.apache.org/jira/browse/SPARK-16026
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>         Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework beyond broadcast join selection. This framework can be used to implement some useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller logical units. For example, changes to statistics class, system catalog, cost estimation/propagation in expressions, cost estimation/propagation in operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org