You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jinfeng Ni <jn...@maprtech.com> on 2015/05/09 02:11:34 UTC

Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/
-----------------------------------------------------------

Review request for drill and Aman Sinha.


Repository: drill-git


Description
-------

Drill current use VolcanoPlanner in join planning. This planner has two known issues:

1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.

2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.

To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 

The main logic of this new planner:
1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
3) Continue with the same physical planning as before.

With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 

Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 

Will update more in detail about the comparison.


Diffs
-----

  exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
  exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 

Diff: https://reviews.apache.org/r/34006/diff/


Testing
-------

Unit test / Regression suite.


Thanks,

Jinfeng Ni


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Aman Sinha <as...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/#review83150
-----------------------------------------------------------

Ship it!


Ship It!

- Aman Sinha


On May 9, 2015, 12:20 a.m., Jinfeng Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34006/
> -----------------------------------------------------------
> 
> (Updated May 9, 2015, 12:20 a.m.)
> 
> 
> Review request for drill and Aman Sinha.
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> Drill current use VolcanoPlanner in join planning. This planner has two known issues:
> 
> 1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.
> 
> 2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.
> 
> To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 
> 
> The main logic of this new planner:
> 1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
> 2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
> 3) Continue with the same physical planning as before.
> 
> With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 
> 
> Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 
> 
> Will update more in detail about the comparison.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 
> 
> Diff: https://reviews.apache.org/r/34006/diff/
> 
> 
> Testing
> -------
> 
> Unit test / Regression suite.
> 
> 
> Thanks,
> 
> Jinfeng Ni
> 
>


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Jinfeng Ni <jn...@maprtech.com>.

> On May 8, 2015, 10:57 p.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java, line 529
> > <https://reviews.apache.org/r/34006/diff/2/?file=954207#file954207line529>
> >
> >     Is the criteria for using Lopt optimizer (in terms of number of tables above a certain threshold) applied internally ? We should have a Drill specific setting for it beyond just a true/false setting.
> 
> Jinfeng Ni wrote:
>     I think it probably makes sense to avoid LOPT planner for single table query. For any query with JOIN, since the current planer does not enable SwapJoin rule in the logical planning phase, it may not find the optimal plan, and rely on a post-planing method to swap join based on rowcount. In that sense, I feel LOPT planner might be a better choice even for 2 or 3 tables join.
>     
>     For single table query, since there is no join, it seems no difference between LOPT / the current planner. That's why I did not add a threashold.

Another reason that I did not add a threashold option is we have added a new option to swtich between the existing planner and new planner. User could simply turn on/off that option, to completely switch to one of them. Having both the switch option and threashold option probably will cause more confusing, IMHO.


- Jinfeng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/#review83134
-----------------------------------------------------------


On May 8, 2015, 5:20 p.m., Jinfeng Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34006/
> -----------------------------------------------------------
> 
> (Updated May 8, 2015, 5:20 p.m.)
> 
> 
> Review request for drill and Aman Sinha.
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> Drill current use VolcanoPlanner in join planning. This planner has two known issues:
> 
> 1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.
> 
> 2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.
> 
> To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 
> 
> The main logic of this new planner:
> 1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
> 2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
> 3) Continue with the same physical planning as before.
> 
> With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 
> 
> Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 
> 
> Will update more in detail about the comparison.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 
> 
> Diff: https://reviews.apache.org/r/34006/diff/
> 
> 
> Testing
> -------
> 
> Unit test / Regression suite.
> 
> 
> Thanks,
> 
> Jinfeng Ni
> 
>


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Aman Sinha <as...@maprtech.com>.

> On May 9, 2015, 5:57 a.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java, line 53
> > <https://reviews.apache.org/r/34006/diff/2/?file=954198#file954198line53>
> >
> >     It is not clear why Scan should have a 'distinct' row count of 10% ?  What does 'distinct' row count mean for scan ? (since we are not considering filter push-down or partition pruning here).
> 
> Jinfeng Ni wrote:
>     The distinct row count is for one set of column. It's more like column cardinality.  For instance, GB c1, c2, c3. The ImmutableBitSet groupKey indicates the column set. However, since Drill does not have any column cardinality, we simply use row count to estimate.
>     
>     As a matter of fact, the current VolcanoPlanner uses the default RelMetadaProvider, which will return 10% of row count for AggregateRel. In this sense, this DistinctRowCount MetadataProvider is using the same way for estimation.

Returning 10% as the distinct row count for AggregateRel seems reasonable since it is doing group-by.  I suppose for ScanRel it will be called for individual columns, not a set of columns... in other words if I want the NDV(c1), NDV(c2), NDV(c3), I will call this method 3 times with different columns in the groupKey.


> On May 9, 2015, 5:57 a.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java, line 529
> > <https://reviews.apache.org/r/34006/diff/2/?file=954207#file954207line529>
> >
> >     Is the criteria for using Lopt optimizer (in terms of number of tables above a certain threshold) applied internally ? We should have a Drill specific setting for it beyond just a true/false setting.
> 
> Jinfeng Ni wrote:
>     I think it probably makes sense to avoid LOPT planner for single table query. For any query with JOIN, since the current planer does not enable SwapJoin rule in the logical planning phase, it may not find the optimal plan, and rely on a post-planing method to swap join based on rowcount. In that sense, I feel LOPT planner might be a better choice even for 2 or 3 tables join.
>     
>     For single table query, since there is no join, it seems no difference between LOPT / the current planner. That's why I did not add a threashold.
> 
> Jinfeng Ni wrote:
>     Another reason that I did not add a threashold option is we have added a new option to swtich between the existing planner and new planner. User could simply turn on/off that option, to completely switch to one of them. Having both the switch option and threashold option probably will cause more confusing, IMHO.

Ok, let's see how the TPC-DS testing goes and see if more control is needed.


- Aman


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/#review83134
-----------------------------------------------------------


On May 9, 2015, 12:20 a.m., Jinfeng Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34006/
> -----------------------------------------------------------
> 
> (Updated May 9, 2015, 12:20 a.m.)
> 
> 
> Review request for drill and Aman Sinha.
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> Drill current use VolcanoPlanner in join planning. This planner has two known issues:
> 
> 1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.
> 
> 2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.
> 
> To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 
> 
> The main logic of this new planner:
> 1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
> 2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
> 3) Continue with the same physical planning as before.
> 
> With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 
> 
> Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 
> 
> Will update more in detail about the comparison.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 
> 
> Diff: https://reviews.apache.org/r/34006/diff/
> 
> 
> Testing
> -------
> 
> Unit test / Regression suite.
> 
> 
> Thanks,
> 
> Jinfeng Ni
> 
>


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Jinfeng Ni <jn...@maprtech.com>.

> On May 8, 2015, 10:57 p.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java, line 124
> > <https://reviews.apache.org/r/34006/diff/2/?file=954195#file954195line124>
> >
> >     Since the main difference between this and other costing function is the multiplication factor, can you consolidate the two and just provide thin wrappers ?

There is some difference between the two costing function. 

But I'll try to put the common code in one place.


> On May 8, 2015, 10:57 p.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java, line 53
> > <https://reviews.apache.org/r/34006/diff/2/?file=954198#file954198line53>
> >
> >     It is not clear why Scan should have a 'distinct' row count of 10% ?  What does 'distinct' row count mean for scan ? (since we are not considering filter push-down or partition pruning here).

The distinct row count is for one set of column. It's more like column cardinality.  For instance, GB c1, c2, c3. The ImmutableBitSet groupKey indicates the column set. However, since Drill does not have any column cardinality, we simply use row count to estimate.

As a matter of fact, the current VolcanoPlanner uses the default RelMetadaProvider, which will return 10% of row count for AggregateRel. In this sense, this DistinctRowCount MetadataProvider is using the same way for estimation.


> On May 8, 2015, 10:57 p.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java, line 37
> > <https://reviews.apache.org/r/34006/diff/2/?file=954203#file954203line37>
> >
> >     This currently has factories for creating logical project, filter and join.  Why only these 3 and not other logical rels ?

This is because LOPT only requires three : project, filter, join. We certainly could create more factories. But I'm not inclined to create more class, if they are not used at all in this patch.

I'll file another JIRA to create more factories for futher use.


> On May 8, 2015, 10:57 p.m., Aman Sinha wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java, line 529
> > <https://reviews.apache.org/r/34006/diff/2/?file=954207#file954207line529>
> >
> >     Is the criteria for using Lopt optimizer (in terms of number of tables above a certain threshold) applied internally ? We should have a Drill specific setting for it beyond just a true/false setting.

I think it probably makes sense to avoid LOPT planner for single table query. For any query with JOIN, since the current planer does not enable SwapJoin rule in the logical planning phase, it may not find the optimal plan, and rely on a post-planing method to swap join based on rowcount. In that sense, I feel LOPT planner might be a better choice even for 2 or 3 tables join.

For single table query, since there is no join, it seems no difference between LOPT / the current planner. That's why I did not add a threashold.


- Jinfeng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/#review83134
-----------------------------------------------------------


On May 8, 2015, 5:20 p.m., Jinfeng Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34006/
> -----------------------------------------------------------
> 
> (Updated May 8, 2015, 5:20 p.m.)
> 
> 
> Review request for drill and Aman Sinha.
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> Drill current use VolcanoPlanner in join planning. This planner has two known issues:
> 
> 1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.
> 
> 2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.
> 
> To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 
> 
> The main logic of this new planner:
> 1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
> 2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
> 3) Continue with the same physical planning as before.
> 
> With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 
> 
> Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 
> 
> Will update more in detail about the comparison.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 
> 
> Diff: https://reviews.apache.org/r/34006/diff/
> 
> 
> Testing
> -------
> 
> Unit test / Regression suite.
> 
> 
> Thanks,
> 
> Jinfeng Ni
> 
>


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Aman Sinha <as...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/#review83134
-----------------------------------------------------------



exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java
<https://reviews.apache.org/r/34006/#comment134006>

    Since the main difference between this and other costing function is the multiplication factor, can you consolidate the two and just provide thin wrappers ?



exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java
<https://reviews.apache.org/r/34006/#comment134007>

    It is not clear why Scan should have a 'distinct' row count of 10% ?  What does 'distinct' row count mean for scan ? (since we are not considering filter push-down or partition pruning here).



exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java
<https://reviews.apache.org/r/34006/#comment134008>

    This currently has factories for creating logical project, filter and join.  Why only these 3 and not other logical rels ?



exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java
<https://reviews.apache.org/r/34006/#comment134009>

    Remove these commented rules.



exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java
<https://reviews.apache.org/r/34006/#comment134010>

    Is the criteria for using Lopt optimizer (in terms of number of tables above a certain threshold) applied internally ? We should have a Drill specific setting for it beyond just a true/false setting.


- Aman Sinha


On May 9, 2015, 12:20 a.m., Jinfeng Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34006/
> -----------------------------------------------------------
> 
> (Updated May 9, 2015, 12:20 a.m.)
> 
> 
> Review request for drill and Aman Sinha.
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> Drill current use VolcanoPlanner in join planning. This planner has two known issues:
> 
> 1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.
> 
> 2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.
> 
> To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 
> 
> The main logic of this new planner:
> 1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
> 2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
> 3) Continue with the same physical planning as before.
> 
> With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 
> 
> Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 
> 
> Will update more in detail about the comparison.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
>   exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 
> 
> Diff: https://reviews.apache.org/r/34006/diff/
> 
> 
> Testing
> -------
> 
> Unit test / Regression suite.
> 
> 
> Thanks,
> 
> Jinfeng Ni
> 
>


Re: Review Request 34006: DRILL-2958: Move Drill to alternative cost-based planner for Join planning

Posted by Jinfeng Ni <jn...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34006/
-----------------------------------------------------------

(Updated May 8, 2015, 5:20 p.m.)


Review request for drill and Aman Sinha.


Changes
-------

code cleanup.


Repository: drill-git


Description
-------

Drill current use VolcanoPlanner in join planning. This planner has two known issues:

1. The search space is increased exponentially with increased # of tables joined. If query has more than > 10 tables join, the planning time itself could be minutes, if not longer.

2. Drill did not enable a rule to swap both sides of join, due to the search space problem. We only do a swap join afterwards. See DRILL-2236. This means the join order chosen by Drill's VolcanoPlanner might not be optimal.

To address the above two issues, we are going to provide another planner for the purpose of join ordering planning. This planner will use a different optimization rules, and the search space is not increased exponentially with # of table. 

The main logic of this new planner:
1) Let VolcanoPlanner do all the rule transformations same as the current planner's logical planning, except for the join permutation rule.
2) After that, pass to HepPlanner with Calcite LOPT optimization rule, to let it do the join ordering. Feed with the HepPlanner with Drill's RelMetaDataProvider, to leverage the statistics (rowcount) available in Drill's table/files. 
3) Continue with the same physical planning as before.

With the limited statistics available in Drill, the new planner seems to produce better query plan than the current, for several TPCH queries. 

Preliminary performance results show this planner run faster than the existing one, and the join plan seems to be same or better than the plan chosen by the existing planner. 

Will update more in detail about the comparison.


Diffs (updated)
-----

  exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillJoinRelBase.java 5ab416c 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillProjectRelBase.java 42ef6ac 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillDefaultRelMetadataProvider.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillFilterRel.java dbd08f4 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillJoinRel.java dcccdb0 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillProjectRel.java 6e132aa 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjIntoScan.java 2981de8 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRelFactories.java PRE-CREATION 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java 53e1bff 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java 7d8dd97 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DrillSqlWorker.java 3c78c08 
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java eda1b5f 
  exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java 4d8b034 

Diff: https://reviews.apache.org/r/34006/diff/


Testing
-------

Unit test / Regression suite.


Thanks,

Jinfeng Ni