You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@quickstep.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/04/23 01:34:04 UTC
[jira] [Commented] (QUICKSTEP-89) Add support for common
subexpression elimination.
[ https://issues.apache.org/jira/browse/QUICKSTEP-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980210#comment-15980210 ]
ASF GitHub Bot commented on QUICKSTEP-89:
-----------------------------------------
GitHub user jianqiao opened a pull request:
https://github.com/apache/incubator-quickstep/pull/237
QUICKSTEP-89 Add support for common subexpression.
This PR adds support for common subexpression elimination in Quickstep. In summary we apply two kinds of optimizations:
**(1)** Query rewriting for _aggregate expressions_, e.g.
```
SELECT SUM(x), SUM(x)*2, AVG(x), COUNT(x) FROM s;
```
will be rewritten as
```
SELECT sum_x, sum_x * 2, sum_x / count_x, count_x
FROM (
SELECT SUM(x) AS sum_x, COUNT(x) AS count_x
) t;
```
**(2)** Common subexpression labeling with **per-storage-block memoization table** to cache result column vectors to avoid redundant computation. For example, the evaluation of expressions in
```
SELECT (x + 1) * (y + 2) * 3, (x + 1) * (y + 2), x + 1
FROM s;
```
will essentially look like:
```
t1 = x + 1
t2 = t1 * (y + 2)
Out1 = t2 * 3
Out2 = Reference(t2)
Out3 = Reference(t3)
```
Here `t1`, `t2`, `Out1`, `Out2`, `Out3` are per-storage-block `ColumnVector`'s.
This PR together with a (later) new aggregation hash table implementation will improve the performance of TPC-H Q01 from ~11.5s to ~5.3s, with scale factor 100 on a cloud lab machine.
___
**Note**: For optimization (1), note that it is not free-of-cost to "re-project" the columns, so it may not worth doing the transformation in situations where a group-by aggregation has large number of groups. E.g., it may actually slow down the query to rewrite
```
SELECT SUM(a), SUM(b), SUM(c), SUM(d), SUM(a)
FROM s
GROUP BY x;
```
as
```
SELECT sum_a, sum_b, sum_c, sum_d, sum_a
FROM (
SELECT SUM(a) AS sum_a, SUM(b) AS sum_b, SUM(c) AS sum_c, SUM(d) AS sum_d
FROM s
GROUP BY x;
) t;
```
if the number of distinct values of attribute x is large -- in that case, we avoid one duplicate computation of `SUM(a)`, but introduce 5 extra expensive column copying due to the intermediate relation `t`.
Thus currently we employ a simple heuristic that if either
_(1) The estimated number of groups for the Aggregate node is smaller than the specified `FLAGS_reuse_aggregate_group_size_threshold`._
or
_(2) The ratio of (# of eliminable columns) to (# of original columns) is greater than the specified `FLAGS_reuse_aggregate_ratio_threshold`._
then we perform the transformation.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-quickstep common-subexpression
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-quickstep/pull/237.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #237
----
commit b31f0360a93727ed1250e8e3cf28c242c3d727a1
Author: Jianqiao Zhu <ji...@cs.wisc.edu>
Date: 2017-04-20T20:28:12Z
Add common-subexpression support.
----
> Add support for common subexpression elimination.
> -------------------------------------------------
>
> Key: QUICKSTEP-89
> URL: https://issues.apache.org/jira/browse/QUICKSTEP-89
> Project: Apache Quickstep
> Issue Type: New Feature
> Components: Expressions, Query Optimizer, Relational Operators
> Reporter: Jianqiao Zhu
> Assignee: Jianqiao Zhu
>
> This feature adds support for common subexpression elimination in Quickstep.
> Note that Quickstep features column-at-a-time expression evaluation. So we use a relatively simple *memoization* approach that (1) identifies common subexpressions at query optimization time and assigns unique IDs for each equivalence class of common subexpressions, and (2) uses a memoization table (per block) to memorize and lookup result column vectors at query execution time based on the common subexpression IDs.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)