You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@quickstep.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/04/23 01:34:04 UTC

[jira] [Commented] (QUICKSTEP-89) Add support for common subexpression elimination.

    [ https://issues.apache.org/jira/browse/QUICKSTEP-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980210#comment-15980210 ] 

ASF GitHub Bot commented on QUICKSTEP-89:
-----------------------------------------

GitHub user jianqiao opened a pull request:

    https://github.com/apache/incubator-quickstep/pull/237

    QUICKSTEP-89 Add support for common subexpression.

    This PR adds support for common subexpression elimination in Quickstep. In summary we apply two kinds of optimizations:
    **(1)** Query rewriting for _aggregate expressions_, e.g.
    ```
    SELECT SUM(x), SUM(x)*2, AVG(x), COUNT(x) FROM s;
    ```
    will be rewritten as
    ```
    SELECT sum_x, sum_x * 2, sum_x / count_x, count_x
    FROM (
      SELECT SUM(x) AS sum_x, COUNT(x) AS count_x
    ) t;
    ```
    **(2)** Common subexpression labeling with **per-storage-block memoization table** to cache result column vectors to avoid redundant computation. For example, the evaluation of expressions in
    ```
    SELECT (x + 1) * (y + 2) * 3, (x + 1) * (y + 2), x + 1
    FROM s;
    ```
    will essentially look like:
    ```
    t1 = x + 1
    t2 = t1 * (y + 2)
    Out1 = t2 * 3
    Out2 = Reference(t2)
    Out3 = Reference(t3)
    ```
    Here `t1`, `t2`, `Out1`, `Out2`, `Out3` are per-storage-block `ColumnVector`'s.
    
    This PR together with a (later) new aggregation hash table implementation will improve the performance of TPC-H Q01 from ~11.5s to ~5.3s, with scale factor 100 on a cloud lab machine.
    
    ___
    **Note**: For optimization (1), note that it is not free-of-cost to "re-project" the columns, so it may not worth doing the transformation in situations where a group-by aggregation has large number of groups. E.g., it may actually slow down the query to rewrite
    ```
    SELECT SUM(a), SUM(b), SUM(c), SUM(d), SUM(a)
    FROM s
    GROUP BY x;
    ```
    as
    ```
    SELECT sum_a, sum_b, sum_c, sum_d, sum_a
    FROM (
      SELECT SUM(a) AS sum_a, SUM(b) AS sum_b, SUM(c) AS sum_c, SUM(d) AS sum_d
      FROM s
      GROUP BY x;
    ) t;
    ```
    if the number of distinct values of attribute x is large -- in that case, we avoid one duplicate computation of `SUM(a)`, but introduce 5 extra expensive column copying due to the intermediate relation `t`.
    
    Thus currently we employ a simple heuristic that if either
    _(1) The estimated number of groups for the Aggregate node is smaller than the specified `FLAGS_reuse_aggregate_group_size_threshold`._
    or
    _(2) The ratio of (# of eliminable columns) to (# of original columns) is greater than the specified `FLAGS_reuse_aggregate_ratio_threshold`._
    then we perform the transformation.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-quickstep common-subexpression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-quickstep/pull/237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #237
    
----
commit b31f0360a93727ed1250e8e3cf28c242c3d727a1
Author: Jianqiao Zhu <ji...@cs.wisc.edu>
Date:   2017-04-20T20:28:12Z

    Add common-subexpression support.

----


> Add support for common subexpression elimination.
> -------------------------------------------------
>
>                 Key: QUICKSTEP-89
>                 URL: https://issues.apache.org/jira/browse/QUICKSTEP-89
>             Project: Apache Quickstep
>          Issue Type: New Feature
>          Components: Expressions, Query Optimizer, Relational Operators
>            Reporter: Jianqiao Zhu
>            Assignee: Jianqiao Zhu
>
> This feature adds support for common subexpression elimination in Quickstep.
> Note that Quickstep features column-at-a-time expression evaluation. So we use a relatively simple *memoization* approach that (1) identifies common subexpressions at query optimization time and assigns unique IDs for each equivalence class of common subexpressions, and (2) uses a memoization table (per block) to memorize and lookup result column vectors at query execution time based on the common subexpression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)