You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2019/11/23 00:31:00 UTC
[jira] [Created] (DRILL-7455) "Renaming" projection operator to avoid physical copies

Paul Rogers created DRILL-7455:
----------------------------------

             Summary: "Renaming" projection operator to avoid physical copies
                 Key: DRILL-7455
                 URL: https://issues.apache.org/jira/browse/DRILL-7455
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Paul Rogers


Drill/Calcite inserts project operators for three main reasons:

1. To compute a new column: {{SELECT a + b AS c ...}}

2. To rename columns: {{SELECT a AS x ...}}

3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, and {{b}}.

Example of case 1:

{code:json}
    "pop" : "project",
    "@id" : 4,
    "exprs" : [ {
      "ref" : "`a0`",
      "expr" : "`a`"
    }, {
      "ref" : "`b0`",
      "expr" : "`b`"
    } ],
{code}

Of these, only case 2 requires row-by-row computation of new values. Case 1 simply creates a new vector with only the name changed; but the same data. Case 3 preserves some vectors, drops others.

In the cases 1 and 2, a simple data transfer from input to output would be adequate. Yet, if one steps through the code, and enables code generation, one will see that Drill steps through each record in all three cases, even calling an empty per-record compute block.

A better-performance solution is to separate out the renames/drops (cases 1 and 3) from the column computations (case 2). This can be done either:

1. At plan time, identify that all columns are renames, and replace the row-by-row project with a column-level project.

2. At run time that identifies the column-level projections (cases 1 and 3) and handles those with transfer pairs, while doing row-by-row computes only if case 2 exists.

Since row-by-row copies are among the most expensive operations in Drill, this optimization could improve performance by a decent amount.

Note that a further optimization is to remove "trivial" projects such as the following:

{code:json}
    "pop" : "project",
    "@id" : 2,
    "exprs" : [ {
      "ref" : "`a`",
      "expr" : "`a`"
    }, {
      "ref" : "`b`",
      "expr" : "`b`"
    }, {
      "ref" : "`b0`",
      "expr" : "`b0`"
    } ],
{code}

The only value of such a projection is to say, "remove all vectors except {{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be needed is:

1. On top of a data source that does not support projection push down.

2. When Calcite knows it wants to discard certain intermediate columns.

Otherwise, Calcite knows which columns emerge from operator x, and should not need to add a project to enforce that schema if it is already what the project will emit.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)