You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Jorge (Jira)" <ji...@apache.org> on 2020/07/17 18:40:00 UTC

[jira] [Created] (ARROW-9516) [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

Jorge created ARROW-9516:
----------------------------

             Summary: [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes
                 Key: ARROW-9516
                 URL: https://issues.apache.org/jira/browse/ARROW-9516
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust - DataFusion
            Reporter: Jorge


This issue covers three main topics that IMO are addressed as a whole in a refactor of the physical plans and expressions in data fusion. The underlying issues that justify this particular ticket:
h3. We currently assign poor names to the output schema.

Specifically, most names are given based on the last expression's name. Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the fields names "c, SUM, SUM".
h3. We currently derive the column names from physical expressions, not logical expressions

This implies that logical expressions that perform multiple operations (e.g. an grouped aggregation that performs partitioned aggregations + merge + final aggregation) have their name derived from their physical declaration, not logical. IMO a physical plan is an execution plan and is thus not concerned with naming. It is the logical plan that should be concerned with naming. Conceptually, a given logical plan can have more than one physical plan, e.g. depending on the execution environment (e.g. locally vs distributed).
h3. We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers.

More details here. In summary, it is possible to remove one of the optimizers and significantly simplify the other if columns do not carry indexing information.
h2. Proposal

I propose that we:
h3. drop {{physical_plan::expressions::Column::index}}

This is a major simplification of the code, and allow us to just ignore the position of the statement on the schema, and instead focus on its name. This is overall a simplification because it allow us to treat columns based solely on their names, and not on their position in the schema. Since SQL does not care about the position of the column on the table anyway (we currently already take the first column with that name), this seems natural.

I already prototyped this [here|https://github.com/jorgecarleitao/arrow/tree/column_names].

The main conclusion of this prototype is that this feasible as long as all our expressions get assigned a unique name, which is against what we currently offer (see example above). This leads me to:
h3. drop {{physical_plan::PhysicalExpr::name()}}

Currently, the name of an expression is derived from its physical plan. However, some operations' names are required to be known before its physical representation. The example I found in our current code is the grouped aggregation described above. If we were to build the name of our aggregation based on its physical plan, the name of a "COUNT(a)" operation would be {{SUM(COUNT(a))}} because, in the physical plan we first count on each partition, then merge, and them sum the counts over all partitions.

Fundamentally, IMO the issue here is that we are mixing responsibilities: the physical plan should not care about naming, because the physical plan corresponds to an execution plan, not a logical description of the column (its name). This leads me to:
h3. add {{logicalplan::Expr::name()}}

This will contain the name of this expression, that will naturally depend on the variation. Its implementation will be based on our current code for {{physical_plan::PhysicalExpr::name()}}.

I can take this work, but before committing, would like to know your thoughts about this. My initial prototyping indicate that all of this is possible and greatly simplifies the code, but I may be missing a design aspect of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [jira] [Created] (ARROW-9516) [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

Posted by Andy Grove <an...@gmail.com>.

Thanks for the detailed write up. That all makes good sense to me. I am not
sure that I had a good reason for having physical expressions determine
their names.

On Fri, Jul 17, 2020, 12:50 PM Jorge (Jira) <ji...@apache.org> wrote:

> Jorge created ARROW-9516:
> ----------------------------
>
>              Summary: [Rust][DataFusion] Refactor physical expressions to
> not care about their names nor indexes
>                  Key: ARROW-9516
>                  URL: https://issues.apache.org/jira/browse/ARROW-9516
>              Project: Apache Arrow
>           Issue Type: Improvement
>           Components: Rust - DataFusion
>             Reporter: Jorge
>
>
> This issue covers three main topics that IMO are addressed as a whole in a
> refactor of the physical plans and expressions in data fusion. The
> underlying issues that justify this particular ticket:
> h3. We currently assign poor names to the output schema.
>
> Specifically, most names are given based on the last expression's name.
> Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the
> fields names "c, SUM, SUM".
> h3. We currently derive the column names from physical expressions, not
> logical expressions
>
> This implies that logical expressions that perform multiple operations
> (e.g. an grouped aggregation that performs partitioned aggregations + merge
> + final aggregation) have their name derived from their physical
> declaration, not logical. IMO a physical plan is an execution plan and is
> thus not concerned with naming. It is the logical plan that should be
> concerned with naming. Conceptually, a given logical plan can have more
> than one physical plan, e.g. depending on the execution environment (e.g.
> locally vs distributed).
> h3. We currently carry the index of a column read throughout the plans,
> making it cumbersome to write optimizers.
>
> More details here. In summary, it is possible to remove one of the
> optimizers and significantly simplify the other if columns do not carry
> indexing information.
> h2. Proposal
>
> I propose that we:
> h3. drop {{physical_plan::expressions::Column::index}}
>
> This is a major simplification of the code, and allow us to just ignore
> the position of the statement on the schema, and instead focus on its name.
> This is overall a simplification because it allow us to treat columns based
> solely on their names, and not on their position in the schema. Since SQL
> does not care about the position of the column on the table anyway (we
> currently already take the first column with that name), this seems natural.
>
> I already prototyped this [here|
> https://github.com/jorgecarleitao/arrow/tree/column_names].
>
> The main conclusion of this prototype is that this feasible as long as all
> our expressions get assigned a unique name, which is against what we
> currently offer (see example above). This leads me to:
> h3. drop {{physical_plan::PhysicalExpr::name()}}
>
> Currently, the name of an expression is derived from its physical plan.
> However, some operations' names are required to be known before its
> physical representation. The example I found in our current code is the
> grouped aggregation described above. If we were to build the name of our
> aggregation based on its physical plan, the name of a "COUNT(a)" operation
> would be {{SUM(COUNT(a))}} because, in the physical plan we first count on
> each partition, then merge, and them sum the counts over all partitions.
>
> Fundamentally, IMO the issue here is that we are mixing responsibilities:
> the physical plan should not care about naming, because the physical plan
> corresponds to an execution plan, not a logical description of the column
> (its name). This leads me to:
> h3. add {{logicalplan::Expr::name()}}
>
> This will contain the name of this expression, that will naturally depend
> on the variation. Its implementation will be based on our current code for
> {{physical_plan::PhysicalExpr::name()}}.
>
> I can take this work, but before committing, would like to know your
> thoughts about this. My initial prototyping indicate that all of this is
> possible and greatly simplifies the code, but I may be missing a design
> aspect of this.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>