You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Markus Appel <ma...@hotmail.de> on 2023/04/11 15:31:35 UTC

[DataFusion] Projection pushdown and pushed down filters

Hello,

I hope this is the right place to ask this.

While working on a project based on arrow-datafusion, I came across some weird behavior where a projection did not get eliminated as expected, thus breaking a custom optimizer rule's assumption (into which I won't go further, as it's not important for this question).

Specifically, I found an execution plan like:

Projection [col1, col2]
     TableScan projection=[col1, col2, col3], full_filters=[ ... col3 ...]

to not be simplified to:

TableScan projection=[col1, col2], full_filters=[... col3 ...]

This does seem to be intentional, as I have found this snippet<https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174> in the optimizer rule:

...
LogicalPlan::TableScan(scan)
                if !scan.projected_schema.fields().is_empty() =>
            {
                let mut used_columns: HashSet<Column> = HashSet::new();
                // filter expr may not exist in expr in projection.
                // like: TableScan: t1 projection=[bool_col, int_col], full_filters=[t1.id = Int32(1)]
                // projection=[bool_col, int_col] don't contain `ti.id`.
                exprlist_to_columns(&scan.filters, &mut used_columns)?;
...

However, the comment does not explain why we need to keep the extra projection and return the extra column - after all, the filters inside of the scan are internal to that scan, and should not affect the execution plan, right?

I am looking forward to any opinions.

Best regards, Markus Appel

Re: [DataFusion] Projection pushdown and pushed down filters

Posted by Andrew Lamb <al...@influxdata.com>.

I agree it would be better if the projection did not include the columns
needed (only) for the pushdown filter . As Jie Wen says, it would be a nice
thing to improve.

Would you be willing to file a ticket to track this issue? If not I can do
so.

Thanks,
Andrew



On Wed, Apr 12, 2023 at 2:13 AM vin jake <ja...@gmail.com> wrote:

> Hi,
>
> The current "Projection pushdown" rule was implemented by me previously.
>
> I believe the key issue is the execution order of the internal projection
> and filter of the `TableScan` operator.
> If we do `projection` before `filter`, `Projection [col1, col2]` would be
> essential. Otherwise, we will miss some columns.
> But If we do `projection` before `filter`, I think `Projection [col1,
> col2]` isn't necessary, you can create a new PR to enhance it (if you are
> willing to) and request my review.
>
> Best regards, Jie Wen.
>
> On Wed, Apr 12, 2023 at 1:36 PM Markus Appel <ma...@hotmail.de>
> wrote:
>
> > Hello,
> >
> > I hope this is the right place to ask this.
> >
> > While working on a project based on arrow-datafusion, I came across some
> > weird behavior where a projection did not get eliminated as expected,
> thus
> > breaking a custom optimizer rule's assumption (into which I won't go
> > further, as it's not important for this question).
> >
> > Specifically, I found an execution plan like:
> >
> > Projection [col1, col2]
> >      TableScan projection=[col1, col2, col3], full_filters=[ ... col3
> ...]
> >
> > to not be simplified to:
> >
> > TableScan projection=[col1, col2], full_filters=[... col3 ...]
> >
> > This does seem to be intentional, as I have found this snippet<
> >
> https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174
> >
> > in the optimizer rule:
> >
> > ...
> > LogicalPlan::TableScan(scan)
> >                 if !scan.projected_schema.fields().is_empty() =>
> >             {
> >                 let mut used_columns: HashSet<Column> = HashSet::new();
> >                 // filter expr may not exist in expr in projection.
> >                 // like: TableScan: t1 projection=[bool_col, int_col],
> > full_filters=[t1.id = Int32(1)]
> >                 // projection=[bool_col, int_col] don't contain `ti.id`.
> >                 exprlist_to_columns(&scan.filters, &mut used_columns)?;
> > ...
> >
> > However, the comment does not explain why we need to keep the extra
> > projection and return the extra column - after all, the filters inside of
> > the scan are internal to that scan, and should not affect the execution
> > plan, right?
> >
> > I am looking forward to any opinions.
> >
> > Best regards, Markus Appel
> >
> >
> >
>

Re: [DataFusion] Projection pushdown and pushed down filters

Posted by vin jake <ja...@gmail.com>.

Hi,

The current "Projection pushdown" rule was implemented by me previously.

I believe the key issue is the execution order of the internal projection
and filter of the `TableScan` operator.
If we do `projection` before `filter`, `Projection [col1, col2]` would be
essential. Otherwise, we will miss some columns.
But If we do `projection` before `filter`, I think `Projection [col1,
col2]` isn't necessary, you can create a new PR to enhance it (if you are
willing to) and request my review.

Best regards, Jie Wen.

On Wed, Apr 12, 2023 at 1:36 PM Markus Appel <ma...@hotmail.de> wrote:

> Hello,
>
> I hope this is the right place to ask this.
>
> While working on a project based on arrow-datafusion, I came across some
> weird behavior where a projection did not get eliminated as expected, thus
> breaking a custom optimizer rule's assumption (into which I won't go
> further, as it's not important for this question).
>
> Specifically, I found an execution plan like:
>
> Projection [col1, col2]
>      TableScan projection=[col1, col2, col3], full_filters=[ ... col3 ...]
>
> to not be simplified to:
>
> TableScan projection=[col1, col2], full_filters=[... col3 ...]
>
> This does seem to be intentional, as I have found this snippet<
> https://github.com/apache/arrow-datafusion/blob/c97048d178594b10b813c6bcd1543f157db4ba3f/datafusion/optimizer/src/push_down_projection.rs#L174>
> in the optimizer rule:
>
> ...
> LogicalPlan::TableScan(scan)
>                 if !scan.projected_schema.fields().is_empty() =>
>             {
>                 let mut used_columns: HashSet<Column> = HashSet::new();
>                 // filter expr may not exist in expr in projection.
>                 // like: TableScan: t1 projection=[bool_col, int_col],
> full_filters=[t1.id = Int32(1)]
>                 // projection=[bool_col, int_col] don't contain `ti.id`.
>                 exprlist_to_columns(&scan.filters, &mut used_columns)?;
> ...
>
> However, the comment does not explain why we need to keep the extra
> projection and return the extra column - after all, the filters inside of
> the scan are internal to that scan, and should not affect the execution
> plan, right?
>
> I am looking forward to any opinions.
>
> Best regards, Markus Appel
>
>
>