You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Andrew Campbell <an...@gmail.com> on 2020/11/15 22:52:09 UTC

Predicate pushdown clarification

Hi Arrow community,

I'm new to the project and am trying to understand exactly what is
happening under the hood when I run a filter-collect query on an Arrow
Dataset (backed by Parquet).

Let's say I created a Parquet dataset with no file-level partitions. I just
wrote a bunch of separate files to a dataset. Now I want to run a query
that returns the rows corresponding to a specific range of datetimes in the
dataset's dt column.

My understanding is that the Dataset API will push this query down to the
file level, checking the footer of each file for the min/max value of dt
and determining whether this block of rows should be read.

Assuming this is correct, a few questions:

Will every query result in the reading all of the file footers? Is there
any caching of these min/max values?

Is there a way to profile query performance? A way to view a query plan
before it is executed?

I appreciate your time in helping me better understand.

Andrew