You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Adam Gilmore <dr...@gmail.com> on 2015/03/23 02:48:11 UTC

Partition pruning

Hi guys,

I'm trying to work on an issue I've raised with partition pruning:

https://issues.apache.org/jira/browse/DRILL-2287

Basically, because the partition pruning is done after the
DrillPushProjIntoScan, it seems like we can't detect that dir0 (for
example) is not actually needed to be projected if it's not in the SELECT
clause (or GROUP BY etc.).

Moreover, I've come up with an issue whereby if I have, for example, 3
directories - 2 with valid Parquet files and 1 with an invalid 0-byte
Parquet file, even if we're partition pruning to only the valid
directories, the query will fail (because it's trying to read the footer of
the invalid Parquet file).

It really feels like the partition pruning should be done before the
DrillPushProjIntoScan.

I know Jacques has just done some work on moving the partition pruning, so
I thought I'd open the discussion here first before making too many
in-roads into it.

I do feel if we're partition pruning, we shouldn't even try to read any of
those other directories during the planning stage.  Furthermore, it doesn't
make sense to prune the files being scanned but still keep a Filter
operation in the query plan and project dir0 throughout it if it's not
needed.  The latter is why the queries end up being a lot slower.

Thoughts?

Re: Partition pruning

Posted by Adam Gilmore <dr...@gmail.com>.
Hi Aman,

I've also created a second issue for the invalid 0 length parquet files not
being pruned out:

https://issues.apache.org/jira/browse/DRILL-2517

I've done a bit of work on resolving it but need some input to see if I'm
going down the right path.

On Mon, Mar 23, 2015 at 12:54 PM, Aman Sinha <as...@maprtech.com> wrote:

> Hi Adam,
> I will update DRILL-2287 with some comments because it has more context
> than this discussion thread.  We can continue the discussion there.  The
> issue of the invalid 0 length parquet files being read sounds like a
> different issue.
>
> Aman
>
> On Sun, Mar 22, 2015 at 6:48 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to work on an issue I've raised with partition pruning:
> >
> > https://issues.apache.org/jira/browse/DRILL-2287
> >
> > Basically, because the partition pruning is done after the
> > DrillPushProjIntoScan, it seems like we can't detect that dir0 (for
> > example) is not actually needed to be projected if it's not in the SELECT
> > clause (or GROUP BY etc.).
> >
> > Moreover, I've come up with an issue whereby if I have, for example, 3
> > directories - 2 with valid Parquet files and 1 with an invalid 0-byte
> > Parquet file, even if we're partition pruning to only the valid
> > directories, the query will fail (because it's trying to read the footer
> of
> > the invalid Parquet file).
> >
> > It really feels like the partition pruning should be done before the
> > DrillPushProjIntoScan.
> >
> > I know Jacques has just done some work on moving the partition pruning,
> so
> > I thought I'd open the discussion here first before making too many
> > in-roads into it.
> >
> > I do feel if we're partition pruning, we shouldn't even try to read any
> of
> > those other directories during the planning stage.  Furthermore, it
> doesn't
> > make sense to prune the files being scanned but still keep a Filter
> > operation in the query plan and project dir0 throughout it if it's not
> > needed.  The latter is why the queries end up being a lot slower.
> >
> > Thoughts?
> >
>

Re: Partition pruning

Posted by Aman Sinha <as...@maprtech.com>.
Hi Adam,
I will update DRILL-2287 with some comments because it has more context
than this discussion thread.  We can continue the discussion there.  The
issue of the invalid 0 length parquet files being read sounds like a
different issue.

Aman

On Sun, Mar 22, 2015 at 6:48 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Hi guys,
>
> I'm trying to work on an issue I've raised with partition pruning:
>
> https://issues.apache.org/jira/browse/DRILL-2287
>
> Basically, because the partition pruning is done after the
> DrillPushProjIntoScan, it seems like we can't detect that dir0 (for
> example) is not actually needed to be projected if it's not in the SELECT
> clause (or GROUP BY etc.).
>
> Moreover, I've come up with an issue whereby if I have, for example, 3
> directories - 2 with valid Parquet files and 1 with an invalid 0-byte
> Parquet file, even if we're partition pruning to only the valid
> directories, the query will fail (because it's trying to read the footer of
> the invalid Parquet file).
>
> It really feels like the partition pruning should be done before the
> DrillPushProjIntoScan.
>
> I know Jacques has just done some work on moving the partition pruning, so
> I thought I'd open the discussion here first before making too many
> in-roads into it.
>
> I do feel if we're partition pruning, we shouldn't even try to read any of
> those other directories during the planning stage.  Furthermore, it doesn't
> make sense to prune the files being scanned but still keep a Filter
> operation in the query plan and project dir0 throughout it if it's not
> needed.  The latter is why the queries end up being a lot slower.
>
> Thoughts?
>