You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Adam Gilmore <dr...@gmail.com> on 2015/12/02 07:10:33 UTC

Parquet pushdown filtering

Hi guys,

I'm trying to (re)implement pushdown filtering for Parquet with the new
Parquet metadata caching implementation.

I've run into a couple of challenges:

   1. Scan batches don't allow empty batches.  This means if a particular
   filter filters out *all* rows, we get an exception.  I haven't read the
   full comments on the relevant JIRA items, but it seems odd that we can't
   query an empty JSON file, for example.  This is a bit of a blocker to
   implement the pushdown filtering properly.
   2. The Parquet metadata doesn't include all the relevant metadata.
   Specifically, count of values is not included, therefore the default
   Parquet statistics filter has issues because it compares the count of
   values with count of nulls to work out if it can drop it.  This isn't
   necessarily a blocker, but it feels ugly simulating there's "1" row in a
   block (just to get around the null comparison).

Also, it feels a bit ugly rehydrating the standard Parquet metadata objects
manually.  I'm not sure I understand why we created our own objects for the
Parquet metadata as opposed to simply writing a custom serializer for those
objects which we store.

Thoughts would be great - I'd love to get a patch out for this.

Re: Parquet pushdown filtering

Posted by Adam Gilmore <dr...@gmail.com>.

Shall we say 10am my time, 4pm your time?

On Sunday, 13 December 2015, Julien Le Dem <ju...@dremio.com> wrote:

> Tuesday morning in Australia, Monday afternoon in California sounds good to
> me.
>
> On Fri, Dec 11, 2015 at 11:42 AM, Parth Chandra <parthc@apache.org
> <javascript:;>> wrote:
>
> > I'd like to attend as well. Any time that works for Julien/Jason works
> for
> > me.
> >
> >
> >
> >
> >
> > On Thu, Dec 10, 2015 at 6:15 PM, Adam Gilmore <dragoncurve@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > Could we say Monday or Tuesday next week?  I'm actually ahead of you
> guys
> > > by about 18 hours, so Monday morning my time would be Sunday
> > > afternoon/evening for you.  If that doesn't work, what about Tuesday
> > > morning my time - Monday afternoon/evening your time?
> > >
> > > On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <
> > altekrusejason@gmail.com <javascript:;>
> > > >
> > > wrote:
> > >
> > > > I can also join for this meeting, Julien and I are both on SF time.
> > Looks
> > > > like you are about 5-6 hours behind us, so depending on if you would
> > > prefer
> > > > morning or afternoon we'll just be a little further into our days.
> > > >
> > > > On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dragoncurve@gmail.com
> <javascript:;>>
> > > > wrote:
> > > >
> > > > > Sure - I'm in Australia so I'm not sure how the timezones will
> work
> > > for
> > > > > you guys, but I'm pretty flexible.  Where are you located?
> > > > >
> > > > > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <julien@dremio.com
> <javascript:;>>
> > > wrote:
> > > > >
> > > > > > Adam: do you want to schedule a hangout?
> > > > > >
> > > > > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <
> > dragoncurve@gmail.com <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > That makes sense, yep.  The problem is I guess with my
> > > > > > implementation.  I
> > > > > > > will iterate through all Parquet files and try to eliminate
> ones
> > > > where
> > > > > > the
> > > > > > > filter conflicts with the statistics.  In instances where no
> > files
> > > > > match
> > > > > > > the filter, I end up with an empty set of files for the Parquet
> > > scan
> > > > to
> > > > > > > iterate through.  I suppose I could just pick the schema of the
> > > first
> > > > > > file
> > > > > > > or something, but that seems like a pretty messy rule.
> > > > > > >
> > > > > > > Julien - I'd be happy to have a chat about this.  I've pretty
> > much
> > > > got
> > > > > > the
> > > > > > > implementation down, but need to solve a few of these little
> > > issues.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <
> > > hanifigunes@gmail.com <javascript:;>>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Regarding your point  #1. I guess Daniel struggled with this
> > > > > limitation
> > > > > > > as
> > > > > > > > well. I merged few of his patches which addressed empty
> > batch(no
> > > > > data)
> > > > > > > > handling in various places during execution. That said,
> > however,
> > > we
> > > > > > still
> > > > > > > > could not have time to develop a solid way to handle empty
> > > batches
> > > > > with
> > > > > > > no
> > > > > > > > schema.
> > > > > > > >
> > > > > > > > *- Scan batches don't allow empty batches.  This means if a
> > > > > > > > particular filter filters out *all* rows, we get an
> exception.*
> > > > > > > > Looks to me, you are referring to no data rather than no
> schema
> > > > > here. I
> > > > > > > > would expect graceful execution in this case. Do you mind
> > > sharing a
> > > > > > > simple
> > > > > > > > reproduction?
> > > > > > > >
> > > > > > > >
> > > > > > > > -Hanifi
> > > > > > > >
> > > > > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <julien@dremio.com
> <javascript:;>>:
> > > > > > > >
> > > > > > > > > Hey Adam,
> > > > > > > > > If you have questions about the Parquet side of things, I'm
> > > happy
> > > > > to
> > > > > > > > chat.
> > > > > > > > > Julien
> > > > > > > > >
> > > > > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <
> > > > parthc@apache.org <javascript:;>>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Parquet metadata has the rowCount for every rowGroup
> which
> > is
> > > > > also
> > > > > > > the
> > > > > > > > > > value count for every column in the rowGroup. Isn't that
> > what
> > > > you
> > > > > > > need?
> > > > > > > > > >
> > > > > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > > > > > dragoncurve@gmail.com <javascript:;>
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi guys,
> > > > > > > > > > >
> > > > > > > > > > > I'm trying to (re)implement pushdown filtering for
> > Parquet
> > > > with
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > Parquet metadata caching implementation.
> > > > > > > > > > >
> > > > > > > > > > > I've run into a couple of challenges:
> > > > > > > > > > >
> > > > > > > > > > >    1. Scan batches don't allow empty batches.  This
> means
> > > if
> > > > a
> > > > > > > > > particular
> > > > > > > > > > >    filter filters out *all* rows, we get an
> exception.  I
> > > > > haven't
> > > > > > > > read
> > > > > > > > > > the
> > > > > > > > > > >    full comments on the relevant JIRA items, but it
> seems
> > > odd
> > > > > > that
> > > > > > > we
> > > > > > > > > > can't
> > > > > > > > > > >    query an empty JSON file, for example.  This is a
> bit
> > > of a
> > > > > > > blocker
> > > > > > > > > to
> > > > > > > > > > >    implement the pushdown filtering properly.
> > > > > > > > > > >    2. The Parquet metadata doesn't include all the
> > relevant
> > > > > > > metadata.
> > > > > > > > > > >    Specifically, count of values is not included,
> > therefore
> > > > the
> > > > > > > > default
> > > > > > > > > > >    Parquet statistics filter has issues because it
> > compares
> > > > the
> > > > > > > count
> > > > > > > > > of
> > > > > > > > > > >    values with count of nulls to work out if it can
> drop
> > > it.
> > > > > > This
> > > > > > > > > isn't
> > > > > > > > > > >    necessarily a blocker, but it feels ugly simulating
> > > > there's
> > > > > > "1"
> > > > > > > > row
> > > > > > > > > > in a
> > > > > > > > > > >    block (just to get around the null comparison).
> > > > > > > > > > >
> > > > > > > > > > > Also, it feels a bit ugly rehydrating the standard
> > Parquet
> > > > > > metadata
> > > > > > > > > > objects
> > > > > > > > > > > manually.  I'm not sure I understand why we created our
> > own
> > > > > > objects
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > Parquet metadata as opposed to simply writing a custom
> > > > > serializer
> > > > > > > for
> > > > > > > > > > those
> > > > > > > > > > > objects which we store.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts would be great - I'd love to get a patch out
> for
> > > > this.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Julien
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Julien
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Re: Parquet pushdown filtering

Posted by Julien Le Dem <ju...@dremio.com>.

Tuesday morning in Australia, Monday afternoon in California sounds good to
me.

On Fri, Dec 11, 2015 at 11:42 AM, Parth Chandra <pa...@apache.org> wrote:

> I'd like to attend as well. Any time that works for Julien/Jason works for
> me.
>
>
>
>
>
> On Thu, Dec 10, 2015 at 6:15 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Could we say Monday or Tuesday next week?  I'm actually ahead of you guys
> > by about 18 hours, so Monday morning my time would be Sunday
> > afternoon/evening for you.  If that doesn't work, what about Tuesday
> > morning my time - Monday afternoon/evening your time?
> >
> > On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <
> altekrusejason@gmail.com
> > >
> > wrote:
> >
> > > I can also join for this meeting, Julien and I are both on SF time.
> Looks
> > > like you are about 5-6 hours behind us, so depending on if you would
> > prefer
> > > morning or afternoon we'll just be a little further into our days.
> > >
> > > On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > > > Sure - I'm in Australia so I'm not sure how the timezones will work
> > for
> > > > you guys, but I'm pretty flexible.  Where are you located?
> > > >
> > > > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <ju...@dremio.com>
> > wrote:
> > > >
> > > > > Adam: do you want to schedule a hangout?
> > > > >
> > > > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <
> dragoncurve@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > That makes sense, yep.  The problem is I guess with my
> > > > > implementation.  I
> > > > > > will iterate through all Parquet files and try to eliminate ones
> > > where
> > > > > the
> > > > > > filter conflicts with the statistics.  In instances where no
> files
> > > > match
> > > > > > the filter, I end up with an empty set of files for the Parquet
> > scan
> > > to
> > > > > > iterate through.  I suppose I could just pick the schema of the
> > first
> > > > > file
> > > > > > or something, but that seems like a pretty messy rule.
> > > > > >
> > > > > > Julien - I'd be happy to have a chat about this.  I've pretty
> much
> > > got
> > > > > the
> > > > > > implementation down, but need to solve a few of these little
> > issues.
> > > > > >
> > > > > >
> > > > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <
> > hanifigunes@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Regarding your point  #1. I guess Daniel struggled with this
> > > > limitation
> > > > > > as
> > > > > > > well. I merged few of his patches which addressed empty
> batch(no
> > > > data)
> > > > > > > handling in various places during execution. That said,
> however,
> > we
> > > > > still
> > > > > > > could not have time to develop a solid way to handle empty
> > batches
> > > > with
> > > > > > no
> > > > > > > schema.
> > > > > > >
> > > > > > > *- Scan batches don't allow empty batches.  This means if a
> > > > > > > particular filter filters out *all* rows, we get an exception.*
> > > > > > > Looks to me, you are referring to no data rather than no schema
> > > > here. I
> > > > > > > would expect graceful execution in this case. Do you mind
> > sharing a
> > > > > > simple
> > > > > > > reproduction?
> > > > > > >
> > > > > > >
> > > > > > > -Hanifi
> > > > > > >
> > > > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> > > > > > >
> > > > > > > > Hey Adam,
> > > > > > > > If you have questions about the Parquet side of things, I'm
> > happy
> > > > to
> > > > > > > chat.
> > > > > > > > Julien
> > > > > > > >
> > > > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <
> > > parthc@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Parquet metadata has the rowCount for every rowGroup which
> is
> > > > also
> > > > > > the
> > > > > > > > > value count for every column in the rowGroup. Isn't that
> what
> > > you
> > > > > > need?
> > > > > > > > >
> > > > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > > > > dragoncurve@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi guys,
> > > > > > > > > >
> > > > > > > > > > I'm trying to (re)implement pushdown filtering for
> Parquet
> > > with
> > > > > the
> > > > > > > new
> > > > > > > > > > Parquet metadata caching implementation.
> > > > > > > > > >
> > > > > > > > > > I've run into a couple of challenges:
> > > > > > > > > >
> > > > > > > > > >    1. Scan batches don't allow empty batches.  This means
> > if
> > > a
> > > > > > > > particular
> > > > > > > > > >    filter filters out *all* rows, we get an exception.  I
> > > > haven't
> > > > > > > read
> > > > > > > > > the
> > > > > > > > > >    full comments on the relevant JIRA items, but it seems
> > odd
> > > > > that
> > > > > > we
> > > > > > > > > can't
> > > > > > > > > >    query an empty JSON file, for example.  This is a bit
> > of a
> > > > > > blocker
> > > > > > > > to
> > > > > > > > > >    implement the pushdown filtering properly.
> > > > > > > > > >    2. The Parquet metadata doesn't include all the
> relevant
> > > > > > metadata.
> > > > > > > > > >    Specifically, count of values is not included,
> therefore
> > > the
> > > > > > > default
> > > > > > > > > >    Parquet statistics filter has issues because it
> compares
> > > the
> > > > > > count
> > > > > > > > of
> > > > > > > > > >    values with count of nulls to work out if it can drop
> > it.
> > > > > This
> > > > > > > > isn't
> > > > > > > > > >    necessarily a blocker, but it feels ugly simulating
> > > there's
> > > > > "1"
> > > > > > > row
> > > > > > > > > in a
> > > > > > > > > >    block (just to get around the null comparison).
> > > > > > > > > >
> > > > > > > > > > Also, it feels a bit ugly rehydrating the standard
> Parquet
> > > > > metadata
> > > > > > > > > objects
> > > > > > > > > > manually.  I'm not sure I understand why we created our
> own
> > > > > objects
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > Parquet metadata as opposed to simply writing a custom
> > > > serializer
> > > > > > for
> > > > > > > > > those
> > > > > > > > > > objects which we store.
> > > > > > > > > >
> > > > > > > > > > Thoughts would be great - I'd love to get a patch out for
> > > this.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Julien
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Julien
> > > > >
> > > >
> > >
> >
>



-- 
Julien

Re: Parquet pushdown filtering

Posted by Parth Chandra <pa...@apache.org>.

I'd like to attend as well. Any time that works for Julien/Jason works for
me.





On Thu, Dec 10, 2015 at 6:15 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Could we say Monday or Tuesday next week?  I'm actually ahead of you guys
> by about 18 hours, so Monday morning my time would be Sunday
> afternoon/evening for you.  If that doesn't work, what about Tuesday
> morning my time - Monday afternoon/evening your time?
>
> On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > I can also join for this meeting, Julien and I are both on SF time. Looks
> > like you are about 5-6 hours behind us, so depending on if you would
> prefer
> > morning or afternoon we'll just be a little further into our days.
> >
> > On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> > > Sure - I'm in Australia so I'm not sure how the timezones will work
> for
> > > you guys, but I'm pretty flexible.  Where are you located?
> > >
> > > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <ju...@dremio.com>
> wrote:
> > >
> > > > Adam: do you want to schedule a hangout?
> > > >
> > > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dr...@gmail.com>
> > > > wrote:
> > > >
> > > > > That makes sense, yep.  The problem is I guess with my
> > > > implementation.  I
> > > > > will iterate through all Parquet files and try to eliminate ones
> > where
> > > > the
> > > > > filter conflicts with the statistics.  In instances where no files
> > > match
> > > > > the filter, I end up with an empty set of files for the Parquet
> scan
> > to
> > > > > iterate through.  I suppose I could just pick the schema of the
> first
> > > > file
> > > > > or something, but that seems like a pretty messy rule.
> > > > >
> > > > > Julien - I'd be happy to have a chat about this.  I've pretty much
> > got
> > > > the
> > > > > implementation down, but need to solve a few of these little
> issues.
> > > > >
> > > > >
> > > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <
> hanifigunes@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Regarding your point  #1. I guess Daniel struggled with this
> > > limitation
> > > > > as
> > > > > > well. I merged few of his patches which addressed empty batch(no
> > > data)
> > > > > > handling in various places during execution. That said, however,
> we
> > > > still
> > > > > > could not have time to develop a solid way to handle empty
> batches
> > > with
> > > > > no
> > > > > > schema.
> > > > > >
> > > > > > *- Scan batches don't allow empty batches.  This means if a
> > > > > > particular filter filters out *all* rows, we get an exception.*
> > > > > > Looks to me, you are referring to no data rather than no schema
> > > here. I
> > > > > > would expect graceful execution in this case. Do you mind
> sharing a
> > > > > simple
> > > > > > reproduction?
> > > > > >
> > > > > >
> > > > > > -Hanifi
> > > > > >
> > > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> > > > > >
> > > > > > > Hey Adam,
> > > > > > > If you have questions about the Parquet side of things, I'm
> happy
> > > to
> > > > > > chat.
> > > > > > > Julien
> > > > > > >
> > > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <
> > parthc@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Parquet metadata has the rowCount for every rowGroup which is
> > > also
> > > > > the
> > > > > > > > value count for every column in the rowGroup. Isn't that what
> > you
> > > > > need?
> > > > > > > >
> > > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > > > dragoncurve@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi guys,
> > > > > > > > >
> > > > > > > > > I'm trying to (re)implement pushdown filtering for Parquet
> > with
> > > > the
> > > > > > new
> > > > > > > > > Parquet metadata caching implementation.
> > > > > > > > >
> > > > > > > > > I've run into a couple of challenges:
> > > > > > > > >
> > > > > > > > >    1. Scan batches don't allow empty batches.  This means
> if
> > a
> > > > > > > particular
> > > > > > > > >    filter filters out *all* rows, we get an exception.  I
> > > haven't
> > > > > > read
> > > > > > > > the
> > > > > > > > >    full comments on the relevant JIRA items, but it seems
> odd
> > > > that
> > > > > we
> > > > > > > > can't
> > > > > > > > >    query an empty JSON file, for example.  This is a bit
> of a
> > > > > blocker
> > > > > > > to
> > > > > > > > >    implement the pushdown filtering properly.
> > > > > > > > >    2. The Parquet metadata doesn't include all the relevant
> > > > > metadata.
> > > > > > > > >    Specifically, count of values is not included, therefore
> > the
> > > > > > default
> > > > > > > > >    Parquet statistics filter has issues because it compares
> > the
> > > > > count
> > > > > > > of
> > > > > > > > >    values with count of nulls to work out if it can drop
> it.
> > > > This
> > > > > > > isn't
> > > > > > > > >    necessarily a blocker, but it feels ugly simulating
> > there's
> > > > "1"
> > > > > > row
> > > > > > > > in a
> > > > > > > > >    block (just to get around the null comparison).
> > > > > > > > >
> > > > > > > > > Also, it feels a bit ugly rehydrating the standard Parquet
> > > > metadata
> > > > > > > > objects
> > > > > > > > > manually.  I'm not sure I understand why we created our own
> > > > objects
> > > > > > for
> > > > > > > > the
> > > > > > > > > Parquet metadata as opposed to simply writing a custom
> > > serializer
> > > > > for
> > > > > > > > those
> > > > > > > > > objects which we store.
> > > > > > > > >
> > > > > > > > > Thoughts would be great - I'd love to get a patch out for
> > this.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Julien
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Julien
> > > >
> > >
> >
>

Re: Parquet pushdown filtering

Posted by Adam Gilmore <dr...@gmail.com>.

Could we say Monday or Tuesday next week?  I'm actually ahead of you guys
by about 18 hours, so Monday morning my time would be Sunday
afternoon/evening for you.  If that doesn't work, what about Tuesday
morning my time - Monday afternoon/evening your time?

On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <al...@gmail.com>
wrote:

> I can also join for this meeting, Julien and I are both on SF time. Looks
> like you are about 5-6 hours behind us, so depending on if you would prefer
> morning or afternoon we'll just be a little further into our days.
>
> On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Sure - I'm in Australia so I'm not sure how the timezones will work for
> > you guys, but I'm pretty flexible.  Where are you located?
> >
> > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <ju...@dremio.com> wrote:
> >
> > > Adam: do you want to schedule a hangout?
> > >
> > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > > > That makes sense, yep.  The problem is I guess with my
> > > implementation.  I
> > > > will iterate through all Parquet files and try to eliminate ones
> where
> > > the
> > > > filter conflicts with the statistics.  In instances where no files
> > match
> > > > the filter, I end up with an empty set of files for the Parquet scan
> to
> > > > iterate through.  I suppose I could just pick the schema of the first
> > > file
> > > > or something, but that seems like a pretty messy rule.
> > > >
> > > > Julien - I'd be happy to have a chat about this.  I've pretty much
> got
> > > the
> > > > implementation down, but need to solve a few of these little issues.
> > > >
> > > >
> > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <ha...@gmail.com>
> > > > wrote:
> > > >
> > > > > Regarding your point  #1. I guess Daniel struggled with this
> > limitation
> > > > as
> > > > > well. I merged few of his patches which addressed empty batch(no
> > data)
> > > > > handling in various places during execution. That said, however, we
> > > still
> > > > > could not have time to develop a solid way to handle empty batches
> > with
> > > > no
> > > > > schema.
> > > > >
> > > > > *- Scan batches don't allow empty batches.  This means if a
> > > > > particular filter filters out *all* rows, we get an exception.*
> > > > > Looks to me, you are referring to no data rather than no schema
> > here. I
> > > > > would expect graceful execution in this case. Do you mind sharing a
> > > > simple
> > > > > reproduction?
> > > > >
> > > > >
> > > > > -Hanifi
> > > > >
> > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> > > > >
> > > > > > Hey Adam,
> > > > > > If you have questions about the Parquet side of things, I'm happy
> > to
> > > > > chat.
> > > > > > Julien
> > > > > >
> > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <
> parthc@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Parquet metadata has the rowCount for every rowGroup which is
> > also
> > > > the
> > > > > > > value count for every column in the rowGroup. Isn't that what
> you
> > > > need?
> > > > > > >
> > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > > dragoncurve@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi guys,
> > > > > > > >
> > > > > > > > I'm trying to (re)implement pushdown filtering for Parquet
> with
> > > the
> > > > > new
> > > > > > > > Parquet metadata caching implementation.
> > > > > > > >
> > > > > > > > I've run into a couple of challenges:
> > > > > > > >
> > > > > > > >    1. Scan batches don't allow empty batches.  This means if
> a
> > > > > > particular
> > > > > > > >    filter filters out *all* rows, we get an exception.  I
> > haven't
> > > > > read
> > > > > > > the
> > > > > > > >    full comments on the relevant JIRA items, but it seems odd
> > > that
> > > > we
> > > > > > > can't
> > > > > > > >    query an empty JSON file, for example.  This is a bit of a
> > > > blocker
> > > > > > to
> > > > > > > >    implement the pushdown filtering properly.
> > > > > > > >    2. The Parquet metadata doesn't include all the relevant
> > > > metadata.
> > > > > > > >    Specifically, count of values is not included, therefore
> the
> > > > > default
> > > > > > > >    Parquet statistics filter has issues because it compares
> the
> > > > count
> > > > > > of
> > > > > > > >    values with count of nulls to work out if it can drop it.
> > > This
> > > > > > isn't
> > > > > > > >    necessarily a blocker, but it feels ugly simulating
> there's
> > > "1"
> > > > > row
> > > > > > > in a
> > > > > > > >    block (just to get around the null comparison).
> > > > > > > >
> > > > > > > > Also, it feels a bit ugly rehydrating the standard Parquet
> > > metadata
> > > > > > > objects
> > > > > > > > manually.  I'm not sure I understand why we created our own
> > > objects
> > > > > for
> > > > > > > the
> > > > > > > > Parquet metadata as opposed to simply writing a custom
> > serializer
> > > > for
> > > > > > > those
> > > > > > > > objects which we store.
> > > > > > > >
> > > > > > > > Thoughts would be great - I'd love to get a patch out for
> this.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Julien
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>

Re: Parquet pushdown filtering

Posted by Jason Altekruse <al...@gmail.com>.

I can also join for this meeting, Julien and I are both on SF time. Looks
like you are about 5-6 hours behind us, so depending on if you would prefer
morning or afternoon we'll just be a little further into our days.

On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Sure - I'm in Australia so I'm not sure how the timezones will work for
> you guys, but I'm pretty flexible.  Where are you located?
>
> On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > Adam: do you want to schedule a hangout?
> >
> > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> > > That makes sense, yep.  The problem is I guess with my
> > implementation.  I
> > > will iterate through all Parquet files and try to eliminate ones where
> > the
> > > filter conflicts with the statistics.  In instances where no files
> match
> > > the filter, I end up with an empty set of files for the Parquet scan to
> > > iterate through.  I suppose I could just pick the schema of the first
> > file
> > > or something, but that seems like a pretty messy rule.
> > >
> > > Julien - I'd be happy to have a chat about this.  I've pretty much got
> > the
> > > implementation down, but need to solve a few of these little issues.
> > >
> > >
> > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <ha...@gmail.com>
> > > wrote:
> > >
> > > > Regarding your point  #1. I guess Daniel struggled with this
> limitation
> > > as
> > > > well. I merged few of his patches which addressed empty batch(no
> data)
> > > > handling in various places during execution. That said, however, we
> > still
> > > > could not have time to develop a solid way to handle empty batches
> with
> > > no
> > > > schema.
> > > >
> > > > *- Scan batches don't allow empty batches.  This means if a
> > > > particular filter filters out *all* rows, we get an exception.*
> > > > Looks to me, you are referring to no data rather than no schema
> here. I
> > > > would expect graceful execution in this case. Do you mind sharing a
> > > simple
> > > > reproduction?
> > > >
> > > >
> > > > -Hanifi
> > > >
> > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> > > >
> > > > > Hey Adam,
> > > > > If you have questions about the Parquet side of things, I'm happy
> to
> > > > chat.
> > > > > Julien
> > > > >
> > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Parquet metadata has the rowCount for every rowGroup which is
> also
> > > the
> > > > > > value count for every column in the rowGroup. Isn't that what you
> > > need?
> > > > > >
> > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > dragoncurve@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > I'm trying to (re)implement pushdown filtering for Parquet with
> > the
> > > > new
> > > > > > > Parquet metadata caching implementation.
> > > > > > >
> > > > > > > I've run into a couple of challenges:
> > > > > > >
> > > > > > >    1. Scan batches don't allow empty batches.  This means if a
> > > > > particular
> > > > > > >    filter filters out *all* rows, we get an exception.  I
> haven't
> > > > read
> > > > > > the
> > > > > > >    full comments on the relevant JIRA items, but it seems odd
> > that
> > > we
> > > > > > can't
> > > > > > >    query an empty JSON file, for example.  This is a bit of a
> > > blocker
> > > > > to
> > > > > > >    implement the pushdown filtering properly.
> > > > > > >    2. The Parquet metadata doesn't include all the relevant
> > > metadata.
> > > > > > >    Specifically, count of values is not included, therefore the
> > > > default
> > > > > > >    Parquet statistics filter has issues because it compares the
> > > count
> > > > > of
> > > > > > >    values with count of nulls to work out if it can drop it.
> > This
> > > > > isn't
> > > > > > >    necessarily a blocker, but it feels ugly simulating there's
> > "1"
> > > > row
> > > > > > in a
> > > > > > >    block (just to get around the null comparison).
> > > > > > >
> > > > > > > Also, it feels a bit ugly rehydrating the standard Parquet
> > metadata
> > > > > > objects
> > > > > > > manually.  I'm not sure I understand why we created our own
> > objects
> > > > for
> > > > > > the
> > > > > > > Parquet metadata as opposed to simply writing a custom
> serializer
> > > for
> > > > > > those
> > > > > > > objects which we store.
> > > > > > >
> > > > > > > Thoughts would be great - I'd love to get a patch out for this.
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Julien
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Re: Parquet pushdown filtering

Posted by Adam Gilmore <dr...@gmail.com>.

Sure - I'm in Australia so I'm not sure how the timezones will work for
you guys, but I'm pretty flexible.  Where are you located?

On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <ju...@dremio.com> wrote:

> Adam: do you want to schedule a hangout?
>
> On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > That makes sense, yep.  The problem is I guess with my
> implementation.  I
> > will iterate through all Parquet files and try to eliminate ones where
> the
> > filter conflicts with the statistics.  In instances where no files match
> > the filter, I end up with an empty set of files for the Parquet scan to
> > iterate through.  I suppose I could just pick the schema of the first
> file
> > or something, but that seems like a pretty messy rule.
> >
> > Julien - I'd be happy to have a chat about this.  I've pretty much got
> the
> > implementation down, but need to solve a few of these little issues.
> >
> >
> > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <ha...@gmail.com>
> > wrote:
> >
> > > Regarding your point  #1. I guess Daniel struggled with this limitation
> > as
> > > well. I merged few of his patches which addressed empty batch(no data)
> > > handling in various places during execution. That said, however, we
> still
> > > could not have time to develop a solid way to handle empty batches with
> > no
> > > schema.
> > >
> > > *- Scan batches don't allow empty batches.  This means if a
> > > particular filter filters out *all* rows, we get an exception.*
> > > Looks to me, you are referring to no data rather than no schema here. I
> > > would expect graceful execution in this case. Do you mind sharing a
> > simple
> > > reproduction?
> > >
> > >
> > > -Hanifi
> > >
> > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> > >
> > > > Hey Adam,
> > > > If you have questions about the Parquet side of things, I'm happy to
> > > chat.
> > > > Julien
> > > >
> > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org>
> > > wrote:
> > > >
> > > > > Parquet metadata has the rowCount for every rowGroup which is also
> > the
> > > > > value count for every column in the rowGroup. Isn't that what you
> > need?
> > > > >
> > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> dragoncurve@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > I'm trying to (re)implement pushdown filtering for Parquet with
> the
> > > new
> > > > > > Parquet metadata caching implementation.
> > > > > >
> > > > > > I've run into a couple of challenges:
> > > > > >
> > > > > >    1. Scan batches don't allow empty batches.  This means if a
> > > > particular
> > > > > >    filter filters out *all* rows, we get an exception.  I haven't
> > > read
> > > > > the
> > > > > >    full comments on the relevant JIRA items, but it seems odd
> that
> > we
> > > > > can't
> > > > > >    query an empty JSON file, for example.  This is a bit of a
> > blocker
> > > > to
> > > > > >    implement the pushdown filtering properly.
> > > > > >    2. The Parquet metadata doesn't include all the relevant
> > metadata.
> > > > > >    Specifically, count of values is not included, therefore the
> > > default
> > > > > >    Parquet statistics filter has issues because it compares the
> > count
> > > > of
> > > > > >    values with count of nulls to work out if it can drop it.
> This
> > > > isn't
> > > > > >    necessarily a blocker, but it feels ugly simulating there's
> "1"
> > > row
> > > > > in a
> > > > > >    block (just to get around the null comparison).
> > > > > >
> > > > > > Also, it feels a bit ugly rehydrating the standard Parquet
> metadata
> > > > > objects
> > > > > > manually.  I'm not sure I understand why we created our own
> objects
> > > for
> > > > > the
> > > > > > Parquet metadata as opposed to simply writing a custom serializer
> > for
> > > > > those
> > > > > > objects which we store.
> > > > > >
> > > > > > Thoughts would be great - I'd love to get a patch out for this.
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Julien
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Re: Parquet pushdown filtering

Posted by Julien Le Dem <ju...@dremio.com>.

Adam: do you want to schedule a hangout?

On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <dr...@gmail.com> wrote:

> That makes sense, yep.  The problem is I guess with my implementation.  I
> will iterate through all Parquet files and try to eliminate ones where the
> filter conflicts with the statistics.  In instances where no files match
> the filter, I end up with an empty set of files for the Parquet scan to
> iterate through.  I suppose I could just pick the schema of the first file
> or something, but that seems like a pretty messy rule.
>
> Julien - I'd be happy to have a chat about this.  I've pretty much got the
> implementation down, but need to solve a few of these little issues.
>
>
> On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <ha...@gmail.com>
> wrote:
>
> > Regarding your point  #1. I guess Daniel struggled with this limitation
> as
> > well. I merged few of his patches which addressed empty batch(no data)
> > handling in various places during execution. That said, however, we still
> > could not have time to develop a solid way to handle empty batches with
> no
> > schema.
> >
> > *- Scan batches don't allow empty batches.  This means if a
> > particular filter filters out *all* rows, we get an exception.*
> > Looks to me, you are referring to no data rather than no schema here. I
> > would expect graceful execution in this case. Do you mind sharing a
> simple
> > reproduction?
> >
> >
> > -Hanifi
> >
> > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
> >
> > > Hey Adam,
> > > If you have questions about the Parquet side of things, I'm happy to
> > chat.
> > > Julien
> > >
> > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org>
> > wrote:
> > >
> > > > Parquet metadata has the rowCount for every rowGroup which is also
> the
> > > > value count for every column in the rowGroup. Isn't that what you
> need?
> > > >
> > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dragoncurve@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I'm trying to (re)implement pushdown filtering for Parquet with the
> > new
> > > > > Parquet metadata caching implementation.
> > > > >
> > > > > I've run into a couple of challenges:
> > > > >
> > > > >    1. Scan batches don't allow empty batches.  This means if a
> > > particular
> > > > >    filter filters out *all* rows, we get an exception.  I haven't
> > read
> > > > the
> > > > >    full comments on the relevant JIRA items, but it seems odd that
> we
> > > > can't
> > > > >    query an empty JSON file, for example.  This is a bit of a
> blocker
> > > to
> > > > >    implement the pushdown filtering properly.
> > > > >    2. The Parquet metadata doesn't include all the relevant
> metadata.
> > > > >    Specifically, count of values is not included, therefore the
> > default
> > > > >    Parquet statistics filter has issues because it compares the
> count
> > > of
> > > > >    values with count of nulls to work out if it can drop it.  This
> > > isn't
> > > > >    necessarily a blocker, but it feels ugly simulating there's "1"
> > row
> > > > in a
> > > > >    block (just to get around the null comparison).
> > > > >
> > > > > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> > > > objects
> > > > > manually.  I'm not sure I understand why we created our own objects
> > for
> > > > the
> > > > > Parquet metadata as opposed to simply writing a custom serializer
> for
> > > > those
> > > > > objects which we store.
> > > > >
> > > > > Thoughts would be great - I'd love to get a patch out for this.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>



-- 
Julien

Re: Parquet pushdown filtering

Posted by Adam Gilmore <dr...@gmail.com>.

That makes sense, yep.  The problem is I guess with my implementation.  I
will iterate through all Parquet files and try to eliminate ones where the
filter conflicts with the statistics.  In instances where no files match
the filter, I end up with an empty set of files for the Parquet scan to
iterate through.  I suppose I could just pick the schema of the first file
or something, but that seems like a pretty messy rule.

Julien - I'd be happy to have a chat about this.  I've pretty much got the
implementation down, but need to solve a few of these little issues.


On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <ha...@gmail.com> wrote:

> Regarding your point  #1. I guess Daniel struggled with this limitation as
> well. I merged few of his patches which addressed empty batch(no data)
> handling in various places during execution. That said, however, we still
> could not have time to develop a solid way to handle empty batches with no
> schema.
>
> *- Scan batches don't allow empty batches.  This means if a
> particular filter filters out *all* rows, we get an exception.*
> Looks to me, you are referring to no data rather than no schema here. I
> would expect graceful execution in this case. Do you mind sharing a simple
> reproduction?
>
>
> -Hanifi
>
> 2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:
>
> > Hey Adam,
> > If you have questions about the Parquet side of things, I'm happy to
> chat.
> > Julien
> >
> > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org>
> wrote:
> >
> > > Parquet metadata has the rowCount for every rowGroup which is also the
> > > value count for every column in the rowGroup. Isn't that what you need?
> > >
> > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I'm trying to (re)implement pushdown filtering for Parquet with the
> new
> > > > Parquet metadata caching implementation.
> > > >
> > > > I've run into a couple of challenges:
> > > >
> > > >    1. Scan batches don't allow empty batches.  This means if a
> > particular
> > > >    filter filters out *all* rows, we get an exception.  I haven't
> read
> > > the
> > > >    full comments on the relevant JIRA items, but it seems odd that we
> > > can't
> > > >    query an empty JSON file, for example.  This is a bit of a blocker
> > to
> > > >    implement the pushdown filtering properly.
> > > >    2. The Parquet metadata doesn't include all the relevant metadata.
> > > >    Specifically, count of values is not included, therefore the
> default
> > > >    Parquet statistics filter has issues because it compares the count
> > of
> > > >    values with count of nulls to work out if it can drop it.  This
> > isn't
> > > >    necessarily a blocker, but it feels ugly simulating there's "1"
> row
> > > in a
> > > >    block (just to get around the null comparison).
> > > >
> > > > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> > > objects
> > > > manually.  I'm not sure I understand why we created our own objects
> for
> > > the
> > > > Parquet metadata as opposed to simply writing a custom serializer for
> > > those
> > > > objects which we store.
> > > >
> > > > Thoughts would be great - I'd love to get a patch out for this.
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Re: Parquet pushdown filtering

Posted by Hanifi GUNES <ha...@gmail.com>.

Regarding your point  #1. I guess Daniel struggled with this limitation as
well. I merged few of his patches which addressed empty batch(no data)
handling in various places during execution. That said, however, we still
could not have time to develop a solid way to handle empty batches with no
schema.

*- Scan batches don't allow empty batches.  This means if a
particular filter filters out *all* rows, we get an exception.*
Looks to me, you are referring to no data rather than no schema here. I
would expect graceful execution in this case. Do you mind sharing a simple
reproduction?


-Hanifi

2015-12-03 10:56 GMT-08:00 Julien Le Dem <ju...@dremio.com>:

> Hey Adam,
> If you have questions about the Parquet side of things, I'm happy to chat.
> Julien
>
> On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org> wrote:
>
> > Parquet metadata has the rowCount for every rowGroup which is also the
> > value count for every column in the rowGroup. Isn't that what you need?
> >
> > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> > > Hi guys,
> > >
> > > I'm trying to (re)implement pushdown filtering for Parquet with the new
> > > Parquet metadata caching implementation.
> > >
> > > I've run into a couple of challenges:
> > >
> > >    1. Scan batches don't allow empty batches.  This means if a
> particular
> > >    filter filters out *all* rows, we get an exception.  I haven't read
> > the
> > >    full comments on the relevant JIRA items, but it seems odd that we
> > can't
> > >    query an empty JSON file, for example.  This is a bit of a blocker
> to
> > >    implement the pushdown filtering properly.
> > >    2. The Parquet metadata doesn't include all the relevant metadata.
> > >    Specifically, count of values is not included, therefore the default
> > >    Parquet statistics filter has issues because it compares the count
> of
> > >    values with count of nulls to work out if it can drop it.  This
> isn't
> > >    necessarily a blocker, but it feels ugly simulating there's "1" row
> > in a
> > >    block (just to get around the null comparison).
> > >
> > > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> > objects
> > > manually.  I'm not sure I understand why we created our own objects for
> > the
> > > Parquet metadata as opposed to simply writing a custom serializer for
> > those
> > > objects which we store.
> > >
> > > Thoughts would be great - I'd love to get a patch out for this.
> > >
> >
>
>
>
> --
> Julien
>

Re: Parquet pushdown filtering

Posted by Julien Le Dem <ju...@dremio.com>.

Hey Adam,
If you have questions about the Parquet side of things, I'm happy to chat.
Julien

On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <pa...@apache.org> wrote:

> Parquet metadata has the rowCount for every rowGroup which is also the
> value count for every column in the rowGroup. Isn't that what you need?
>
> On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to (re)implement pushdown filtering for Parquet with the new
> > Parquet metadata caching implementation.
> >
> > I've run into a couple of challenges:
> >
> >    1. Scan batches don't allow empty batches.  This means if a particular
> >    filter filters out *all* rows, we get an exception.  I haven't read
> the
> >    full comments on the relevant JIRA items, but it seems odd that we
> can't
> >    query an empty JSON file, for example.  This is a bit of a blocker to
> >    implement the pushdown filtering properly.
> >    2. The Parquet metadata doesn't include all the relevant metadata.
> >    Specifically, count of values is not included, therefore the default
> >    Parquet statistics filter has issues because it compares the count of
> >    values with count of nulls to work out if it can drop it.  This isn't
> >    necessarily a blocker, but it feels ugly simulating there's "1" row
> in a
> >    block (just to get around the null comparison).
> >
> > Also, it feels a bit ugly rehydrating the standard Parquet metadata
> objects
> > manually.  I'm not sure I understand why we created our own objects for
> the
> > Parquet metadata as opposed to simply writing a custom serializer for
> those
> > objects which we store.
> >
> > Thoughts would be great - I'd love to get a patch out for this.
> >
>



-- 
Julien

Re: Parquet pushdown filtering

Posted by Parth Chandra <pa...@apache.org>.

Parquet metadata has the rowCount for every rowGroup which is also the
value count for every column in the rowGroup. Isn't that what you need?

On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Hi guys,
>
> I'm trying to (re)implement pushdown filtering for Parquet with the new
> Parquet metadata caching implementation.
>
> I've run into a couple of challenges:
>
>    1. Scan batches don't allow empty batches.  This means if a particular
>    filter filters out *all* rows, we get an exception.  I haven't read the
>    full comments on the relevant JIRA items, but it seems odd that we can't
>    query an empty JSON file, for example.  This is a bit of a blocker to
>    implement the pushdown filtering properly.
>    2. The Parquet metadata doesn't include all the relevant metadata.
>    Specifically, count of values is not included, therefore the default
>    Parquet statistics filter has issues because it compares the count of
>    values with count of nulls to work out if it can drop it.  This isn't
>    necessarily a blocker, but it feels ugly simulating there's "1" row in a
>    block (just to get around the null comparison).
>
> Also, it feels a bit ugly rehydrating the standard Parquet metadata objects
> manually.  I'm not sure I understand why we created our own objects for the
> Parquet metadata as opposed to simply writing a custom serializer for those
> objects which we store.
>
> Thoughts would be great - I'd love to get a patch out for this.
>