You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/01/24 04:38:31 UTC

[Format] Array/RowBatch filters

One of the things that I think got overlooked in the conversation on having
a slice offset in the C API was a suggestion from Jacques of perhaps
generalizing the concept to an arbitrary "filter" for arrays/record batches.

I believe this point was also discussed in the past as well.  I'm not
advocating for adding it now but I'm curious if people feel we should add
something to Schema.fbs for forward compatibility,  in case we wish to
support this use-case in the future.

Thanks,
Micah

Re: [Format] Array/RowBatch filters

Posted by Micah Kornfield <em...@gmail.com>.
Thanks for all the input:

> I think having support for this in some way in the IPC
> protocol makes sense (it seems slightly less important for the C API
> but worth thinking about

The way I read Jacques e-mail is it seems like the opposite might be true
(at least for Dremio).  For IPC I think there is probably a sweet spot
where it doesn't pay to compact the batches but it would like take some
tuning.


> The question is how mechanically, would it be some extra buffers at
> the start or end of the record batch body (probably have to be at the
> end of the body for forward compatibility reasons)?

I think for RecordBatch it would be an extra buffer either at the beginning
for the end.  Its possible putting at the end would allow better forwards
compatibility.  I haven't really given much thought on design here.  My
main concern is to define appropriate metadata before 1.0.0 to maintain
forwards compatibility.  My thinking is the metadata would be an enum or
null table that indicates "no filters".  Implementations could then
determine if they know how to understand the corresponding buffers
correctly based on the metadata.

I can try to put up a straw-man PR for metadata if we think this is worth
pursuing further.

Thanks,
Micah

P.S. This also raises a slightly related concern about letting applications
negotiate "capabilities" at a finer grained level (e.g. letting the
transmitter know that the receive only supports unfiltered values).

On Mon, Jan 27, 2020 at 8:34 PM Wes McKinney <we...@gmail.com> wrote:

> hi Micah -- I think having support for this in some way in the IPC
> protocol makes sense (it seems slightly less important for the C API
> but worth thinking about). It's helpful to know that Dremio (a big
> Arrow user) already employs various filters / selection vectors.
>
> The question is how mechanically, would it be some extra buffers at
> the start or end of the record batch body (probably have to be at the
> end of the body for forward compatibility reasons)?
>
> On Sun, Jan 26, 2020 at 1:16 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > At Dremio, we use four main types of selection vector/bitmaps:
> >
> > Dense Format (record valid or not, no ordering)
> > - single bit (bitmap)
> >
> > Sparse formats (identifies valid records as well as their order)
> > - 2 byte (for record batches up to 2^16 records).
> > - 4 byte (for 2^16 batches of 2^16 records);
> > - 6 byte (for 2^32 batches of 2^16 records);
> >
> > We've considered introducing a couple more. I imagine for other use
> cases,
> > where people use much larger batches of records, different requirements
> > would be necessary. My reason for sharing is it seems like this may be
> > use-case specific. I'd also note that at the IPC level, you'd generally
> > want to contract batches before dropping them on the wire (or at least
> that
> > is what we typically do).
> >
> > On Fri, Jan 24, 2020 at 11:23 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > I was thinking selection vector/bitmap (possibly with different
> encodings),
> > > but really nothing for now.  Ordinarily, I'd lean towards YAGNI but
> there
> > > isn't a good way to add this in easily in a forward compatible way
> unless
> > > we add a placeholder enum/table for 1.0 (the default option would be no
> > > filter and wouldn't change the packaged data at all).
> > >
> > > On Fri, Jan 24, 2020 at 4:55 AM Francois Saint-Jacques <
> > > fsaintjacques@gmail.com> wrote:
> > >
> > > > By filter, you mean a filter expression, or a selection
> vector/bitmap?
> > > >
> > > > On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > > >
> > > > > One of the things that I think got overlooked in the conversation
> on
> > > > having
> > > > > a slice offset in the C API was a suggestion from Jacques of
> perhaps
> > > > > generalizing the concept to an arbitrary "filter" for arrays/record
> > > > batches.
> > > > >
> > > > > I believe this point was also discussed in the past as well.  I'm
> not
> > > > > advocating for adding it now but I'm curious if people feel we
> should
> > > add
> > > > > something to Schema.fbs for forward compatibility,  in case we
> wish to
> > > > > support this use-case in the future.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > >
> > >
>

Re: [Format] Array/RowBatch filters

Posted by Wes McKinney <we...@gmail.com>.
hi Micah -- I think having support for this in some way in the IPC
protocol makes sense (it seems slightly less important for the C API
but worth thinking about). It's helpful to know that Dremio (a big
Arrow user) already employs various filters / selection vectors.

The question is how mechanically, would it be some extra buffers at
the start or end of the record batch body (probably have to be at the
end of the body for forward compatibility reasons)?

On Sun, Jan 26, 2020 at 1:16 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> At Dremio, we use four main types of selection vector/bitmaps:
>
> Dense Format (record valid or not, no ordering)
> - single bit (bitmap)
>
> Sparse formats (identifies valid records as well as their order)
> - 2 byte (for record batches up to 2^16 records).
> - 4 byte (for 2^16 batches of 2^16 records);
> - 6 byte (for 2^32 batches of 2^16 records);
>
> We've considered introducing a couple more. I imagine for other use cases,
> where people use much larger batches of records, different requirements
> would be necessary. My reason for sharing is it seems like this may be
> use-case specific. I'd also note that at the IPC level, you'd generally
> want to contract batches before dropping them on the wire (or at least that
> is what we typically do).
>
> On Fri, Jan 24, 2020 at 11:23 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > I was thinking selection vector/bitmap (possibly with different encodings),
> > but really nothing for now.  Ordinarily, I'd lean towards YAGNI but there
> > isn't a good way to add this in easily in a forward compatible way unless
> > we add a placeholder enum/table for 1.0 (the default option would be no
> > filter and wouldn't change the packaged data at all).
> >
> > On Fri, Jan 24, 2020 at 4:55 AM Francois Saint-Jacques <
> > fsaintjacques@gmail.com> wrote:
> >
> > > By filter, you mean a filter expression, or a selection vector/bitmap?
> > >
> > > On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > > >
> > > > One of the things that I think got overlooked in the conversation on
> > > having
> > > > a slice offset in the C API was a suggestion from Jacques of perhaps
> > > > generalizing the concept to an arbitrary "filter" for arrays/record
> > > batches.
> > > >
> > > > I believe this point was also discussed in the past as well.  I'm not
> > > > advocating for adding it now but I'm curious if people feel we should
> > add
> > > > something to Schema.fbs for forward compatibility,  in case we wish to
> > > > support this use-case in the future.
> > > >
> > > > Thanks,
> > > > Micah
> > >
> >

Re: [Format] Array/RowBatch filters

Posted by Jacques Nadeau <ja...@apache.org>.
At Dremio, we use four main types of selection vector/bitmaps:

Dense Format (record valid or not, no ordering)
- single bit (bitmap)

Sparse formats (identifies valid records as well as their order)
- 2 byte (for record batches up to 2^16 records).
- 4 byte (for 2^16 batches of 2^16 records);
- 6 byte (for 2^32 batches of 2^16 records);

We've considered introducing a couple more. I imagine for other use cases,
where people use much larger batches of records, different requirements
would be necessary. My reason for sharing is it seems like this may be
use-case specific. I'd also note that at the IPC level, you'd generally
want to contract batches before dropping them on the wire (or at least that
is what we typically do).

On Fri, Jan 24, 2020 at 11:23 PM Micah Kornfield <em...@gmail.com>
wrote:

> I was thinking selection vector/bitmap (possibly with different encodings),
> but really nothing for now.  Ordinarily, I'd lean towards YAGNI but there
> isn't a good way to add this in easily in a forward compatible way unless
> we add a placeholder enum/table for 1.0 (the default option would be no
> filter and wouldn't change the packaged data at all).
>
> On Fri, Jan 24, 2020 at 4:55 AM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > By filter, you mean a filter expression, or a selection vector/bitmap?
> >
> > On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > One of the things that I think got overlooked in the conversation on
> > having
> > > a slice offset in the C API was a suggestion from Jacques of perhaps
> > > generalizing the concept to an arbitrary "filter" for arrays/record
> > batches.
> > >
> > > I believe this point was also discussed in the past as well.  I'm not
> > > advocating for adding it now but I'm curious if people feel we should
> add
> > > something to Schema.fbs for forward compatibility,  in case we wish to
> > > support this use-case in the future.
> > >
> > > Thanks,
> > > Micah
> >
>

Re: [Format] Array/RowBatch filters

Posted by Micah Kornfield <em...@gmail.com>.
I was thinking selection vector/bitmap (possibly with different encodings),
but really nothing for now.  Ordinarily, I'd lean towards YAGNI but there
isn't a good way to add this in easily in a forward compatible way unless
we add a placeholder enum/table for 1.0 (the default option would be no
filter and wouldn't change the packaged data at all).

On Fri, Jan 24, 2020 at 4:55 AM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> By filter, you mean a filter expression, or a selection vector/bitmap?
>
> On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > One of the things that I think got overlooked in the conversation on
> having
> > a slice offset in the C API was a suggestion from Jacques of perhaps
> > generalizing the concept to an arbitrary "filter" for arrays/record
> batches.
> >
> > I believe this point was also discussed in the past as well.  I'm not
> > advocating for adding it now but I'm curious if people feel we should add
> > something to Schema.fbs for forward compatibility,  in case we wish to
> > support this use-case in the future.
> >
> > Thanks,
> > Micah
>

Re: [Format] Array/RowBatch filters

Posted by Francois Saint-Jacques <fs...@gmail.com>.
By filter, you mean a filter expression, or a selection vector/bitmap?

On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <em...@gmail.com> wrote:
>
> One of the things that I think got overlooked in the conversation on having
> a slice offset in the C API was a suggestion from Jacques of perhaps
> generalizing the concept to an arbitrary "filter" for arrays/record batches.
>
> I believe this point was also discussed in the past as well.  I'm not
> advocating for adding it now but I'm curious if people feel we should add
> something to Schema.fbs for forward compatibility,  in case we wish to
> support this use-case in the future.
>
> Thanks,
> Micah