You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2021/08/04 00:59:12 UTC

Re: [DISCUSS] next iteration of flatbuffer structures

Another Flatbuffers/Message.fbs project we should rekindle soon, in
addition to the schema evolution/replacement question which has been
raised with Flight, is that of sparse/compressed data (e.g. RLE). I
have a vacation plus some travel coming up so won't be able to devote
meaningful attention to this until the last part of August, but would
like to help it move forward.


On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org> wrote:
>
> Hey Nate,
>
> For the first two points, semantically I'm tempted to think of it more like the ability to send a "bag of columns" according to some schema (and hence columns could have differing lengths or even be absent). This could be a new structure alongside a record batch, which is semantically like a "slice of a table" (and hence rectangular and complete), instead of exposing existing users of RecordBatch to rather different behavior.
>
> For #3, a different thread was discussing some of the points there - it sounds like it may be possible to relax from map<string, string> to map<string, binary>.
>
> -David
>
> On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > Wes suggested that maybe there are enough new ideas that it may make sense
> > to evolve-past the existing structures rather than to bolt-on new
> > functionality. I would like to learn what requirements exist should new
> > structures be adopted, and if applicable, would like to turn this into a
> > full POC proposal.
> >
> > These are the features that I feel are missing from the existing design:
> > - the ability to notify that the columns are not consistent in length (e.g.
> > setting RecordBatch.length to -1; and give the arrow/flight user the true
> > FieldNode lengths).
> > - the ability to skip top-level field nodes that have length 0 at a small
> > cost (such as in a bitset)
> > - the ability to embed binary payload in the Message flatbuffer wrapper
> > (instead of String payload only)
> > - the ability to concurrently use more than one schema (the most likely API
> > will look like how one identifies a dictionary. ideally dictionaries could
> > be shared across field nodes in a schema or across schemas in the same
> > flight)
> >
> > What other features, or improvements, could/should be considered? Any
> > strong opinions against the ideas above? (Remember, that a goal of mine is
> > to be able to send a RecordBatch of rows that were modified intersected
> > only by the field-nodes that have changed (including those with only inner
> > node changes); thus the columns are a subset of the full schema and that
> > the length of each node is independent of the other).
> >
> > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <we...@gmail.com> wrote:
> > > It sounds like we may want to discuss some potential evolutions of the
> > > Arrow binary protocol (for example: new Message types). Certainly a
> > > can of worms but rather than trying to bolt some new functionality
> > > onto the existing structures, it might be better to support the new
> > > use cases through some new structures which will be more clear cut
> > > from a forward compatibility standpoint.
> >
> > Nate
> >
> > --
> >

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by David Li <li...@apache.org>.

Just following up here - what's the status? It looks like there's some unaddressed comments on the PR?

On Tue, Nov 23, 2021, at 13:54, Micah Kornfield wrote:
> Sorry I just took a closer look and left some comments.  I think the one
> substantive issue, is the document linked talks about different
> length columns in the Bag, and this isn't mentioned in the flatbuffers?
> Could you comment/update the documentations in flatbuffers accordingly?
>
> Thanks,
> Micah
>
> On Tue, Nov 23, 2021 at 10:41 AM David Li <li...@apache.org> wrote:
>
>> Thanks for putting that up.
>>
>> It doesn't look like there's been too much discussion here. If people
>> agree it's useful, maybe the next step is to draft an implementation in
>> Java or C++ for feedback? There was some discussion on the use cases in the
>> document, do we feel like we need to clarify that better?
>>
>> -David
>>
>> On Mon, Nov 8, 2021, at 16:46, Nate Bauernfeind wrote:
>> > I put the draft up here: https://github.com/apache/arrow/pull/11646
>> >
>> > Thanks.
>> >
>> > On Mon, Nov 8, 2021 at 1:57 PM David Li <li...@apache.org> wrote:
>> >
>> > > Hey Nate,
>> > >
>> > > Thanks for doing this! Would you be interested in putting that commit
>> up
>> > > as a draft PR for discussion? I think we can discuss there.
>> > >
>> > > I'm not sure anyone is actively working on RLE or other encoding
>> schemes
>> > > at the moment.
>> > >
>> > > -David
>> > >
>> > > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
>> > > > I've written up the ColumnBag proposal addressing items 1 and 2 on
>> the
>> > > > list. I'm open to any and all feedback/suggestions.
>> > > >
>> > > > I'd be happy to add item 3 (binary metadata) to the proposed change
>> set.
>> > > > Let me know if you want me to whip up the initial suggestion for that
>> > > > version (and whether or not to keep it separate from ColumnBag).
>> > > >
>> > > > Would RLE related efforts change the structure of RecordBatch or
>> > > ColumnBag
>> > > > (if accepted)?
>> > > >
>> > > > Here is the brief history-discussion around why ColumnBag:
>> > > >
>> > >
>> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
>> > > >
>> > > > Here is a brief commit doctoring up the flatbuffer to support this
>> > > version
>> > > > of the proposed change:
>> > > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
>> > > >
>> > > > I don't know if it's better to comment in the document or bring
>> comments
>> > > > back to the list. If it ends up being document heavy, then I'll
>> summarize
>> > > > the main points back on the list.
>> > > >
>> > > > I think I'll get started on a Java impl just to learn more even if it
>> > > ends
>> > > > up being extra work.
>> > > >
>> > > > Looking forward to your feedback,
>> > > > Nate
>> > > >
>> > > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <
>> emkornfield@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > I'm still interested in RLE related effort, but not sure about my
>> > > available
>> > > > > bandwidth (which is why I haven't made more of an effort there).
>> > > > >
>> > > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Another Flatbuffers/Message.fbs project we should rekindle soon,
>> in
>> > > > > > addition to the schema evolution/replacement question which has
>> been
>> > > > > > raised with Flight, is that of sparse/compressed data (e.g.
>> RLE). I
>> > > > > > have a vacation plus some travel coming up so won't be able to
>> devote
>> > > > > > meaningful attention to this until the last part of August, but
>> would
>> > > > > > like to help it move forward.
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org>
>> > > wrote:
>> > > > > > >
>> > > > > > > Hey Nate,
>> > > > > > >
>> > > > > > > For the first two points, semantically I'm tempted to think of
>> it
>> > > more
>> > > > > > like the ability to send a "bag of columns" according to some
>> schema
>> > > (and
>> > > > > > hence columns could have differing lengths or even be absent).
>> This
>> > > could
>> > > > > > be a new structure alongside a record batch, which is
>> semantically
>> > > like a
>> > > > > > "slice of a table" (and hence rectangular and complete), instead
>> of
>> > > > > > exposing existing users of RecordBatch to rather different
>> behavior.
>> > > > > > >
>> > > > > > > For #3, a different thread was discussing some of the points
>> there
>> > > - it
>> > > > > > sounds like it may be possible to relax from map<string, string>
>> to
>> > > > > > map<string, binary>.
>> > > > > > >
>> > > > > > > -David
>> > > > > > >
>> > > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
>> > > > > > > > Wes suggested that maybe there are enough new ideas that it
>> may
>> > > make
>> > > > > > sense
>> > > > > > > > to evolve-past the existing structures rather than to
>> bolt-on new
>> > > > > > > > functionality. I would like to learn what requirements exist
>> > > should
>> > > > > new
>> > > > > > > > structures be adopted, and if applicable, would like to turn
>> this
>> > > > > into
>> > > > > > a
>> > > > > > > > full POC proposal.
>> > > > > > > >
>> > > > > > > > These are the features that I feel are missing from the
>> existing
>> > > > > > design:
>> > > > > > > > - the ability to notify that the columns are not consistent
>> in
>> > > length
>> > > > > > (e.g.
>> > > > > > > > setting RecordBatch.length to -1; and give the arrow/flight
>> user
>> > > the
>> > > > > > true
>> > > > > > > > FieldNode lengths).
>> > > > > > > > - the ability to skip top-level field nodes that have length
>> 0
>> > > at a
>> > > > > > small
>> > > > > > > > cost (such as in a bitset)
>> > > > > > > > - the ability to embed binary payload in the Message
>> flatbuffer
>> > > > > wrapper
>> > > > > > > > (instead of String payload only)
>> > > > > > > > - the ability to concurrently use more than one schema (the
>> most
>> > > > > > likely API
>> > > > > > > > will look like how one identifies a dictionary. ideally
>> > > dictionaries
>> > > > > > could
>> > > > > > > > be shared across field nodes in a schema or across schemas
>> in the
>> > > > > same
>> > > > > > > > flight)
>> > > > > > > >
>> > > > > > > > What other features, or improvements, could/should be
>> > > considered? Any
>> > > > > > > > strong opinions against the ideas above? (Remember, that a
>> goal
>> > > of
>> > > > > > mine is
>> > > > > > > > to be able to send a RecordBatch of rows that were modified
>> > > > > intersected
>> > > > > > > > only by the field-nodes that have changed (including those
>> with
>> > > only
>> > > > > > inner
>> > > > > > > > node changes); thus the columns are a subset of the full
>> schema
>> > > and
>> > > > > > that
>> > > > > > > > the length of each node is independent of the other).
>> > > > > > > >
>> > > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <
>> wesmckinn@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > > > It sounds like we may want to discuss some potential
>> > > evolutions of
>> > > > > > the
>> > > > > > > > > Arrow binary protocol (for example: new Message types).
>> > > Certainly a
>> > > > > > > > > can of worms but rather than trying to bolt some new
>> > > functionality
>> > > > > > > > > onto the existing structures, it might be better to support
>> > > the new
>> > > > > > > > > use cases through some new structures which will be more
>> clear
>> > > cut
>> > > > > > > > > from a forward compatibility standpoint.
>> > > > > > > >
>> > > > > > > > Nate
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > >
>> >
>> >
>> > --
>> >
>>

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by Micah Kornfield <em...@gmail.com>.

Sorry I just took a closer look and left some comments.  I think the one
substantive issue, is the document linked talks about different
length columns in the Bag, and this isn't mentioned in the flatbuffers?
Could you comment/update the documentations in flatbuffers accordingly?

Thanks,
Micah

On Tue, Nov 23, 2021 at 10:41 AM David Li <li...@apache.org> wrote:

> Thanks for putting that up.
>
> It doesn't look like there's been too much discussion here. If people
> agree it's useful, maybe the next step is to draft an implementation in
> Java or C++ for feedback? There was some discussion on the use cases in the
> document, do we feel like we need to clarify that better?
>
> -David
>
> On Mon, Nov 8, 2021, at 16:46, Nate Bauernfeind wrote:
> > I put the draft up here: https://github.com/apache/arrow/pull/11646
> >
> > Thanks.
> >
> > On Mon, Nov 8, 2021 at 1:57 PM David Li <li...@apache.org> wrote:
> >
> > > Hey Nate,
> > >
> > > Thanks for doing this! Would you be interested in putting that commit
> up
> > > as a draft PR for discussion? I think we can discuss there.
> > >
> > > I'm not sure anyone is actively working on RLE or other encoding
> schemes
> > > at the moment.
> > >
> > > -David
> > >
> > > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
> > > > I've written up the ColumnBag proposal addressing items 1 and 2 on
> the
> > > > list. I'm open to any and all feedback/suggestions.
> > > >
> > > > I'd be happy to add item 3 (binary metadata) to the proposed change
> set.
> > > > Let me know if you want me to whip up the initial suggestion for that
> > > > version (and whether or not to keep it separate from ColumnBag).
> > > >
> > > > Would RLE related efforts change the structure of RecordBatch or
> > > ColumnBag
> > > > (if accepted)?
> > > >
> > > > Here is the brief history-discussion around why ColumnBag:
> > > >
> > >
> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
> > > >
> > > > Here is a brief commit doctoring up the flatbuffer to support this
> > > version
> > > > of the proposed change:
> > > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
> > > >
> > > > I don't know if it's better to comment in the document or bring
> comments
> > > > back to the list. If it ends up being document heavy, then I'll
> summarize
> > > > the main points back on the list.
> > > >
> > > > I think I'll get started on a Java impl just to learn more even if it
> > > ends
> > > > up being extra work.
> > > >
> > > > Looking forward to your feedback,
> > > > Nate
> > > >
> > > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > >
> > > > > I'm still interested in RLE related effort, but not sure about my
> > > available
> > > > > bandwidth (which is why I haven't made more of an effort there).
> > > > >
> > > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Another Flatbuffers/Message.fbs project we should rekindle soon,
> in
> > > > > > addition to the schema evolution/replacement question which has
> been
> > > > > > raised with Flight, is that of sparse/compressed data (e.g.
> RLE). I
> > > > > > have a vacation plus some travel coming up so won't be able to
> devote
> > > > > > meaningful attention to this until the last part of August, but
> would
> > > > > > like to help it move forward.
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > Hey Nate,
> > > > > > >
> > > > > > > For the first two points, semantically I'm tempted to think of
> it
> > > more
> > > > > > like the ability to send a "bag of columns" according to some
> schema
> > > (and
> > > > > > hence columns could have differing lengths or even be absent).
> This
> > > could
> > > > > > be a new structure alongside a record batch, which is
> semantically
> > > like a
> > > > > > "slice of a table" (and hence rectangular and complete), instead
> of
> > > > > > exposing existing users of RecordBatch to rather different
> behavior.
> > > > > > >
> > > > > > > For #3, a different thread was discussing some of the points
> there
> > > - it
> > > > > > sounds like it may be possible to relax from map<string, string>
> to
> > > > > > map<string, binary>.
> > > > > > >
> > > > > > > -David
> > > > > > >
> > > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > > > > > Wes suggested that maybe there are enough new ideas that it
> may
> > > make
> > > > > > sense
> > > > > > > > to evolve-past the existing structures rather than to
> bolt-on new
> > > > > > > > functionality. I would like to learn what requirements exist
> > > should
> > > > > new
> > > > > > > > structures be adopted, and if applicable, would like to turn
> this
> > > > > into
> > > > > > a
> > > > > > > > full POC proposal.
> > > > > > > >
> > > > > > > > These are the features that I feel are missing from the
> existing
> > > > > > design:
> > > > > > > > - the ability to notify that the columns are not consistent
> in
> > > length
> > > > > > (e.g.
> > > > > > > > setting RecordBatch.length to -1; and give the arrow/flight
> user
> > > the
> > > > > > true
> > > > > > > > FieldNode lengths).
> > > > > > > > - the ability to skip top-level field nodes that have length
> 0
> > > at a
> > > > > > small
> > > > > > > > cost (such as in a bitset)
> > > > > > > > - the ability to embed binary payload in the Message
> flatbuffer
> > > > > wrapper
> > > > > > > > (instead of String payload only)
> > > > > > > > - the ability to concurrently use more than one schema (the
> most
> > > > > > likely API
> > > > > > > > will look like how one identifies a dictionary. ideally
> > > dictionaries
> > > > > > could
> > > > > > > > be shared across field nodes in a schema or across schemas
> in the
> > > > > same
> > > > > > > > flight)
> > > > > > > >
> > > > > > > > What other features, or improvements, could/should be
> > > considered? Any
> > > > > > > > strong opinions against the ideas above? (Remember, that a
> goal
> > > of
> > > > > > mine is
> > > > > > > > to be able to send a RecordBatch of rows that were modified
> > > > > intersected
> > > > > > > > only by the field-nodes that have changed (including those
> with
> > > only
> > > > > > inner
> > > > > > > > node changes); thus the columns are a subset of the full
> schema
> > > and
> > > > > > that
> > > > > > > > the length of each node is independent of the other).
> > > > > > > >
> > > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <
> wesmckinn@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > It sounds like we may want to discuss some potential
> > > evolutions of
> > > > > > the
> > > > > > > > > Arrow binary protocol (for example: new Message types).
> > > Certainly a
> > > > > > > > > can of worms but rather than trying to bolt some new
> > > functionality
> > > > > > > > > onto the existing structures, it might be better to support
> > > the new
> > > > > > > > > use cases through some new structures which will be more
> clear
> > > cut
> > > > > > > > > from a forward compatibility standpoint.
> > > > > > > >
> > > > > > > > Nate
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > >
> >
> >
> > --
> >
>

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by David Li <li...@apache.org>.

Thanks for putting that up.

It doesn't look like there's been too much discussion here. If people agree it's useful, maybe the next step is to draft an implementation in Java or C++ for feedback? There was some discussion on the use cases in the document, do we feel like we need to clarify that better?

-David

On Mon, Nov 8, 2021, at 16:46, Nate Bauernfeind wrote:
> I put the draft up here: https://github.com/apache/arrow/pull/11646
> 
> Thanks.
> 
> On Mon, Nov 8, 2021 at 1:57 PM David Li <li...@apache.org> wrote:
> 
> > Hey Nate,
> >
> > Thanks for doing this! Would you be interested in putting that commit up
> > as a draft PR for discussion? I think we can discuss there.
> >
> > I'm not sure anyone is actively working on RLE or other encoding schemes
> > at the moment.
> >
> > -David
> >
> > On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
> > > I've written up the ColumnBag proposal addressing items 1 and 2 on the
> > > list. I'm open to any and all feedback/suggestions.
> > >
> > > I'd be happy to add item 3 (binary metadata) to the proposed change set.
> > > Let me know if you want me to whip up the initial suggestion for that
> > > version (and whether or not to keep it separate from ColumnBag).
> > >
> > > Would RLE related efforts change the structure of RecordBatch or
> > ColumnBag
> > > (if accepted)?
> > >
> > > Here is the brief history-discussion around why ColumnBag:
> > >
> > https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
> > >
> > > Here is a brief commit doctoring up the flatbuffer to support this
> > version
> > > of the proposed change:
> > > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
> > >
> > > I don't know if it's better to comment in the document or bring comments
> > > back to the list. If it ends up being document heavy, then I'll summarize
> > > the main points back on the list.
> > >
> > > I think I'll get started on a Java impl just to learn more even if it
> > ends
> > > up being extra work.
> > >
> > > Looking forward to your feedback,
> > > Nate
> > >
> > > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > > > I'm still interested in RLE related effort, but not sure about my
> > available
> > > > bandwidth (which is why I haven't made more of an effort there).
> > > >
> > > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > Another Flatbuffers/Message.fbs project we should rekindle soon, in
> > > > > addition to the schema evolution/replacement question which has been
> > > > > raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> > > > > have a vacation plus some travel coming up so won't be able to devote
> > > > > meaningful attention to this until the last part of August, but would
> > > > > like to help it move forward.
> > > > >
> > > > >
> > > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org>
> > wrote:
> > > > > >
> > > > > > Hey Nate,
> > > > > >
> > > > > > For the first two points, semantically I'm tempted to think of it
> > more
> > > > > like the ability to send a "bag of columns" according to some schema
> > (and
> > > > > hence columns could have differing lengths or even be absent). This
> > could
> > > > > be a new structure alongside a record batch, which is semantically
> > like a
> > > > > "slice of a table" (and hence rectangular and complete), instead of
> > > > > exposing existing users of RecordBatch to rather different behavior.
> > > > > >
> > > > > > For #3, a different thread was discussing some of the points there
> > - it
> > > > > sounds like it may be possible to relax from map<string, string> to
> > > > > map<string, binary>.
> > > > > >
> > > > > > -David
> > > > > >
> > > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > > > > Wes suggested that maybe there are enough new ideas that it may
> > make
> > > > > sense
> > > > > > > to evolve-past the existing structures rather than to bolt-on new
> > > > > > > functionality. I would like to learn what requirements exist
> > should
> > > > new
> > > > > > > structures be adopted, and if applicable, would like to turn this
> > > > into
> > > > > a
> > > > > > > full POC proposal.
> > > > > > >
> > > > > > > These are the features that I feel are missing from the existing
> > > > > design:
> > > > > > > - the ability to notify that the columns are not consistent in
> > length
> > > > > (e.g.
> > > > > > > setting RecordBatch.length to -1; and give the arrow/flight user
> > the
> > > > > true
> > > > > > > FieldNode lengths).
> > > > > > > - the ability to skip top-level field nodes that have length 0
> > at a
> > > > > small
> > > > > > > cost (such as in a bitset)
> > > > > > > - the ability to embed binary payload in the Message flatbuffer
> > > > wrapper
> > > > > > > (instead of String payload only)
> > > > > > > - the ability to concurrently use more than one schema (the most
> > > > > likely API
> > > > > > > will look like how one identifies a dictionary. ideally
> > dictionaries
> > > > > could
> > > > > > > be shared across field nodes in a schema or across schemas in the
> > > > same
> > > > > > > flight)
> > > > > > >
> > > > > > > What other features, or improvements, could/should be
> > considered? Any
> > > > > > > strong opinions against the ideas above? (Remember, that a goal
> > of
> > > > > mine is
> > > > > > > to be able to send a RecordBatch of rows that were modified
> > > > intersected
> > > > > > > only by the field-nodes that have changed (including those with
> > only
> > > > > inner
> > > > > > > node changes); thus the columns are a subset of the full schema
> > and
> > > > > that
> > > > > > > the length of each node is independent of the other).
> > > > > > >
> > > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <wesmckinn@gmail.com
> > >
> > > > > wrote:
> > > > > > > > It sounds like we may want to discuss some potential
> > evolutions of
> > > > > the
> > > > > > > > Arrow binary protocol (for example: new Message types).
> > Certainly a
> > > > > > > > can of worms but rather than trying to bolt some new
> > functionality
> > > > > > > > onto the existing structures, it might be better to support
> > the new
> > > > > > > > use cases through some new structures which will be more clear
> > cut
> > > > > > > > from a forward compatibility standpoint.
> > > > > > >
> > > > > > > Nate
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> >
> 
> 
> --
>

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by Nate Bauernfeind <na...@deephaven.io>.

I put the draft up here: https://github.com/apache/arrow/pull/11646

Thanks.

On Mon, Nov 8, 2021 at 1:57 PM David Li <li...@apache.org> wrote:

> Hey Nate,
>
> Thanks for doing this! Would you be interested in putting that commit up
> as a draft PR for discussion? I think we can discuss there.
>
> I'm not sure anyone is actively working on RLE or other encoding schemes
> at the moment.
>
> -David
>
> On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
> > I've written up the ColumnBag proposal addressing items 1 and 2 on the
> > list. I'm open to any and all feedback/suggestions.
> >
> > I'd be happy to add item 3 (binary metadata) to the proposed change set.
> > Let me know if you want me to whip up the initial suggestion for that
> > version (and whether or not to keep it separate from ColumnBag).
> >
> > Would RLE related efforts change the structure of RecordBatch or
> ColumnBag
> > (if accepted)?
> >
> > Here is the brief history-discussion around why ColumnBag:
> >
> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
> >
> > Here is a brief commit doctoring up the flatbuffer to support this
> version
> > of the proposed change:
> > https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
> >
> > I don't know if it's better to comment in the document or bring comments
> > back to the list. If it ends up being document heavy, then I'll summarize
> > the main points back on the list.
> >
> > I think I'll get started on a Java impl just to learn more even if it
> ends
> > up being extra work.
> >
> > Looking forward to your feedback,
> > Nate
> >
> > On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > I'm still interested in RLE related effort, but not sure about my
> available
> > > bandwidth (which is why I haven't made more of an effort there).
> > >
> > > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > Another Flatbuffers/Message.fbs project we should rekindle soon, in
> > > > addition to the schema evolution/replacement question which has been
> > > > raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> > > > have a vacation plus some travel coming up so won't be able to devote
> > > > meaningful attention to this until the last part of August, but would
> > > > like to help it move forward.
> > > >
> > > >
> > > > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org>
> wrote:
> > > > >
> > > > > Hey Nate,
> > > > >
> > > > > For the first two points, semantically I'm tempted to think of it
> more
> > > > like the ability to send a "bag of columns" according to some schema
> (and
> > > > hence columns could have differing lengths or even be absent). This
> could
> > > > be a new structure alongside a record batch, which is semantically
> like a
> > > > "slice of a table" (and hence rectangular and complete), instead of
> > > > exposing existing users of RecordBatch to rather different behavior.
> > > > >
> > > > > For #3, a different thread was discussing some of the points there
> - it
> > > > sounds like it may be possible to relax from map<string, string> to
> > > > map<string, binary>.
> > > > >
> > > > > -David
> > > > >
> > > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > > > Wes suggested that maybe there are enough new ideas that it may
> make
> > > > sense
> > > > > > to evolve-past the existing structures rather than to bolt-on new
> > > > > > functionality. I would like to learn what requirements exist
> should
> > > new
> > > > > > structures be adopted, and if applicable, would like to turn this
> > > into
> > > > a
> > > > > > full POC proposal.
> > > > > >
> > > > > > These are the features that I feel are missing from the existing
> > > > design:
> > > > > > - the ability to notify that the columns are not consistent in
> length
> > > > (e.g.
> > > > > > setting RecordBatch.length to -1; and give the arrow/flight user
> the
> > > > true
> > > > > > FieldNode lengths).
> > > > > > - the ability to skip top-level field nodes that have length 0
> at a
> > > > small
> > > > > > cost (such as in a bitset)
> > > > > > - the ability to embed binary payload in the Message flatbuffer
> > > wrapper
> > > > > > (instead of String payload only)
> > > > > > - the ability to concurrently use more than one schema (the most
> > > > likely API
> > > > > > will look like how one identifies a dictionary. ideally
> dictionaries
> > > > could
> > > > > > be shared across field nodes in a schema or across schemas in the
> > > same
> > > > > > flight)
> > > > > >
> > > > > > What other features, or improvements, could/should be
> considered? Any
> > > > > > strong opinions against the ideas above? (Remember, that a goal
> of
> > > > mine is
> > > > > > to be able to send a RecordBatch of rows that were modified
> > > intersected
> > > > > > only by the field-nodes that have changed (including those with
> only
> > > > inner
> > > > > > node changes); thus the columns are a subset of the full schema
> and
> > > > that
> > > > > > the length of each node is independent of the other).
> > > > > >
> > > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > > wrote:
> > > > > > > It sounds like we may want to discuss some potential
> evolutions of
> > > > the
> > > > > > > Arrow binary protocol (for example: new Message types).
> Certainly a
> > > > > > > can of worms but rather than trying to bolt some new
> functionality
> > > > > > > onto the existing structures, it might be better to support
> the new
> > > > > > > use cases through some new structures which will be more clear
> cut
> > > > > > > from a forward compatibility standpoint.
> > > > > >
> > > > > > Nate
> > > > > >
> > > > > > --
> > > > > >
> > > >
> > >
> >
> >
> > --
> >
>


--

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by David Li <li...@apache.org>.

Hey Nate,

Thanks for doing this! Would you be interested in putting that commit up as a draft PR for discussion? I think we can discuss there.

I'm not sure anyone is actively working on RLE or other encoding schemes at the moment.

-David

On Mon, Nov 8, 2021, at 13:19, Nate Bauernfeind wrote:
> I've written up the ColumnBag proposal addressing items 1 and 2 on the
> list. I'm open to any and all feedback/suggestions.
> 
> I'd be happy to add item 3 (binary metadata) to the proposed change set.
> Let me know if you want me to whip up the initial suggestion for that
> version (and whether or not to keep it separate from ColumnBag).
> 
> Would RLE related efforts change the structure of RecordBatch or ColumnBag
> (if accepted)?
> 
> Here is the brief history-discussion around why ColumnBag:
> https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/
> 
> Here is a brief commit doctoring up the flatbuffer to support this version
> of the proposed change:
> https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1
> 
> I don't know if it's better to comment in the document or bring comments
> back to the list. If it ends up being document heavy, then I'll summarize
> the main points back on the list.
> 
> I think I'll get started on a Java impl just to learn more even if it ends
> up being extra work.
> 
> Looking forward to your feedback,
> Nate
> 
> On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
> > I'm still interested in RLE related effort, but not sure about my available
> > bandwidth (which is why I haven't made more of an effort there).
> >
> > On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Another Flatbuffers/Message.fbs project we should rekindle soon, in
> > > addition to the schema evolution/replacement question which has been
> > > raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> > > have a vacation plus some travel coming up so won't be able to devote
> > > meaningful attention to this until the last part of August, but would
> > > like to help it move forward.
> > >
> > >
> > > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org> wrote:
> > > >
> > > > Hey Nate,
> > > >
> > > > For the first two points, semantically I'm tempted to think of it more
> > > like the ability to send a "bag of columns" according to some schema (and
> > > hence columns could have differing lengths or even be absent). This could
> > > be a new structure alongside a record batch, which is semantically like a
> > > "slice of a table" (and hence rectangular and complete), instead of
> > > exposing existing users of RecordBatch to rather different behavior.
> > > >
> > > > For #3, a different thread was discussing some of the points there - it
> > > sounds like it may be possible to relax from map<string, string> to
> > > map<string, binary>.
> > > >
> > > > -David
> > > >
> > > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > > Wes suggested that maybe there are enough new ideas that it may make
> > > sense
> > > > > to evolve-past the existing structures rather than to bolt-on new
> > > > > functionality. I would like to learn what requirements exist should
> > new
> > > > > structures be adopted, and if applicable, would like to turn this
> > into
> > > a
> > > > > full POC proposal.
> > > > >
> > > > > These are the features that I feel are missing from the existing
> > > design:
> > > > > - the ability to notify that the columns are not consistent in length
> > > (e.g.
> > > > > setting RecordBatch.length to -1; and give the arrow/flight user the
> > > true
> > > > > FieldNode lengths).
> > > > > - the ability to skip top-level field nodes that have length 0 at a
> > > small
> > > > > cost (such as in a bitset)
> > > > > - the ability to embed binary payload in the Message flatbuffer
> > wrapper
> > > > > (instead of String payload only)
> > > > > - the ability to concurrently use more than one schema (the most
> > > likely API
> > > > > will look like how one identifies a dictionary. ideally dictionaries
> > > could
> > > > > be shared across field nodes in a schema or across schemas in the
> > same
> > > > > flight)
> > > > >
> > > > > What other features, or improvements, could/should be considered? Any
> > > > > strong opinions against the ideas above? (Remember, that a goal of
> > > mine is
> > > > > to be able to send a RecordBatch of rows that were modified
> > intersected
> > > > > only by the field-nodes that have changed (including those with only
> > > inner
> > > > > node changes); thus the columns are a subset of the full schema and
> > > that
> > > > > the length of each node is independent of the other).
> > > > >
> > > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > > > It sounds like we may want to discuss some potential evolutions of
> > > the
> > > > > > Arrow binary protocol (for example: new Message types). Certainly a
> > > > > > can of worms but rather than trying to bolt some new functionality
> > > > > > onto the existing structures, it might be better to support the new
> > > > > > use cases through some new structures which will be more clear cut
> > > > > > from a forward compatibility standpoint.
> > > > >
> > > > > Nate
> > > > >
> > > > > --
> > > > >
> > >
> >
> 
> 
> --
>

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by Nate Bauernfeind <na...@deephaven.io>.

I've written up the ColumnBag proposal addressing items 1 and 2 on the
list. I'm open to any and all feedback/suggestions.

I'd be happy to add item 3 (binary metadata) to the proposed change set.
Let me know if you want me to whip up the initial suggestion for that
version (and whether or not to keep it separate from ColumnBag).

Would RLE related efforts change the structure of RecordBatch or ColumnBag
(if accepted)?

Here is the brief history-discussion around why ColumnBag:
https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/

Here is a brief commit doctoring up the flatbuffer to support this version
of the proposed change:
https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1

I don't know if it's better to comment in the document or bring comments
back to the list. If it ends up being document heavy, then I'll summarize
the main points back on the list.

I think I'll get started on a Java impl just to learn more even if it ends
up being extra work.

Looking forward to your feedback,
Nate

On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <em...@gmail.com>
wrote:

> I'm still interested in RLE related effort, but not sure about my available
> bandwidth (which is why I haven't made more of an effort there).
>
> On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Another Flatbuffers/Message.fbs project we should rekindle soon, in
> > addition to the schema evolution/replacement question which has been
> > raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> > have a vacation plus some travel coming up so won't be able to devote
> > meaningful attention to this until the last part of August, but would
> > like to help it move forward.
> >
> >
> > On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org> wrote:
> > >
> > > Hey Nate,
> > >
> > > For the first two points, semantically I'm tempted to think of it more
> > like the ability to send a "bag of columns" according to some schema (and
> > hence columns could have differing lengths or even be absent). This could
> > be a new structure alongside a record batch, which is semantically like a
> > "slice of a table" (and hence rectangular and complete), instead of
> > exposing existing users of RecordBatch to rather different behavior.
> > >
> > > For #3, a different thread was discussing some of the points there - it
> > sounds like it may be possible to relax from map<string, string> to
> > map<string, binary>.
> > >
> > > -David
> > >
> > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > Wes suggested that maybe there are enough new ideas that it may make
> > sense
> > > > to evolve-past the existing structures rather than to bolt-on new
> > > > functionality. I would like to learn what requirements exist should
> new
> > > > structures be adopted, and if applicable, would like to turn this
> into
> > a
> > > > full POC proposal.
> > > >
> > > > These are the features that I feel are missing from the existing
> > design:
> > > > - the ability to notify that the columns are not consistent in length
> > (e.g.
> > > > setting RecordBatch.length to -1; and give the arrow/flight user the
> > true
> > > > FieldNode lengths).
> > > > - the ability to skip top-level field nodes that have length 0 at a
> > small
> > > > cost (such as in a bitset)
> > > > - the ability to embed binary payload in the Message flatbuffer
> wrapper
> > > > (instead of String payload only)
> > > > - the ability to concurrently use more than one schema (the most
> > likely API
> > > > will look like how one identifies a dictionary. ideally dictionaries
> > could
> > > > be shared across field nodes in a schema or across schemas in the
> same
> > > > flight)
> > > >
> > > > What other features, or improvements, could/should be considered? Any
> > > > strong opinions against the ideas above? (Remember, that a goal of
> > mine is
> > > > to be able to send a RecordBatch of rows that were modified
> intersected
> > > > only by the field-nodes that have changed (including those with only
> > inner
> > > > node changes); thus the columns are a subset of the full schema and
> > that
> > > > the length of each node is independent of the other).
> > > >
> > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > > > It sounds like we may want to discuss some potential evolutions of
> > the
> > > > > Arrow binary protocol (for example: new Message types). Certainly a
> > > > > can of worms but rather than trying to bolt some new functionality
> > > > > onto the existing structures, it might be better to support the new
> > > > > use cases through some new structures which will be more clear cut
> > > > > from a forward compatibility standpoint.
> > > >
> > > > Nate
> > > >
> > > > --
> > > >
> >
>


--

Re: [DISCUSS] next iteration of flatbuffer structures

Posted by Micah Kornfield <em...@gmail.com>.

I'm still interested in RLE related effort, but not sure about my available
bandwidth (which is why I haven't made more of an effort there).

On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <we...@gmail.com> wrote:

> Another Flatbuffers/Message.fbs project we should rekindle soon, in
> addition to the schema evolution/replacement question which has been
> raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> have a vacation plus some travel coming up so won't be able to devote
> meaningful attention to this until the last part of August, but would
> like to help it move forward.
>
>
> On Tue, Jul 27, 2021 at 1:40 PM David Li <li...@apache.org> wrote:
> >
> > Hey Nate,
> >
> > For the first two points, semantically I'm tempted to think of it more
> like the ability to send a "bag of columns" according to some schema (and
> hence columns could have differing lengths or even be absent). This could
> be a new structure alongside a record batch, which is semantically like a
> "slice of a table" (and hence rectangular and complete), instead of
> exposing existing users of RecordBatch to rather different behavior.
> >
> > For #3, a different thread was discussing some of the points there - it
> sounds like it may be possible to relax from map<string, string> to
> map<string, binary>.
> >
> > -David
> >
> > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > Wes suggested that maybe there are enough new ideas that it may make
> sense
> > > to evolve-past the existing structures rather than to bolt-on new
> > > functionality. I would like to learn what requirements exist should new
> > > structures be adopted, and if applicable, would like to turn this into
> a
> > > full POC proposal.
> > >
> > > These are the features that I feel are missing from the existing
> design:
> > > - the ability to notify that the columns are not consistent in length
> (e.g.
> > > setting RecordBatch.length to -1; and give the arrow/flight user the
> true
> > > FieldNode lengths).
> > > - the ability to skip top-level field nodes that have length 0 at a
> small
> > > cost (such as in a bitset)
> > > - the ability to embed binary payload in the Message flatbuffer wrapper
> > > (instead of String payload only)
> > > - the ability to concurrently use more than one schema (the most
> likely API
> > > will look like how one identifies a dictionary. ideally dictionaries
> could
> > > be shared across field nodes in a schema or across schemas in the same
> > > flight)
> > >
> > > What other features, or improvements, could/should be considered? Any
> > > strong opinions against the ideas above? (Remember, that a goal of
> mine is
> > > to be able to send a RecordBatch of rows that were modified intersected
> > > only by the field-nodes that have changed (including those with only
> inner
> > > node changes); thus the columns are a subset of the full schema and
> that
> > > the length of each node is independent of the other).
> > >
> > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <we...@gmail.com>
> wrote:
> > > > It sounds like we may want to discuss some potential evolutions of
> the
> > > > Arrow binary protocol (for example: new Message types). Certainly a
> > > > can of worms but rather than trying to bolt some new functionality
> > > > onto the existing structures, it might be better to support the new
> > > > use cases through some new structures which will be more clear cut
> > > > from a forward compatibility standpoint.
> > >
> > > Nate
> > >
> > > --
> > >
>