You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Matt Topol <zo...@gmail.com> on 2022/12/14 16:27:10 UTC

[VOTE] Add RLE Arrays to Arrow Format

Hello,

I'd like to propose adding the RLE type based on earlier discussions[1][2]
to the Arrow format:
- Columnar Format description:
https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
- Flatbuffers changes:
https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07

There is a proposed implementation available in both C++ (written by Tobias
Zagorni) and Go[3][4]. Both implementations have mostly the same tests
implemented and were tested to be compatible over IPC with an archery test.
In both cases, the implementations are split out among several Draft PRs so
that they can be easily reviewed piecemeal if the vote is approved, with
each Draft PR including the changes of the one before it. The links
provided are the Draft PRs with the entirety of the changes included.

The vote will be open for at least 72 hours.

[ ] +1 add the proposed RLE type to the Apache Arrow format
[ ] -1 do not add the proposed RLE type to the Apache Arrow format
because...

Thanks much, and please let me know if any more information or links are
needed (I've never proposed a vote before on here!)

--Matt

[1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
[2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
[3] https://github.com/apache/arrow/pull/14179
[4] https://github.com/apache/arrow/pull/14223

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

Huzzah!

That brings us to 3 +1 (binding) votes, and 1 +1 (non-binding) vote!

The vote passes! I've updated the PR for the format changes (on their own)
here: https://github.com/apache/arrow/pull/14176 and will follow it up with
updating the other PRs as I can. If anyone could comment / approve that PR,
I'll merge it to kick this off and start getting the other PRs ready for
review.

Thanks everyone!

On Mon, Dec 19, 2022 at 4:59 PM Ian Cook <ia...@ursacomputing.com> wrote:

> @Matt Topol: Yes, a change of the name to "run-end encoding" changes
> my (non-binding) vote to a +1.
>
> On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
> <ma...@voltrondata.com.invalid> wrote:
> >
> > Okay, slight edit to my previous email: It was brought to my attention
> that
> > we need at least 3 +1 binding votes, so this vote is still open for the
> > moment.
> >
> > @IanCook: With the change of the name to RunEndEncoding is that
> sufficient
> > to change your vote to a +1?
> >
> > On Mon, Dec 19, 2022 at 12:57 PM Matt Topol <zo...@gmail.com>
> wrote:
> >
> > > That leaves us with a total vote of +1.5 so the vote carries with the
> > > caveat of changing the name to be Run End Encoded rather than Run
> Length
> > > Encoded (unless this means I need to do a new vote with the changed
> name?
> > > This is my first time doing one of these so please correct me if I
> need to
> > > do a new vote!)
> > >
> > > Thanks everyone for your feedback and comments!
> > >
> > > I'm going to go update the Go and Format specific PRs to make them
> regular
> > > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > > anyone who reviews the upcoming PRs!
> > >
> > > --Matt
> > >
> > > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <we...@gmail.com>
> wrote:
> > >
> > > > +1
> > > >
> > > > I agree that run-end encoding makes more sense but also don't see it
> > > > as a deal breaker.
> > > >
> > > > The most compelling counter-argument I've seen for new types is to
> > > > avoid a schism where some implementations do not support the newer
> > > > types.  However, for the type proposed here I think the risk is low
> > > > because data can be losslessly converted to existing formats for
> > > > compatibility with any system that doesn't support the type.
> > > >
> > > > Another argument I've seen is that we should introduce a more formal
> > > > distinction between "layouts" and "types" (with dictionary and
> > > > run-end-encoding being layouts).  However, this seems like an
> > > > impractical change at this point.  In addition, given that we have
> > > > dictionary as an array type the cat is already out of the bag.
> > > > Furthermore, systems and implementations are still welcome to make
> > > > this distinction themselves.  The spec only needs to specify what the
> > > > buffer layouts should be.  If a particular library chooses to group
> > > > those layouts into two different categories I think that would still
> > > > be feasible.
> > > >
> > > > -Weston
> > > >
> > > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com>
> > > wrote:
> > > > >
> > > > > +1 on the proposal as written
> > > > >
> > > > > I think it makes sense and offers exciting opportunities for faster
> > > > > computation (especially for cases where parquet files can be
> decoded
> > > > > directly into such an array and avoid unpacking. RLE encoded
> dictionary
> > > > are
> > > > > quite compelling)
> > > > >
> > > > > I would prefer to use the term Run-End-Encoding (which would also
> > > follow
> > > > > the naming of the internal fields) but I don't view that as a deal
> > > > blocker.
> > > > >
> > > > > Thank you for all your work in this matter,
> > > > > Andrew
> > > > >
> > > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zotthewizard@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if
> that
> > > > would
> > > > > > be preferable. Hopefully others will chime in with their
> feedback.
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ian@ursacomputing.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > > >
> > > > > > > I am -0.5 on this proposal in its current form because (pardon
> the
> > > > > > > pedantry) what we have implemented here is not run-length
> encoding;
> > > > it
> > > > > > > is run-end encoding. Based on community input, the choice was
> made
> > > to
> > > > > > > store run ends instead of run lengths because this enables
> > > O(log(N))
> > > > > > > random access as opposed to O(N). This is a sensible choice,
> but it
> > > > > > > comes with some trade-offs including limitations in array
> length
> > > > > > > (which maybe not really a problem in practice) and lack of
> > > > bit-for-bit
> > > > > > > equivalence with RLE encodings that use run lengths like
> Velox's
> > > > > > > SequenceVector encoding (which I think is a more serious
> problem in
> > > > > > > practice).
> > > > > > >
> > > > > > > I believe that we should either:
> > > > > > > (a) rename this to "run-end encoding"
> > > > > > > (b) change this to a parameterized type called "run encoding"
> that
> > > > > > > takes a Boolean parameter specifying whether run lengths or run
> > > ends
> > > > > > > are stored.
> > > > > > >
> > > > > > > Ian
> > > > > > >
> > > > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> > > zotthewizard@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > > > discussions[1][2]
> > > > > > > > to the Arrow format:
> > > > > > > > - Columnar Format description:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > > > - Flatbuffers changes:
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > > > >
> > > > > > > > There is a proposed implementation available in both C++
> (written
> > > > by
> > > > > > > Tobias
> > > > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the
> same
> > > > tests
> > > > > > > > implemented and were tested to be compatible over IPC with an
> > > > archery
> > > > > > > test.
> > > > > > > > In both cases, the implementations are split out among
> several
> > > > Draft
> > > > > > PRs
> > > > > > > so
> > > > > > > > that they can be easily reviewed piecemeal if the vote is
> > > approved,
> > > > > > with
> > > > > > > > each Draft PR including the changes of the one before it. The
> > > links
> > > > > > > > provided are the Draft PRs with the entirety of the changes
> > > > included.
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > >
> > > > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow
> > > format
> > > > > > > > because...
> > > > > > > >
> > > > > > > > Thanks much, and please let me know if any more information
> or
> > > > links
> > > > > > are
> > > > > > > > needed (I've never proposed a vote before on here!)
> > > > > > > >
> > > > > > > > --Matt
> > > > > > > >
> > > > > > > > [1]
> > > > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > > > > [2]
> > > > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > > > > [4] https://github.com/apache/arrow/pull/14223
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Ian Cook <ia...@ursacomputing.com>.

@Matt Topol: Yes, a change of the name to "run-end encoding" changes
my (non-binding) vote to a +1.

On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
<ma...@voltrondata.com.invalid> wrote:
>
> Okay, slight edit to my previous email: It was brought to my attention that
> we need at least 3 +1 binding votes, so this vote is still open for the
> moment.
>
> @IanCook: With the change of the name to RunEndEncoding is that sufficient
> to change your vote to a +1?
>
> On Mon, Dec 19, 2022 at 12:57 PM Matt Topol <zo...@gmail.com> wrote:
>
> > That leaves us with a total vote of +1.5 so the vote carries with the
> > caveat of changing the name to be Run End Encoded rather than Run Length
> > Encoded (unless this means I need to do a new vote with the changed name?
> > This is my first time doing one of these so please correct me if I need to
> > do a new vote!)
> >
> > Thanks everyone for your feedback and comments!
> >
> > I'm going to go update the Go and Format specific PRs to make them regular
> > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > anyone who reviews the upcoming PRs!
> >
> > --Matt
> >
> > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <we...@gmail.com> wrote:
> >
> > > +1
> > >
> > > I agree that run-end encoding makes more sense but also don't see it
> > > as a deal breaker.
> > >
> > > The most compelling counter-argument I've seen for new types is to
> > > avoid a schism where some implementations do not support the newer
> > > types.  However, for the type proposed here I think the risk is low
> > > because data can be losslessly converted to existing formats for
> > > compatibility with any system that doesn't support the type.
> > >
> > > Another argument I've seen is that we should introduce a more formal
> > > distinction between "layouts" and "types" (with dictionary and
> > > run-end-encoding being layouts).  However, this seems like an
> > > impractical change at this point.  In addition, given that we have
> > > dictionary as an array type the cat is already out of the bag.
> > > Furthermore, systems and implementations are still welcome to make
> > > this distinction themselves.  The spec only needs to specify what the
> > > buffer layouts should be.  If a particular library chooses to group
> > > those layouts into two different categories I think that would still
> > > be feasible.
> > >
> > > -Weston
> > >
> > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com>
> > wrote:
> > > >
> > > > +1 on the proposal as written
> > > >
> > > > I think it makes sense and offers exciting opportunities for faster
> > > > computation (especially for cases where parquet files can be decoded
> > > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > > are
> > > > quite compelling)
> > > >
> > > > I would prefer to use the term Run-End-Encoding (which would also
> > follow
> > > > the naming of the internal fields) but I don't view that as a deal
> > > blocker.
> > > >
> > > > Thank you for all your work in this matter,
> > > > Andrew
> > > >
> > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com>
> > > wrote:
> > > >
> > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > > would
> > > > > be preferable. Hopefully others will chime in with their feedback.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com>
> > > wrote:
> > > > >
> > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > >
> > > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > > pedantry) what we have implemented here is not run-length encoding;
> > > it
> > > > > > is run-end encoding. Based on community input, the choice was made
> > to
> > > > > > store run ends instead of run lengths because this enables
> > O(log(N))
> > > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > > comes with some trade-offs including limitations in array length
> > > > > > (which maybe not really a problem in practice) and lack of
> > > bit-for-bit
> > > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > > practice).
> > > > > >
> > > > > > I believe that we should either:
> > > > > > (a) rename this to "run-end encoding"
> > > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > > takes a Boolean parameter specifying whether run lengths or run
> > ends
> > > > > > are stored.
> > > > > >
> > > > > > Ian
> > > > > >
> > > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> > zotthewizard@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > > discussions[1][2]
> > > > > > > to the Arrow format:
> > > > > > > - Columnar Format description:
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > > - Flatbuffers changes:
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > > >
> > > > > > > There is a proposed implementation available in both C++ (written
> > > by
> > > > > > Tobias
> > > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> > > tests
> > > > > > > implemented and were tested to be compatible over IPC with an
> > > archery
> > > > > > test.
> > > > > > > In both cases, the implementations are split out among several
> > > Draft
> > > > > PRs
> > > > > > so
> > > > > > > that they can be easily reviewed piecemeal if the vote is
> > approved,
> > > > > with
> > > > > > > each Draft PR including the changes of the one before it. The
> > links
> > > > > > > provided are the Draft PRs with the entirety of the changes
> > > included.
> > > > > > >
> > > > > > > The vote will be open for at least 72 hours.
> > > > > > >
> > > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow
> > format
> > > > > > > because...
> > > > > > >
> > > > > > > Thanks much, and please let me know if any more information or
> > > links
> > > > > are
> > > > > > > needed (I've never proposed a vote before on here!)
> > > > > > >
> > > > > > > --Matt
> > > > > > >
> > > > > > > [1]
> > > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > > > [2]
> > > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > > > [4] https://github.com/apache/arrow/pull/14223
> > > > > >
> > > > >
> > >
> >

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

Okay, slight edit to my previous email: It was brought to my attention that
we need at least 3 +1 binding votes, so this vote is still open for the
moment.

@IanCook: With the change of the name to RunEndEncoding is that sufficient
to change your vote to a +1?

On Mon, Dec 19, 2022 at 12:57 PM Matt Topol <zo...@gmail.com> wrote:

> That leaves us with a total vote of +1.5 so the vote carries with the
> caveat of changing the name to be Run End Encoded rather than Run Length
> Encoded (unless this means I need to do a new vote with the changed name?
> This is my first time doing one of these so please correct me if I need to
> do a new vote!)
>
> Thanks everyone for your feedback and comments!
>
> I'm going to go update the Go and Format specific PRs to make them regular
> PR's (instead of drafts) and get this all moving. Thanks in advance to
> anyone who reviews the upcoming PRs!
>
> --Matt
>
> On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <we...@gmail.com> wrote:
>
> > +1
> >
> > I agree that run-end encoding makes more sense but also don't see it
> > as a deal breaker.
> >
> > The most compelling counter-argument I've seen for new types is to
> > avoid a schism where some implementations do not support the newer
> > types.  However, for the type proposed here I think the risk is low
> > because data can be losslessly converted to existing formats for
> > compatibility with any system that doesn't support the type.
> >
> > Another argument I've seen is that we should introduce a more formal
> > distinction between "layouts" and "types" (with dictionary and
> > run-end-encoding being layouts).  However, this seems like an
> > impractical change at this point.  In addition, given that we have
> > dictionary as an array type the cat is already out of the bag.
> > Furthermore, systems and implementations are still welcome to make
> > this distinction themselves.  The spec only needs to specify what the
> > buffer layouts should be.  If a particular library chooses to group
> > those layouts into two different categories I think that would still
> > be feasible.
> >
> > -Weston
> >
> > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> > >
> > > +1 on the proposal as written
> > >
> > > I think it makes sense and offers exciting opportunities for faster
> > > computation (especially for cases where parquet files can be decoded
> > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > are
> > > quite compelling)
> > >
> > > I would prefer to use the term Run-End-Encoding (which would also
> follow
> > > the naming of the internal fields) but I don't view that as a deal
> > blocker.
> > >
> > > Thank you for all your work in this matter,
> > > Andrew
> > >
> > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com>
> > wrote:
> > >
> > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > would
> > > > be preferable. Hopefully others will chime in with their feedback.
> > > >
> > > > --Matt
> > > >
> > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com>
> > wrote:
> > > >
> > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > >
> > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > pedantry) what we have implemented here is not run-length encoding;
> > it
> > > > > is run-end encoding. Based on community input, the choice was made
> to
> > > > > store run ends instead of run lengths because this enables
> O(log(N))
> > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > comes with some trade-offs including limitations in array length
> > > > > (which maybe not really a problem in practice) and lack of
> > bit-for-bit
> > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > practice).
> > > > >
> > > > > I believe that we should either:
> > > > > (a) rename this to "run-end encoding"
> > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > takes a Boolean parameter specifying whether run lengths or run
> ends
> > > > > are stored.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> zotthewizard@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > discussions[1][2]
> > > > > > to the Arrow format:
> > > > > > - Columnar Format description:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > - Flatbuffers changes:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > >
> > > > > > There is a proposed implementation available in both C++ (written
> > by
> > > > > Tobias
> > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> > tests
> > > > > > implemented and were tested to be compatible over IPC with an
> > archery
> > > > > test.
> > > > > > In both cases, the implementations are split out among several
> > Draft
> > > > PRs
> > > > > so
> > > > > > that they can be easily reviewed piecemeal if the vote is
> approved,
> > > > with
> > > > > > each Draft PR including the changes of the one before it. The
> links
> > > > > > provided are the Draft PRs with the entirety of the changes
> > included.
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow
> format
> > > > > > because...
> > > > > >
> > > > > > Thanks much, and please let me know if any more information or
> > links
> > > > are
> > > > > > needed (I've never proposed a vote before on here!)
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > > [2]
> > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > > [4] https://github.com/apache/arrow/pull/14223
> > > > >
> > > >
> >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

+1

Thanks a lot for all this. Really exciting!!

On Mon, 19 Dec 2022, 17:56 Matt Topol, <zo...@gmail.com> wrote:

> That leaves us with a total vote of +1.5 so the vote carries with the
> caveat of changing the name to be Run End Encoded rather than Run Length
> Encoded (unless this means I need to do a new vote with the changed name?
> This is my first time doing one of these so please correct me if I need to
> do a new vote!)
>
> Thanks everyone for your feedback and comments!
>
> I'm going to go update the Go and Format specific PRs to make them regular
> PR's (instead of drafts) and get this all moving. Thanks in advance to
> anyone who reviews the upcoming PRs!
>
> --Matt
>
> On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <we...@gmail.com> wrote:
>
> > +1
> >
> > I agree that run-end encoding makes more sense but also don't see it
> > as a deal breaker.
> >
> > The most compelling counter-argument I've seen for new types is to
> > avoid a schism where some implementations do not support the newer
> > types.  However, for the type proposed here I think the risk is low
> > because data can be losslessly converted to existing formats for
> > compatibility with any system that doesn't support the type.
> >
> > Another argument I've seen is that we should introduce a more formal
> > distinction between "layouts" and "types" (with dictionary and
> > run-end-encoding being layouts).  However, this seems like an
> > impractical change at this point.  In addition, given that we have
> > dictionary as an array type the cat is already out of the bag.
> > Furthermore, systems and implementations are still welcome to make
> > this distinction themselves.  The spec only needs to specify what the
> > buffer layouts should be.  If a particular library chooses to group
> > those layouts into two different categories I think that would still
> > be feasible.
> >
> > -Weston
> >
> > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> > >
> > > +1 on the proposal as written
> > >
> > > I think it makes sense and offers exciting opportunities for faster
> > > computation (especially for cases where parquet files can be decoded
> > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > are
> > > quite compelling)
> > >
> > > I would prefer to use the term Run-End-Encoding (which would also
> follow
> > > the naming of the internal fields) but I don't view that as a deal
> > blocker.
> > >
> > > Thank you for all your work in this matter,
> > > Andrew
> > >
> > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com>
> > wrote:
> > >
> > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > would
> > > > be preferable. Hopefully others will chime in with their feedback.
> > > >
> > > > --Matt
> > > >
> > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com>
> > wrote:
> > > >
> > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > >
> > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > pedantry) what we have implemented here is not run-length encoding;
> > it
> > > > > is run-end encoding. Based on community input, the choice was made
> to
> > > > > store run ends instead of run lengths because this enables
> O(log(N))
> > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > comes with some trade-offs including limitations in array length
> > > > > (which maybe not really a problem in practice) and lack of
> > bit-for-bit
> > > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > > SequenceVector encoding (which I think is a more serious problem in
> > > > > practice).
> > > > >
> > > > > I believe that we should either:
> > > > > (a) rename this to "run-end encoding"
> > > > > (b) change this to a parameterized type called "run encoding" that
> > > > > takes a Boolean parameter specifying whether run lengths or run
> ends
> > > > > are stored.
> > > > >
> > > > > Ian
> > > > >
> > > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <
> zotthewizard@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like to propose adding the RLE type based on earlier
> > > > > discussions[1][2]
> > > > > > to the Arrow format:
> > > > > > - Columnar Format description:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > > - Flatbuffers changes:
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > > >
> > > > > > There is a proposed implementation available in both C++ (written
> > by
> > > > > Tobias
> > > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> > tests
> > > > > > implemented and were tested to be compatible over IPC with an
> > archery
> > > > > test.
> > > > > > In both cases, the implementations are split out among several
> > Draft
> > > > PRs
> > > > > so
> > > > > > that they can be easily reviewed piecemeal if the vote is
> approved,
> > > > with
> > > > > > each Draft PR including the changes of the one before it. The
> links
> > > > > > provided are the Draft PRs with the entirety of the changes
> > included.
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow
> format
> > > > > > because...
> > > > > >
> > > > > > Thanks much, and please let me know if any more information or
> > links
> > > > are
> > > > > > needed (I've never proposed a vote before on here!)
> > > > > >
> > > > > > --Matt
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > > [2]
> > https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > > [4] https://github.com/apache/arrow/pull/14223
> > > > >
> > > >
> >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Matt Topol <zo...@gmail.com>.

That leaves us with a total vote of +1.5 so the vote carries with the
caveat of changing the name to be Run End Encoded rather than Run Length
Encoded (unless this means I need to do a new vote with the changed name?
This is my first time doing one of these so please correct me if I need to
do a new vote!)

Thanks everyone for your feedback and comments!

I'm going to go update the Go and Format specific PRs to make them regular
PR's (instead of drafts) and get this all moving. Thanks in advance to
anyone who reviews the upcoming PRs!

--Matt

On Fri, Dec 16, 2022 at 8:24 PM Weston Pace <we...@gmail.com> wrote:

> +1
>
> I agree that run-end encoding makes more sense but also don't see it
> as a deal breaker.
>
> The most compelling counter-argument I've seen for new types is to
> avoid a schism where some implementations do not support the newer
> types.  However, for the type proposed here I think the risk is low
> because data can be losslessly converted to existing formats for
> compatibility with any system that doesn't support the type.
>
> Another argument I've seen is that we should introduce a more formal
> distinction between "layouts" and "types" (with dictionary and
> run-end-encoding being layouts).  However, this seems like an
> impractical change at this point.  In addition, given that we have
> dictionary as an array type the cat is already out of the bag.
> Furthermore, systems and implementations are still welcome to make
> this distinction themselves.  The spec only needs to specify what the
> buffer layouts should be.  If a particular library chooses to group
> those layouts into two different categories I think that would still
> be feasible.
>
> -Weston
>
> On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > +1 on the proposal as written
> >
> > I think it makes sense and offers exciting opportunities for faster
> > computation (especially for cases where parquet files can be decoded
> > directly into such an array and avoid unpacking. RLE encoded dictionary
> are
> > quite compelling)
> >
> > I would prefer to use the term Run-End-Encoding (which would also follow
> > the naming of the internal fields) but I don't view that as a deal
> blocker.
> >
> > Thank you for all your work in this matter,
> > Andrew
> >
> > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com>
> wrote:
> >
> > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> would
> > > be preferable. Hopefully others will chime in with their feedback.
> > >
> > > --Matt
> > >
> > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com>
> wrote:
> > >
> > > > Thank you Matt, Tobias, and others for the great work on this.
> > > >
> > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > pedantry) what we have implemented here is not run-length encoding;
> it
> > > > is run-end encoding. Based on community input, the choice was made to
> > > > store run ends instead of run lengths because this enables O(log(N))
> > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > comes with some trade-offs including limitations in array length
> > > > (which maybe not really a problem in practice) and lack of
> bit-for-bit
> > > > equivalence with RLE encodings that use run lengths like Velox's
> > > > SequenceVector encoding (which I think is a more serious problem in
> > > > practice).
> > > >
> > > > I believe that we should either:
> > > > (a) rename this to "run-end encoding"
> > > > (b) change this to a parameterized type called "run encoding" that
> > > > takes a Boolean parameter specifying whether run lengths or run ends
> > > > are stored.
> > > >
> > > > Ian
> > > >
> > > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zo...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like to propose adding the RLE type based on earlier
> > > > discussions[1][2]
> > > > > to the Arrow format:
> > > > > - Columnar Format description:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > > - Flatbuffers changes:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > > >
> > > > > There is a proposed implementation available in both C++ (written
> by
> > > > Tobias
> > > > > Zagorni) and Go[3][4]. Both implementations have mostly the same
> tests
> > > > > implemented and were tested to be compatible over IPC with an
> archery
> > > > test.
> > > > > In both cases, the implementations are split out among several
> Draft
> > > PRs
> > > > so
> > > > > that they can be easily reviewed piecemeal if the vote is approved,
> > > with
> > > > > each Draft PR including the changes of the one before it. The links
> > > > > provided are the Draft PRs with the entirety of the changes
> included.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > > > because...
> > > > >
> > > > > Thanks much, and please let me know if any more information or
> links
> > > are
> > > > > needed (I've never proposed a vote before on here!)
> > > > >
> > > > > --Matt
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > > [2]
> https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > > [3] https://github.com/apache/arrow/pull/14179
> > > > > [4] https://github.com/apache/arrow/pull/14223
> > > >
> > >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Weston Pace <we...@gmail.com>.

+1

I agree that run-end encoding makes more sense but also don't see it
as a deal breaker.

The most compelling counter-argument I've seen for new types is to
avoid a schism where some implementations do not support the newer
types.  However, for the type proposed here I think the risk is low
because data can be losslessly converted to existing formats for
compatibility with any system that doesn't support the type.

Another argument I've seen is that we should introduce a more formal
distinction between "layouts" and "types" (with dictionary and
run-end-encoding being layouts).  However, this seems like an
impractical change at this point.  In addition, given that we have
dictionary as an array type the cat is already out of the bag.
Furthermore, systems and implementations are still welcome to make
this distinction themselves.  The spec only needs to specify what the
buffer layouts should be.  If a particular library chooses to group
those layouts into two different categories I think that would still
be feasible.

-Weston

On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> +1 on the proposal as written
>
> I think it makes sense and offers exciting opportunities for faster
> computation (especially for cases where parquet files can be decoded
> directly into such an array and avoid unpacking. RLE encoded dictionary are
> quite compelling)
>
> I would prefer to use the term Run-End-Encoding (which would also follow
> the naming of the internal fields) but I don't view that as a deal blocker.
>
> Thank you for all your work in this matter,
> Andrew
>
> On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com> wrote:
>
> > I'm not at all opposed to renaming it as `Run-End-Encoding` if that would
> > be preferable. Hopefully others will chime in with their feedback.
> >
> > --Matt
> >
> > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com> wrote:
> >
> > > Thank you Matt, Tobias, and others for the great work on this.
> > >
> > > I am -0.5 on this proposal in its current form because (pardon the
> > > pedantry) what we have implemented here is not run-length encoding; it
> > > is run-end encoding. Based on community input, the choice was made to
> > > store run ends instead of run lengths because this enables O(log(N))
> > > random access as opposed to O(N). This is a sensible choice, but it
> > > comes with some trade-offs including limitations in array length
> > > (which maybe not really a problem in practice) and lack of bit-for-bit
> > > equivalence with RLE encodings that use run lengths like Velox's
> > > SequenceVector encoding (which I think is a more serious problem in
> > > practice).
> > >
> > > I believe that we should either:
> > > (a) rename this to "run-end encoding"
> > > (b) change this to a parameterized type called "run encoding" that
> > > takes a Boolean parameter specifying whether run lengths or run ends
> > > are stored.
> > >
> > > Ian
> > >
> > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zo...@gmail.com>
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'd like to propose adding the RLE type based on earlier
> > > discussions[1][2]
> > > > to the Arrow format:
> > > > - Columnar Format description:
> > > >
> > >
> > https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > - Flatbuffers changes:
> > > >
> > >
> > https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > >
> > > > There is a proposed implementation available in both C++ (written by
> > > Tobias
> > > > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > > > implemented and were tested to be compatible over IPC with an archery
> > > test.
> > > > In both cases, the implementations are split out among several Draft
> > PRs
> > > so
> > > > that they can be easily reviewed piecemeal if the vote is approved,
> > with
> > > > each Draft PR including the changes of the one before it. The links
> > > > provided are the Draft PRs with the entirety of the changes included.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > > because...
> > > >
> > > > Thanks much, and please let me know if any more information or links
> > are
> > > > needed (I've never proposed a vote before on here!)
> > > >
> > > > --Matt
> > > >
> > > > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > [3] https://github.com/apache/arrow/pull/14179
> > > > [4] https://github.com/apache/arrow/pull/14223
> > >
> >

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Andrew Lamb <al...@influxdata.com>.

+1 on the proposal as written

I think it makes sense and offers exciting opportunities for faster
computation (especially for cases where parquet files can be decoded
directly into such an array and avoid unpacking. RLE encoded dictionary are
quite compelling)

I would prefer to use the term Run-End-Encoding (which would also follow
the naming of the internal fields) but I don't view that as a deal blocker.

Thank you for all your work in this matter,
Andrew

On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <zo...@gmail.com> wrote:

> I'm not at all opposed to renaming it as `Run-End-Encoding` if that would
> be preferable. Hopefully others will chime in with their feedback.
>
> --Matt
>
> On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com> wrote:
>
> > Thank you Matt, Tobias, and others for the great work on this.
> >
> > I am -0.5 on this proposal in its current form because (pardon the
> > pedantry) what we have implemented here is not run-length encoding; it
> > is run-end encoding. Based on community input, the choice was made to
> > store run ends instead of run lengths because this enables O(log(N))
> > random access as opposed to O(N). This is a sensible choice, but it
> > comes with some trade-offs including limitations in array length
> > (which maybe not really a problem in practice) and lack of bit-for-bit
> > equivalence with RLE encodings that use run lengths like Velox's
> > SequenceVector encoding (which I think is a more serious problem in
> > practice).
> >
> > I believe that we should either:
> > (a) rename this to "run-end encoding"
> > (b) change this to a parameterized type called "run encoding" that
> > takes a Boolean parameter specifying whether run lengths or run ends
> > are stored.
> >
> > Ian
> >
> > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zo...@gmail.com>
> > wrote:
> > >
> > > Hello,
> > >
> > > I'd like to propose adding the RLE type based on earlier
> > discussions[1][2]
> > > to the Arrow format:
> > > - Columnar Format description:
> > >
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > - Flatbuffers changes:
> > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > >
> > > There is a proposed implementation available in both C++ (written by
> > Tobias
> > > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > > implemented and were tested to be compatible over IPC with an archery
> > test.
> > > In both cases, the implementations are split out among several Draft
> PRs
> > so
> > > that they can be easily reviewed piecemeal if the vote is approved,
> with
> > > each Draft PR including the changes of the one before it. The links
> > > provided are the Draft PRs with the entirety of the changes included.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > because...
> > >
> > > Thanks much, and please let me know if any more information or links
> are
> > > needed (I've never proposed a vote before on here!)
> > >
> > > --Matt
> > >
> > > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > [3] https://github.com/apache/arrow/pull/14179
> > > [4] https://github.com/apache/arrow/pull/14223
> >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Matt Topol <zo...@gmail.com>.

I'm not at all opposed to renaming it as `Run-End-Encoding` if that would
be preferable. Hopefully others will chime in with their feedback.

--Matt

On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <ia...@ursacomputing.com> wrote:

> Thank you Matt, Tobias, and others for the great work on this.
>
> I am -0.5 on this proposal in its current form because (pardon the
> pedantry) what we have implemented here is not run-length encoding; it
> is run-end encoding. Based on community input, the choice was made to
> store run ends instead of run lengths because this enables O(log(N))
> random access as opposed to O(N). This is a sensible choice, but it
> comes with some trade-offs including limitations in array length
> (which maybe not really a problem in practice) and lack of bit-for-bit
> equivalence with RLE encodings that use run lengths like Velox's
> SequenceVector encoding (which I think is a more serious problem in
> practice).
>
> I believe that we should either:
> (a) rename this to "run-end encoding"
> (b) change this to a parameterized type called "run encoding" that
> takes a Boolean parameter specifying whether run lengths or run ends
> are stored.
>
> Ian
>
> On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zo...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I'd like to propose adding the RLE type based on earlier
> discussions[1][2]
> > to the Arrow format:
> > - Columnar Format description:
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > - Flatbuffers changes:
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> >
> > There is a proposed implementation available in both C++ (written by
> Tobias
> > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > implemented and were tested to be compatible over IPC with an archery
> test.
> > In both cases, the implementations are split out among several Draft PRs
> so
> > that they can be easily reviewed piecemeal if the vote is approved, with
> > each Draft PR including the changes of the one before it. The links
> > provided are the Draft PRs with the entirety of the changes included.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > because...
> >
> > Thanks much, and please let me know if any more information or links are
> > needed (I've never proposed a vote before on here!)
> >
> > --Matt
> >
> > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > [3] https://github.com/apache/arrow/pull/14179
> > [4] https://github.com/apache/arrow/pull/14223
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Ian Cook <ia...@ursacomputing.com>.

Thank you Matt, Tobias, and others for the great work on this.

I am -0.5 on this proposal in its current form because (pardon the
pedantry) what we have implemented here is not run-length encoding; it
is run-end encoding. Based on community input, the choice was made to
store run ends instead of run lengths because this enables O(log(N))
random access as opposed to O(N). This is a sensible choice, but it
comes with some trade-offs including limitations in array length
(which maybe not really a problem in practice) and lack of bit-for-bit
equivalence with RLE encodings that use run lengths like Velox's
SequenceVector encoding (which I think is a more serious problem in
practice).

I believe that we should either:
(a) rename this to "run-end encoding"
(b) change this to a parameterized type called "run encoding" that
takes a Boolean parameter specifying whether run lengths or run ends
are stored.

Ian

On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zo...@gmail.com> wrote:
>
> Hello,
>
> I'd like to propose adding the RLE type based on earlier discussions[1][2]
> to the Arrow format:
> - Columnar Format description:
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> - Flatbuffers changes:
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
>
> There is a proposed implementation available in both C++ (written by Tobias
> Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> implemented and were tested to be compatible over IPC with an archery test.
> In both cases, the implementations are split out among several Draft PRs so
> that they can be easily reviewed piecemeal if the vote is approved, with
> each Draft PR including the changes of the one before it. The links
> provided are the Draft PRs with the entirety of the changes included.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 add the proposed RLE type to the Apache Arrow format
> [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> because...
>
> Thanks much, and please let me know if any more information or links are
> needed (I've never proposed a vote before on here!)
>
> --Matt
>
> [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> [3] https://github.com/apache/arrow/pull/14179
> [4] https://github.com/apache/arrow/pull/14223

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Matt Topol <zo...@gmail.com>.

Thanks Antoine! I'll go respond to your comments now!

On Mon, Jan 9, 2023 at 11:01 AM Antoine Pitrou <an...@python.org> wrote:

>
> I've commented on the PR. I'm +1 on the principle and on the proposed
> format / layout additions.
>
> Regards
>
> Antoine.
>
>
> Le 14/12/2022 à 17:27, Matt Topol a écrit :
> > Hello,
> >
> > I'd like to propose adding the RLE type based on earlier
> discussions[1][2]
> > to the Arrow format:
> > - Columnar Format description:
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > - Flatbuffers changes:
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> >
> > There is a proposed implementation available in both C++ (written by
> Tobias
> > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > implemented and were tested to be compatible over IPC with an archery
> test.
> > In both cases, the implementations are split out among several Draft PRs
> so
> > that they can be easily reviewed piecemeal if the vote is approved, with
> > each Draft PR including the changes of the one before it. The links
> > provided are the Draft PRs with the entirety of the changes included.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > because...
> >
> > Thanks much, and please let me know if any more information or links are
> > needed (I've never proposed a vote before on here!)
> >
> > --Matt
> >
> > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > [3] https://github.com/apache/arrow/pull/14179
> > [4] https://github.com/apache/arrow/pull/14223
> >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Posted by Antoine Pitrou <an...@python.org>.

I've commented on the PR. I'm +1 on the principle and on the proposed 
format / layout additions.

Regards

Antoine.


Le 14/12/2022 à 17:27, Matt Topol a écrit :
> Hello,
> 
> I'd like to propose adding the RLE type based on earlier discussions[1][2]
> to the Arrow format:
> - Columnar Format description:
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> - Flatbuffers changes:
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> 
> There is a proposed implementation available in both C++ (written by Tobias
> Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> implemented and were tested to be compatible over IPC with an archery test.
> In both cases, the implementations are split out among several Draft PRs so
> that they can be easily reviewed piecemeal if the vote is approved, with
> each Draft PR including the changes of the one before it. The links
> provided are the Draft PRs with the entirety of the changes included.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 add the proposed RLE type to the Apache Arrow format
> [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> because...
> 
> Thanks much, and please let me know if any more information or links are
> needed (I've never proposed a vote before on here!)
> 
> --Matt
> 
> [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> [3] https://github.com/apache/arrow/pull/14179
> [4] https://github.com/apache/arrow/pull/14223
>