You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Larry White <lj...@gmail.com> on 2022/08/18 17:09:38 UTC

Proposal: A Table Data Structure for Arrow Java

Hi all,

I would like to propose a new Table data structure for Arrow Java that is
similar to the existing VectorSchemaRoot, but has:

- more table functionality (e.g. row-oriented operations)
- a simpler and more general mutability API

It lacks VectorSchemaRoot's buffer-like qualities, making it more like the
common understanding of a table. The hope is that it would compliment
VectorSchemaRoot, with one used for batch/pipeline work, and the other a
standard 'table'

A Google Doc describing the proposal can be found here:
https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing

All comments are welcome.

Best,

larry

Re: Proposal: A Table Data Structure for Arrow Java

Posted by Antoine Pitrou <an...@python.org>.
Le 25/08/2022 à 19:01, Larry White a écrit :
> Hi all,
> 
> Thank you, Antoine and everyone for the feedback. It's been very helpful.
> The proposal has been updated to incorporate suggested changes and clarify
> as needed.
> 
> Several people have expressed support for the idea of using a Java version
> of ChunkedArrays as the internal representation. I'm wondering if a
> complete implementation of ChunkedArray is needed to achieve the
> performance benefits that you mention in this thread. In my reading of the
> API, data streamed as RecordBatches are converted to ChunkedArrays in a
> One-RecordBatch-to-One-ChunkedArray fashion.  This suggests that the
> complexity of managing chunks of different shapes isn't strictly required.
> Is that your understanding?.

Yes, it is right.  The ability to have chunks of different shapes is a 
C++ design decision, but it doesn't affect other implementations.
So instead you could eschew ChunkedArray and have a Table be a sequence 
of record batches, for example.

Regards

Antoine.

Re: Proposal: A Table Data Structure for Arrow Java

Posted by Larry White <lj...@gmail.com>.
Hi all,

Thank you, Antoine and everyone for the feedback. It's been very helpful.
The proposal has been updated to incorporate suggested changes and clarify
as needed.

Several people have expressed support for the idea of using a Java version
of ChunkedArrays as the internal representation. I'm wondering if a
complete implementation of ChunkedArray is needed to achieve the
performance benefits that you mention in this thread. In my reading of the
API, data streamed as RecordBatches are converted to ChunkedArrays in a
One-RecordBatch-to-One-ChunkedArray fashion.  This suggests that the
complexity of managing chunks of different shapes isn't strictly required.
Is that your understanding?.

I don't have a sense of the effort required to produce a Java version of
ChunkedArrays, so I want to understand what the baseline requirement is.

Thanks again.

Larry



On Wed, Aug 24, 2022 at 11:58 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi,
>
> Can Java developers please take a look at Larry's proposal below?
>
>
> As for my 2 cents as a non-Java developer:
>
> That's a detailed and well-explained proposal, thank you.
> My only concern is that you're proposing to implement this first as a
> set of contiguous vectors.  The various communication protocols offered
> by the Arrow specifications (IPC, Flight, C Stream Interface...) are all
> based on the notion of a stream of batches.  Minimizing the number of
> copies made is one of the selling points of Arrow, so being able to
> consume such streaming data without materializing a concatenation sounds
> important.
>
> Regards
>
> Antoine.
>
>
> Le 18/08/2022 à 19:09, Larry White a écrit :
> > Hi all,
> >
> > I would like to propose a new Table data structure for Arrow Java that is
> > similar to the existing VectorSchemaRoot, but has:
> >
> > - more table functionality (e.g. row-oriented operations)
> > - a simpler and more general mutability API
> >
> > It lacks VectorSchemaRoot's buffer-like qualities, making it more like
> the
> > common understanding of a table. The hope is that it would compliment
> > VectorSchemaRoot, with one used for batch/pipeline work, and the other a
> > standard 'table'
> >
> > A Google Doc describing the proposal can be found here:
> >
> https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing
> >
> > All comments are welcome.
> >
> > Best,
> >
> > larry
> >
>

Re: Proposal: A Table Data Structure for Arrow Java

Posted by Antoine Pitrou <an...@python.org>.
Hi,

Can Java developers please take a look at Larry's proposal below?


As for my 2 cents as a non-Java developer:

That's a detailed and well-explained proposal, thank you.
My only concern is that you're proposing to implement this first as a 
set of contiguous vectors.  The various communication protocols offered 
by the Arrow specifications (IPC, Flight, C Stream Interface...) are all 
based on the notion of a stream of batches.  Minimizing the number of 
copies made is one of the selling points of Arrow, so being able to 
consume such streaming data without materializing a concatenation sounds 
important.

Regards

Antoine.


Le 18/08/2022 à 19:09, Larry White a écrit :
> Hi all,
> 
> I would like to propose a new Table data structure for Arrow Java that is
> similar to the existing VectorSchemaRoot, but has:
> 
> - more table functionality (e.g. row-oriented operations)
> - a simpler and more general mutability API
> 
> It lacks VectorSchemaRoot's buffer-like qualities, making it more like the
> common understanding of a table. The hope is that it would compliment
> VectorSchemaRoot, with one used for batch/pipeline work, and the other a
> standard 'table'
> 
> A Google Doc describing the proposal can be found here:
> https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit?usp=sharing
> 
> All comments are welcome.
> 
> Best,
> 
> larry
>