You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/11/07 07:43:35 UTC

[Java] Append multiple record batches together?

Hi,
A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048 for
having similar functionality to the python APIs that allow for creating one
larger data structure from a series of record batches.  I just wanted to
surface it here in case:
1.  An efficient solution already exists? It seems like TransferPair
implementations could possibly be improved upon or have they already been
optimized?
2.  What the preferred API for doing this would be?  Some options i can
think of:

* VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
* VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
* VectorLoader.load(Collection<ArrowRecordBatch>)

Thanks,
Micah

Re: [Java] Append multiple record batches together?

Posted by Fan Liya <li...@gmail.com>.

One use-case for ChunkedArray that comes to my mind is external sort for
large vectors.

Best,
Liya Fan

On Fri, Nov 15, 2019 at 2:14 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > Maybe Java can add the concept of Tables and ChunkedArrays sometime in
> the
> > future.
>
>
> Is there a concrete use-case here?  It might pay to open up some JIRAs.
> I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
> designed and how that would relate to Table/ChunkedArrays (or maybe they
> are completely separate)?
>
> On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler <cu...@gmail.com> wrote:
>
> > Yes, you are correct. I think I was mixing up a couple different things.
> I
> > like the way C++/Python distinguishes it where a RecordBatch is
> contiguous
> > memory and a Table can be chunked. So since you are just talking about
> > RecordBatches, I think we should keep it contiguous and concat would
> > require memcpy. Maybe Java can add the concept of Tables and
> ChunkedArrays
> > sometime in the future.
> >
> > On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality.
> >>
> >>
> >> There are potentially two different use-cases.  ChunkedArray is
> >> logical/lazy concatenation where as concat, physically rebuilds the
> vectors
> >> to be a single vector.
> >>
> >> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cu...@gmail.com> wrote:
> >>
> >>> I think having a chunked array with multiple vector buffers would be
> >>> ideal, similar to C++. It might take a fair amount of work to add this
> but
> >>> would open up a lot more functionality. As for the API,
> >>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
> >>>
> >>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <li...@gmail.com> wrote:
> >>>
> >>>> Hi Micah,
> >>>>
> >>>> Thanks for bringing this up.
> >>>>
> >>>> > 1.  An efficient solution already exists? It seems like TransferPair
> >>>> implementations could possibly be improved upon or have they already
> >>>> been
> >>>> optimized?
> >>>>
> >>>> Fundamnentally, memory copy is unavoidable, IMO, because the source
> and
> >>>> targe memory regions are likely to be in non-contiguous regions.
> >>>> An alternative is to make ArrowBuf support a number of non-contiguous
> >>>> memory regions. However, that would harm the perfomance of ArrowBuf,
> and
> >>>> ArrowBuf is the core of the Arrow library.
> >>>>
> >>>> > 2.  What the preferred API for doing this would be?  Some options i
> >>>> can
> >>>> think of:
> >>>>
> >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
> >>>>
> >>>> IMO, option 1 is required, as we have scenarios that need to concate
> >>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
> >>>> delta
> >>>> dictionaries).
> >>>> Options 2 and 3 are optional for us.
> >>>>
> >>>> Best,
> >>>> Liya Fan
> >>>>
> >>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfield@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>> > Hi,
> >>>> > A colleague opened up
> >>>> https://issues.apache.org/jira/browse/ARROW-7048 for
> >>>> > having similar functionality to the python APIs that allow for
> >>>> creating one
> >>>> > larger data structure from a series of record batches.  I just
> wanted
> >>>> to
> >>>> > surface it here in case:
> >>>> > 1.  An efficient solution already exists? It seems like TransferPair
> >>>> > implementations could possibly be improved upon or have they already
> >>>> been
> >>>> > optimized?
> >>>> > 2.  What the preferred API for doing this would be?  Some options i
> >>>> can
> >>>> > think of:
> >>>> >
> >>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> >>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> >>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
> >>>> >
> >>>> > Thanks,
> >>>> > Micah
> >>>> >
> >>>>
> >>>
>

Re: [Java] Append multiple record batches together?

Posted by Micah Kornfield <em...@gmail.com>.

>
> Maybe Java can add the concept of Tables and ChunkedArrays sometime in the
> future.


Is there a concrete use-case here?  It might pay to open up some JIRAs.
I'm still not 100% clear on the rationale for the way VectorSchemaRoot is
designed and how that would relate to Table/ChunkedArrays (or maybe they
are completely separate)?

On Tue, Nov 12, 2019 at 11:28 AM Bryan Cutler <cu...@gmail.com> wrote:

> Yes, you are correct. I think I was mixing up a couple different things. I
> like the way C++/Python distinguishes it where a RecordBatch is contiguous
> memory and a Table can be chunked. So since you are just talking about
> RecordBatches, I think we should keep it contiguous and concat would
> require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays
> sometime in the future.
>
> On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality.
>>
>>
>> There are potentially two different use-cases.  ChunkedArray is
>> logical/lazy concatenation where as concat, physically rebuilds the vectors
>> to be a single vector.
>>
>> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> I think having a chunked array with multiple vector buffers would be
>>> ideal, similar to C++. It might take a fair amount of work to add this but
>>> would open up a lot more functionality. As for the API,
>>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
>>>
>>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <li...@gmail.com> wrote:
>>>
>>>> Hi Micah,
>>>>
>>>> Thanks for bringing this up.
>>>>
>>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>>> implementations could possibly be improved upon or have they already
>>>> been
>>>> optimized?
>>>>
>>>> Fundamnentally, memory copy is unavoidable, IMO, because the source and
>>>> targe memory regions are likely to be in non-contiguous regions.
>>>> An alternative is to make ArrowBuf support a number of non-contiguous
>>>> memory regions. However, that would harm the perfomance of ArrowBuf, and
>>>> ArrowBuf is the core of the Arrow library.
>>>>
>>>> > 2.  What the preferred API for doing this would be?  Some options i
>>>> can
>>>> think of:
>>>>
>>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>>
>>>> IMO, option 1 is required, as we have scenarios that need to concate
>>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
>>>> delta
>>>> dictionaries).
>>>> Options 2 and 3 are optional for us.
>>>>
>>>> Best,
>>>> Liya Fan
>>>>
>>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> > A colleague opened up
>>>> https://issues.apache.org/jira/browse/ARROW-7048 for
>>>> > having similar functionality to the python APIs that allow for
>>>> creating one
>>>> > larger data structure from a series of record batches.  I just wanted
>>>> to
>>>> > surface it here in case:
>>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>>> > implementations could possibly be improved upon or have they already
>>>> been
>>>> > optimized?
>>>> > 2.  What the preferred API for doing this would be?  Some options i
>>>> can
>>>> > think of:
>>>> >
>>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>> >
>>>> > Thanks,
>>>> > Micah
>>>> >
>>>>
>>>

Re: [Java] Append multiple record batches together?

Posted by Bryan Cutler <cu...@gmail.com>.

Yes, you are correct. I think I was mixing up a couple different things. I
like the way C++/Python distinguishes it where a RecordBatch is contiguous
memory and a Table can be chunked. So since you are just talking about
RecordBatches, I think we should keep it contiguous and concat would
require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays
sometime in the future.

On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <em...@gmail.com>
wrote:

> I think having a chunked array with multiple vector buffers would be
>> ideal, similar to C++. It might take a fair amount of work to add this but
>> would open up a lot more functionality.
>
>
> There are potentially two different use-cases.  ChunkedArray is
> logical/lazy concatenation where as concat, physically rebuilds the vectors
> to be a single vector.
>
> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cu...@gmail.com> wrote:
>
>> I think having a chunked array with multiple vector buffers would be
>> ideal, similar to C++. It might take a fair amount of work to add this but
>> would open up a lot more functionality. As for the API,
>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
>>
>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <li...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> Thanks for bringing this up.
>>>
>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>> implementations could possibly be improved upon or have they already been
>>> optimized?
>>>
>>> Fundamnentally, memory copy is unavoidable, IMO, because the source and
>>> targe memory regions are likely to be in non-contiguous regions.
>>> An alternative is to make ArrowBuf support a number of non-contiguous
>>> memory regions. However, that would harm the perfomance of ArrowBuf, and
>>> ArrowBuf is the core of the Arrow library.
>>>
>>> > 2.  What the preferred API for doing this would be?  Some options i can
>>> think of:
>>>
>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>
>>> IMO, option 1 is required, as we have scenarios that need to concate
>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
>>> delta
>>> dictionaries).
>>> Options 2 and 3 are optional for us.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> > A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048
>>> for
>>> > having similar functionality to the python APIs that allow for
>>> creating one
>>> > larger data structure from a series of record batches.  I just wanted
>>> to
>>> > surface it here in case:
>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>> > implementations could possibly be improved upon or have they already
>>> been
>>> > optimized?
>>> > 2.  What the preferred API for doing this would be?  Some options i can
>>> > think of:
>>> >
>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>>
>>

Re: [Java] Append multiple record batches together?

Posted by Micah Kornfield <em...@gmail.com>.

>
> I think having a chunked array with multiple vector buffers would be
> ideal, similar to C++. It might take a fair amount of work to add this but
> would open up a lot more functionality.


There are potentially two different use-cases.  ChunkedArray is
logical/lazy concatenation where as concat, physically rebuilds the vectors
to be a single vector.

On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cu...@gmail.com> wrote:

> I think having a chunked array with multiple vector buffers would be
> ideal, similar to C++. It might take a fair amount of work to add this but
> would open up a lot more functionality. As for the API,
> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
>
> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <li...@gmail.com> wrote:
>
>> Hi Micah,
>>
>> Thanks for bringing this up.
>>
>> > 1.  An efficient solution already exists? It seems like TransferPair
>> implementations could possibly be improved upon or have they already been
>> optimized?
>>
>> Fundamnentally, memory copy is unavoidable, IMO, because the source and
>> targe memory regions are likely to be in non-contiguous regions.
>> An alternative is to make ArrowBuf support a number of non-contiguous
>> memory regions. However, that would harm the perfomance of ArrowBuf, and
>> ArrowBuf is the core of the Arrow library.
>>
>> > 2.  What the preferred API for doing this would be?  Some options i can
>> think of:
>>
>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>
>> IMO, option 1 is required, as we have scenarios that need to concate
>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from delta
>> dictionaries).
>> Options 2 and 3 are optional for us.
>>
>> Best,
>> Liya Fan
>>
>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048
>> for
>> > having similar functionality to the python APIs that allow for creating
>> one
>> > larger data structure from a series of record batches.  I just wanted to
>> > surface it here in case:
>> > 1.  An efficient solution already exists? It seems like TransferPair
>> > implementations could possibly be improved upon or have they already
>> been
>> > optimized?
>> > 2.  What the preferred API for doing this would be?  Some options i can
>> > think of:
>> >
>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>> >
>> > Thanks,
>> > Micah
>> >
>>
>

Re: [Java] Append multiple record batches together?

Posted by Bryan Cutler <cu...@gmail.com>.

I think having a chunked array with multiple vector buffers would be ideal,
similar to C++. It might take a fair amount of work to add this but would
open up a lot more functionality. As for the API,
VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.

On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <li...@gmail.com> wrote:

> Hi Micah,
>
> Thanks for bringing this up.
>
> > 1.  An efficient solution already exists? It seems like TransferPair
> implementations could possibly be improved upon or have they already been
> optimized?
>
> Fundamnentally, memory copy is unavoidable, IMO, because the source and
> targe memory regions are likely to be in non-contiguous regions.
> An alternative is to make ArrowBuf support a number of non-contiguous
> memory regions. However, that would harm the perfomance of ArrowBuf, and
> ArrowBuf is the core of the Arrow library.
>
> > 2.  What the preferred API for doing this would be?  Some options i can
> think of:
>
> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>
> IMO, option 1 is required, as we have scenarios that need to concate
> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from delta
> dictionaries).
> Options 2 and 3 are optional for us.
>
> Best,
> Liya Fan
>
> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi,
> > A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048
> for
> > having similar functionality to the python APIs that allow for creating
> one
> > larger data structure from a series of record batches.  I just wanted to
> > surface it here in case:
> > 1.  An efficient solution already exists? It seems like TransferPair
> > implementations could possibly be improved upon or have they already been
> > optimized?
> > 2.  What the preferred API for doing this would be?  Some options i can
> > think of:
> >
> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> > * VectorLoader.load(Collection<ArrowRecordBatch>)
> >
> > Thanks,
> > Micah
> >
>

Re: [Java] Append multiple record batches together?

Posted by Fan Liya <li...@gmail.com>.

Hi Micah,

Thanks for bringing this up.

> 1.  An efficient solution already exists? It seems like TransferPair
implementations could possibly be improved upon or have they already been
optimized?

Fundamnentally, memory copy is unavoidable, IMO, because the source and
targe memory regions are likely to be in non-contiguous regions.
An alternative is to make ArrowBuf support a number of non-contiguous
memory regions. However, that would harm the perfomance of ArrowBuf, and
ArrowBuf is the core of the Arrow library.

> 2.  What the preferred API for doing this would be?  Some options i can
think of:

> * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> * VectorLoader.load(Collection<ArrowRecordBatch>)

IMO, option 1 is required, as we have scenarios that need to concate
vectors/VectorSchemaRoots (e.g. restore the complete dictionary from delta
dictionaries).
Options 2 and 3 are optional for us.

Best,
Liya Fan

On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi,
> A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048 for
> having similar functionality to the python APIs that allow for creating one
> larger data structure from a series of record batches.  I just wanted to
> surface it here in case:
> 1.  An efficient solution already exists? It seems like TransferPair
> implementations could possibly be improved upon or have they already been
> optimized?
> 2.  What the preferred API for doing this would be?  Some options i can
> think of:
>
> * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
> * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
> * VectorLoader.load(Collection<ArrowRecordBatch>)
>
> Thanks,
> Micah
>