You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wenbo Hu <hu...@gmail.com> on 2023/04/03 12:58:50 UTC

Best practice on populating from VectorSchemaRoot to VectorSchemaRoot, ArrowStreamReader to ArrowStreamWriter

Hi,

Consider a situation, when doGet a ticket on arrow flight rpc server,
the server retrieves several IPC upstreams (read parquet files through
dataset api) and push into the same downstream, how to implement with
less copies?
Normally with one single IPC upstream, I'll direct start
ServerStreamListener with the getVectorSchemaRoot of the reader of the
upstream IPC.
It seems that I have to deal with VectorSchemaRoot rather than
ArrowRecordBatch directly.
What is the proper impelmentation on popluating root to root? Is that
correct use VectorLoad/Unloader?
Does this introduce extra steps making immediate ArrowRecordBatch
unnecessarily? (ArrowBuf -> VectorSchemaRoot@UpstreamReader ->
ArrowBuf@Loader ->VectorSchemaRoot@DownstreamWriter -> ArrowBuf)

Maybe it relates to the allocator, is it any better implementations on
same allocator?
-- 
---------------------
Best Regards,
Wenbo Hu,

Re: Best practice on populating from VectorSchemaRoot to VectorSchemaRoot, ArrowStreamReader to ArrowStreamWriter

Posted by David Dali Susanibar Arce <da...@gmail.com>.
Hi Wenbo Hu,

Sorry to join late. Wenbo, what about the proposal mentioned in the Java
Flight Cookbook (1). The method acceptPut will be an upstream with
VectorUnloader needed, then getStream method will be a downstream with
VectorLoader needed. Initially this cookbook use ArrowRecordBatch.
cloneWithTransfer but it did not work for all scenarios and finally was
changed to VectorLoader.load (2)

Please let us know how you see that.

(1) https://arrow.apache.org/cookbook/java/flight.html
(2) https://github.com/apache/arrow-cookbook/issues/218

Best regards,

David

El lun, 3 abr 2023 a las 7:59, Wenbo Hu (<hu...@gmail.com>) escribió:

> Hi,
>
> Consider a situation, when doGet a ticket on arrow flight rpc server,
> the server retrieves several IPC upstreams (read parquet files through
> dataset api) and push into the same downstream, how to implement with
> less copies?
> Normally with one single IPC upstream, I'll direct start
> ServerStreamListener with the getVectorSchemaRoot of the reader of the
> upstream IPC.
> It seems that I have to deal with VectorSchemaRoot rather than
> ArrowRecordBatch directly.
> What is the proper impelmentation on popluating root to root? Is that
> correct use VectorLoad/Unloader?
> Does this introduce extra steps making immediate ArrowRecordBatch
> unnecessarily? (ArrowBuf -> VectorSchemaRoot@UpstreamReader ->
> ArrowBuf@Loader ->VectorSchemaRoot@DownstreamWriter -> ArrowBuf)
>
> Maybe it relates to the allocator, is it any better implementations on
> same allocator?
> --
> ---------------------
> Best Regards,
> Wenbo Hu,
>