You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Liya Fan (Jira)" <ji...@apache.org> on 2019/11/06 06:57:00 UTC

[jira] [Assigned] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

     [ https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Liya Fan reassigned ARROW-7048:
-------------------------------

    Assignee: Liya Fan

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> --------------------------------------------------------------------
>
>                 Key: ARROW-7048
>                 URL: https://issues.apache.org/jira/browse/ARROW-7048
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Java
>            Reporter: Yogesh Tewari
>            Assignee: Liya Fan
>            Priority: Major
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from BigQuery. BigQuery storage api supports data output in Arrow format but streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfield@gmail.com], I tried to write my own implementation by copying data vector by vector using TransferPair's copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one value at a time. That means a lot of looping trying copyValueSafe millions of rows from source vector index to target vector index. Ideally I would want to concatenate/link the underlying buffers rather than copying one cell at a time.
>  
> Eg, if I have :
> {code:java}
> List<VectorSchemaRoot> batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
>     Schema schema = reader.getVectorSchemaRoot().getSchema();
>     for (int i = 0; i < 5; i++) {
>         // This will be loaded with new values on every call to loadNextBatch
>         VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
>         reader.loadNextBatch();
>         batchList.add(readBatch);
>     }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?
> I did read the VectorSchemaRoot discussion on https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)