You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Liya Fan (Jira)" <ji...@apache.org> on 2019/11/06 06:57:00 UTC
[jira] [Assigned] (ARROW-7048) [Java] Support for combining
multiple vectors under VectorSchemaRoot
[ https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liya Fan reassigned ARROW-7048:
-------------------------------
Assignee: Liya Fan
> [Java] Support for combining multiple vectors under VectorSchemaRoot
> --------------------------------------------------------------------
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Java
> Reporter: Yogesh Tewari
> Assignee: Liya Fan
> Priority: Major
>
> Hi,
>
> pyarrow.Table.combine_chunks provides a nice functionality of combining multiple batch records under a single pyarrow.Table.
>
> I am currently working on a downstream application which reads data from BigQuery. BigQuery storage api supports data output in Arrow format but streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfield@gmail.com], I tried to write my own implementation by copying data vector by vector using TransferPair's copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one value at a time. That means a lot of looping trying copyValueSafe millions of rows from source vector index to target vector index. Ideally I would want to concatenate/link the underlying buffers rather than copying one cell at a time.
>
> Eg, if I have :
> {code:java}
> List<VectorSchemaRoot> batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>
> A method like VectorSchemaRoot.combineChunks(List<VectorSchemaRoot>)?
> I did read the VectorSchemaRoot discussion on https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the right thing to use here.
>
>
> PS. Feel free to update the title of this feature request with more appropriate wordings.
>
> Cheers,
> Yogesh
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)