You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/09/24 14:00:00 UTC

[jira] [Commented] (ARROW-4713) [C++] Improve C++ Orc Adapter performance and memory footprint

    [ https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201546#comment-17201546 ] 

Antoine Pitrou commented on ARROW-4713:
---------------------------------------

[~Yurui Zhou] Are you still planning to work on these issues at some point?

> [C++] Improve C++ Orc Adapter performance and memory footprint
> --------------------------------------------------------------
>
>                 Key: ARROW-4713
>                 URL: https://issues.apache.org/jira/browse/ARROW-4713
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yurui Zhou
>            Assignee: Yurui Zhou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently the Arrow C++ provide a naive adapter implementation that allow user to read orc file to Arrow RecordBatch. However, this implementation have several drawbacks:
>  * Inefficient conversion that incurs huge memcpy overhead
>  ** currently the ORC adapter are performing byte to byte memcpy to move data to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC VectorBatch shares the same memory layout with Arrow in most of the Data Types
>  * Huge memory footprint because the lack of TableReader implementation
>  ** The ORC adapter currently only allow user to read data with the unit of stripe. However, as a columnar format with high compression ration, data read from a ORC stripe can potential takes over gigabytes of memory, which makes the ORC adapter not quite usable in production environment.
> Here we propose a new ORC adapter implementation to fix the issues mentioned above:
>  * To reduce conversion overhead, instead of performing naive data copy, the new adapter would be able to fully taking advantage of the memory layout similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new adapter will perform pointer manipulation to transfer the memory ownership from VectorBatch to Arrow RecordBatch whenever possible.
>  * The new ORC Adapter would be able to provide user a row level granularity when reading data from Orc File. The user should be able to specify how many rows should be expected on output RecordBatch and the ORC Adapter should make sure no more the requested number of rows would be returned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)