You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/10/11 15:49:00 UTC

[jira] [Resolved] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

     [ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou resolved ARROW-13797.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 11372
[https://github.com/apache/arrow/pull/11372]

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
>                 Key: ARROW-13797
>                 URL: https://issues.apache.org/jira/browse/ARROW-13797
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, orc, pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)