You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/10/08 15:36:00 UTC

[jira] [Assigned] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

     [ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-13797:
---------------------------------------------

    Assignee: Joris Van den Bossche

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
>                 Key: ARROW-13797
>                 URL: https://issues.apache.org/jira/browse/ARROW-13797
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, orc, pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)