You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/10/08 15:36:00 UTC
[jira] [Assigned] (ARROW-13797) [C++] Implement column projection
pushdown to ORC reader in Datasets API
[ https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-13797:
---------------------------------------------
Assignee: Joris Van den Bossche
> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, orc, pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step.
> The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), but so this needs to be moved to C++.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)