You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/03/23 20:05:00 UTC
[jira] [Assigned] (PARQUET-1166) [API Proposal] Add
GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned PARQUET-1166:
-------------------------------------
Assignee: Xianjin YE
> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -----------------------------------------------------------------
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp
> Reporter: Xianjin YE
> Assignee: Xianjin YE
> Priority: Major
> Fix For: cpp-1.5.0
>
>
> Hi, I'd like to proposal a new API to better support splittable reading for Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally be contiguous, but can be arbitrary as long as the row_group_idxes are sorted and unique, [1, 3, 5] for example).
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
> std::shared_ptr<::arrow::RecordBatchReader>* out);
>
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
> const std::vector<int>& column_indices,
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)