You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Xianjin YE (JIRA)" <ji...@apache.org> on 2017/11/27 06:01:00 UTC
[jira] [Updated] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

     [ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xianjin YE updated PARQUET-1166:
--------------------------------
    Description: 
Hi, I'd like to proposal a new API to better support splittable reading for Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be contiguous, but can be arbitrary as long as the row_group_idxes are sorted and unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                std::shared_ptr<::arrow::RecordBatchReader>* out);
                
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                const std::vector<int>& column_indices,
                                                                std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)

[~wesmckinn]@xch

  was:
Hi, I'd like to proposal a new API to better support splittable reading for Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be contiguous, but can be arbitrary as long as the row_group_idxes are sorted and unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                std::shared_ptr<::arrow::RecordBatchReader>* out);
                
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                const std::vector<int>& column_indices,
                                                                std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1166
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1166
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally be contiguous, but can be arbitrary as long as the row_group_idxes are sorted and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
>                                                                 std::shared_ptr<::arrow::RecordBatchReader>* out);
>                 
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
>                                                                 const std::vector<int>& column_indices,
>                                                                 std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn]@xch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)