You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Chao Sun (Jira)" <ji...@apache.org> on 2020/12/29 01:45:00 UTC
[jira] [Comment Edited] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

    [ https://issues.apache.org/jira/browse/ARROW-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255765#comment-17255765 ] 

Chao Sun edited comment on ARROW-11016 at 12/29/20, 1:44 AM:
-------------------------------------------------------------

Sorry for the late reply. Yes I think it should be possible. On the file reader side we can pass in a (start, end) besides the file handle, to indicate we want to only read a segment of the file. Then after parsing the file metadata, we can check all the row groups for the file and determine which row group(s) overlaps with the segment, and only select those. 

You can probably check relevant code in [Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105] and [Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223] for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I thought we used to clone file handle so that they can be shared but yeah haven't looked at the code base for some time :(


was (Author: csun):
Sorry for the late reply. Yes I think it should be possible. On the file reader side we can pass in a (start, end) besides the file handle, to indicate we want to only read a segment of the file. Then after parsing the file metadata, we can check all the row groups for the file and determine which row group(s) overlaps with the segment, and only select those. 

You can probably check relevant code in [Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105] and [Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223] for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I thought we used to clone file handle so that they can be shared but yeah haven't looked at the code base for some time :((

> [Rust] Parquet ArrayReader should allow reading a subset of row groups
> ----------------------------------------------------------------------
>
>                 Key: ARROW-11016
>                 URL: https://issues.apache.org/jira/browse/ARROW-11016
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Rust
>            Reporter: Andy Grove
>            Priority: Major
>
> Parquet ArrayReader currently only supports reading an entire file from start to finish and does not allow selectively reading a subset of row groups. This prevents us from parallelizing work across threads when processing a single parquet file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)