You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Remi Dettai (Jira)" <ji...@apache.org> on 2020/10/01 08:12:00 UTC
[jira] [Commented] (ARROW-10135) [Rust] [Parquet] Refactor file module to help adding sources

    [ https://issues.apache.org/jira/browse/ARROW-10135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205356#comment-17205356 ] 

Remi Dettai commented on ARROW-10135:
-------------------------------------

Hi [~alamb] ! Thanks for your insight. The problem when you use S3 (or any http blob storage) is that you typically want to query large chunks at once because each call is costly and has high latency. I am not talking about optimization here, calling tiny ranges from a blob storage is just baaaaad ;) (and using a buffered reader is also very sub-optimal).

In the parquet from S3 use case, you'll want to download the footer at once then entire columns at each call. This is not compatible with the standard IO traits that are typically tailored to read only the amount of data that is needed for the current operation: read footer length+magic bytes, deserialize row group header, deserialize page... The problem is the same in the C++ implem, but it was left to the `ParquetFileReader` implementation to fetch those full chunks at once when calling `ReadAt(position,nbytes)` IO interface. This is not the case in Rust, and I don't believe it is a good solution: it is hidden in the implementation how the IO API will be called and nothing guards from updates in the `FileReader` to break it again.

What I'm working on is introducing an intermediate `ChunckReader` trait object that generates "sliced readers" implementing the standard IO traits. These "sliced readers" will be of the maximum size (e.g. entire footer, entire columns) but the rest of the implementation can keep using normal "small reads". These brings together the best of two worlds: the standard FS reader can keep working with its buffered reader with barely no overhead and blob storage clients can get large ranges at once then read the response stream progressively.

[~andygrove] It is not entirely clear to me how to introduce async efficiently into this, but I will try to figure it out as I go.

> [Rust] [Parquet] Refactor file module to help adding sources
> ------------------------------------------------------------
>
>                 Key: ARROW-10135
>                 URL: https://issues.apache.org/jira/browse/ARROW-10135
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>    Affects Versions: 1.0.1
>            Reporter: Remi Dettai
>            Priority: Major
>              Labels: parquet, pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, the Parquet reader is very strongly tied to file system reads. This makes it hard to add other sources. For instance, to implement S3, we would need a reader that loads entire columns at once rather than buffered reads of a few Ko.
> To improve modularity, we could try to move as much logic as possible to the generic traits (FileReader, RowGroupReader...) and reduce the code in the implementing structs (SerializedFileReader, SerializedRowGroupReader...) to the part that is specific to file/buffered reads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)