You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "David Li (Jira)" <ji...@apache.org> on 2020/01/11 15:46:00 UTC

[jira] [Commented] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

    [ https://issues.apache.org/jira/browse/PARQUET-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013527#comment-17013527 ] 

David Li commented on PARQUET-1698:
-----------------------------------

Hey, we actually are investigating improvements in this direction. We have an in-house library that does read coalescing (so: not individual Read calls and not pre-buffering the entire group - we do a lot of column selection from wide datasets). In essence, given the bandwidth delay product, it computes an optimal set of read ranges and hints to the S3 file implementation what ranges it needs to read, and then our S3 file implementation goes and buffers those ranges in parallel.

We're still working out the contribution strategy - but is there a path to go from simple optimizations like this to more complex ones like ours?

> [C++] Add reader option to pre-buffer entire serialized row group into memory
> -----------------------------------------------------------------------------
>
>                 Key: PARQUET-1698
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1698
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Zherui Cao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: cpp-1.6.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In some scenarios (example: reading datasets from Amazon S3), reading columns independently and allowing unbridled {{Read}} calls to the underlying file handle can yield suboptimal performance. In such cases, it may be preferable to first read the entire serialized row group into memory then deserialize the constituent columns from this
> Note that such an option would not be appropriate as a default behavior for all file handle types since low-selectivity reads (example: reading only 3 columns out of a file with 100 columns)  will be suboptimal in some cases. I think it would be better for "high latency" file systems to opt into this option
> cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)