You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/11/21 04:16:00 UTC

[jira] [Created] (PARQUET-1698) [C++] Add reader option to pre-buffer entire serialized row group into memory

Wes McKinney created PARQUET-1698:
-------------------------------------

             Summary: [C++] Add reader option to pre-buffer entire serialized row group into memory
                 Key: PARQUET-1698
                 URL: https://issues.apache.org/jira/browse/PARQUET-1698
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Wes McKinney
             Fix For: cpp-1.6.0


In some scenarios (example: reading datasets from Amazon S3), reading columns independently and allowing unbridled {{Read}} calls to the underlying file handle can yield suboptimal performance. In such cases, it may be preferable to first read the entire serialized row group into memory then deserialize the constituent columns from this

Note that such an option would not be appropriate as a default behavior for all file handle types since low-selectivity reads (example: reading only 3 columns out of a file with 100 columns)  will be suboptimal in some cases. I think it would be better for "high latency" file systems to opt into this option

cc [~fsaintjacques] [~bkietz] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)