You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/08/07 17:55:00 UTC

[jira] [Commented] (PARQUET-1634) [C++] Factor out data/dictionary page writes to allow for page buffering

    [ https://issues.apache.org/jira/browse/PARQUET-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902357#comment-16902357 ] 

Wes McKinney commented on PARQUET-1634:
---------------------------------------

To assist with this it would make sense to first make sure we have a mock "high latency" filesystem for testing / benchmarking

> [C++] Factor out data/dictionary page writes to allow for page buffering 
> -------------------------------------------------------------------------
>
>                 Key: PARQUET-1634
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1634
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: cpp-1.6.0
>
>
> Logic that eagerly writes out data pages is hard-coded into the column writer implementation
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L565
> For higher-latency file systems like Amazon S3, it makes more sense to buffer pages in memory and write them in larger batches (and preferably asynchronously). We should refactor this logic so we have the ability to choose rather than have the behavior hard-coded



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)