You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/02/01 22:25:00 UTC
[jira] [Updated] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

     [ https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-11465:
-----------------------------------
    Labels: pull-request-available  (was: )

> Parquet file writer snapshot API and proper ColumnChunk.file_path utilization
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-11465
>                 URL: https://issues.apache.org/jira/browse/ARROW-11465
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 3.0.0
>            Reporter: Radu Teodorescu
>            Assignee: Radu Teodorescu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow up to the thread:
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCDD00783-0FFC-4934-AA24-529FB2A44D88@yahoo.com%3e]
> The specific use case I am targeting is having the ability to partially read a parquet file while it's still being written to.
> This is relevant for any process that is recording events over a long period of times and writing them to parquet (tracing data, logging events or any other live time series)
> The solution relies on the fact that parquet specifications allows column chunk metadata to point explicitly to its location in a file which can theoretically be different from the file containing the metadata (as covered in other threads, this behavior is not fully supported by major parquet implementations).
> My solution is centered around adding a method,
>  
> {{void ParquetFileWriter::Snapshot(const std::string& data_path,
>                                  std::shared_ptr<::arrow::io::OutputStream>& sink) }}
> ,that writes writes the metadata for the RowGroups given so far to the {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to point to {{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}
> On the reading side I implemented full support for ColumnChunk.file_path by introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in the {{ParquetFileReader}} implementation stack. In the PR implementation one can default to the current behavior by using the {{SingleFile}} class, have full read support for multi-file parquet in line with the specification by using {{MultiReadableFile}} implementation (that captures the metafile base directory and uses it as the base directory to the ColumChunk.file_path) or one can provide a separate implementation for a non-posix file system storage.
> For an example see {{write_parquet_file_with_snapshot}} function in reader-writer.cc that illustrates the snapshotting write while the {{read_whole_file}} function has been modified to read one of the snapshots (I will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)