You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/02/01 22:25:00 UTC
[jira] [Updated] (ARROW-11465) Parquet file writer snapshot API and
proper ColumnChunk.file_path utilization
[ https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-11465:
-----------------------------------
Labels: pull-request-available (was: )
> Parquet file writer snapshot API and proper ColumnChunk.file_path utilization
> -----------------------------------------------------------------------------
>
> Key: ARROW-11465
> URL: https://issues.apache.org/jira/browse/ARROW-11465
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 3.0.0
> Reporter: Radu Teodorescu
> Assignee: Radu Teodorescu
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is a follow up to the thread:
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCDD00783-0FFC-4934-AA24-529FB2A44D88@yahoo.com%3e]
> The specific use case I am targeting is having the ability to partially read a parquet file while it's still being written to.
> This is relevant for any process that is recording events over a long period of times and writing them to parquet (tracing data, logging events or any other live time series)
> The solution relies on the fact that parquet specifications allows column chunk metadata to point explicitly to its location in a file which can theoretically be different from the file containing the metadata (as covered in other threads, this behavior is not fully supported by major parquet implementations).
> My solution is centered around adding a method,
>
> {{void ParquetFileWriter::Snapshot(const std::string& data_path,
> std::shared_ptr<::arrow::io::OutputStream>& sink) }}
> ,that writes writes the metadata for the RowGroups given so far to the {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to point to {{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}
> On the reading side I implemented full support for ColumnChunk.file_path by introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in the {{ParquetFileReader}} implementation stack. In the PR implementation one can default to the current behavior by using the {{SingleFile}} class, have full read support for multi-file parquet in line with the specification by using {{MultiReadableFile}} implementation (that captures the metafile base directory and uses it as the base directory to the ColumChunk.file_path) or one can provide a separate implementation for a non-posix file system storage.
> For an example see {{write_parquet_file_with_snapshot}} function in reader-writer.cc that illustrates the snapshotting write while the {{read_whole_file}} function has been modified to read one of the snapshots (I will rollback that change and provide separate example before the merge)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)