You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Radu Teodorescu (Jira)" <ji...@apache.org> on 2021/02/01 22:23:00 UTC

[jira] [Created] (ARROW-11465) Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

Radu Teodorescu created ARROW-11465:
---------------------------------------

             Summary: Parquet file writer snapshot API and proper ColumnChunk.file_path utilization
                 Key: ARROW-11465
                 URL: https://issues.apache.org/jira/browse/ARROW-11465
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
    Affects Versions: 3.0.0
            Reporter: Radu Teodorescu
            Assignee: Radu Teodorescu
             Fix For: 4.0.0


This is a follow up to the thread:
[https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCDD00783-0FFC-4934-AA24-529FB2A44D88@yahoo.com%3e]

The specific use case I am targeting is having the ability to partially read a parquet file while it's still being written to.
This is relevant for any process that is recording events over a long period of times and writing them to parquet (tracing data, logging events or any other live time series)
The solution relies on the fact that parquet specifications allows column chunk metadata to point explicitly to its location in a file which can theoretically be different from the file containing the metadata (as covered in other threads, this behavior is not fully supported by major parquet implementations).
My solution is centered around adding a method,

 

{{void ParquetFileWriter::Snapshot(const std::string& data_path,
                                 std::shared_ptr<::arrow::io::OutputStream>& sink) }}

,that writes writes the metadata for the RowGroups given so far to the {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to point to {{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}

On the reading side I implemented full support for ColumnChunk.file_path by introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in the {{ParquetFileReader}} implementation stack. In the PR implementation one can default to the current behavior by using the {{SingleFile}} class, have full read support for multi-file parquet in line with the specification by using {{MultiReadableFile}} implementation (that captures the metafile base directory and uses it as the base directory to the ColumChunk.file_path) or one can provide a separate implementation for a non-posix file system storage.

For an example see {{write_parquet_file_with_snapshot}} function in reader-writer.cc that illustrates the snapshotting write while the {{read_whole_file}} function has been modified to read one of the snapshots (I will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)