You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Apache Arrow JIRA Bot (Jira)" <ji...@apache.org> on 2022/10/13 17:52:00 UTC

[jira] [Commented] (ARROW-11465) [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

    [ https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617232#comment-17617232 ] 

Apache Arrow JIRA Bot commented on ARROW-11465:
-----------------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

> [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path utilization
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-11465
>                 URL: https://issues.apache.org/jira/browse/ARROW-11465
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 3.0.0
>            Reporter: Radu Teodorescu
>            Assignee: Radu Teodorescu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow up to the thread:
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3cCDD00783-0FFC-4934-AA24-529FB2A44D88@yahoo.com%3e]
> The specific use case I am targeting is having the ability to partially read a parquet file while it's still being written to.
> This is relevant for any process that is recording events over a long period of times and writing them to parquet (tracing data, logging events or any other live time series)
> The solution relies on the fact that parquet specifications allows column chunk metadata to point explicitly to its location in a file which can theoretically be different from the file containing the metadata (as covered in other threads, this behavior is not fully supported by major parquet implementations).
> My solution is centered around adding a method,
>  
> {{void ParquetFileWriter::Snapshot(const std::string& data_path,
>                                  std::shared_ptr<::arrow::io::OutputStream>& sink) }}
> ,that writes writes the metadata for the RowGroups given so far to the {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to point to {{data_path}}. This was intended as a minimalist change to {{ParquetFileWriter}}
> On the reading side I implemented full support for ColumnChunk.file_path by introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} in the {{ParquetFileReader}} implementation stack. In the PR implementation one can default to the current behavior by using the {{SingleFile}} class, have full read support for multi-file parquet in line with the specification by using {{MultiReadableFile}} implementation (that captures the metafile base directory and uses it as the base directory to the ColumChunk.file_path) or one can provide a separate implementation for a non-posix file system storage.
> For an example see {{write_parquet_file_with_snapshot}} function in reader-writer.cc that illustrates the snapshotting write while the {{read_whole_file}} function has been modified to read one of the snapshots (I will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)