You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "David Lee (JIRA)" <ji...@apache.org> on 2018/05/01 14:56:00 UTC
[jira] [Updated] (PARQUET-1289) Spec for Updateable Parquet

     [ https://issues.apache.org/jira/browse/PARQUET-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Lee updated PARQUET-1289:
-------------------------------
    Description: 
Parquet today is a read only columnar format, but can we also make it updateable using the methods in Apache Arrow for row filtering?

Here's how it would work:

A. Add an insert timestamp for every single record in a parquet file.
B. Add a list of modifiable row offsets to the parquet file's footer for records in the parquet file which have been logically deleted. We should also include the delete timestamp for every offset as well in order to reproduce snapshot of what data looked like at any point in time.
C. If a parquet record is ever update. The updated record would be a new record and the old record in the parquet file would be logically deleted by adding its row offset to its parquet file's footer. We would need a service that does this.
D. When reading parquet files. Logically deleted rows would be excluded.
E. Alternatively when reading parquet files with a snapshot time any rows in the parquet files with an insert timestamp > snapshot time would be excluded and any rows which have been logically flagged for deletion would be included if delete timestamp < snapshop time.

This way we do not have to reorganize the columnar data in existing parquet files. We just have to modify the metadata footer.



  was:
Parquet today is a read only columnar format, but can we also make it updateable using the methods in Apache Arrow for row filtering?

Here's how it would work:

A. Add an insert timestamp for every single record in a parquet file.
B. Add a list of modifiable row offsets to the parquet file's footer for records in the parquet file which have been logically deleted. We should also include the delete timestamp for every offset as well in order to reproduce snapshot of what data looked like at any point in time.
C. If a parquet record is ever update. The updated record would be a new record and the old record in the parquet file would be logically deleted by adding its row offset to its parquet file's footer. We would need a service that does this.
D. When reading parquet files. Logically deleted rows would be excluded.
E. Alternatively when reading parquet files with a snapshot time any rows in the parquet files with an insert timestamp > snapshot time would be excluded and any rows which have been logically flagged for deletion would be included if delete timestamp < snapshop time.

This way we do not have to reorganizing the columnar data in existing parquet files. We just have to modify the metadata footer.




> Spec for Updateable Parquet
> ---------------------------
>
>                 Key: PARQUET-1289
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1289
>             Project: Parquet
>          Issue Type: Wish
>          Components: parquet-format
>            Reporter: David Lee
>            Priority: Minor
>
> Parquet today is a read only columnar format, but can we also make it updateable using the methods in Apache Arrow for row filtering?
> Here's how it would work:
> A. Add an insert timestamp for every single record in a parquet file.
> B. Add a list of modifiable row offsets to the parquet file's footer for records in the parquet file which have been logically deleted. We should also include the delete timestamp for every offset as well in order to reproduce snapshot of what data looked like at any point in time.
> C. If a parquet record is ever update. The updated record would be a new record and the old record in the parquet file would be logically deleted by adding its row offset to its parquet file's footer. We would need a service that does this.
> D. When reading parquet files. Logically deleted rows would be excluded.
> E. Alternatively when reading parquet files with a snapshot time any rows in the parquet files with an insert timestamp > snapshot time would be excluded and any rows which have been logically flagged for deletion would be included if delete timestamp < snapshop time.
> This way we do not have to reorganize the columnar data in existing parquet files. We just have to modify the metadata footer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)