You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/02/01 23:45:09 UTC

[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827


   @gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread.
   External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations.
   There are a few major domains where I see this as valuable component:
   1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers.
   2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set.
   3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones.
   
   If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org