You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/18 13:43:02 UTC

[GitHub] [arrow] westonpace commented on issue #35638: Process parquet rowgroups without Arrow conversion

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553077760

   > My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.
   > 
   > I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please
   
   I would say that it is a very common thing for users to want to do.  However, parquet is often not the correct layer of abstraction to introduce this capability.  For example, the table formats like Iceberg, Delta Lake, and Hudi have all come up with ways to handle this.
   
   Appending data to parquet groups has been asked for several times.  I've seen arguments that it is simply not possible without rewriting the file (because thrift uses a lot of absolute file offsets and those offsets, in the portions of the file you are not changing, would become invalid) but I have not investigated it thoroughly enough myself.
   
   > Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet?
   
   There are pros and cons to both.  Row groups can be more flexible than hive partitions (e.g. each row group contains statistics for ALL columns and not just some and row group filters can include things like bloom filters).  However, hive partitions support append operations (you can always add more files to the month=July folder but you can't add more data to an existing row group).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org