You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "alippai (via GitHub)" <gi...@apache.org> on 2023/05/17 15:51:43 UTC

[GitHub] [arrow] alippai opened a new issue, #35638: Process parquet rowgroups without Arrow conversion

alippai opened a new issue, #35638:
URL: https://github.com/apache/arrow/issues/35638

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I'd like to read a Parquet file and append an Arrow table to the new Parquet file created based on the old file and the new table added as a new row group.
   Can I read the Parquet rowgroup by rowgroup, decide to drop any or use them and assemble a new Parquet file without doing the (de)serialization to Arrow?
   
   ### Component(s)
   
   C++, Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1551961791

   Yes, something like that. My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.
   
   I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1634651200

   > Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.
   
   I am not familiar enough with the code in parquet-c++ to be able to give much advice going forwards (@wgtmac and @mapleFU may have an opinion).  I think it makes sense as a parquet-c++ feature but probably not as an arrow feature (as you wouldn't need any arrow arrays)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553077760

   > My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.
   > 
   > I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please
   
   I would say that it is a very common thing for users to want to do.  However, parquet is often not the correct layer of abstraction to introduce this capability.  For example, the table formats like Iceberg, Delta Lake, and Hudi have all come up with ways to handle this.
   
   Appending data to parquet groups has been asked for several times.  I've seen arguments that it is simply not possible without rewriting the file (because thrift uses a lot of absolute file offsets and those offsets, in the portions of the file you are not changing, would become invalid) but I have not investigated it thoroughly enough myself.
   
   > Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet?
   
   There are pros and cons to both.  Row groups can be more flexible than hive partitions (e.g. each row group contains statistics for ALL columns and not just some and row group filters can include things like bloom filters).  However, hive partitions support append operations (you can always add more files to the month=July folder but you can't add more data to an existing row group).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553216387

   ```
   Rowgroup 1:
     ...
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230517, value: x
   Rowgroup 2:
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     ...
    ```
    Instead of the current:
    ```
   Rowgroup 1:
     ...
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230517, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
   Rowgroup 2:
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     - date: 20230518, value: x
     ...
    ```
   So the question is if using rowgroups as boundaries, partitions makes sense or this is not really useful and a row count based rowgrouping is more than OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553346253

   @westonpace reading the [parquet thrift doc](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift) the naive approach would be keeping the buffers and statistics only, recreating everything else. I didn't know parquet works like this, thanks for the insight!
   
   My goal is slightly different from deltalake and others (and I'm also not fan of JVM based setups for this kind of workload). My idea was relying less on the traditional FS and using the internal structure of the parquet more because of the very reason you've mentioned (filters, statistics). Architecturally Skyhook would be closer to this or "simply" storing all the metadata + statistics in TiKV or other kv store.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1551795301

   Seems that you want a "append" syntax, and want to avoid read->covert to arrow->writeback?
   
   I guess current Parquet code cannot support this :-(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1586467363

   > @wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table.
   > 
   > Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc).
   
   Yes, I understand your use case. Appending to or modifying a parquet file would require the file system to support mutation or append operation, which is not a typical use case. So merging several parquet files directly on row groups seems to be more generic and can be an alternative solution in your case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1572067102

   If I’m right @tustvold created similar low level interfaces. Still looking for the exact MR but maybe he can share what level of abstraction worked well in the rust impl


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1585589807

   I did similar work in the parquet-mr repo to merge row groups of different parquet files without decompression and decoding into a single parquet file (with some supported transformation like re-compression, encryption or dropping columns).   
   
   https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java
   
   Is that what you suppose to have in the parquet-cpp? @alippai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1585983610

   @wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table. 
   
   Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553078061

   > Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition?
   
   I'm not sure I understand what you are suggesting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1572074629

   https://github.com/apache/arrow-rs/pull/4269 is the PR. Not sure how transferable it is, but the basic idea is to allow appending an entire column chunk to a row group 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vinothchandar commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.

vinothchandar commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1633523076

   @wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1634340503

   > Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.
   
   Just curious: is there any plan to add similar optimization to Apache Hudi? Our old friends at Uber have done a great job: https://www.uber.com/en-HK/blog/fast-copy-on-write-within-apache-parquet/. @vinothchandar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1584966456

   +1 to adding this table somewhere (also, yes, big thanks to @mapleFU and @wgtmac for the recent work).  A good first pass would be for each implementation to document what they support locally (e.g. arrow-c++ to add to https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features and arrow-rs to add to somewhere in https://docs.rs/parquet/latest/parquet/arrow/index.html)
   
   If we are going to combine them in a table somewhere then maybe we could add to somewhere on https://parquet.apache.org/docs/overview/
   
   That would allow other parquet implementations to contribute their feature list if they chose and might be more appropriate than https://arrow.apache.org/docs/status.html
   
   Although I have no write privileges over there :shrug: so if we want something more local it would probably be ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1637119429

   Sure, that sounds interesting! Let's discuss more about that @vinothchandar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1552167381

   Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet? Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tustvold commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1572256401

   I think adding documentation of the support within the various arrow projects for parquet makes sense to me, https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/, might serve as some inspiration for further features beyond the obvious support for encoding X or data type Y.
   
   I'm less sure that we should endeavor to maintain up to date feature support for readers outside the arrow umbrella, e.g. parquet-mr, duckdb, arrow2, etc...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1572268750

   Indeed, I didn't realize that's not covered by the current docs. I also favor the less work and more consistency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1554716446

   @alippai I guess it "can" be a better solution, because spliting partition to different row-groups makes reader can prune uneccessary row-group. But I don't know whether current implemention support it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1633601869

   > @wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x.
   
   I don't think there is significant difference between Arrow and parquet-mr if pages do not need  any modification. When re-compression and/or re-encoding is applied, it would be more performant to go with Arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vinothchandar commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.

vinothchandar commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1634321274

   Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vinothchandar commented on issue #35638: [C++][Parquet] Process parquet rowgroups without Arrow conversion

Posted by "vinothchandar (via GitHub)" <gi...@apache.org>.

vinothchandar commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1634968924

   @wgtmac We have an implementation using parquet-mr in the community. I am trying to consolidate all these efforts - ours, parquet-mr and understand plans in Arrow, as we'd like to embrace Arrow (in place of Avro in Hudi 1.0). We can jam more on Hudi Slack if the parquet-mr piece interests you. cc @yihua 
   
   Thanks @westonpace. I'll wait to hear more opinions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1571960854

   Ok, I think I understand better now.  I misread this request originally and didn't fully realize that you want to create a new parquet file.  I thought you were trying to modify the existing parquet file.
   
   Yes, this makes sense.  No, I'm not sure the capability is really there but some of it might be.
   
   The parquet library always decodes its data, as best I can tell.  There are some underlying structures like the PageReader which might not.  However, there is nothing at the level of "read this row group and append it to another file without decoding".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alippai commented on issue #35638: Process parquet rowgroups without Arrow conversion

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1572237367

   @tustvold @mapleFU @westonpace (and many others): the speed you are adding new and new parquet features is amazing. Maybe we should start adding a matrix for arrow, arrow-rs, arrow2 (rs), parquet-mr, duckdb to https://arrow.apache.org/docs/status.html so we know what statistics, bloom filters are read and written, which operations are available.
   
   Would you be supportive or it's not the right time now? I can start the MR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org