You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/05/14 09:19:49 UTC

[GitHub] [parquet-mr] advancedxy commented on pull request #902: PARQUET-1633 Fix integer overflow

advancedxy commented on pull request #902:
URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841124415


   > @eadwright,
   > 
   > I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the `int` array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way.
   > One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.)
   > 
   > The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (`ParquetMetadataConverter`) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr.
   > BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of [ArrayList](https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229).)
   > 
   > About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline.
   
   Hi @gszadovszky parquet-mr is able to produce big row groups. We found some files wrote by Spark(which uses parquet-mr) have this problem. See https://issues.apache.org/jira/browse/PARQUET-2045 for details.
   
   There are two options to fix this problem:
   1. fail at writer side when creating such large row group/column chunk
   2. support at reader side, which is this approach. It would require a lot of resource, but it's feasible.
   
   Either option is fine for me, WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org