You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "deshanxiao (via GitHub)" <gi...@apache.org> on 2023/02/09 05:42:39 UTC

[GitHub] [orc] deshanxiao opened a new issue, #1404: Support read large statistics exceed 2GB

deshanxiao opened a new issue, #1404:
URL: https://github.com/apache/orc/issues/1404

   In Java and C++ reader, we cannot read the statistics exceed 2GB. We should find a new way or design to support read large statistics  files.
   
   ```
   com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
   	at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:154)
   	at com.google.protobuf.CodedInputStream$StreamDecoder.readRawBytesSlowPathOneChunk(CodedInputStream.java:2954)
   	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:3035)
   	at com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2446)
   	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2118)
   	at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2070)
   	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3285)
   	at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3279)
   	at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423)
   	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8172)
   	at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8093)
   	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10494)
   	at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10488)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zabetak commented on issue #1404: Support read large statistics exceed 2GB

Posted by "zabetak (via GitHub)" <gi...@apache.org>.
zabetak commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1441842189

   I am rather positive on the idea of enforcing limits on the writer but I would expect this to be on the protobuf layer (https://github.com/protocolbuffers/protobuf/issues/11729). The limitation comes from protobuf so it seems natural to add checks there and not in ORC code.
   
   The reason that I brought up the question about maximum size is because as the file increases so does the metadata and clearly here we have a hard limitation on till where it can go. If there is a compelling use-case to support arbitrary big files (with arbitrary big metadata) then investing on a new design would make sense.
   
   To be clear, I am not pushing for a new design since I fully agree with both @deshanxiao and @omalley in that with proper schema design + configuration the chances of hitting the problem are rather small. From the ORC perspective, it seems acceptable to settle with the 1/2GB limit on the metadata section.
   
   I didn't mean to imply that having 500GB is a good/normal thing but if nothing prevents user to do it, eventually they will get there. :)
   
   Speaking about actual use-cases, in Hive, I have recently seen `OrcSplit` reporting a [fileLength](https://github.com/apache/hive/blob/d079dfbdda61f6e24fa39ecbcfb758f0a7402cf3/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L69) of `216488139040` for files under certain table partitions. I am not sure how well this number translates to the actual file size nor about the actual configuration that led to this situation since I am not the end-user myself; I was just investigating the problem by checking the Hive application logs.
   
   Summing up, I don't think a new metadata design is worth it at the moment and limiting the writer seems more appropriate to be done in the protobuf layer. From my perspective, the only actionable item regarding this issue at this point would be to add a brief mention about the metadata size limitation on the website (e.g., https://orc.apache.org/specification/ORCv1/).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] omalley commented on issue #1404: Support read large statistics exceed 2GB

Posted by "omalley (via GitHub)" <gi...@apache.org>.
omalley commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1440565884

   Yeah, I've seen ORC files upto 2gb or so, but @deshanxiao 's point is accurate that you'll probably get better performance out of Spark & Trino if you keep them smaller than 1-2gb.
   
   That said, can you share how many rows & columns are in the the 500gb file? How big did you configure the stripe buffers and how large are your actual stripes? You should bump up your stripe buffer to get your stripe size to be 256mb or so. That will still give you 2000 stripes, which unless you have a crazy number of columns should be under 2gb.
   
   My guess is that you have a huge number of very small stripes. That will hurt performance, especially if you are using the C++ reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zabetak commented on issue #1404: Support read large statistics exceed 2GB

Posted by "zabetak (via GitHub)" <gi...@apache.org>.
zabetak commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1439889316

   At first glance 2GB of metadata seems big especially when considering the toy example that I made in ORC-1361. However, if you have a 500GB ORC file then 2GB of metadata does not appear too big anymore so things are relative.
   
   Are there limitations on the maximum size of an ORC file? Do we support such kind of use-cases?
   
   If we add a limit while writing (which by the way I also suggested in https://github.com/protocolbuffers/protobuf/issues/11729) then we should define what happens when the limit is exceeded:
   * drop all metadata
   * fail the entire write
   * ignore metadata over the limit (keep partial metadata)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] zabetak commented on issue #1404: Support read large statistics exceed 2GB

Posted by "zabetak (via GitHub)" <gi...@apache.org>.
zabetak commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1443384312

   @deshanxiao In https://github.com/protocolbuffers/protobuf/issues/11729 I shared a very simple project reproducing/explaining the write/read problem. Please have a look and if you have questions I will be happy to elaborate more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1404: Support read large statistics exceed 2GB

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1442886016

   @zabetak Could you explain in detail why we can write but not read in protobuf layer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1404: Support read large statistics exceed 2GB

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1439937946

   @zabetak Could you provide a scenario for a 500GB ORC file? In my experience, columnar storage generally serves big data engines, and each of these big data files is generally about 512M/256M.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [orc] deshanxiao commented on issue #1404: Support read large statistics exceed 2GB

Posted by "deshanxiao (via GitHub)" <gi...@apache.org>.
deshanxiao commented on issue #1404:
URL: https://github.com/apache/orc/issues/1404#issuecomment-1439605709

   In fact, it is not reasonable to have such large statistics for a single ORC file, it requires too much memory. My suggestion is to limit writing such huge statistics in writer side. What do you think? @zabetak 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@orc.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org