You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/04 23:13:24 UTC

[GitHub] [arrow] wesm opened a new pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

wesm opened a new pull request #7103:
URL: https://github.com/apache/arrow/pull/7103


   While it's not an ideal use case for Parquet, the 10MB limit for strings was causing a Thrift deserialization failure due to the large "pandas metadata" JSON blob written with the Schema when there are many columns. A 100MB limit should still catch "memory bombs" caused by nefarious input while allowing pretty wide data frames to be stored successfully


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7103:
URL: https://github.com/apache/arrow/pull/7103#issuecomment-623757231


   https://issues.apache.org/jira/browse/ARROW-8694


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7103:
URL: https://github.com/apache/arrow/pull/7103#discussion_r420262089



##########
File path: cpp/src/parquet/thrift_internal.h
##########
@@ -362,7 +362,7 @@ inline void DeserializeThriftUnencryptedMsg(const uint8_t* buf, uint32_t* len,
       new ThriftBuffer(const_cast<uint8_t*>(buf), *len));
   apache::thrift::protocol::TCompactProtocolFactoryT<ThriftBuffer> tproto_factory;
   // Protect against CPU and memory bombs
-  tproto_factory.setStringSizeLimit(10 * 1000 * 1000);
+  tproto_factory.setStringSizeLimit(100 * 1000 * 1000);

Review comment:
       I briefly took a look at this. It is a lot of refactoring to be able to pass a reader property into this function for little gain. If the problem arises I think we can address it then. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #7103:
URL: https://github.com/apache/arrow/pull/7103#discussion_r419875325



##########
File path: cpp/src/parquet/thrift_internal.h
##########
@@ -362,7 +362,7 @@ inline void DeserializeThriftUnencryptedMsg(const uint8_t* buf, uint32_t* len,
       new ThriftBuffer(const_cast<uint8_t*>(buf), *len));
   apache::thrift::protocol::TCompactProtocolFactoryT<ThriftBuffer> tproto_factory;
   // Protect against CPU and memory bombs
-  tproto_factory.setStringSizeLimit(10 * 1000 * 1000);
+  tproto_factory.setStringSizeLimit(100 * 1000 * 1000);

Review comment:
       could this be added easily to reader properties?  If not 100 MB seems like a good amount of buffer, but in case people really want to abuse it, it would be nice not to get follow-up bug reports.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7103:
URL: https://github.com/apache/arrow/pull/7103#issuecomment-624320111


   +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7103: ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7103:
URL: https://github.com/apache/arrow/pull/7103#discussion_r420041217



##########
File path: cpp/src/parquet/thrift_internal.h
##########
@@ -362,7 +362,7 @@ inline void DeserializeThriftUnencryptedMsg(const uint8_t* buf, uint32_t* len,
       new ThriftBuffer(const_cast<uint8_t*>(buf), *len));
   apache::thrift::protocol::TCompactProtocolFactoryT<ThriftBuffer> tproto_factory;
   // Protect against CPU and memory bombs
-  tproto_factory.setStringSizeLimit(10 * 1000 * 1000);
+  tproto_factory.setStringSizeLimit(100 * 1000 * 1000);

Review comment:
       More generally, it turns out that there are other types of memory bombs possible with Parquet (for example using compression). So i'm not sure the limit is very useful anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org