You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/31 16:18:37 UTC

[GitHub] [arrow] pitrou opened a new pull request, #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

pitrou opened a new pull request, #13275:
URL: https://github.com/apache/arrow/pull/13275

   In fringe cases, users may have Parquet files where deserializing exceeds our default Thrift size limits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890888658


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -223,14 +223,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
 class SerializedPageReader : public PageReader {
  public:
   SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
-                       Compression::type codec, ::arrow::MemoryPool* pool,
+                       Compression::type codec, const ReaderProperties& properties,

Review Comment:
   Yes, we probably can.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890883558


##########
cpp/src/parquet/metadata.h:
##########
@@ -126,6 +126,12 @@ class PARQUET_EXPORT ColumnChunkMetaData {
       const ApplicationVersion* writer_version = NULLPTR, int16_t row_group_ordinal = -1,
       int16_t column_ordinal = -1,
       std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

Review Comment:
   +1 to deprecation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890889197


##########
cpp/src/parquet/thrift_internal.h:
##########
@@ -350,74 +351,84 @@ static inline format::EncryptionAlgorithm ToThrift(EncryptionAlgorithm encryptio
 
 using ThriftBuffer = apache::thrift::transport::TMemoryBuffer;
 
-// On Thrift 0.14.0+, we want to use TConfiguration to raise the max message size
-// limit (ARROW-13655).  If we wanted to protect against huge messages, we could
-// do it ourselves since we know the message size up front.
+class ThriftDeserializer {
+ public:
+  explicit ThriftDeserializer(const ReaderProperties& properties)
+      : ThriftDeserializer(properties.thrift_string_size_limit(),
+                           properties.thrift_container_size_limit()) {}
+
+  ThriftDeserializer(int32_t string_size_limit, int32_t container_size_limit)
+      : string_size_limit_(string_size_limit),
+        container_size_limit_(container_size_limit) {}
 
-inline std::shared_ptr<ThriftBuffer> CreateReadOnlyMemoryBuffer(uint8_t* buf,
-                                                                uint32_t len) {
+  // Deserialize a thrift message from buf/len.  buf/len must at least contain
+  // all the bytes needed to store the thrift message.  On return, len will be
+  // set to the actual length of the header.
+  template <class T>
+  void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg,
+                          const std::shared_ptr<Decryptor>& decryptor = NULLPTR) {

Review Comment:
   It is just the original signature in the previous code, but I can change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13275:
URL: https://github.com/apache/arrow/pull/13275#issuecomment-1142363955

   https://issues.apache.org/jira/browse/ARROW-16546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r888009120


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -654,17 +668,49 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     def pre_buffer(self, bint pre_buffer):
         self.arrow_reader_properties().set_pre_buffer(pre_buffer)
 
+    @property
+    def thrift_string_size_limit(self):
+        return self.reader_properties().thrift_string_size_limit()
+
+    @thrift_string_size_limit.setter
+    def thrift_string_size_limit(self, size):
+        if size <= 0:
+            raise ValueError("size must be larger than zero")
+        self.reader_properties().set_thrift_string_size_limit(size)
+
+    @property
+    def thrift_container_size_limit(self):
+        return self.reader_properties().thrift_container_size_limit()
+
+    @thrift_container_size_limit.setter
+    def thrift_container_size_limit(self, size):
+        if size <= 0:
+            raise ValueError("size must be larger than zero")

Review Comment:
   Should such a check be included in `_parquet.pyx` as well?



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2258,11 +2269,13 @@ class _ParquetDatasetV2:
     1       4  Horse  2022
     """
 
-    def __init__(self, path_or_paths, filesystem=None, filters=None,
+    def __init__(self, path_or_paths, filesystem=None, *, filters=None,
                  partitioning="hive", read_dictionary=None, buffer_size=None,
                  memory_map=False, ignore_prefixes=None, pre_buffer=True,
                  coerce_int96_timestamp_unit=None, schema=None,
-                 decryption_properties=None, **kwargs):
+                 decryption_properties=None, thrift_string_size_limit=None,
+                 thrift_container_size_limit=None,

Review Comment:
   Although if the tests pass ..



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2258,11 +2269,13 @@ class _ParquetDatasetV2:
     1       4  Horse  2022
     """
 
-    def __init__(self, path_or_paths, filesystem=None, filters=None,
+    def __init__(self, path_or_paths, filesystem=None, *, filters=None,
                  partitioning="hive", read_dictionary=None, buffer_size=None,
                  memory_map=False, ignore_prefixes=None, pre_buffer=True,
                  coerce_int96_timestamp_unit=None, schema=None,
-                 decryption_properties=None, **kwargs):
+                 decryption_properties=None, thrift_string_size_limit=None,
+                 thrift_container_size_limit=None,

Review Comment:
   I think you need to add it to the signature of `class ParquetDataset.__new__` as well, to pass it through



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r885867888


##########
cpp/src/parquet/metadata.h:
##########
@@ -126,6 +126,12 @@ class PARQUET_EXPORT ColumnChunkMetaData {
       const ApplicationVersion* writer_version = NULLPTR, int16_t row_group_ordinal = -1,
       int16_t column_ordinal = -1,
       std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

Review Comment:
   I think we can just keep it. Maybe put a deprecation on it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890889476


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -654,17 +668,49 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     def pre_buffer(self, bint pre_buffer):
         self.arrow_reader_properties().set_pre_buffer(pre_buffer)
 
+    @property
+    def thrift_string_size_limit(self):
+        return self.reader_properties().thrift_string_size_limit()
+
+    @thrift_string_size_limit.setter
+    def thrift_string_size_limit(self, size):
+        if size <= 0:

Review Comment:
   We don't seem to validate Parquet properties values on the C++ side. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890887947


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -654,17 +668,49 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     def pre_buffer(self, bint pre_buffer):
         self.arrow_reader_properties().set_pre_buffer(pre_buffer)
 
+    @property
+    def thrift_string_size_limit(self):
+        return self.reader_properties().thrift_string_size_limit()
+
+    @thrift_string_size_limit.setter
+    def thrift_string_size_limit(self, size):
+        if size <= 0:

Review Comment:
   should these checks be pushed into C++?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13275:
URL: https://github.com/apache/arrow/pull/13275#issuecomment-1148335033

   Revision: c47ddb6ac26d6916e07e0aa84596e81c797bba65
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1775fb24b1](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1775fb24b1)
   
   |Task|Status|
   |----|------|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-build-cpp-fuzz)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-conda-cpp)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-1775fb24b1-azure-test-conda-cpp-valgrind)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-1775fb24b1-azure-test-conda-cpp-valgrind)|
   |test-debian-10-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-debian-10-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-debian-10-cpp-amd64)|
   |test-debian-10-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-debian-10-cpp-i386)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-debian-10-cpp-i386)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-debian-11-cpp-amd64)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-debian-11-cpp-i386)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-fedora-35-cpp)|
   |test-ubuntu-18.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-18.04-cpp)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-18.04-cpp)|
   |test-ubuntu-18.04-cpp-release|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-18.04-cpp-release)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-18.04-cpp-release)|
   |test-ubuntu-18.04-cpp-static|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-18.04-cpp-static)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-18.04-cpp-static)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-20.04-cpp)|
   |test-ubuntu-20.04-cpp-14|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-20.04-cpp-14)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-20.04-cpp-14)|
   |test-ubuntu-20.04-cpp-17|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-20.04-cpp-17)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-20.04-cpp-17)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-20.04-cpp-bundled)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-20.04-cpp-thread-sanitizer)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1775fb24b1-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1775fb24b1-github-test-ubuntu-22.04-cpp)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13275:
URL: https://github.com/apache/arrow/pull/13275#issuecomment-1148741916

   Sorry, I missed the CI failure introduced by this PR in the Numpydoc checks.
   Quick fix in https://github.com/apache/arrow/pull/13331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r888047022


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -654,17 +668,49 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     def pre_buffer(self, bint pre_buffer):
         self.arrow_reader_properties().set_pre_buffer(pre_buffer)
 
+    @property
+    def thrift_string_size_limit(self):
+        return self.reader_properties().thrift_string_size_limit()
+
+    @thrift_string_size_limit.setter
+    def thrift_string_size_limit(self, size):
+        if size <= 0:
+            raise ValueError("size must be larger than zero")
+        self.reader_properties().set_thrift_string_size_limit(size)
+
+    @property
+    def thrift_container_size_limit(self):
+        return self.reader_properties().thrift_container_size_limit()
+
+    @thrift_container_size_limit.setter
+    def thrift_container_size_limit(self, size):
+        if size <= 0:
+            raise ValueError("size must be larger than zero")

Review Comment:
   Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r888058952


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2258,11 +2269,13 @@ class _ParquetDatasetV2:
     1       4  Horse  2022
     """
 
-    def __init__(self, path_or_paths, filesystem=None, filters=None,
+    def __init__(self, path_or_paths, filesystem=None, *, filters=None,
                  partitioning="hive", read_dictionary=None, buffer_size=None,
                  memory_map=False, ignore_prefixes=None, pre_buffer=True,
                  coerce_int96_timestamp_unit=None, schema=None,
-                 decryption_properties=None, **kwargs):
+                 decryption_properties=None, thrift_string_size_limit=None,
+                 thrift_container_size_limit=None,

Review Comment:
   Ah, yes, I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890885939


##########
cpp/src/parquet/thrift_internal.h:
##########
@@ -350,74 +351,84 @@ static inline format::EncryptionAlgorithm ToThrift(EncryptionAlgorithm encryptio
 
 using ThriftBuffer = apache::thrift::transport::TMemoryBuffer;
 
-// On Thrift 0.14.0+, we want to use TConfiguration to raise the max message size
-// limit (ARROW-13655).  If we wanted to protect against huge messages, we could
-// do it ourselves since we know the message size up front.
+class ThriftDeserializer {
+ public:
+  explicit ThriftDeserializer(const ReaderProperties& properties)
+      : ThriftDeserializer(properties.thrift_string_size_limit(),
+                           properties.thrift_container_size_limit()) {}
+
+  ThriftDeserializer(int32_t string_size_limit, int32_t container_size_limit)
+      : string_size_limit_(string_size_limit),
+        container_size_limit_(container_size_limit) {}
 
-inline std::shared_ptr<ThriftBuffer> CreateReadOnlyMemoryBuffer(uint8_t* buf,
-                                                                uint32_t len) {
+  // Deserialize a thrift message from buf/len.  buf/len must at least contain
+  // all the bytes needed to store the thrift message.  On return, len will be
+  // set to the actual length of the header.
+  template <class T>
+  void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg,
+                          const std::shared_ptr<Decryptor>& decryptor = NULLPTR) {

Review Comment:
   why shared_ptr here?  If this is needed for async operations doesn't a copy of the shared_ptr need to be taken?  If it isn't pass by normal pointer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r888048009


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2258,11 +2269,13 @@ class _ParquetDatasetV2:
     1       4  Horse  2022
     """
 
-    def __init__(self, path_or_paths, filesystem=None, filters=None,
+    def __init__(self, path_or_paths, filesystem=None, *, filters=None,
                  partitioning="hive", read_dictionary=None, buffer_size=None,
                  memory_map=False, ignore_prefixes=None, pre_buffer=True,
                  coerce_int96_timestamp_unit=None, schema=None,
-                 decryption_properties=None, **kwargs):
+                 decryption_properties=None, thrift_string_size_limit=None,
+                 thrift_container_size_limit=None,

Review Comment:
   Yeah, `pq.read_table` invokes dataset directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13275:
URL: https://github.com/apache/arrow/pull/13275#issuecomment-1148332322

   > any plans to add a sample file with overrides that would pass/fail to verify end-to-end functionality (sorry if I missed this)
   
   The Python tests generate such a file on the fly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #13275:
URL: https://github.com/apache/arrow/pull/13275


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890888928


##########
cpp/src/parquet/thrift_internal.h:
##########
@@ -350,74 +351,84 @@ static inline format::EncryptionAlgorithm ToThrift(EncryptionAlgorithm encryptio
 
 using ThriftBuffer = apache::thrift::transport::TMemoryBuffer;
 
-// On Thrift 0.14.0+, we want to use TConfiguration to raise the max message size
-// limit (ARROW-13655).  If we wanted to protect against huge messages, we could
-// do it ourselves since we know the message size up front.
+class ThriftDeserializer {
+ public:
+  explicit ThriftDeserializer(const ReaderProperties& properties)
+      : ThriftDeserializer(properties.thrift_string_size_limit(),
+                           properties.thrift_container_size_limit()) {}
+
+  ThriftDeserializer(int32_t string_size_limit, int32_t container_size_limit)
+      : string_size_limit_(string_size_limit),
+        container_size_limit_(container_size_limit) {}
 
-inline std::shared_ptr<ThriftBuffer> CreateReadOnlyMemoryBuffer(uint8_t* buf,
-                                                                uint32_t len) {
+  // Deserialize a thrift message from buf/len.  buf/len must at least contain
+  // all the bytes needed to store the thrift message.  On return, len will be
+  // set to the actual length of the header.
+  template <class T>
+  void DeserializeMessage(const uint8_t* buf, uint32_t* len, T* deserialized_msg,
+                          const std::shared_ptr<Decryptor>& decryptor = NULLPTR) {

Review Comment:
   I see this is just moving around code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
emkornfield commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r890880747


##########
cpp/src/parquet/column_reader.cc:
##########
@@ -223,14 +223,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
 class SerializedPageReader : public PageReader {
  public:
   SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
-                       Compression::type codec, ::arrow::MemoryPool* pool,
+                       Compression::type codec, const ReaderProperties& properties,

Review Comment:
   is properties a movable type?  should we pass by value and move?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13275:
URL: https://github.com/apache/arrow/pull/13275#issuecomment-1148304010

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13275: ARROW-16546: [Parquet][C++][Python] Make Thrift limits configurable

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13275:
URL: https://github.com/apache/arrow/pull/13275#discussion_r885842956


##########
cpp/src/parquet/metadata.h:
##########
@@ -126,6 +126,12 @@ class PARQUET_EXPORT ColumnChunkMetaData {
       const ApplicationVersion* writer_version = NULLPTR, int16_t row_group_ordinal = -1,
       int16_t column_ordinal = -1,
       std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

Review Comment:
   I kept the old `*MetaData::Make` signatures for compatibility, but perhaps that's not important? @emkornfield @lidavidm 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org