You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/05 16:16:27 UTC

[GitHub] [arrow] wesm opened a new pull request #7108: PARQUET-1857: [C++] Do not fail on unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

wesm opened a new pull request #7108:
URL: https://github.com/apache/arrow/pull/7108


   While writing files with so many row groups is a bad idea, people will still do it and currently the C++ library will segfault in release builds when trying to read such a file. This removes those potential segfaults and enables reading the many-row-group files again. Files with encrypted row group metadata with that many row groups cannot be read because the Parquet metadata uses an int16 row group ordinal key.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420739547



##########
File path: cpp/src/parquet/arrow/reader_internal.h
##########
@@ -76,7 +76,8 @@ class FileColumnIterator {
       return nullptr;
     }
 
-    auto row_group_reader = reader_->RowGroup(row_groups_.front());
+    int row_group_index = row_groups_.front();
+    auto row_group_reader = reader_->RowGroup(row_group_index);

Review comment:
       Without this change I was unable to determine at which row group index the failure occurred in the debugger. I can revert it but it seems useful to be able to know 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420688407



##########
File path: cpp/src/parquet/arrow/reader_internal.h
##########
@@ -76,7 +76,8 @@ class FileColumnIterator {
       return nullptr;
     }
 
-    auto row_group_reader = reader_->RowGroup(row_groups_.front());
+    int row_group_index = row_groups_.front();
+    auto row_group_reader = reader_->RowGroup(row_group_index);

Review comment:
       Does this change make any difference? I was expecting a type issue, but it seems like `row_groups_.front()` should be an `int` anyway.

##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -284,6 +284,12 @@ def test_special_chars_filename(tempdir, use_legacy_dataset):
     assert table_read.equals(table)
 
 
+@pytest.mark.slow
+def test_file_with_over_int16_max_row_groups():

Review comment:
       Is this test supposed to succeed? Or is the 16-bit limitation only for encrypted files?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420862312



##########
File path: cpp/src/parquet/file_reader.cc
##########
@@ -160,6 +166,12 @@ class SerializedRowGroup : public RowGroupReader::Contents {
       throw ParquetException("RowGroup is noted as encrypted but no file decryptor");
     }
 
+    constexpr auto kEncryptedRowGroupsLimit = 32767;
+    if (i > kEncryptedRowGroupsLimit) {
+      throw ParquetException(
+          "Encrypted files cannot contain more than 32767 row groups;");

Review comment:
       Nope, that's a typo




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420745806



##########
File path: cpp/src/parquet/arrow/reader_internal.h
##########
@@ -76,7 +76,8 @@ class FileColumnIterator {
       return nullptr;
     }
 
-    auto row_group_reader = reader_->RowGroup(row_groups_.front());
+    int row_group_index = row_groups_.front();
+    auto row_group_reader = reader_->RowGroup(row_group_index);

Review comment:
       Ah I thought it was pop_front. I'll revert 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#issuecomment-624752287


   +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#issuecomment-624152633


   https://issues.apache.org/jira/browse/PARQUET-1857


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420740026



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -284,6 +284,12 @@ def test_special_chars_filename(tempdir, use_legacy_dataset):
     assert table_read.equals(table)
 
 
+@pytest.mark.slow
+def test_file_with_over_int16_max_row_groups():

Review comment:
       Correct only for encrypted files. So this worked prior to the encrypted patch landing which introduced the artificial limitation for unencrypted files




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#issuecomment-624350469


   @majetideepak or @xhochy would you have a look at this? Will need to get a thumbs up from a Parquet committer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kiszk commented on a change in pull request #7108: PARQUET-1857: [C++] Do not fail to read unencrypted files with over 32767 row groups. Change some DCHECKs causing segfaults to throw exceptions

Posted by GitBox <gi...@apache.org>.
kiszk commented on a change in pull request #7108:
URL: https://github.com/apache/arrow/pull/7108#discussion_r420860301



##########
File path: cpp/src/parquet/file_reader.cc
##########
@@ -160,6 +166,12 @@ class SerializedRowGroup : public RowGroupReader::Contents {
       throw ParquetException("RowGroup is noted as encrypted but no file decryptor");
     }
 
+    constexpr auto kEncryptedRowGroupsLimit = 32767;
+    if (i > kEncryptedRowGroupsLimit) {
+      throw ParquetException(
+          "Encrypted files cannot contain more than 32767 row groups;");

Review comment:
       nit: is `;` necessary in `groups;`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org