You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/28 15:12:37 UTC

[GitHub] [arrow] ghuls opened a new pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

ghuls opened a new pull request #9349:
URL: https://github.com/apache/arrow/pull/9349


   …iles which it created as default flatbuffers::Verifier verifier number of columns argument is to low when creating a Feather file from a dataframe with 499999 columns
   
   Increase the number of maximum allowed tables when verifying a FlatBuffer
   from 1_000_000 to 10_000_000.
   
   This allows reading back Feather v2 files with more than 1_000_000
   columns (499_999 columns from Pandas dataframe ==> Feather v2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ghuls commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
ghuls commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566334685



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1053,7 +1053,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
 
     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);

Review comment:
       done
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ghuls commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
ghuls commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566326075



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1053,7 +1053,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
 
     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);

Review comment:
       Like this?
   ```c++
   // Increase max_tables from default 100000 to UINT_MAX when verfying a flatbuffer.
   flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #9349:
URL: https://github.com/apache/arrow/pull/9349


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775317542


   So there is a downside :) does reducing it to 10000000 make this any better?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775295349


   Note that this can make timeout bombs more efficient. For example OSS-Fuzz produced a 1536-byte [Footer](https://github.com/apache/arrow/blob/master/format/File.fbs#L26-L37) table whose verification takes 4 minutes in debug mode.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-769205613


   No particular reason according to [this message](https://github.com/google/flatbuffers/issues/4774#issuecomment-395848846).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775346541


   I've asked for guidance on the flatbuffers mailing-list.
   
   However, I think that we may want to make the `max_tables` a factor of the buffer size. The only point of recursion is the `Field` table. A Field must have a non-empty type (and also, perhaps, a non-empty name). Even with at least 1 bit in the Flatbuffers buffer per Field, a 1536-byte Footer shouldn't contain more than 1536*8 tables.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566305825



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1053,7 +1053,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
 
     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);

Review comment:
       please still add the parameter comment if possible. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775349181


   If I use the aforementioned heuristic, verification of the 1.5kB file fails after less than 20 ms (in debug mode).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ghuls commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
ghuls commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566271219



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -58,6 +58,8 @@
 #include "generated/Schema_generated.h"
 #include "generated/SparseTensor_generated.h"
 
+#define FLATBUFFER_MAX_TABLES 10000000

Review comment:
       > According to various sources on the Internet it should be ok to just pass `UINT_MAX`. @wesm @emkornfield What do you think?
   
   That should be fine indeed. I got the solution for this problem from:
   https://groups.google.com/g/flatbuffers/c/JtDGnBPx9is
   
   > You simply need to pass your own max_tables to the FlatBufferBuilder constructor.
   > Pass e.g. INT_MAX and you will guaranteed never hit it :)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566327155



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1053,7 +1053,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
 
     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);

Review comment:
       No, like this:
   ```c++
   
   flatbuffers::Verifier verifier(data, footer_buffer_->size(), /*max_depth=*/128, /*max_tables=*/UINT_MAX);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-769180624


   https://issues.apache.org/jira/browse/ARROW-10056


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775361444


   Opened ARROW-11559 to track the verification blowup issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775334287


   So apparently the 1536-byte footer message contains 268 million flatbuffers tables (and it doesn't fail verifying). I don't know how that works.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++] Increase flatbuffers max_tables parameter in order to read wide tables

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-775295349


   Note that this can make timeout bombs more efficient. For example OSS-Fuzz produced a 1536-byte [Footer](https://github.com/apache/arrow/blob/master/format/File.fbs#L26-L37) table whose verification takes 8 minutes in debug mode.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ghuls commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
ghuls commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566271219



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -58,6 +58,8 @@
 #include "generated/Schema_generated.h"
 #include "generated/SparseTensor_generated.h"
 
+#define FLATBUFFER_MAX_TABLES 10000000

Review comment:
       > According to various sources on the Internet it should be ok to just pass `UINT_MAX`. @wesm @emkornfield What do you think?
   
   That should be fine indeed. I got the solution for this problem from:
   https://groups.google.com/g/flatbuffers/c/JtDGnBPx9is
   
   > You simply need to pass your own max_tables to the FlatBufferBuilder constructor.
   > Pass e.g. INT_MAX and you will guaranteed never hit it :)
   
   I am fine with `UINT_MAX`, instead of an arbitrary value like 10000000.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ghuls commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
ghuls commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566326075



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1053,7 +1053,7 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
         file_->ReadAt(footer_offset_ - footer_length - file_end_size, footer_length));
 
     auto data = footer_buffer_->data();
-    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128);
+    flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);

Review comment:
       Like this?
   ```c++
   // Increase max_tables from default 1000000 to UINT_MAX when verfying a flatbuffer.
   flatbuffers::Verifier verifier(data, footer_buffer_->size(), 128, UINT_MAX);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-769195156


   According to various sources on the Internet it should be ok to just pass `UINT_MAX`. @wesm @emkornfield  What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566231106



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -58,6 +58,8 @@
 #include "generated/Schema_generated.h"
 #include "generated/SparseTensor_generated.h"
 
+#define FLATBUFFER_MAX_TABLES 10000000

Review comment:
       or simple comment the parameter inline below /*flatbuffer_max_tables=*/100000000




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#discussion_r566230575



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -58,6 +58,8 @@
 #include "generated/Schema_generated.h"
 #include "generated/SparseTensor_generated.h"
 
+#define FLATBUFFER_MAX_TABLES 10000000

Review comment:
       style nit: use a constexpr in an anonymous namespace.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #9349: ARROW-10056: [C++][Python] PyArrow fails to read (valid) Feather v2 f…

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #9349:
URL: https://github.com/apache/arrow/pull/9349#issuecomment-769204571


   uint max seems OK to me.   tables seems like a weird way for protection, is the reason why it was set low in the first place?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org