You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "olwmc (via GitHub)" <gi...@apache.org> on 2023/06/18 15:47:02 UTC

[GitHub] [arrow-rs] olwmc opened a new pull request, #4434: Added verifier options

olwmc opened a new pull request, #4434:
URL: https://github.com/apache/arrow-rs/pull/4434

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4432.
   
   *This is my first time contributing to this repository. My apologies if I missed anything from the contribution guide.*
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   It seems reasonable that that arrow files with greater than one million columns should be able to be read by `arrow-ipc`.
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   All of the changes take place in `arrow-ipc::reader::FileReader::try_new`. I added `VerifierOptions` that get passed to `crate::root_as_footer_with_opts` as opposed to the call to `crate::root_as_footer`, which uses, downstream, the default `VerifierOptions` that caused the million column limit. This has the effect of setting the `max_tables` option based on the number of columns in the given file. This change is directly mirrored from [arrow2's dealing with this same problem](https://github.com/jorgecarleitao/arrow2/commit/8e146a78f53899dd7f7166512c1b73a51803b0ab)
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No. The interfaces are otherwise untouched.
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   This should not introduce any breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ghuls commented on a diff in pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "ghuls (via GitHub)" <gi...@apache.org>.
ghuls commented on code in PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#discussion_r1247835572


##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file and avoid an error if the file contains more than 1M rows
+        let verifier_options = VerifierOptions {
+            max_depth: 128,
+            max_tables: footer_len as usize * 8,

Review Comment:
   This number is based comes from this comment: https://github.com/apache/arrow/pull/9349#issuecomment-775346541



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#issuecomment-1602767380

   Thanks @olwmc  -- I kicked off the CI tests for this PR. 
   
   I wonder if we have any testcase we can use / create? Maybe we can write a test that creates a file with 1M columns and then tries to read it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#issuecomment-1615121403

   I thought the suggestion in https://github.com/apache/arrow-rs/pull/4434/files#r1242167017 was to make overriding the limit an option that people could opt in to . Something like
   
   ```rust
   let file_reader = arrow::ipc::FileReader::try_new(reader, None)
      // Allow reading up to 4M columns because we trust the input
     .with_max_columns(4_000_000_000);
   ```
   
   I am not sure why this PR was closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ghuls commented on a diff in pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "ghuls (via GitHub)" <gi...@apache.org>.
ghuls commented on code in PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#discussion_r1247835572


##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file and avoid an error if the file contains more than 1M rows
+        let verifier_options = VerifierOptions {
+            max_depth: 128,
+            max_tables: footer_len as usize * 8,

Review Comment:
   This number comes from this comment: https://github.com/apache/arrow/pull/9349#issuecomment-775346541



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#discussion_r1238639606


##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file

Review Comment:
   ```suggestion
           // in file and avoid an error if the file contains more than 1M rows
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#discussion_r1242061745


##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file and avoid an error if the file contains more than 1M rows
+        let verifier_options = VerifierOptions {

Review Comment:
   I wonder if we should create a FileReaderOptions, that contains both the projection, and also `VerifierOptions`, to allow people to set these as they deem fit



##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file and avoid an error if the file contains more than 1M rows
+        let verifier_options = VerifierOptions {
+            max_depth: 128,
+            max_tables: footer_len as usize * 8,

Review Comment:
   I think some explanation of where this number is coming from is probably not remiss



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] olwmc closed pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "olwmc (via GitHub)" <gi...@apache.org>.
olwmc closed pull request #4434: fix: Allow reading of arrow files with more than one million columns
URL: https://github.com/apache/arrow-rs/pull/4434


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#issuecomment-1602769665

   Looks like there is no test case in arrow2 (at least not as part of https://github.com/jorgecarleitao/arrow2/pull/240/files) -- @ ghuls as the author of that PR, do you know if you found a way to test this functionality?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] olwmc commented on a diff in pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "olwmc (via GitHub)" <gi...@apache.org>.
olwmc commented on code in PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#discussion_r1242167017


##########
arrow-ipc/src/reader.rs:
##########
@@ -647,9 +649,18 @@ impl<R: Read + Seek> FileReader<R> {
         reader.seek(SeekFrom::End(-10 - footer_len as i64))?;
         reader.read_exact(&mut footer_data)?;
 
-        let footer = crate::root_as_footer(&footer_data[..]).map_err(|err| {
-            ArrowError::IoError(format!("Unable to get root as footer: {err:?}"))
-        })?;
+        // construct verifier options that reflect actual number of columns
+        // in file and avoid an error if the file contains more than 1M rows
+        let verifier_options = VerifierOptions {

Review Comment:
   I think this suggestion is a significantly more robust solution and would be preferred over just overriding the verifier options by default.  I'd be happy to close this PR in favor of that being the solution for this issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ghuls commented on pull request #4434: fix: Allow reading of arrow files with more than one million columns

Posted by "ghuls (via GitHub)" <gi...@apache.org>.
ghuls commented on PR #4434:
URL: https://github.com/apache/arrow-rs/pull/4434#issuecomment-1614623064

   > I am a little bit concerned that the flatbuffer table limit exists for a reason, e.g. to prevent a DOS vector. I don't feel confident that we should change the default settings, as I don't feel I have a good enough grasp as to why it is present. I therefore wonder if we can instead allow users to opt-in to looser validation behaviour?
   > 
   > As an aside I would not expect million column schemas to be a good idea in general, in the absence of extremely aggressive projection pushdown the performance will likely be poor. I would definitely encourage people with such schema to perhaps reconsider their schema design...
   
   As far as I understand the pyarrow fix for this issue checks the size of the footer to calculate the maximum number of tables that could be encoded in the footer (https://issues.apache.org/jira/browse/ARROW-11559, on which the arrow2 implementation is based).
   
   I have quite a few (real) files that reach this limit of 1milion columns (max < 2.5 million). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org