You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/11/28 12:44:39 UTC

[PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Jefffrey opened a new pull request, #5136:
URL: https://github.com/apache/arrow-rs/pull/5136

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4432
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   Following suggestion as per https://github.com/apache/arrow-rs/pull/4434#discussion_r1242061745
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   New `FileReaderBuilder` for arrow-ipc reader to allow configuring the `VerifierOptions` which can allow reading an IPC file with massive number of columns (over 1 million) if user configures correctly.
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136#issuecomment-1869533595

   I have double-checked that it is only the footer where we need to handle this, as the message encoding is already flattened and doesn't contain nested tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136#discussion_r1409328603


##########
arrow-ipc/src/reader.rs:
##########
@@ -498,61 +498,39 @@ pub fn read_dictionary(
     Ok(())
 }
 
-/// Arrow File reader
-pub struct FileReader<R: Read + Seek> {
-    /// Buffered file reader that supports reading and seeking
-    reader: BufReader<R>,
-
-    /// The schema that is read from the file header
-    schema: SchemaRef,
-
-    /// The blocks in the file
-    ///
-    /// A block indicates the regions in the file to read to get data
-    blocks: Vec<crate::Block>,
-
-    /// A counter to keep track of the current block that should be read
-    current_block: usize,
-
-    /// The total number of blocks, which may contain record batches and other types
-    total_blocks: usize,
+/// Build an Arrow [`FileReader`] with custom options.
+#[derive(Debug, Default)]
+pub struct FileReaderBuilder {
+    /// Optional projection for which columns to load (zero-based column indices)
+    projection: Option<Vec<usize>>,
+    /// Flatbuffers options for parsing footer
+    verifier_options: VerifierOptions,
+}
 
-    /// Optional dictionaries for each schema field.
+impl FileReaderBuilder {
+    /// Options for creating a new [`FileReader`].
     ///
-    /// Dictionaries may be appended to in the streaming format.
-    dictionaries_by_id: HashMap<i64, ArrayRef>,
-
-    /// Metadata version
-    metadata_version: crate::MetadataVersion,
-
-    /// User defined metadata
-    custom_metadata: HashMap<String, String>,
+    /// To convert a builder into a reader, call [`FileReaderBuilder::build`].
+    pub fn new() -> Self {
+        Self::default()
+    }
 
-    /// Optional projection and projected_schema
-    projection: Option<(Vec<usize>, Schema)>,
-}
+    /// Optional projection for which columns to load (zero-based column indices).
+    pub fn with_projection(mut self, projection: Vec<usize>) -> Self {
+        self.projection = Some(projection);
+        self
+    }
 
-impl<R: Read + Seek> fmt::Debug for FileReader<R> {
-    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> std::result::Result<(), fmt::Error> {
-        f.debug_struct("FileReader<R>")
-            .field("reader", &"BufReader<..>")
-            .field("schema", &self.schema)
-            .field("blocks", &self.blocks)
-            .field("current_block", &self.current_block)
-            .field("total_blocks", &self.total_blocks)
-            .field("dictionaries_by_id", &self.dictionaries_by_id)
-            .field("metadata_version", &self.metadata_version)
-            .field("projection", &self.projection)
-            .finish()
+    /// Flatbuffers options for parsing footer. Useful if needing to parse a file containing
+    /// millions of columns, in which case can up the value for `max_tables` to accommodate parsing
+    /// such a file.
+    pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self {
+        self.verifier_options = verifier_options;
+        self

Review Comment:
   I think I would prefer the max columns option as it both avoids exposing flatbuffer types in our public API, and is more obvious to users why it might be relevant to them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136#issuecomment-1868928303

   I've refactored to not expose the flatbuffer types, but am still keeping the flatbuffer terminology. I figured that since tables refers to both key-value metadata pairs and also columns, it would be better to just expose the flatbuffer setting and document it rather than name it `with_approx_max_columns` or something similar as that could cause confusion since it'll also affect/be affected by number of metadata key-value pairs
   
   Also added setting for depth while I was at it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136#discussion_r1407704042


##########
arrow-ipc/src/reader.rs:
##########
@@ -498,61 +498,39 @@ pub fn read_dictionary(
     Ok(())
 }
 
-/// Arrow File reader
-pub struct FileReader<R: Read + Seek> {
-    /// Buffered file reader that supports reading and seeking
-    reader: BufReader<R>,
-
-    /// The schema that is read from the file header
-    schema: SchemaRef,
-
-    /// The blocks in the file
-    ///
-    /// A block indicates the regions in the file to read to get data
-    blocks: Vec<crate::Block>,
-
-    /// A counter to keep track of the current block that should be read
-    current_block: usize,
-
-    /// The total number of blocks, which may contain record batches and other types
-    total_blocks: usize,
+/// Build an Arrow [`FileReader`] with custom options.
+#[derive(Debug, Default)]
+pub struct FileReaderBuilder {
+    /// Optional projection for which columns to load (zero-based column indices)
+    projection: Option<Vec<usize>>,
+    /// Flatbuffers options for parsing footer
+    verifier_options: VerifierOptions,
+}
 
-    /// Optional dictionaries for each schema field.
+impl FileReaderBuilder {
+    /// Options for creating a new [`FileReader`].
     ///
-    /// Dictionaries may be appended to in the streaming format.
-    dictionaries_by_id: HashMap<i64, ArrayRef>,
-
-    /// Metadata version
-    metadata_version: crate::MetadataVersion,
-
-    /// User defined metadata
-    custom_metadata: HashMap<String, String>,
+    /// To convert a builder into a reader, call [`FileReaderBuilder::build`].
+    pub fn new() -> Self {
+        Self::default()
+    }
 
-    /// Optional projection and projected_schema
-    projection: Option<(Vec<usize>, Schema)>,
-}
+    /// Optional projection for which columns to load (zero-based column indices).
+    pub fn with_projection(mut self, projection: Vec<usize>) -> Self {
+        self.projection = Some(projection);
+        self
+    }
 
-impl<R: Read + Seek> fmt::Debug for FileReader<R> {
-    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> std::result::Result<(), fmt::Error> {
-        f.debug_struct("FileReader<R>")
-            .field("reader", &"BufReader<..>")
-            .field("schema", &self.schema)
-            .field("blocks", &self.blocks)
-            .field("current_block", &self.current_block)
-            .field("total_blocks", &self.total_blocks)
-            .field("dictionaries_by_id", &self.dictionaries_by_id)
-            .field("metadata_version", &self.metadata_version)
-            .field("projection", &self.projection)
-            .finish()
+    /// Flatbuffers options for parsing footer. Useful if needing to parse a file containing
+    /// millions of columns, in which case can up the value for `max_tables` to accommodate parsing
+    /// such a file.
+    pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self {
+        self.verifier_options = verifier_options;
+        self

Review Comment:
   Wasn't sure if preferable to have this for maximum flexibility, or have something as suggested here: https://github.com/apache/arrow-rs/pull/4434#issuecomment-1615121403
   
   Which might be more user friendly, though abstracts away the inner flatbuffers setting being changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136#discussion_r1411897792


##########
arrow-ipc/src/reader.rs:
##########
@@ -498,61 +498,39 @@ pub fn read_dictionary(
     Ok(())
 }
 
-/// Arrow File reader
-pub struct FileReader<R: Read + Seek> {
-    /// Buffered file reader that supports reading and seeking
-    reader: BufReader<R>,
-
-    /// The schema that is read from the file header
-    schema: SchemaRef,
-
-    /// The blocks in the file
-    ///
-    /// A block indicates the regions in the file to read to get data
-    blocks: Vec<crate::Block>,
-
-    /// A counter to keep track of the current block that should be read
-    current_block: usize,
-
-    /// The total number of blocks, which may contain record batches and other types
-    total_blocks: usize,
+/// Build an Arrow [`FileReader`] with custom options.
+#[derive(Debug, Default)]
+pub struct FileReaderBuilder {
+    /// Optional projection for which columns to load (zero-based column indices)
+    projection: Option<Vec<usize>>,
+    /// Flatbuffers options for parsing footer
+    verifier_options: VerifierOptions,
+}
 
-    /// Optional dictionaries for each schema field.
+impl FileReaderBuilder {
+    /// Options for creating a new [`FileReader`].
     ///
-    /// Dictionaries may be appended to in the streaming format.
-    dictionaries_by_id: HashMap<i64, ArrayRef>,
-
-    /// Metadata version
-    metadata_version: crate::MetadataVersion,
-
-    /// User defined metadata
-    custom_metadata: HashMap<String, String>,
+    /// To convert a builder into a reader, call [`FileReaderBuilder::build`].
+    pub fn new() -> Self {
+        Self::default()
+    }
 
-    /// Optional projection and projected_schema
-    projection: Option<(Vec<usize>, Schema)>,
-}
+    /// Optional projection for which columns to load (zero-based column indices).
+    pub fn with_projection(mut self, projection: Vec<usize>) -> Self {
+        self.projection = Some(projection);
+        self
+    }
 
-impl<R: Read + Seek> fmt::Debug for FileReader<R> {
-    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> std::result::Result<(), fmt::Error> {
-        f.debug_struct("FileReader<R>")
-            .field("reader", &"BufReader<..>")
-            .field("schema", &self.schema)
-            .field("blocks", &self.blocks)
-            .field("current_block", &self.current_block)
-            .field("total_blocks", &self.total_blocks)
-            .field("dictionaries_by_id", &self.dictionaries_by_id)
-            .field("metadata_version", &self.metadata_version)
-            .field("projection", &self.projection)
-            .finish()
+    /// Flatbuffers options for parsing footer. Useful if needing to parse a file containing
+    /// millions of columns, in which case can up the value for `max_tables` to accommodate parsing
+    /// such a file.
+    pub fn with_verifier_options(mut self, verifier_options: VerifierOptions) -> Self {
+        self.verifier_options = verifier_options;
+        self

Review Comment:
   Considering that keys in the schema custom metadata can also contribute to the table count in the footer flatbuffer, not sure if naming it something like `with_max_columns()` would be accurate.
   
   Is it possible to simply abstract over those flatbuffer settings without exposing the inner flatbuffer struct, such as
   
   ```rust
   .with_flatbuffers_max_tables(10000000)
   .with_flatbuffers_max_depth(100)
   ```
   
   - In case a user has a file with a deeply nested schema and might want to tune this parameter as well, unlikely as it might be
   
   Can then document these methods to explain what effect tuning them would have on the file reader, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add FileReaderBuilder for arrow-ipc to allow reading large no. of column files [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #5136:
URL: https://github.com/apache/arrow-rs/pull/5136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org