You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/10/28 07:36:19 UTC

[PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Jefffrey opened a new pull request, #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #6368
   
   ## Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   See issue
   
   ## What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Adjust behaviour of infer Arrow IPC file schema from stream to infer the schema from the first Schema message in the filer, rather than from the footer of the file, thereby allowing us to only need to read as many bytes as we need to infer schema. Old behaviour was to read entire stream before inferring the schema.
   
   ## Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Added  new unit tests.
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379219105


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   > What do you think about moving this logic upstream into the arrow-rs reader.
   
   Yes, this logic was essentially ripped from `StreamReader` and `FileReader` of `arrow-ipc`, but adjusted to be made compatible with async stream of bytes. We could move this logic to `arrow-ipc`, but we need to keep in mind that though we are getting a stream of bytes, this is a stream of bytes in the IPC file format and not the IPC streaming format. So an `AsyncStreamReader` might not exactly fit our use, whereas an `AsyncFileReader` could but might be limited if we don't read its footer when attempting to decode the rest of the data.
   
   > I think I would rather the approach described on https://github.com/apache/arrow-rs/issues/5021, reading the footer is more generally useful, providing information beyond just the schema
   
   It seems this ticket could be appropriate for that. Just to note, that we can't exactly read the footer in a stream without reverting to the old method of reading the entire stream just to decode the schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379308582


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   Ah apologies, I took a look at the parquet code and I see what you mean now. That indeed would be a better approach, in line with existing behaviour for `FileReader` 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#issuecomment-1791396160

   It appears I accidentally merged this PR without a passing CI run. I have filed a PR to fix: https://github.com/apache/arrow-datafusion/pull/8037


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1375273084


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,140 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
+const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
+
+async fn infer_schema_from_file_stream(
+    mut stream: BoxStream<'static, object_store::Result<Bytes>>,
+) -> Result<SchemaRef> {
+    // Expected format:
+    // <magic number "ARROW1"> - 6 bytes
+    // <empty padding bytes [to 8 byte boundary]> - 2 bytes
+    // <continutation: 0xFFFFFFFF> - 4 bytes, not present below v0.15.0
+    // <metadata_size: int32> - 4 bytes
+    // <metadata_flatbuffer: bytes>
+    // <rest of file bytes>
+
+    // So in first read we need at least all known sized sections,
+    // which is 6 + 2 + 4 + 4 = 16 bytes.
+    let bytes = collect_at_least_n_bytes(&mut stream, 16, None).await?;
+    if bytes.len() < 16 {
+        return Err(ArrowError::ParseError(
+            "Arrow IPC file stream shorter than expected".to_string(),
+        ))?;
+    }
+
+    // Files should start with these magic bytes
+    if bytes[0..6] != ARROW_MAGIC {
+        return Err(ArrowError::ParseError(
+            "Arrow file does not contian correct header".to_string(),
+        ))?;
+    }
+
+    // Since continuation marker bytes added in later versions
+    let (meta_len, rest_of_bytes_start_index) = if bytes[8..12] == CONTINUATION_MARKER {
+        (&bytes[12..16], 16)
+    } else {
+        (&bytes[8..12], 12)
+    };
+
+    let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
+    let meta_len = i32::from_le_bytes(meta_len);
+
+    // Read bytes for Schema message
+    let block_data = if bytes[rest_of_bytes_start_index..].len() < meta_len as usize {
+        // Need to read more bytes to decode Message
+        let mut block_data = Vec::with_capacity(meta_len as usize);
+        // In case we had some spare bytes in our initial read chunk
+        block_data.extend_from_slice(&bytes[rest_of_bytes_start_index..]);
+        let size_to_read = meta_len as usize - block_data.len();
+        let block_data =
+            collect_at_least_n_bytes(&mut stream, size_to_read, Some(block_data)).await?;

Review Comment:
   Related to my previous comment, there is currently no check here that we actually did read at least n bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379187910


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   What do you think about moving this logic upstream into the arrow-rs reader. 
   
   https://github.com/apache/arrow-rs/blob/78735002d99eb0212166924948f95554c4ac2866/arrow-ipc/src/reader.rs#L560
   
   If you agree, I can file an upstream ticket to do so. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#issuecomment-1789537022

   > FWIW for consistency we might want to do something closer to what we do for parquet where:
   > 
   >     * We have an estimate of the size of the footer which we fetch
   > 
   >     * We read the actual footer size from the fetched data
   > 
   >     * We then fetch any extra data needed
   > 
   >     * Once decoded the footer provides information on the schema and where the data blocks are located
   > 
   > 
   > This PR instead appears to read the first RecordBatch, whilst I _think_ this should work (provided the file contains data), the more standard approach might be to read the footer.
   > 
   > Edit: I also filed [apache/arrow-rs#5021](https://github.com/apache/arrow-rs/issues/5021) which outlines some APIs we could add upstream that might help here
   
   We could read the first chunk of the stream of the file similar to reading the last chunk of parquet and hoping it contains all the necessary data to decode the schema. I chose the current approach as we don't have control over the number of bytes each await brings from the stream, whereas when reading parquet we generally have more control over that I believe.
   
   This method shouldn't read any record batches, it simply reads the first flatbuffer message in the IPC file contents which is expected to be a schema message, per the specification stating that an IPC streaming format should have the schema message come first and the IPC file format is simply an encapsulation of the IPC streaming format with some addiitonal wrapping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1375316172


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,140 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
+const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
+
+async fn infer_schema_from_file_stream(
+    mut stream: BoxStream<'static, object_store::Result<Bytes>>,
+) -> Result<SchemaRef> {
+    // Expected format:
+    // <magic number "ARROW1"> - 6 bytes
+    // <empty padding bytes [to 8 byte boundary]> - 2 bytes
+    // <continutation: 0xFFFFFFFF> - 4 bytes, not present below v0.15.0
+    // <metadata_size: int32> - 4 bytes
+    // <metadata_flatbuffer: bytes>
+    // <rest of file bytes>
+
+    // So in first read we need at least all known sized sections,
+    // which is 6 + 2 + 4 + 4 = 16 bytes.
+    let bytes = collect_at_least_n_bytes(&mut stream, 16, None).await?;
+    if bytes.len() < 16 {

Review Comment:
   Good point, fixed to have `collect_at_least_n_bytes()` do the error checking now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379276288


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   > reverting to the old method of reading the entire stream just to decode the schema.
   
   The idea would be to do something similar to what we do to read the parquet footer, I provided a few more details on the linked ticket. The trick is to perform ranged reads



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1375272991


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,140 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
+const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
+
+async fn infer_schema_from_file_stream(
+    mut stream: BoxStream<'static, object_store::Result<Bytes>>,
+) -> Result<SchemaRef> {
+    // Expected format:
+    // <magic number "ARROW1"> - 6 bytes
+    // <empty padding bytes [to 8 byte boundary]> - 2 bytes
+    // <continutation: 0xFFFFFFFF> - 4 bytes, not present below v0.15.0
+    // <metadata_size: int32> - 4 bytes
+    // <metadata_flatbuffer: bytes>
+    // <rest of file bytes>
+
+    // So in first read we need at least all known sized sections,
+    // which is 6 + 2 + 4 + 4 = 16 bytes.
+    let bytes = collect_at_least_n_bytes(&mut stream, 16, None).await?;
+    if bytes.len() < 16 {

Review Comment:
   Shouldn't this error check happen within `collect_at_least_n_bytes`? I would expect that function to `Err` if it cannot read at least n bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#issuecomment-1789347016

   FWIW for consistency we might want to do something closer to what we do for parquet where:
   
   * We have an estimate of the size of the footer which we fetch
   * We read the actual footer size
   * We then fetch any extra data needed
   * Once decoded the footer provides information on the schema and where the data blocks are located
   
   This PR instead appears to read the first RecordBatch, whilst I _think_ this should work (provided the file contains data), the more standard approach might be to read the footer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379276288


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   > reverting to the old method of reading the entire stream just to decode the schema.
   
   The idea would be to do something similar to what we do to read the parquet footer, I provided a few more details on the linked ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1375316184


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,140 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
+const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
+
+async fn infer_schema_from_file_stream(
+    mut stream: BoxStream<'static, object_store::Result<Bytes>>,
+) -> Result<SchemaRef> {
+    // Expected format:
+    // <magic number "ARROW1"> - 6 bytes
+    // <empty padding bytes [to 8 byte boundary]> - 2 bytes
+    // <continutation: 0xFFFFFFFF> - 4 bytes, not present below v0.15.0
+    // <metadata_size: int32> - 4 bytes
+    // <metadata_flatbuffer: bytes>
+    // <rest of file bytes>
+
+    // So in first read we need at least all known sized sections,
+    // which is 6 + 2 + 4 + 4 = 16 bytes.
+    let bytes = collect_at_least_n_bytes(&mut stream, 16, None).await?;
+    if bytes.len() < 16 {
+        return Err(ArrowError::ParseError(
+            "Arrow IPC file stream shorter than expected".to_string(),
+        ))?;
+    }
+
+    // Files should start with these magic bytes
+    if bytes[0..6] != ARROW_MAGIC {
+        return Err(ArrowError::ParseError(
+            "Arrow file does not contian correct header".to_string(),
+        ))?;
+    }
+
+    // Since continuation marker bytes added in later versions
+    let (meta_len, rest_of_bytes_start_index) = if bytes[8..12] == CONTINUATION_MARKER {
+        (&bytes[12..16], 16)
+    } else {
+        (&bytes[8..12], 12)
+    };
+
+    let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
+    let meta_len = i32::from_le_bytes(meta_len);
+
+    // Read bytes for Schema message
+    let block_data = if bytes[rest_of_bytes_start_index..].len() < meta_len as usize {
+        // Need to read more bytes to decode Message
+        let mut block_data = Vec::with_capacity(meta_len as usize);
+        // In case we had some spare bytes in our initial read chunk
+        block_data.extend_from_slice(&bytes[rest_of_bytes_start_index..]);
+        let size_to_read = meta_len as usize - block_data.len();
+        let block_data =
+            collect_at_least_n_bytes(&mut stream, size_to_read, Some(block_data)).await?;

Review Comment:
   ` collect_at_least_n_bytes()` should now do the length checking



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379304130


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
+const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
+
+async fn infer_schema_from_file_stream(

Review Comment:
   ```suggestion
   /// Custom implementation of inferring schema. Should eventually be moved upstream to arrow-rs. 
   /// See https://github.com/apache/arrow-rs/issues/5021
   async fn infer_schema_from_file_stream(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Read only enough bytes to infer Arrow IPC file schema via stream [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #7962:
URL: https://github.com/apache/arrow-datafusion/pull/7962#discussion_r1379202501


##########
datafusion/core/src/datasource/file_format/arrow.rs:
##########
@@ -99,7 +102,177 @@ impl FileFormat for ArrowFormat {
     }
 }
 
-fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
-    let reader = FileReader::try_new(reader, None)?;
-    Ok(reader.schema())
+const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];

Review Comment:
   I think I would rather the approach described on https://github.com/apache/arrow-rs/issues/5021, reading the footer is more generally useful, providing information beyond just the schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org