You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 08:10:51 UTC

[GitHub] [arrow] rdettai opened a new pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

rdettai opened a new pull request #8525:
URL: https://github.com/apache/arrow/pull/8525


   PR for https://issues.apache.org/jira/browse/ARROW-10387
   > When reading a Parquet file, you first need to read its footer and metadata. If you only have "read from start" capability, this means you need the size of the file to read relatively to end. On some storages, getting the size metadata can be expensive (e.g extra http call for blob storage).
   > The proposition is to add the capability to "read from end" to the ChunkReader trait as most file storages will have this feature (file storage as well as blob storages).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r516152984



##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -44,30 +44,31 @@ use crate::schema::types::{self, SchemaDescriptor};
 /// The reader first reads DEFAULT_FOOTER_SIZE bytes from the end of the file.
 /// If it is not enough according to the length indicated in the footer, it reads more bytes.
 pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaData> {
-    // check file is large enough to hold footer
-    let file_size = chunk_reader.len();
-    if file_size < (FOOTER_SIZE as u64) {
+    // read and cache up to DEFAULT_FOOTER_READ_SIZE bytes from the end and process the footer
+    let mut first_end_read = chunk_reader.get_read(

Review comment:
       yes exactly, I feel the footer reader should not be aware of how the input stream is processed and also the logic can vary depending on the remote storage so the `DEFAULT_FOOTER_READ_SIZE` may not fit for all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r513246472



##########
File path: rust/parquet/src/util/io.rs
##########
@@ -243,6 +251,19 @@ mod tests {
         assert_eq!(src.pos(), 4);
     }
 
+    #[test]
+    fn test_io_slice_over_limit() {

Review comment:
       When you slice a reader in a way that it exceeds the bounds of the original reader, the slice reader should be clamped to the original reader. This is what is tested here.
   ```
   Original:        ++++++++++++++++++++   
   Requested Slice: ----------------+++++++++++++
   Resulting Slice: ----------------++++
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#issuecomment-721201569

@sunchao with most object storage, you will select the bytes you want to read with the http [Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) header, which can read from end. You can use this to implement `ChunkMode::FromEnd` without knowing the length of the file.

Getting the length is expensive as it adds an extra GET request, and I guess that with HDFS it also implies a network round trip which is not free. But as @alamb mentioned earlier, you will often have the length around beforehand because you get it at the same time as you list your files/objects.

**I would be in favor of stalling this PR until at least one other person expresses his interest for the ability to read without knowing the length of the file.**

Meanwhile, one of your [comments](https://github.com/apache/arrow/pull/8525#discussion_r515420243) on the PR made me think, and it's true that there is some buffering logic that is managed in the footer parser that should be left to the `ChunkReader` implementation. I'll try to see if I can find an interface that better separates concerns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r513240857



##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -77,24 +78,31 @@ pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaDat
     let footer_metadata_len = FOOTER_SIZE + metadata_len as usize;
 
     // build up the reader covering the entire metadata
-    let mut default_end_cursor = Cursor::new(default_len_end_buf);
+    let mut first_end_cursor = Cursor::new(first_len_end_buf);
     let metadata_read: Box<dyn Read>;
-    if footer_metadata_len > file_size as usize {
+    if first_end_len < DEFAULT_FOOTER_READ_SIZE

Review comment:
       exactly, I should add a comment to explicit this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r513243215



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -32,6 +32,22 @@ use crate::basic::Type;
 
 use crate::column::reader::ColumnReaderImpl;
 
+/// Parquet files must be read from end for the footer then from start for columns
+pub enum ChunkMode {

Review comment:
       `std::io::SeekFrom` is similar but it has also the `SeekFrom::Current` state that we are not interested in. 
   About the name, I agree that `ChunkMode` is not ideal, but I wanted it to explicitly relate to ChunkReader. What about the more verbose but probably also more explicit `ReadChunkFrom` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r511885161



##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -77,24 +78,31 @@ pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaDat
     let footer_metadata_len = FOOTER_SIZE + metadata_len as usize;
 
     // build up the reader covering the entire metadata
-    let mut default_end_cursor = Cursor::new(default_len_end_buf);
+    let mut first_end_cursor = Cursor::new(first_len_end_buf);
     let metadata_read: Box<dyn Read>;
-    if footer_metadata_len > file_size as usize {
+    if first_end_len < DEFAULT_FOOTER_READ_SIZE

Review comment:
       to check my understanding -- this branch is checking if the total file was smaller than `DEFAULT_FOOTER_READ_SIZE` (b/c it was all read in a single chunk) and the metadata size claims to be larger than this chunk

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -65,12 +66,19 @@ pub struct FileSource<R: ParquetReader> {
 
 impl<R: ParquetReader> FileSource<R> {
     /// Creates new file reader with start and length from a file handle
+    /// If length is larger than the remaining bytes in the source, it is clamped.
+    /// Panics if start is larger than the file size
     pub fn new(fd: &R, start: u64, length: usize) -> Self {
         let reader = RefCell::new(fd.try_clone().unwrap());
+        if start > fd.len() as u64 {

Review comment:
       👍 

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -32,6 +32,22 @@ use crate::basic::Type;
 
 use crate::column::reader::ColumnReaderImpl;
 
+/// Parquet files must be read from end for the footer then from start for columns
+pub enum ChunkMode {

Review comment:
       One suggestion I have is to call this `ReadFrom` rather than `ChunkMode`to make the name more specific.
   
   Alternatively, given its similarity, I wonder if it would make sense to use  `std::io::SeekFrom` here https://doc.rust-lang.org/std/io/enum.SeekFrom.html rather than a custom enum
   
   

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -243,6 +251,19 @@ mod tests {
         assert_eq!(src.pos(), 4);
     }
 
+    #[test]
+    fn test_io_slice_over_limit() {

Review comment:
       Is this testing the case when `buf.len() > src.len()` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai closed pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai closed pull request #8525:
URL: https://github.com/apache/arrow/pull/8525


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#issuecomment-737297952


   Closing because nobody showed interest in this feature


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#issuecomment-716396156


   https://issues.apache.org/jira/browse/ARROW-10387


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#issuecomment-720683396


   @rdettai I don't know enough about object storage to comment about the necessity of this change, but yeah for HDFS getting the file length is cheap. Also, seems we are still going to read from the end of the file so the length info is still required? how `ChunkMode` is going to be used without `from_start`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r513319156



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -32,6 +32,22 @@ use crate::basic::Type;
 
 use crate::column::reader::ColumnReaderImpl;
 
+/// Parquet files must be read from end for the footer then from start for columns
+pub enum ChunkMode {

Review comment:
       I think`ReadChunkFrom` is reasonable. Or maybe `ChunkFrom` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#issuecomment-716388352


   @alamb has raised doubts in https://github.com/apache/arrow/pull/8300 about the necessity for this as the file size could be known from an other source (list operation or metadata catalog). I still want to propose this for discussion as I believe it is more flexible than the current interface. 
   
   There is one catch I encountered though when implementing the S3 source: the http range header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) allows you to read from end, but not from end with a given offset. So implementing the ChunkReader as it is specified here forces you to query more bytes than required (up until the end) and stop reading the stream once you have the data you need.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org