You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 11:23:40 UTC

[GitHub] [arrow] alamb commented on a change in pull request #8525: ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer

alamb commented on a change in pull request #8525:
URL: https://github.com/apache/arrow/pull/8525#discussion_r511885161



##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -77,24 +78,31 @@ pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaDat
     let footer_metadata_len = FOOTER_SIZE + metadata_len as usize;
 
     // build up the reader covering the entire metadata
-    let mut default_end_cursor = Cursor::new(default_len_end_buf);
+    let mut first_end_cursor = Cursor::new(first_len_end_buf);
     let metadata_read: Box<dyn Read>;
-    if footer_metadata_len > file_size as usize {
+    if first_end_len < DEFAULT_FOOTER_READ_SIZE

Review comment:
       to check my understanding -- this branch is checking if the total file was smaller than `DEFAULT_FOOTER_READ_SIZE` (b/c it was all read in a single chunk) and the metadata size claims to be larger than this chunk

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -65,12 +66,19 @@ pub struct FileSource<R: ParquetReader> {
 
 impl<R: ParquetReader> FileSource<R> {
     /// Creates new file reader with start and length from a file handle
+    /// If length is larger than the remaining bytes in the source, it is clamped.
+    /// Panics if start is larger than the file size
     pub fn new(fd: &R, start: u64, length: usize) -> Self {
         let reader = RefCell::new(fd.try_clone().unwrap());
+        if start > fd.len() as u64 {

Review comment:
       👍 

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -32,6 +32,22 @@ use crate::basic::Type;
 
 use crate::column::reader::ColumnReaderImpl;
 
+/// Parquet files must be read from end for the footer then from start for columns
+pub enum ChunkMode {

Review comment:
       One suggestion I have is to call this `ReadFrom` rather than `ChunkMode`to make the name more specific.
   
   Alternatively, given its similarity, I wonder if it would make sense to use  `std::io::SeekFrom` here https://doc.rust-lang.org/std/io/enum.SeekFrom.html rather than a custom enum
   
   

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -243,6 +251,19 @@ mod tests {
         assert_eq!(src.pos(), 4);
     }
 
+    #[test]
+    fn test_io_slice_over_limit() {

Review comment:
       Is this testing the case when `buf.len() > src.len()` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org