You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/29 14:24:11 UTC

[GitHub] [arrow] rdettai opened a new pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

rdettai opened a new pull request #8300:
URL: https://github.com/apache/arrow/pull/8300


   https://issues.apache.org/jira/browse/ARROW-10135


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r509056820



##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       This is not a new feature but an adaptation of what we had before. Without `SliceableCursor` there is no way to read from RAM the way it was done with `Cursor`. Instead of directly implementing the new `ChunkReader` trait for `Cursor` and having the same problem of copying the data all over the place as we did before, I tweeked the Cursor a bit to share its data between the `Read` slices. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-708039466


   @rdettai  --
   
   > This wrapper makes sense because File is cheap to clone and needs to be buffered, but this will not be generally the case for other implems of ParquetReader. 
   
   Yes, actually that was our experience (when we used a buffered implementation, we found that the underlying (large) buffer got copies around a lot. 
   
   I think in general, the cleaner thing (and my preference) would be to simply port my code to use `ChunkReader` (I think it will end up being cleaner and likely more performant). 
   
   > Do you think it is worth it have the ChunkReader be implemented for T:ParquetReader rather than File ?  
   
   The upside is that it would be somewhat more backwards compatible, but unless other reviewers feel strongly I personally think we should just do the breaking change in terms of `ChunkReader` and move on


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me closed pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

nevi-me closed pull request #8300:
URL: https://github.com/apache/arrow/pull/8300


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-708440082


   An argument in favor of having `ChunkReader` be implemented for `T:ParquetReader` rather than `File`:
   -> this allows to write wrappers around `File` that you can then pass to `SerializedFileReader`. For instance at some points I wanted to debug the number of actual read ops on the `File` object so I wrote a tiny wrapper that incremented a counter at each `read()` call...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r507107269



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       I think it is (as the `ParquetWriter` uses `TryClone`) but I may not have tested with the latest changes from @rdettai 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-714317445


   Hi! I did some experimentation with this new interface (with `ChunkMode`). It works but is not flawless. You are right @alamb , we should move on with this PR the way it is currently implemented (with `get_read(&self, start: u64, length: usize)`), and we'll discuss the other possibility in a dedicated thread.
   @sunchao I have responded to all your comments. If you are okay with my answers, I think we can merge this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r502329941



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.

Review comment:
       ```suggestion
   /// Length should return the total number of bytes in the input source.
   ```

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       For backwards compatibility in other projects, I think we need to allow  `use parquet::file::reader::TryClone` to keep working (`TryClone` moved to `util::io::TryClone`, and appears not to be public anymore). 
   
   Perhap something like this (untested):
   ```suggestion
   pub use super::TryClone;
   // ----------------------------------------------------------------------
   ```
   
   When I tried to compile an in-house project with this branch I got the following error: 
   
   ```
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
     --> delorean_parquet/src/lib.rs:13:28
      |
   13 |     file::reader::{Length, TryClone},
      |                            ^^^^^^^^ no `TryClone` in `parquet::file::reader`
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
   ```
   
   When I tried to use the copy in `util::io` I got: 
   
   ```
   error[E0603]: module `util` is private
     --> delorean_parquet/src/lib.rs:14:5
      |
   14 |     util::io::{TryClone},
      |     ^^^^ private module
      |
   
   ```
   
   When I used the import in `seriailized_reader` I got a similar error: 
   
   ```
   error[E0603]: trait `TryClone` is private
     --> delorean_parquet/src/lib.rs:14:30
      |
   14 |     file::serialized_reader::TryClone,
      |                              ^^^^^^^^ private trait
      |
   ```

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.

Review comment:
       👍 

##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I discovered above that the entire `util` module is not public -- 
   
   https://github.com/apache/arrow/blob/master/rust/parquet/src/lib.rs#L39
   
   ```
   #[macro_use]
   mod util;
   ```
   
   So none of these pub traits can be used outside this crate at the moment

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       
   When I tried to compile an in-house project with this branch I got the following error: 
   
   ```
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
     --> delorean_parquet/src/lib.rs:13:28
      |
   13 |     file::reader::{Length, TryClone},
      |                            ^^^^^^^^ no `TryClone` in `parquet::file::reader`
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
   ```
   
   When I tried to use the copy in `util::io` I got: 
   
   ```
   error[E0603]: module `util` is private
     --> delorean_parquet/src/lib.rs:14:5
      |
   14 |     util::io::{TryClone},
      |     ^^^^ private module
      |
   
   ```
   
   When I used the import in `seriailized_reader` I got a similar error: 
   
   ```
   error[E0603]: trait `TryClone` is private
     --> delorean_parquet/src/lib.rs:14:30
      |
   14 |     file::serialized_reader::TryClone,
      |                              ^^^^^^^^ private trait
      |
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-700742388


   https://issues.apache.org/jira/browse/ARROW-10135


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-712998318


   Thanks @rdettai . I'll take a look at this PR today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r506962143



##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       Should we implement this separately? That'd be my preference




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r506961922



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       @alamb is this still an issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai edited a comment on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-712910596


   > @rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review.
   
   @nevi-me Depends on whether we want `ParquetReader` to remain public or not. If not, I think the PR is fine, otherwise, I can bring it back into `parquet::file::reader`.
   
   @sunchao I tried to restrain myself on this PR 😄. Honestly, I had to move quite a lot of things around because this touches a core API and things were very "monolithic". There are two points were I'm getting a little bit out of the main concern:
   - I added a `Seek` implem to the `FileSource` that ended up not being useful. I am removing it right now.
   - The typo fix in `array_reader`, but I'm sure you can forgive me that one :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-712479156


   From the description the changes proposed here makes sense to me as well - but seems the PR itself is messed up and contains many unrelated changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb edited a comment on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-708039466


   @rdettai  --
   
   > This wrapper makes sense because File is cheap to clone and needs to be buffered, but this will not be generally the case for other implems of ParquetReader. 
   
   Yes, actually that was our experience (when we used a buffered implementation, we found that the underlying (large) buffer got copies around a lot. 
   
   I think in general, the cleaner thing (and my preference) would be to simply port my code to use `ChunkReader` (I think it will end up being cleaner and likely more performant). 
   
   > Do you think it is worth it have the ChunkReader be implemented for T:ParquetReader rather than File ?  
   
   The upside is that it would be somewhat more backwards compatible, but unless other reviewers feel strongly I personally think we should just do the breaking change, merge this PR, (maybe even remove `ParquetReader` too) and I'll rewrite our project in terms of `ChunkReader`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-704796994


   The discussion with @alamb about the need for an intermediate layer when reading a parquet file is discussed on [JIRA](https://issues.apache.org/jira/browse/ARROW-10135)
   
   The highlights of the current implementation:
   - The public API has changed, but keeps working for `File` and `Path` thanks to the corresponding trait implementations. `Cursor` cannot be used any more because it requires data copies when being passed around with `clone()` (this was already the case before in the implem of `TryClone` for `Cursor<Vec<u8>>`).
   - I have added a custom cursor type (`SliceableCursor`) that allows to generate cursor slices without cloning the underlying data. This can be used to read in-memory files. I guess it could be made more generic, but this would be for convenience only and I find it simple and clear as is.
   - I have separated the implem (`SerializedFileReader`, `SerializedRowGroupReader`...) from the traits (`FileReader`, `RowGroupReader`...) for more clarity. I know that this is not how the code base is structured in the rest of the project but I tend to get lost in these huge files with millions of struct/trait/impl blocks. I'm very much open to suggestion about this point!
   - There is nothing about async/parallelism yet, I have to think about it a little bit.
   
   @alamb : can you take a look at the new `ChunckReader` trait and how it is integrated to the rest of the reader? What do you think about it? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-716585625


   FYI I have filed https://issues.apache.org/jira/browse/ARROW-10390 for the issue with `TryClone` not being public. I'll try and whip up a PR to get it fixed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-712910596


   > @rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review.
   @nevi-me Depends on whether we want `ParquetReader` to remain public or not. If not, I think the PR is fine, otherwise, I can bring it back into `parquet::file::reader`.
   
   @sunchao I tried to restrain myself on this PR 😄. Honestly, I had to move quite a lot of things around because this touches a core API and things were very "monolithic". There are two points were I'm getting a little bit out of the main concern:
   - I added a `Seek` implem to the `FileSource` that ended up not being useful. I am removing it right now.
   - The typo fix in `array_reader`, but I'm sure you can forgive me that one :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb edited a comment on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-701575470






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-713467151


   I agree that this PR is hanging, but as this is an API change, I guess its better to think things through before moving forward! 😄 This should maybe have been thought through in a design document rather than in the PR... 😅
   
   Its true that most of the time you will have the length around from an other source. But shouldn't we prefer a traits that contains exactly the minimum amount of information we need to make the `FileReader` implem work ?
   Proposition:
   ```
   pub enum ChunkMode {
       FromStart(u64),
       FromEnd(u64),
   }
   
   pub trait ChunkReader {
       type T: Read + Length;
       fn get_read(&self, start: ChunkMode, length: usize) -> Result<Self::T>;
   }
   ```
   
   I can have the implem ready by the end of the day!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-707712589


   I'm open to moving traits such as `ParquetReader` back to `reader.rs` or re-exposing them with `pub use crate::`. But I have the feeling these traits are very specific to the implementation of the `util::io::FileSource`, which is private because it is mainly a slicing/buffering wrapper around `File` (hence the name ;-)). This wrapper makes sense because File is cheap to clone and needs to be buffered, but this will not be generally the case for other implems of `ParquetReader`. What about your usecase @alamb ? Do you think it is worth it have the ChunkReader be implemented for `T:ParquetReader` rather than `File` ? This would be more flexible, maybe a little bit too much...
   
   It is a good observation that `TryClone` is also used in the writer. I didn't want to move things around in the writer because it would have made the PR even more massive and also because I've been told that there was an active development effort going on over there :-). Similar changes might apply.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-713455867


   > I'm having second thoughts about the ChunkReader interface.
   
   I personally think the `ChunkReader` interface is good enough as is and this PR has been hanging out for quite a bit. 
   
   By adding more sophistication (FromStart, FromEnd) you may save an S3 or other object store`  metadata request to get the length in theory, but I suspect most applications will already have the length metadata from other sources anyway (e.g.  because they had to list the objects for some other reason) so acquiring the length may not be as expensive as it first appears. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r509071560



##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       this is intended

##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -0,0 +1,747 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains file reader implementations and provides methods to access file metadata, row group

Review comment:
       I'll try to come up with something more explicit!

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the total number of bytes in the input source.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {

Review comment:
       I used *Reader* because it does not implem `Read` but generates things that implem `Read`. But I admit that I'm not 100% happy with `ChunkReader` and it would be nice to have something more explicit as this is a very public part of the API. 

##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       This is not a new feature but an adaptation of what we had before. Without `SliceableCursor` their is now way to read from RAM the way it was done with `Cursor`. It is just that instead of directly implementing the new `ChunkReader` trait for `Cursor` and having the same problem of copying the data all over the place as we did before, I tweeked the Cursor a bit to share its data between the `Read` slices. 

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the total number of bytes in the input source.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {

Review comment:
       I think not because because it is also used by `ChunkReader` and io.rs contains internal implementations. But I admit that I still don't have it very clear how modules should be structured in Rust 😄 

##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains file reader API and provides methods to access file metadata, row group

Review comment:
       Thanks, slipped away.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai edited a comment on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-713467151


   I agree that this PR is hanging, but as this is an API change, I guess its better to think things through before moving forward! 😄 This should maybe have been prepared in a design document rather than in the PR... 😅
   
   Its true that most of the time you will have the length around from an other source. But shouldn't we prefer a trait that contains exactly the minimum amount of information we need to make the `FileReader` implem work ?
   Proposition:
   ```
   pub enum ChunkMode {
       FromStart(u64),
       FromEnd(u64),
   }
   
   pub trait ChunkReader {
       type T: Read + Length;
       fn get_read(&self, start: ChunkMode, length: usize) -> Result<Self::T>;
   }
   ```
   
   I can have the implem ready by the end of the day!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-701575470


   from https://issues.apache.org/jira/browse/ARROW-10135, it seems like your goal is to support reading parquet files from sources other than files. 
   
   I wonder if you have tried implementing the `ParquetReader` trait for another data source?
   
   Then you can create a file reader from that other source like:
   ```
   let source = ThingThatImplementsParquetReader::new();
   let file_reader = SerializedFileReader::new(source);
   ...
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-713398263


   I'm having second thoughts about the `ChunkReader` interface. It has a length, but what we really want is the capability to "read from end" in order to get the parquet footer. So I'm thinking about removing the Length trait and give to `get_read` the capability to read from end, for instance by making `start: u64` an `enum { FromStart(u64), FromEnd }`. This is particularly interesting when reading from a remote store where getting the size is expensive. What do you think ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-704940018


   Thanks a lot for your time and comments! 
   
   > using a more standard spelling of Chunk rather than Chunck
   
   You mean "a more correct spelling" 😉 
   
   > I wonder how important separating out `get_read` and `get_read_seek` is
   
   Right, I was just looking at this. There might be some simplifications that could be implemented here.
   
   > I think we should begin moving the parquet reader towards using Arc
   
   That is true, but there is a major problem with the way `FileSource` works in that case. The fact that we share a single file handle between instances (because it is the behavior of `File::clone()`) prevents us from accessing concurrently a same file.
   
   Give me a bit of time, I will try to fix these. I'll come back to you if I have further doubts. Thanks again !!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai edited a comment on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-700744261


   I first moved the footer parsing out of the `SerializedFileReader`. I will now try to move as much logic as possible from the `SerializedFileReader` and `SerializedRowGroupReader` implems to the `FileReader` and `RowGroupReader` traits


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r508895342



##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains file reader API and provides methods to access file metadata, row group

Review comment:
       I think this needs to be updated?

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -17,11 +17,25 @@
 
 use std::{cell::RefCell, cmp, io::*};
 
-use crate::file::{reader::ParquetReader, writer::ParquetWriter};
+use crate::file::{reader::Length, writer::ParquetWriter};
 
 const DEFAULT_BUF_SIZE: usize = 8 * 1024;
 
 // ----------------------------------------------------------------------
+
+/// TryClone tries to clone the type and should maintain the `Seek` position of the given
+/// instance.
+pub trait TryClone: Sized {
+    /// Clones the type returning a new instance or an error if it's not possible
+    /// to clone it.
+    fn try_clone(&self) -> Result<Self>;
+}
+
+/// ParquetReader is the interface which needs to be fulfilled to be able to parse a
+/// parquet source.
+pub trait ParquetReader: Read + Seek + Length + TryClone {}

Review comment:
       (not related to this PR) I think `ParquetReader` is a little confusing - perhaps a better name can be `ParquetSource` or something. And similarly `ParquetSink` for `ParquetWriter`.

##########
File path: rust/parquet/src/file/footer.rs
##########
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains file reader API and provides methods to access file metadata, row group
+//! readers to read individual column chunks, or access record iterator.
+
+use std::{
+    cmp::min,
+    io::{Cursor, Read, Seek, SeekFrom},
+    rc::Rc,
+};
+
+use byteorder::{ByteOrder, LittleEndian};
+use parquet_format::{ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData};
+use thrift::protocol::TCompactInputProtocol;
+
+use crate::basic::ColumnOrder;
+
+use crate::errors::{ParquetError, Result};
+use crate::file::{
+    metadata::*, reader::ChunkReader, DEFAULT_FOOTER_READ_SIZE, FOOTER_SIZE,
+    PARQUET_MAGIC,
+};
+
+use crate::schema::types::{self, SchemaDescriptor};
+
+/// Layout of Parquet file
+/// +---------------------------+-----+---+
+/// |      Rest of file         |  B  | A |
+/// +---------------------------+-----+---+
+/// where A: parquet footer, B: parquet metadata.
+///
+/// The reader first reads DEFAULT_FOOTER_SIZE bytes from the end of the file.
+/// If it is not enough according to the length indicated in the footer, it reads more bytes.
+pub fn parse_metadata<R: ChunkReader>(chunk_reader: &R) -> Result<ParquetMetaData> {

Review comment:
       +1 - nice to see this separated into another file

##########
File path: rust/parquet/src/file/serialized_reader.rs
##########
@@ -0,0 +1,747 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains file reader implementations and provides methods to access file metadata, row group

Review comment:
       Needs update as well - specifically for serialized reader

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the total number of bytes in the input source.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {

Review comment:
       maybe io.rs is more suitable for this trait?

##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       Yes maybe move this into a separate PR? given the current one is already big enough.

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the total number of bytes in the input source.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {

Review comment:
       nit: maybe `ChunkRead`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r502329941



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.

Review comment:
       ```suggestion
   /// Length should return the total number of bytes in the input source.
   ```

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       For backwards compatibility in other projects, I think we need to allow  `use parquet::file::reader::TryClone` to keep working (`TryClone` moved to `util::io::TryClone`, and appears not to be public anymore). 
   
   Perhap something like this (untested):
   ```suggestion
   pub use super::TryClone;
   // ----------------------------------------------------------------------
   ```
   
   When I tried to compile an in-house project with this branch I got the following error: 
   
   ```
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
     --> delorean_parquet/src/lib.rs:13:28
      |
   13 |     file::reader::{Length, TryClone},
      |                            ^^^^^^^^ no `TryClone` in `parquet::file::reader`
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
   ```
   
   When I tried to use the copy in `util::io` I got: 
   
   ```
   error[E0603]: module `util` is private
     --> delorean_parquet/src/lib.rs:14:5
      |
   14 |     util::io::{TryClone},
      |     ^^^^ private module
      |
   
   ```
   
   When I used the import in `seriailized_reader` I got a similar error: 
   
   ```
   error[E0603]: trait `TryClone` is private
     --> delorean_parquet/src/lib.rs:14:30
      |
   14 |     file::serialized_reader::TryClone,
      |                              ^^^^^^^^ private trait
      |
   ```

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.

Review comment:
       👍 

##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I discovered above that the entire `util` module is not public -- 
   
   https://github.com/apache/arrow/blob/master/rust/parquet/src/lib.rs#L39
   
   ```
   #[macro_use]
   mod util;
   ```
   
   So none of these pub traits can be used outside this crate at the moment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-711046982


   > I am but one opinion -- I wonder if @nevi-me or @sunchao have any thoughts / opinions about where this PR is heading
   
   I'm principally fine with the changes, and the direction of the PR. I'll highly likely be working more on the non-Arrow side of Parquet, so I expect that I might be making a lot of changes on top of this.
   
   There seems to be some demand for supporting sources and sinks other than files, and as this helps with that; I'm pro getting it merged.
   
   @rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-711155312


   @nevi-me  -- I think this is good enough to merge; Once it is merged, I'll try and port my internal project asap (which has custom readers and writers) and make a PR to do any touchups needed (like re-exporting TryClone if need be) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-705682739


   @alamb I think I addressed most of your concerns. The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR. There is more to it than converting Rc to Arc because of this shared handle to the file. Lets do that work in an other PR associated to a sub-tasks of [ARROW-9464](https://issues.apache.org/jira/browse/ARROW-9464) or simply associated to [ARROW-9674](https://issues.apache.org/jira/browse/ARROW-9674). I will work on it next week, but until then we can validate and merge this one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r502332349



##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -18,35 +18,37 @@
 //! Contains file reader API and provides methods to access file metadata, row group
 //! readers to read individual column chunks, or access record iterator.
 
-use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
-    rc::Rc,
-};
+use std::{boxed::Box, io::Read, rc::Rc};
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+/// The ChunkReader trait generates readers of chunks of a source.
+/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
+/// For an object store reader, each read can be mapped to a range request.
+pub trait ChunkReader: Length {
+    type T: Read;
+    /// get a serialy readeable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+}
 
 // ----------------------------------------------------------------------

Review comment:
       
   When I tried to compile an in-house project with this branch I got the following error: 
   
   ```
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
     --> delorean_parquet/src/lib.rs:13:28
      |
   13 |     file::reader::{Length, TryClone},
      |                            ^^^^^^^^ no `TryClone` in `parquet::file::reader`
   
   error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
   ```
   
   When I tried to use the copy in `util::io` I got: 
   
   ```
   error[E0603]: module `util` is private
     --> delorean_parquet/src/lib.rs:14:5
      |
   14 |     util::io::{TryClone},
      |     ^^^^ private module
      |
   
   ```
   
   When I used the import in `seriailized_reader` I got a similar error: 
   
   ```
   error[E0603]: trait `TryClone` is private
     --> delorean_parquet/src/lib.rs:14:30
      |
   14 |     file::serialized_reader::TryClone,
      |                              ^^^^^^^^ private trait
      |
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r500903617



##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       I would personally suggest moving to `Arc` here rather than `Rc` across the board in the Parquet reader. I suggest `Arc` to "future proof" the code as `Arc` almost always is needed with `async` and multi-threading (as `Rc` doesn't implement `Send`)

##########
File path: rust/parquet/src/file/reader.rs
##########
@@ -19,34 +19,48 @@
 //! readers to read individual column chunks, or access record iterator.
 
 use std::{
-    convert::TryFrom,
-    fs::File,
-    io::{BufReader, Cursor, Read, Seek, SeekFrom},
-    path::Path,
+    boxed::Box,
+    io::{Read, Seek},
     rc::Rc,
 };
 
-use byteorder::{ByteOrder, LittleEndian};
-use parquet_format::{
-    ColumnOrder as TColumnOrder, FileMetaData as TFileMetaData, PageHeader, PageType,
-};
-use thrift::protocol::TCompactInputProtocol;
-
-use crate::basic::{ColumnOrder, Compression, Encoding, Type};
 use crate::column::page::PageIterator;
-use crate::column::{
-    page::{Page, PageReader},
-    reader::{ColumnReader, ColumnReaderImpl},
-};
-use crate::compression::{create_codec, Codec};
+use crate::column::{page::PageReader, reader::ColumnReader};
 use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::*, statistics, FOOTER_SIZE, PARQUET_MAGIC};
+use crate::file::metadata::*;
+pub use crate::file::serialized_reader::{SerializedFileReader, SerializedPageReader};
 use crate::record::reader::RowIter;
-use crate::record::Row;
-use crate::schema::types::{
-    self, ColumnDescPtr, SchemaDescPtr, SchemaDescriptor, Type as SchemaType,
-};
-use crate::util::{io::FileSource, memory::ByteBufferPtr};
+use crate::schema::types::{ColumnDescPtr, SchemaDescPtr, Type as SchemaType};
+
+use crate::basic::Type;
+
+use crate::column::reader::ColumnReaderImpl;
+
+/// Length should return the amount of bytes that implementor contains.
+/// It's mainly used to read the metadata, which is at the end of the source.
+#[allow(clippy::len_without_is_empty)]
+pub trait Length {
+    /// Returns the amount of bytes of the inner source.
+    fn len(&self) -> u64;
+}
+
+pub trait ReadChunck: Read + Length {}
+pub trait ReadSeekChunck: ReadChunck + Seek {}
+impl<T: Read + Length> ReadChunck for T {}

Review comment:
       I like this pattern (of automatically implementing the combined trait for anything that implements `Read` + `Length` (I struggled doing that with FileReader when working with the Parquet API earlier in my Rust days)

##########
File path: rust/parquet/src/util/io.rs
##########
@@ -17,11 +17,31 @@
 
 use std::{cell::RefCell, cmp, io::*};
 
-use crate::file::{reader::ParquetReader, writer::ParquetWriter};
+use crate::file::{reader::Length, writer::ParquetWriter};
 
 const DEFAULT_BUF_SIZE: usize = 8 * 1024;
 
 // ----------------------------------------------------------------------
+
+/// TryClone tries to clone the type and should maintain the `Seek` position of the given

Review comment:
       👍  for comments




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-706548010


   I'll also only be able to review this in the next 2 week(end)s, and I'd like to go through it before it's merged.
   
   I'm also going to reach out to a few active crates that depend on Parquet, so I can solicit their feedback on these changes.
   I anticipate a lot of changes + improvements for the 3.0.0 release, so if we can manage that without causing a lot of breakage; I think we should go for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-706546221


   Sorry I don't have more time to spend on this -- I am still struggling to get my existing project code to compile against this branch and this PR is pretty massive to digest. 
   
   One other thing I found this morning is that the `TryClone` trait is still used in `ParquetWriter` so I do think it needs to be publically exported somehow (or we need to refactor the writer)
   
   I have also been trying to figure out if the `ParquetReader` trait is still needed after this refactoring.
   
   So I guess all in all my conclusion is that the ideas behind this PR are good but:
   1. It is hard to review given its size
   2. There will likely be a fairly substantial backwards incompatible change for anyone using the ParquetReader in their project


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-705805870


   > The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR
   
   I agree


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-700744261


   I first moved the footer parsing out of the `SerializedFileReader`. I will now try to move as much logic as possible from the `SerializedFileReader` and `SerializedRowGroupReader` traits to the more generic `FileReader` and `RowGroupReader` traits


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-705682739


   @alamb I think I addressed most of your concerns. The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR. There is more to it than converting Rc to Arc because of this shared handle to the file. Lets do that work in an other PR associated to a sub-tasks of [ARROW-9464](https://issues.apache.org/jira/browse/ARROW-9464) or simply associated to [ARROW-9674](https://issues.apache.org/jira/browse/ARROW-9674). I will work on it next week, but until then we can validate and merge this one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-706680795


   I should probably make clear that anyone using the `SerializedFileReader` with `File` will likely have no issues (as @rdettai  has done a great job of keeping the that level interface consistent. Any project that uses a custom input source (as we did in my internal project) by implementing `ParquetReader` and associated traits may have more to change.
   
   I actually think the changes in this PR will make such custom input sources easier to write in the future, but existing code will have to be rejiggered.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rdettai commented on a change in pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

rdettai commented on a change in pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#discussion_r509056820



##########
File path: rust/parquet/src/util/cursor.rs
##########
@@ -0,0 +1,203 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::cmp;
+use std::io::{self, Error, ErrorKind, Read, Seek, SeekFrom};
+use std::rc::Rc;
+
+/// This is object to use if your file is already in memory.
+/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
+/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull

Review comment:
       This is not a new feature but an adaptation of what we had before. Without `SliceableCursor` there is no way to read from RAM the way it was done with `Cursor`. Instead of directly implementing the new `ChunkReader` trait for `Cursor` and having the same problem of copying the data all over the place as we did before, I tweeked the `Cursor` a bit to share its data between the `Read` slices. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-708039865


   I am but one opinion -- I wonder if @nevi-me  or @sunchao  have any thoughts / opinions about where this PR is heading


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org