You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/02 17:17:24 UTC

[GitHub] [arrow-datafusion] yjshen opened a new pull request #1905: Avoid expensive open for one single file and simplify object reader API on the `sync` part

yjshen opened a new pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059729382


   If @tustvold can help get rid of the chunkreader in the short run, then I think that would be the best route forward.
   
   > I cannot do Arc<RefCell<dyn ObjectReader>> since RefCell is not sync. Which is imposed by ChunkReader:
   
   Do we really need RefCell here? Could we just use `Arc::get_mut` instead? But if you can get rid of the Sync requirement, that's even better :P 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen removed a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen removed a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058143404


   Yes, exactly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r818717630



##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -39,27 +39,34 @@ use crate::error::{DataFusionError, Result};
 /// Note that the dynamic dispatch on the reader might
 /// have some performance impacts.
 #[async_trait]
-pub trait ObjectReader: Send + Sync {
+pub trait ObjectReader: Read + Seek + Send {
     /// Get reader for a part [start, start + length] in the file asynchronously
     async fn chunk_reader(&self, start: u64, length: usize)
         -> Result<Box<dyn AsyncRead>>;
 
-    /// Get reader for a part [start, start + length] in the file
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>>;
-
-    /// Get reader for the entire file
-    fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>> {
-        self.sync_chunk_reader(0, self.length() as usize)
-    }

Review comment:
       Hi @rdettai long time no see! I think the chunk is only a term from parquet reader implementation. And we are always using chunks from one same file sequentially in DataFusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r817933625



##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -39,27 +39,34 @@ use crate::error::{DataFusionError, Result};
 /// Note that the dynamic dispatch on the reader might
 /// have some performance impacts.
 #[async_trait]
-pub trait ObjectReader: Send + Sync {
+pub trait ObjectReader: Read + Seek + Send {
     /// Get reader for a part [start, start + length] in the file asynchronously
     async fn chunk_reader(&self, start: u64, length: usize)
         -> Result<Box<dyn AsyncRead>>;
 
-    /// Get reader for a part [start, start + length] in the file
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>>;
-
-    /// Get reader for the entire file
-    fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>> {
-        self.sync_chunk_reader(0, self.length() as usize)
-    }

Review comment:
       After discussions with @houqp and @richox, we agreed that the extra `chunk` semantic introduced in ObjectReader introduces irrelevant file format details to object stores and incurs needless complexity. Therefore, I introduce the API simplifications.

##########
File path: datafusion/src/datasource/object_store/local.rs
##########
@@ -82,23 +112,12 @@ impl ObjectReader for LocalFileReader {
         )
     }
 
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>> {
-        // A new file descriptor is opened for each chunk reader.
-        // This okay because chunks are usually fairly large.
-        let mut file = File::open(&self.file.path)?;

Review comment:
       This is the original purpose of this PR. @liukun4515 initially found the HDFSObjectStore with many unnecessary opens. Same here for the local store, but more expensive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1057205658


   Cc @alamb @matthewmturner @seddonm1 @tustvold @rdettai to gather more input as this PR may affect your work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058149127


   I think I may also need to ~~tune `chunk_reader` in parquet a little bit~~, since it also imposes immutability:
   
   ```rust
   impl ChunkReader for ObjectReaderWrapper {
       type T = Self;
   
       fn get_read(&self, start: u64, length: usize) -> ParquetResult<Self::T>
   ``` 
   
   In order to change ChunkReader to `&mut self` to cut of Mutex, it seems I need to change the full code path along parquet reading. Since they all share an immutability interface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058140404


   just to be sure I understand you correctly: if we used `PartitionedFile` for the file slicing, you would obtain parallelism by creating multiple `ObjectReader` instances? 
   
   In the case we don't want one  `ObjectReader` instance to be distributed accross threads, `file_reader()` should stop returning `Arc<dyn ObjectReader>` and return an owned object (`Box<dyn ObjectReader>`). This would allow you to cut out the mutex.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out. However, it's hard to express or utilize in the current execution plan: how should I trigger the parallel chunk fetch while maintaining a single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   For example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/pull/1389, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And to control the parallelism of FileSan execution, we could tune a 'max_byte_per_partition' configuration and partition all input files into `Vec<Vec<PartitionedFile>`. 
   
   Each `Vec<PartitionedFile>` could be summed up to the 'max_byte_per_split' size, from many individual parquet files, or one big slice from one big parquet file.
   
   By controlling `max_byte_per_partition`, we could still achieve the parallel fetch of file chunks as you mentioned, if users choose a smaller partition input data size. Or avoid unexpected repeated opening of files, at each row-group, each column times.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1061150235


   FWIW I don't have a strong opinion on this one way or the other (as we have our own ObjectStore reader implementation in IOx for the time being). Maybe @tustvold has some thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059731719


   I'm afraid not. 😬
   Arc::get_mut needs self to be a mutable reference.
   ```rust
       pub fn get_mut(this: &mut Self) -> Option<&mut T> {
   ```
   but not the case in ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   `get_read` &self is immutable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen closed pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen closed pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out, however, it's hard to express or fully get utilized in the current execution plan: how should I trigger the current parallel chunk fetch while maintaining single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   for example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/pull/1389, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And in order to control the parallelism of FileSan execution, we could just tune a 'max_byte_per_split' configuration, and partition all input files into Vec<Vec<PartitionedFile>, each `Vec<PartitionedFile>` could be summed up to the 'max_byte_per_split' size, from many individual parquet files, or one big slice from one big parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r820111433



##########
File path: datafusion/src/datasource/object_store/local.rs
##########
@@ -82,23 +112,12 @@ impl ObjectReader for LocalFileReader {
         )
     }
 
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>> {
-        // A new file descriptor is opened for each chunk reader.
-        // This okay because chunks are usually fairly large.
-        let mut file = File::open(&self.file.path)?;

Review comment:
       I probably misunderstand something here and I am sorry I don't quite follow all the comments on this PR. 
   
   If the issue you are trying to solve is that `File::open` is called too often, would it be possible to "memoize" the open here with a mutex inside of the FileReader?
   
   Something like
   
   ```rust
   struct LocalFileReader { 
   ...
       /// Keep the open file descriptor to avoid reopening it
      cache: Mutex<Option<Box<dyn Read + Send + Sync + Clone>>>
   }
   
   impl LocalFileReader { 
   ...
       fn sync_chunk_reader(
           &self,
           start: u64,
           length: usize,
       ) -> Result<Box<dyn Read + Send + Sync>> {
       let mut cache = self.cache.lock();
       if let Some(cache) = cache {
         return Ok(cache.clone())
       };
       *cache = File::open(...);
       return cache.clone();
   }
   ```
         




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059729382


   If @tustvold can help get rid of the chunkreader in the parquet crate in the short run, then I think that would be the best route forward.
   
   > I cannot do Arc<RefCell<dyn ObjectReader>> since RefCell is not sync. Which is imposed by ChunkReader:
   
   Do we really need RefCell here? Could we just use `Arc::get_mut` instead? But if you can get rid of the Sync requirement, that's even better :P 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r820119627



##########
File path: datafusion/src/datasource/object_store/local.rs
##########
@@ -82,23 +112,12 @@ impl ObjectReader for LocalFileReader {
         )
     }
 
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>> {
-        // A new file descriptor is opened for each chunk reader.
-        // This okay because chunks are usually fairly large.
-        let mut file = File::open(&self.file.path)?;

Review comment:
       Yes, the original problem is about too much open, and we do solve it in our HDFS object store implementations similar to your suggestion above. 
   
   However, as I think more of the ObjectReader API and its use, I think I've brought in one extra layer of abstraction, the "chunk" reader layer, into ObjectReader without benefits. I prefer the `chunk reader` is only Parquet related, and object readers should only care about like seeks and reads. 
   
   Therefore the current PR stops creating new readers from `ObjectReader`, but directly reads in the `ObjectReader` itself. And If we are seeking an ability to fetch multi parts from a parquet file simultaneously, we can utilize the struct `PartitionedFile`. 
   
   ```
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   We could have a `max_bytes_per_partition` configuration during query planing, combine multiple parquet files into one partition (like we do now), or split a large parquet file into many ranges, and have each partition handle only several ranges. with the help of https://github.com/apache/arrow-rs/issues/158.
   
   And the last several comments are about how to avoid `Mutex` from the `ObjectReaderWrapper` . Since we read parquet file sequentially in a partition, Mutex may incur unnecessary overhead. Just have to write like this way to achieve interior mutability since ChunkReader API in parquet-rs needs so:
   
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;   //  &self as well as send imposes the need for interior mutability
   }
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059731719


   I'm afraid not. 😬
   Arc::get_mut needs self to be mutable reference.
   ```rust
       pub fn get_mut(this: &mut Self) -> Option<&mut T> {
   ```
   but not the case in ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058087303


   the API change you suggest here is not just a simplification, it breaks the way we can currently parallelize the ObjectStore. You are replacing `fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>>` which takes an immutable ref and can thus be called in parallel from different threads into `fn set_limit(&mut self, limit: usize)` which requires an exclusive reference and a `Mutex` lock that blocks the entire `ObjectReader` at each operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r817933625



##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -39,27 +39,34 @@ use crate::error::{DataFusionError, Result};
 /// Note that the dynamic dispatch on the reader might
 /// have some performance impacts.
 #[async_trait]
-pub trait ObjectReader: Send + Sync {
+pub trait ObjectReader: Read + Seek + Send {
     /// Get reader for a part [start, start + length] in the file asynchronously
     async fn chunk_reader(&self, start: u64, length: usize)
         -> Result<Box<dyn AsyncRead>>;
 
-    /// Get reader for a part [start, start + length] in the file
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>>;
-
-    /// Get reader for the entire file
-    fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>> {
-        self.sync_chunk_reader(0, self.length() as usize)
-    }

Review comment:
       After discussions with @houqp and @richox, we agreed that the extra `chunk` semantic introduced in ObjectReader introduces irrelevant file format details to object stores and incurs needless complexity. Therefore the API simplifications.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059729382


   If @tustvold can help get rid of the chunkreader in the short run, then I think that would be the best route forward.
   
   > I cannot do Arc<RefCell<dyn ObjectReader>> since RefCell is not sync. Which is imposed by ChunkReader:
   
   Do we really need RefCell here? Could we just use `Arc::get_mut` instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out. However, it's hard to express or utilize in the current execution plan: how should I trigger the parallel chunk fetch while maintaining a single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   For example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/issues/158, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And to control the parallelism of FileSan execution, we could tune a `max_byte_per_partition` configuration and partition all input files into `Vec<Vec<PartitionedFile>`. 
   
   Each `Vec<PartitionedFile>` could be summed up to the `max_byte_per_partition` size, from many individual parquet files, or one big slice from one big parquet file.
   
   By controlling `max_byte_per_partition`, we could still achieve the parallel fetch of file chunks as you mentioned, if users choose a smaller partition input data size. Or avoid unexpected repeated opening of files, at each row-group, each column times.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059727088


   I cannot do `Arc<RefCell<dyn ObjectReader>>` since RefCell is not sync. Which is imposed by ~~ChunkReader~~ Arc:
   ```rust
   pub trait ChunkReader: Length + Send + Sync 
   ```
   ```
   343 | impl ChunkReader for ObjectReaderWrapper {
       |      ^^^^^^^^^^^ `RefCell<(dyn ObjectReader + 'static)>` cannot be shared between threads safely
       |
       = help: the trait `Sync` is not implemented for `RefCell<(dyn ObjectReader + 'static)>`
       = note: required because of the requirements on the impl of `Sync` for `Arc<RefCell<(dyn ObjectReader + 'static)>>`
   ```
   
   I'm not sure I could remove Sync from ChunkReader, but I can try it first.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059727088


   I cannot do `Arc<RefCell<dyn ObjectReader>>` since RefCell is not sync. Which is imposed by ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync 
   ```
   ```
   343 | impl ChunkReader for ObjectReaderWrapper {
       |      ^^^^^^^^^^^ `RefCell<(dyn ObjectReader + 'static)>` cannot be shared between threads safely
       |
       = help: the trait `Sync` is not implemented for `RefCell<(dyn ObjectReader + 'static)>`
       = note: required because of the requirements on the impl of `Sync` for `Arc<RefCell<(dyn ObjectReader + 'static)>>`
   ```
   
   I'm not sure I could remove Sync from ChunkReader, but I can try it first.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059727088


   I cannot do `Arc<RefCell<dyn ObjectReader>>` since RefCell is not sync. Which is imposed by Arc as well as ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync 
   ```
   ```
   343 | impl ChunkReader for ObjectReaderWrapper {
       |      ^^^^^^^^^^^ `RefCell<(dyn ObjectReader + 'static)>` cannot be shared between threads safely
       |
       = help: the trait `Sync` is not implemented for `RefCell<(dyn ObjectReader + 'static)>`
       = note: required because of the requirements on the impl of `Sync` for `Arc<RefCell<(dyn ObjectReader + 'static)>>`
   ```
   
   I'm not sure I could remove Sync from ChunkReader, but I can try it first.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1062096897


   I intend to pick up #1617 tomorrow and so I might have some more thoughts then, but some initial thoughts:
   
   * `ObjectStoreWrapper` has perhaps surprising behaviour with respect to the read position, a call to read (and/or seek) can side-effect on clones
   * https://doc.rust-lang.org/std/fs/struct.File.html#method.try_clone might provide another option, although it shares the same pain as above w.r.t shared interior mutability. The parquet crate uses this https://github.com/apache/arrow-rs/blob/master/parquet/src/util/io.rs for better or worse
   * I agree that query parallelism is better handled at a higher level, e.g. in PartitionedFile, vs introducing internal parallelism for row groups within ParquetExec
   * The ChunkReader abstraction used by parquet is extremely confusing, probably not something we want to replicate, and something I may actually remove in the future - https://github.com/apache/arrow-rs/issues/1163
   * The async parquet exec PR as currently written moves away from using the ChunkObjectReader, etc... plumbing in favour of using the tokio io traits directly and so will likely fix this issue - https://github.com/apache/arrow-datafusion/pull/1617/files#diff-4b6766123d48840a81a799938d38bffd72dde49603e0a3fd9d3c0b5b46d2d224R54
   
   In short I think this is working around a slightly funky API exposed by the parquet crate, and to be honest I'd rather fix this API and/or move to the async reader API which doesn't share the same funky.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059731719


   I'm afraid not. 😬
   Arc::get_mut needs self to be a mutable reference.
   ```rust
       pub fn get_mut(this: &mut Self) -> Option<&mut T> {
   ```
   but not the case in ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   `get_read` &self is immutable
   
   And RefCell doesn't work either, since :
   ```
   `RefCell<(dyn ObjectReader + 'static)>` cannot be shared between threads safely
       |
       = help: the trait `Sync` is not implemented for `RefCell<(dyn ObjectReader + 'static)>`
       = note: required because of the requirements on the impl of `std::marker::Send` for `Arc<RefCell<(dyn ObjectReader + 'static)>>`
   
   ```
   Even with the Sync removed from `pub trait ChunkReader: Length + Send ` since it's imposed by the remaining Send.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r818682560



##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -39,27 +39,34 @@ use crate::error::{DataFusionError, Result};
 /// Note that the dynamic dispatch on the reader might
 /// have some performance impacts.
 #[async_trait]
-pub trait ObjectReader: Send + Sync {
+pub trait ObjectReader: Read + Seek + Send {
     /// Get reader for a part [start, start + length] in the file asynchronously
     async fn chunk_reader(&self, start: u64, length: usize)
         -> Result<Box<dyn AsyncRead>>;
 
-    /// Get reader for a part [start, start + length] in the file
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>>;
-
-    /// Get reader for the entire file
-    fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>> {
-        self.sync_chunk_reader(0, self.length() as usize)
-    }

Review comment:
       hi @yjshen! Do you have a pointer to that discussion so I can get the full context?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on a change in pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#discussion_r818682560



##########
File path: datafusion/src/datasource/object_store/mod.rs
##########
@@ -39,27 +39,34 @@ use crate::error::{DataFusionError, Result};
 /// Note that the dynamic dispatch on the reader might
 /// have some performance impacts.
 #[async_trait]
-pub trait ObjectReader: Send + Sync {
+pub trait ObjectReader: Read + Seek + Send {
     /// Get reader for a part [start, start + length] in the file asynchronously
     async fn chunk_reader(&self, start: u64, length: usize)
         -> Result<Box<dyn AsyncRead>>;
 
-    /// Get reader for a part [start, start + length] in the file
-    fn sync_chunk_reader(
-        &self,
-        start: u64,
-        length: usize,
-    ) -> Result<Box<dyn Read + Send + Sync>>;
-
-    /// Get reader for the entire file
-    fn sync_reader(&self) -> Result<Box<dyn Read + Send + Sync>> {
-        self.sync_chunk_reader(0, self.length() as usize)
-    }

Review comment:
       hi @yjshen! Do you have a pointer to that discussion so I can get the full context? also, why do you say that it introduces "irrelevant file format details". A chunk is a chunk, it can apply to any file format 😄 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058149127


   I think I may also need to ~~tune `chunk_reader` in parquet a little bit~~, since it also imposes immutability:
   
   ```rust
   impl ChunkReader for ObjectReaderWrapper {
       type T = Self;
   
       fn get_read(&self, start: u64, length: usize) -> ParquetResult<Self::T>
   ``` 
   
   In order to change ChunkReader to `&mut self` to cut of Mutex, it seems I need to change the full code path of parquet reading. Since they all share an immutability interface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058143404


   Yes, exactly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1062686947


   Thanks @tustvold for the detailed analysis. ❤️
   
   We already have a workaround for the repeated open issue in the HDFS object store. And I'm changing the object reader API here to avoid future object reader implementations falling into the pitfall of repeated open unintentionally. 
   
   I really like the idea of getting rid of `ChunkReader` APIs and using an async parquet exec. I expect we could also achieve file-handle reuse for the async reading path on top of tokio async io. And I think we could remove this API:
   ```rust
       /// Get reader for a part [start, start + length] in the file
       fn sync_chunk_reader(
           &self,
           start: u64,
           length: usize,
       ) -> Result<Box<dyn Read + Send + Sync>>;
   ```
   
    totally since it's misuse-prone and only used by Parquet exec in your #1617 as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059724553


   I also think we should try to avoid forcing mutex at the api level to avoid its overheads. 
   
   > In order to change ChunkReader to &mut self to cut of Mutex, it seems I need to change the full code path along parquet reading. Since they all share an immutability interface.
   
   What if we only use Arc and call `get_mut`? This is similar to what's happening with FileSource in the parquet crate. In the long run though, I feel like it might be worth looking into getting rid of the `ChunkReader` abstraction in the Parquet crate entirely and replace with Read + Seek.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1062686947


   Thanks @tustvold for the detailed analysis. ❤️
   
   We already have a workaround for the repeated open issue in the HDFS object store. And I'm changing the object reader API here to avoid future object reader implementations falling into the pitfall of repeated open unintentionally. 
   
   I really like the idea of getting rid of `ChunkReader` APIs and using an async parquet exec. I expect we could also achieve file-handle reuse for the async reading path on top of tokio async io. And I think we could remove this API:
   ```rust
       /// Get reader for a part [start, start + length] in the file
       fn sync_chunk_reader(
           &self,
           start: u64,
           length: usize,
       ) -> Result<Box<dyn Read + Send + Sync>>;
   ```
   
    totally since it's misuse-prone and only used by Parquet exec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out. However, it's hard to express or utilize in the current execution plan: how should I trigger the parallel chunk fetch while maintaining a single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   For example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/pull/1389, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And to control the parallelism of FileSan execution, we could tune a `max_byte_per_partition` configuration and partition all input files into `Vec<Vec<PartitionedFile>`. 
   
   Each `Vec<PartitionedFile>` could be summed up to the `max_byte_per_partition` size, from many individual parquet files, or one big slice from one big parquet file.
   
   By controlling `max_byte_per_partition`, we could still achieve the parallel fetch of file chunks as you mentioned, if users choose a smaller partition input data size. Or avoid unexpected repeated opening of files, at each row-group, each column times.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out, however, it's hard to express or fully get utilized in the current execution plan: how should I trigger the current parallel chunk fetch while maintaining single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   for example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/pull/1389, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And in order to control the parallelism of FileSan execution, we could just tune a 'max_byte_per_split' configuration, and partition all input files into `Vec<Vec<PartitionedFile>`, each `Vec<PartitionedFile>` could be summed up to the 'max_byte_per_split' size, from many individual parquet files, or one big slice from one big parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058149127


   I think I may also need to tune `chunk_reader` in parquet a little bit, since it also imposes immutability:
   
   ```rust
   impl ChunkReader for ObjectReaderWrapper {
       type T = Self;
   
       fn get_read(&self, start: u64, length: usize) -> ParquetResult<Self::T>
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059725651


   I'm afraid I've not had time to follow along here, but I have a draft PR that moves to an async parquet reader and doesn't use ChunkReader. So if that part is causing problems, I can maybe look to get that PR in sometime next week?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059731719


   I'm afraid not. 😬
   Arc::get_mut needs self to be a mutable reference.
   ```rust
       pub fn get_mut(this: &mut Self) -> Option<&mut T> {
   ```
   but not the case in ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   `get_read` &self is immutable
   
   Even if I remove Sync from ChunkReader, I still need `Arc<Mutex<dyn ObjectReader>>` which is required by Send.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Posted by GitBox <gi...@apache.org>.
yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1059731719


   I'm afraid not. 😬
   Arc::get_mut needs self to be a mutable reference.
   ```rust
       pub fn get_mut(this: &mut Self) -> Option<&mut T> {
   ```
   but not the case in ChunkReader:
   ```rust
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   `get_read` &self is immutable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org