You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/28 18:10:07 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #4156: Remove length from ChunkReader (#4118)

tustvold opened a new pull request, #4156:
URL: https://github.com/apache/arrow-rs/pull/4156

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4118
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   An alternative to #4147 that simply removes the length parameter, to avoid it causing confusion.
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4156: Remove length from ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#discussion_r1180699150


##########
parquet/src/util/io.rs:
##########
@@ -112,9 +109,6 @@ impl<R: ParquetReader> FileSource<R> {
 
 impl<R: ParquetReader> Read for FileSource<R> {
     fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
-        let bytes_to_read = cmp::min(buf.len(), (self.end - self.start) as usize);
-        let buf = &mut buf[0..bytes_to_read];

Review Comment:
   This could actually benefit performance, as it will avoid artificially truncating the input buffer, causing it to not skip the FileSource buffer when it could



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4156: Cleanup ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#issuecomment-1553485473

   It was removed precisely to discourage this style of usage, see the linked issues, as the ranges passed to it are fairly arbitrary. I would recommend using the metadata to determine the necessary byte ranges up front, or using the async reader which handles this for you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] zeevm commented on pull request #4156: Cleanup ChunkReader (#4118)

Posted by "zeevm (via GitHub)" <gi...@apache.org>.
zeevm commented on PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#issuecomment-1553762075

   @tustvold IIUC the async reader reads into Arrow in memory format, not providing a lower level direct column or page reading correct?
   
   This doesn't work for me as we have our own in memory format for our db, going through Arrow would incur a lot of expansive allocations and format translation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4156: Cleanup ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#discussion_r1180807171


##########
parquet/src/file/reader.rs:
##########
@@ -44,19 +46,47 @@ pub trait Length {
 }
 
 /// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+///
+/// For more information see [`File::try_clone`]
 pub trait ChunkReader: Length + Send + Sync {
-    type T: Read + Send;
-    /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
-    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+    type T: Read;
+
+    /// Get a [`Read`] starting at the provided file offset
+    ///
+    /// Subsequent or concurrent calls to [`Self::get_read`] or [`Self::get_bytes`] may

Review Comment:
   FileSource provided protection against subsequent calls to get_read, by calling Seek on every read, but provided no protection against concurrent access. I think it is less risky to just clearly not support non-serial usage, than to only break on concurrent usage.
   
   **TBC there are no safety implications of not synchronising this access**. You will just get interleaved reads, which is no different from just reading gibberish.
   
   One option would be to add `Mutex` to synchronise access, however, this solution is necessarily incomplete as a user can just call `File::try_clone`. Ultimately there is no reliable way to synchronise file IO, I think if no synchronisation is fine for the standard library, it is fine for the parquet crate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4156: Cleanup ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#discussion_r1180807171


##########
parquet/src/file/reader.rs:
##########
@@ -44,19 +46,47 @@ pub trait Length {
 }
 
 /// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+///
+/// For more information see [`File::try_clone`]
 pub trait ChunkReader: Length + Send + Sync {
-    type T: Read + Send;
-    /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
-    fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
+    type T: Read;
+
+    /// Get a [`Read`] starting at the provided file offset
+    ///
+    /// Subsequent or concurrent calls to [`Self::get_read`] or [`Self::get_bytes`] may

Review Comment:
   FileSource provided protection against subsequent calls to get_read, by calling Seek on every read, but provided no protection against concurrent access. I think it is less risky to just clearly not support non-serial usage, than to only break on concurrent usage.
   
   Whilst one option would be to add `Mutex` to synchronise access, I think if it is fine for the standard library it is fine for us, there are no safety implications of not synchronising this access, you just might read gibberish - which you may do anyway :sweat_smile: 
   
   Similarly a user could just call `File::try_clone` on their own and feed both into separate readers, there is no real way to prevent this, file IO is just exciting :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] zeevm commented on pull request #4156: Cleanup ChunkReader (#4118)

Posted by "zeevm (via GitHub)" <gi...@apache.org>.
zeevm commented on PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#issuecomment-1553459656

   @tustvold the length argument was quite useful I think, my main use case is processing Parquet files from cloud storage (azure, s3 etc.) so I'm trying to minimize on downloaded data to both save time and cost, I used the length argument (with some additional extension if it was below some threshold) to only download what the reader required in a single 'read' op (cloud storage services usually charge per 10K ops + bandwidth)
   
   Now I don't know how much to download, I'll have to develop a bunch of heuristics to get to some sensible values.
   
   Since the reader always needs some specific entity (say page, or entire column chunk, or the file metadata) it knows what 'length' it needs, why not provide this as a 'hint' to the implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] zeevm commented on pull request #4156: Cleanup ChunkReader (#4118)

Posted by "zeevm (via GitHub)" <gi...@apache.org>.
zeevm commented on PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#issuecomment-1554231892

   Understood.
   
   My considerations are:
   
   1. Using arrow I'll have to allocate double the memory and incur a lot of mem copies into our engine native in memory format.
   
   2. Our engine already does predicate push down, late materialization, aplying predicates to dictionaries and very fast scanning of raw dictionary IDs (skipping dictionary decoding entirely) with SIMD instructions and other optimizations
   
   Get Outlook for Android<https://aka.ms/AAb9ysg>
   ________________________________
   From: Raphael Taylor-Davies ***@***.***>
   Sent: Friday, May 19, 2023 11:05:15 AM
   To: apache/arrow-rs ***@***.***>
   Cc: zeevm ***@***.***>; Comment ***@***.***>
   Subject: Re: [apache/arrow-rs] Cleanup ChunkReader (#4118) (PR #4156)
   
   
   Correct, the high-level API reads into arrow and supports object stores, in addition to predicate pushdown, late materialization, row filtering, etc... The low-level APIs support advanced usage, but are not nearly as batteries included. I would encourage you to try out the arrow APIs, you may find they are actually faster 😅
   
   —
   Reply to this email directly, view it on GitHub<https://github.com/apache/arrow-rs/pull/4156#issuecomment-1554200543>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAA33OPY2XVBQOVZ6PSXOC3XG4STXANCNFSM6AAAAAAXPRRIHY>.
   You are receiving this because you commented.Message ID: ***@***.***>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #4156: Cleanup ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4156: Cleanup ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4156:
URL: https://github.com/apache/arrow-rs/pull/4156#issuecomment-1554200543

   Correct, the high-level API reads into arrow and supports object stores, in addition to predicate pushdown, late materialization, row filtering, etc... The low-level APIs support advanced usage, but are not nearly as batteries included. I would encourage you to try out the arrow APIs, you may find they are actually faster 😅


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org