You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/27 17:34:58 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #4147: Document ChunkReader (#4118)

tustvold opened a new pull request, #4147:
URL: https://github.com/apache/arrow-rs/pull/4147

# Which issue does this PR close?

Closes #4118

# Rationale for this change

# What changes are included in this PR?

# Are there any user-facing changes?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1179694453


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,27 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider fetching the relevant byte ranges into [`Bytes`]

Review Comment:
   Meaning that `get_read` could possibly read more than `length` bytes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1179493427


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,27 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds

Review Comment:
   This isn't actually true, the FileSource doesn't do this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180656767


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,28 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider instead fetching the relevant parts of the file into
+    /// [`Bytes`], and then feeding this into the synchronous APIs, instead of implementing
+    /// [`ChunkReader`] directly. Arrow users can make use of the [async_reader] which

Review Comment:
   Hmm, let me try to clarify this. So this basically means that systems that want to have prefetching, should not rely on this "characteristic" of `ChunkReader` to implement prefetching but instead implement their synchronous APIs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1179721202


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,27 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider fetching the relevant byte ranges into [`Bytes`]

Review Comment:
   I more meant handling this outside of get_read, i.e. don't use ChunkReader for these use-cases 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180668058


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,28 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider instead fetching the relevant parts of the file into
+    /// [`Bytes`], and then feeding this into the synchronous APIs, instead of implementing
+    /// [`ChunkReader`] directly. Arrow users can make use of the [async_reader] which

Review Comment:
   Perhaps we should just remove the length parameter? What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#issuecomment-1528076770

   Closing in favour of #4156 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180674018


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,28 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider instead fetching the relevant parts of the file into
+    /// [`Bytes`], and then feeding this into the synchronous APIs, instead of implementing
+    /// [`ChunkReader`] directly. Arrow users can make use of the [async_reader] which

Review Comment:
   Yea, the `length` parameter is a bit confusing. Except for a hint, looks like it doesn't do too much here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1179721202


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,27 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider fetching the relevant byte ranges into [`Bytes`]

Review Comment:
   I more meant handling this outside of get_read, i.e. don't use ChunkReader for these use-cases 😅
   
   Will see if I can't clarify the wording tomorrow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180658985


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,28 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider instead fetching the relevant parts of the file into
+    /// [`Bytes`], and then feeding this into the synchronous APIs, instead of implementing
+    /// [`ChunkReader`] directly. Arrow users can make use of the [async_reader] which

Review Comment:
   Pretty much, if you want prefetching you can either use async_reader which does it for you, or implement something yourself :smile: 
   
   One could ask the reasonable question, why does ChunkReader exist then, to which the answer is because I've not removed it yet #1163 :laughing: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed pull request #4147: Document ChunkReader (#4118)
URL: https://github.com/apache/arrow-rs/pull/4147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180372093


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,27 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider fetching the relevant byte ranges into [`Bytes`]

Review Comment:
   Done, PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4147: Document ChunkReader (#4118)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #4147:
URL: https://github.com/apache/arrow-rs/pull/4147#discussion_r1180658985


##########
parquet/src/file/reader.rs:
##########
@@ -43,13 +43,28 @@ pub trait Length {
     fn len(&self) -> u64;
 }
 
-/// The ChunkReader trait generates readers of chunks of a source.
-/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
-/// For an object store reader, each read can be mapped to a range request.
+/// The ChunkReader trait provides synchronous access to contiguous byte ranges of a source
 pub trait ChunkReader: Length + Send + Sync {
     type T: Read + Send;
     /// Get a serially readable slice of the current reader
-    /// This should fail if the slice exceeds the current bounds
+    ///
+    /// # IO Granularity
+    ///
+    /// The `length` parameter provides an upper bound on the amount of bytes that
+    /// will be read, however, it is intended purely as a hint.
+    ///
+    /// It is not guaranteed that `length` bytes will actually be read, nor are any guarantees
+    /// made on the size of `length`, it may be as large as a row group or as small as a couple
+    /// of bytes. It therefore should not be used to solely determine the granularity of
+    /// IO to the underlying storage system.
+    ///
+    /// Systems looking to mask high-IO latency through prefetching, such as encountered with
+    /// object storage, should consider instead fetching the relevant parts of the file into
+    /// [`Bytes`], and then feeding this into the synchronous APIs, instead of implementing
+    /// [`ChunkReader`] directly. Arrow users can make use of the [async_reader] which

Review Comment:
   Pretty much, if you want prefetching you can either use async_reader which does it for you, or implement something yourself :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org