You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jp0317 (via GitHub)" <gi...@apache.org> on 2023/06/20 16:48:52 UTC

[GitHub] [arrow] jp0317 opened a new pull request, #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

jp0317 opened a new pull request, #36180:
URL: https://github.com/apache/arrow/pull/36180

   ### Rationale for this change
   
   The current [ReadRangeCache](https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/caching.h#L100) support a laze mode where each range is read upon an [explicit request](https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/caching.cc#L255), and a non-lazy mode where all ranges are read concurrently [at once](https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/caching.cc#L168). In some scenarios, it would help to support a prefetch mode (which naturally is bound to the lazy mode) such that  explicitly reading one range will prefetching one or more following ranges . 
   
   ### What changes are included in this PR?
   
   Add a new cache option `prefetch_limit` to define the maximum number of ranges to be prefetched when reading one range in lazy mode.
   
   
   ### Are these changes tested?
   
   Yes. A new unit test is included
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244526828


##########
cpp/src/arrow/io/memory_test.cc:
##########
@@ -845,13 +845,61 @@ TEST(RangeReadCache, Lazy) {
   ASSERT_EQ(3, file->read_count());
 }
 
+TEST(RangeReadCache, LazyWithPrefetching) {
+  std::string data = "abcdefghijklmnopqrstuvwxyz";
+
+  auto file = std::make_shared<CountingBufferReader>(Buffer(data));
+  CacheOptions options = CacheOptions::LazyDefaults();
+  options.hole_size_limit = 1;
+  options.range_size_limit = 3;
+  options.prefetch_limit = 2;
+  internal::ReadRangeCache cache(file, {}, options);
+
+  ASSERT_OK(cache.Cache({{1, 1}, {3, 1}, {5, 2}, {8, 2}, {20, 2}, {25, 0}}));
+
+  // Lazy cache doesn't fetch ranges until requested
+  ASSERT_EQ(0, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(auto buf, cache.Read({8, 2}));
+  AssertBufferEqual(*buf, "ij");
+  // Read {8, 2} and prefetch {20, 2}
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({20, 2}));
+  AssertBufferEqual(*buf, "uv");
+  // Read count remains 2 as the range {20, 2} has already been prefetched
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({1, 1}));
+  AssertBufferEqual(*buf, "b");
+  // Read {1, 3} and prefetch {5, 2}
+  ASSERT_EQ(4, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({3, 1}));
+  AssertBufferEqual(*buf, "d");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Requested ranges are still cached
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({5, 1}));
+  AssertBufferEqual(*buf, "f");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Non-cached ranges
+  ASSERT_RAISES(Invalid, cache.Read({20, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({19, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({0, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({25, 2}));
+}
+
 TEST(CacheOptions, Basics) {
   auto check = [](const CacheOptions actual, const double expected_hole_size_limit_MiB,
                   const double expected_range_size_limit_MiB) -> void {
     const CacheOptions expected = {
         static_cast<int64_t>(std::round(expected_hole_size_limit_MiB * 1024 * 1024)),
         static_cast<int64_t>(std::round(expected_range_size_limit_MiB * 1024 * 1024)),
-        /*lazy=*/false};
+        /*lazy=*/false, /*prefetch_limit=*/0};

Review Comment:
   Ah, I thought you had added the value here because it was necessary, or was it just to be explicit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244509746


##########
cpp/src/arrow/io/caching.h:
##########
@@ -43,10 +43,14 @@ struct ARROW_EXPORT CacheOptions {
   int64_t range_size_limit;
   /// \brief A lazy cache does not perform any I/O until requested.
   bool lazy;
+  /// \brief The maximum number of ranges to be prefetched. This is only used

Review Comment:
   Added some explanations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244527104


##########
cpp/src/arrow/io/memory_test.cc:
##########
@@ -845,13 +845,61 @@ TEST(RangeReadCache, Lazy) {
   ASSERT_EQ(3, file->read_count());
 }
 
+TEST(RangeReadCache, LazyWithPrefetching) {
+  std::string data = "abcdefghijklmnopqrstuvwxyz";
+
+  auto file = std::make_shared<CountingBufferReader>(Buffer(data));
+  CacheOptions options = CacheOptions::LazyDefaults();
+  options.hole_size_limit = 1;
+  options.range_size_limit = 3;
+  options.prefetch_limit = 2;
+  internal::ReadRangeCache cache(file, {}, options);
+
+  ASSERT_OK(cache.Cache({{1, 1}, {3, 1}, {5, 2}, {8, 2}, {20, 2}, {25, 0}}));
+
+  // Lazy cache doesn't fetch ranges until requested
+  ASSERT_EQ(0, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(auto buf, cache.Read({8, 2}));
+  AssertBufferEqual(*buf, "ij");
+  // Read {8, 2} and prefetch {20, 2}
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({20, 2}));
+  AssertBufferEqual(*buf, "uv");
+  // Read count remains 2 as the range {20, 2} has already been prefetched
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({1, 1}));
+  AssertBufferEqual(*buf, "b");
+  // Read {1, 3} and prefetch {5, 2}
+  ASSERT_EQ(4, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({3, 1}));
+  AssertBufferEqual(*buf, "d");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Requested ranges are still cached
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({5, 1}));
+  AssertBufferEqual(*buf, "f");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Non-cached ranges
+  ASSERT_RAISES(Invalid, cache.Read({20, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({19, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({0, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({25, 2}));
+}
+
 TEST(CacheOptions, Basics) {
   auto check = [](const CacheOptions actual, const double expected_hole_size_limit_MiB,
                   const double expected_range_size_limit_MiB) -> void {
     const CacheOptions expected = {
         static_cast<int64_t>(std::round(expected_hole_size_limit_MiB * 1024 * 1024)),
         static_cast<int64_t>(std::round(expected_range_size_limit_MiB * 1024 * 1024)),
-        /*lazy=*/false};
+        /*lazy=*/false, /*prefetch_limit=*/0};

Review Comment:
   I suppose from looking it up again that it would be OK to omit this too, which is all I was wondering about.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244553997


##########
cpp/src/arrow/io/memory_test.cc:
##########
@@ -845,13 +845,61 @@ TEST(RangeReadCache, Lazy) {
   ASSERT_EQ(3, file->read_count());
 }
 
+TEST(RangeReadCache, LazyWithPrefetching) {
+  std::string data = "abcdefghijklmnopqrstuvwxyz";
+
+  auto file = std::make_shared<CountingBufferReader>(Buffer(data));
+  CacheOptions options = CacheOptions::LazyDefaults();
+  options.hole_size_limit = 1;
+  options.range_size_limit = 3;
+  options.prefetch_limit = 2;
+  internal::ReadRangeCache cache(file, {}, options);
+
+  ASSERT_OK(cache.Cache({{1, 1}, {3, 1}, {5, 2}, {8, 2}, {20, 2}, {25, 0}}));
+
+  // Lazy cache doesn't fetch ranges until requested
+  ASSERT_EQ(0, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(auto buf, cache.Read({8, 2}));
+  AssertBufferEqual(*buf, "ij");
+  // Read {8, 2} and prefetch {20, 2}
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({20, 2}));
+  AssertBufferEqual(*buf, "uv");
+  // Read count remains 2 as the range {20, 2} has already been prefetched
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({1, 1}));
+  AssertBufferEqual(*buf, "b");
+  // Read {1, 3} and prefetch {5, 2}
+  ASSERT_EQ(4, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({3, 1}));
+  AssertBufferEqual(*buf, "d");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Requested ranges are still cached
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({5, 1}));
+  AssertBufferEqual(*buf, "f");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Non-cached ranges
+  ASSERT_RAISES(Invalid, cache.Read({20, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({19, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({0, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({25, 2}));
+}
+
 TEST(CacheOptions, Basics) {
   auto check = [](const CacheOptions actual, const double expected_hole_size_limit_MiB,
                   const double expected_range_size_limit_MiB) -> void {
     const CacheOptions expected = {
         static_cast<int64_t>(std::round(expected_hole_size_limit_MiB * 1024 * 1024)),
         static_cast<int64_t>(std::round(expected_range_size_limit_MiB * 1024 * 1024)),
-        /*lazy=*/false};
+        /*lazy=*/false, /*prefetch_limit=*/0};

Review Comment:
   i see, I removed this change, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36180:
URL: https://github.com/apache/arrow/pull/36180#issuecomment-1617106861

   Conbench analyzed the 6 benchmark runs on commit `72ec35e6`.
   
   There were 11 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-07-01 04:37:22Z](http://conbench.ursa.dev/compare/runs/11e3a9a04da44566a344fef8c070f122...caf85be31cf54ba79edaefca6d2cde98/)
     - [params=32768, source=cpp-micro, suite=parquet-encoding-benchmark](http://conbench.ursa.dev/compare/benchmarks/0649f819351277c48000bb45efe0a230...0649fae16bac7b348000a7a3b2ed3455)
     - [params=DecodeArrowNonNull_Dense/4096, source=cpp-micro, suite=parquet-encoding-benchmark](http://conbench.ursa.dev/compare/benchmarks/0649f818a88475d18000a928d06a8ed7...0649fae0b03278c08000df40a40ab2b8)
   - and 9 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14724866811) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36180:
URL: https://github.com/apache/arrow/pull/36180#issuecomment-1599156296

   :warning: GitHub issue #36178 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244510388


##########
cpp/src/arrow/io/memory_test.cc:
##########
@@ -845,13 +845,61 @@ TEST(RangeReadCache, Lazy) {
   ASSERT_EQ(3, file->read_count());
 }
 
+TEST(RangeReadCache, LazyWithPrefetching) {
+  std::string data = "abcdefghijklmnopqrstuvwxyz";
+
+  auto file = std::make_shared<CountingBufferReader>(Buffer(data));
+  CacheOptions options = CacheOptions::LazyDefaults();
+  options.hole_size_limit = 1;
+  options.range_size_limit = 3;
+  options.prefetch_limit = 2;
+  internal::ReadRangeCache cache(file, {}, options);
+
+  ASSERT_OK(cache.Cache({{1, 1}, {3, 1}, {5, 2}, {8, 2}, {20, 2}, {25, 0}}));
+
+  // Lazy cache doesn't fetch ranges until requested
+  ASSERT_EQ(0, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(auto buf, cache.Read({8, 2}));
+  AssertBufferEqual(*buf, "ij");
+  // Read {8, 2} and prefetch {20, 2}
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({20, 2}));
+  AssertBufferEqual(*buf, "uv");
+  // Read count remains 2 as the range {20, 2} has already been prefetched
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({1, 1}));
+  AssertBufferEqual(*buf, "b");
+  // Read {1, 3} and prefetch {5, 2}
+  ASSERT_EQ(4, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({3, 1}));
+  AssertBufferEqual(*buf, "d");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Requested ranges are still cached
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({5, 1}));
+  AssertBufferEqual(*buf, "f");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Non-cached ranges
+  ASSERT_RAISES(Invalid, cache.Read({20, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({19, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({0, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({25, 2}));
+}
+
 TEST(CacheOptions, Basics) {
   auto check = [](const CacheOptions actual, const double expected_hole_size_limit_MiB,
                   const double expected_range_size_limit_MiB) -> void {
     const CacheOptions expected = {
         static_cast<int64_t>(std::round(expected_hole_size_limit_MiB * 1024 * 1024)),
         static_cast<int64_t>(std::round(expected_range_size_limit_MiB * 1024 * 1024)),
-        /*lazy=*/false};
+        /*lazy=*/false, /*prefetch_limit=*/0};

Review Comment:
   The `prefetch_limit` has a default value, do we still need explicit constructors?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm merged pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm merged PR #36180:
URL: https://github.com/apache/arrow/pull/36180


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on PR #36180:
URL: https://github.com/apache/arrow/pull/36180#issuecomment-1611601268

   > I'm curious about the cases where you've found this useful?
   
   I think it may help cases where someone wants to stream ranges in, asynchronously fetching the following range(s) while processing data in from the current range.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] emkornfield commented on pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on PR #36180:
URL: https://github.com/apache/arrow/pull/36180#issuecomment-1610023061

   @jp0317 could you sync to head to see if CI issues go away?
   
   @lidavidm would you mind taking a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244343178


##########
cpp/src/arrow/io/caching.h:
##########
@@ -43,10 +43,14 @@ struct ARROW_EXPORT CacheOptions {
   int64_t range_size_limit;
   /// \brief A lazy cache does not perform any I/O until requested.
   bool lazy;
+  /// \brief The maximum number of ranges to be prefetched. This is only used

Review Comment:
   Explain what prefetching means here?



##########
cpp/src/arrow/io/memory_test.cc:
##########
@@ -845,13 +845,61 @@ TEST(RangeReadCache, Lazy) {
   ASSERT_EQ(3, file->read_count());
 }
 
+TEST(RangeReadCache, LazyWithPrefetching) {
+  std::string data = "abcdefghijklmnopqrstuvwxyz";
+
+  auto file = std::make_shared<CountingBufferReader>(Buffer(data));
+  CacheOptions options = CacheOptions::LazyDefaults();
+  options.hole_size_limit = 1;
+  options.range_size_limit = 3;
+  options.prefetch_limit = 2;
+  internal::ReadRangeCache cache(file, {}, options);
+
+  ASSERT_OK(cache.Cache({{1, 1}, {3, 1}, {5, 2}, {8, 2}, {20, 2}, {25, 0}}));
+
+  // Lazy cache doesn't fetch ranges until requested
+  ASSERT_EQ(0, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(auto buf, cache.Read({8, 2}));
+  AssertBufferEqual(*buf, "ij");
+  // Read {8, 2} and prefetch {20, 2}
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({20, 2}));
+  AssertBufferEqual(*buf, "uv");
+  // Read count remains 2 as the range {20, 2} has already been prefetched
+  ASSERT_EQ(2, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({1, 1}));
+  AssertBufferEqual(*buf, "b");
+  // Read {1, 3} and prefetch {5, 2}
+  ASSERT_EQ(4, file->read_count());
+
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({3, 1}));
+  AssertBufferEqual(*buf, "d");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Requested ranges are still cached
+  ASSERT_OK_AND_ASSIGN(buf, cache.Read({5, 1}));
+  AssertBufferEqual(*buf, "f");
+  // Already prefetched
+  ASSERT_EQ(4, file->read_count());
+
+  // Non-cached ranges
+  ASSERT_RAISES(Invalid, cache.Read({20, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({19, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({0, 3}));
+  ASSERT_RAISES(Invalid, cache.Read({25, 2}));
+}
+
 TEST(CacheOptions, Basics) {
   auto check = [](const CacheOptions actual, const double expected_hole_size_limit_MiB,
                   const double expected_range_size_limit_MiB) -> void {
     const CacheOptions expected = {
         static_cast<int64_t>(std::round(expected_hole_size_limit_MiB * 1024 * 1024)),
         static_cast<int64_t>(std::round(expected_range_size_limit_MiB * 1024 * 1024)),
-        /*lazy=*/false};
+        /*lazy=*/false, /*prefetch_limit=*/0};

Review Comment:
   For compatibility, we should probably add explicit constructors here now



##########
cpp/src/arrow/io/caching.cc:
##########
@@ -204,6 +206,17 @@ struct ReadRangeCache::Impl {
     if (it != entries.end() && it->range.Contains(range)) {
       auto fut = MaybeRead(&*it);
       ARROW_ASSIGN_OR_RAISE(auto buf, fut.result());
+      if (options.lazy && options.prefetch_limit > 0) {
+        int64_t num_prefetched = 0;
+        for (auto next_it = it + 1;
+             next_it != entries.end() && num_prefetched++ < options.prefetch_limit;

Review Comment:
   nit: instead of confusing postfix, just increment the counter inside the loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36180:
URL: https://github.com/apache/arrow/pull/36180#discussion_r1244509487


##########
cpp/src/arrow/io/caching.cc:
##########
@@ -204,6 +206,17 @@ struct ReadRangeCache::Impl {
     if (it != entries.end() && it->range.Contains(range)) {
       auto fut = MaybeRead(&*it);
       ARROW_ASSIGN_OR_RAISE(auto buf, fut.result());
+      if (options.lazy && options.prefetch_limit > 0) {
+        int64_t num_prefetched = 0;
+        for (auto next_it = it + 1;
+             next_it != entries.end() && num_prefetched++ < options.prefetch_limit;

Review Comment:
   Thanks for your review! Updated as suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on pull request #36180: GH-36178: [C++]support prefetching for ReadRangeCache lazy mode

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36180:
URL: https://github.com/apache/arrow/pull/36180#issuecomment-1686149779

   So, in this patch, when we have io `[0, 1, ...99]`
   
   1. When 0 is readed, the `1-9` will be prefetched
   2. When 1 is readed, it would wait for the result being readed, and fetch `10-17`?
   
   Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org