You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jp0317 (via GitHub)" <gi...@apache.org> on 2023/06/20 22:15:42 UTC

[GitHub] [arrow] jp0317 opened a new pull request, #36192: PARQUET-36189:[C++]: Allow partial prebuffer in parquet FileReader

jp0317 opened a new pull request, #36192:
URL: https://github.com/apache/arrow/pull/36192

   ### Rationale for this change
   
   The current FileReader can only work inĀ  one of the two modes, coalescing (when Prebuffer is called) and non-coalescing (when Prefufer is not called), due to the if statement [here](https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L203)
   
   Since Prebuffer is basically caching all specified column chunks, it would raise concerns on memory usage for systems with tight memory budget. In such scenarios, one may want to Prebuffer some small chunks while being able to read the rest chunks usingĀ  BufferedInputStream.
   
   ### What changes are included in this PR?
   
   Changes to support partial prebuffer on a subset of column chunks and a unit test
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1244022508


##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2382,6 +2382,38 @@ TEST(TestArrowReadWrite, WaitCoalescedReads) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+// Use coalesced reads and non-coaleasced reads for different column chunks.
+TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
+  constexpr int num_columns = 5;
+  constexpr int num_rows = 128;
+
+  std::shared_ptr<Table> expected;
+  ASSERT_NO_FATAL_FAILURE(
+      MakeDoubleTable(num_columns, num_rows, /*nchunks=*/1, &expected));
+
+  std::shared_ptr<Buffer> buffer;
+  ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(expected, num_rows / 2,
+                                             default_arrow_writer_properties(), &buffer));
+
+  std::unique_ptr<FileReader> reader;
+  ASSERT_OK_NO_THROW(OpenFile(std::make_shared<BufferReader>(buffer),
+                              ::arrow::default_memory_pool(), &reader));
+
+  ASSERT_EQ(2, reader->num_row_groups());
+
+  // Pre-buffer 3 columns in the 2nd row group.
+  const std::vector<int> row_groups = {1};
+  const std::vector<int> column_indices = {0, 1, 4};
+  reader->parquet_reader()->PreBuffer(row_groups, column_indices,
+                                      ::arrow::io::IOContext(),
+                                      ::arrow::io::CacheOptions::Defaults());
+  ASSERT_OK(reader->parquet_reader()->WhenBuffered(row_groups, column_indices).status());
+
+  ASSERT_OK_AND_ASSIGN(auto actual, ReadTableManually(reader.get()));

Review Comment:
   yes, https://github.com/apache/arrow/blob/03bf9a766d50a4bfc7681b560d2f84947ba4e05d/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L2260



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1243627575


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -297,9 +302,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
+    std::unordered_set<int> prebuffered_column_chunks;
+    // Avoid updating the map as this function can be called concurrently. The map can
+    // only be updated within Prebuffer().
+    auto prebuffered_column_chunks_iter = prebuffered_column_chunks_.find(i);
+    if (prebuffered_column_chunks_iter != prebuffered_column_chunks_.end()) {
+      prebuffered_column_chunks = prebuffered_column_chunks_iter->second;
+    }
+
     std::unique_ptr<SerializedRowGroup> contents = std::make_unique<SerializedRowGroup>(
         source_, cached_source_, source_size_, file_metadata_.get(), i, properties_,
-        file_decryptor_);
+        prebuffered_column_chunks, file_decryptor_);

Review Comment:
   The move here means that if you read this row group again, it won't make use of the pre-buffered data. Is that intentional? If so, it may be clearer to explicitly erase the map entry after moving as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1243936260


##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2382,6 +2382,38 @@ TEST(TestArrowReadWrite, WaitCoalescedReads) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+// Use coalesced reads and non-coaleasced reads for different column chunks.
+TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
+  constexpr int num_columns = 5;
+  constexpr int num_rows = 128;
+
+  std::shared_ptr<Table> expected;
+  ASSERT_NO_FATAL_FAILURE(
+      MakeDoubleTable(num_columns, num_rows, /*nchunks=*/1, &expected));
+
+  std::shared_ptr<Buffer> buffer;
+  ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(expected, num_rows / 2,
+                                             default_arrow_writer_properties(), &buffer));
+
+  std::unique_ptr<FileReader> reader;
+  ASSERT_OK_NO_THROW(OpenFile(std::make_shared<BufferReader>(buffer),
+                              ::arrow::default_memory_pool(), &reader));
+
+  ASSERT_EQ(2, reader->num_row_groups());
+
+  // Pre-buffer 3 columns in the 2nd row group.
+  const std::vector<int> row_groups = {1};
+  const std::vector<int> column_indices = {0, 1, 4};
+  reader->parquet_reader()->PreBuffer(row_groups, column_indices,
+                                      ::arrow::io::IOContext(),
+                                      ::arrow::io::CacheOptions::Defaults());
+  ASSERT_OK(reader->parquet_reader()->WhenBuffered(row_groups, column_indices).status());
+
+  ASSERT_OK_AND_ASSIGN(auto actual, ReadTableManually(reader.get()));

Review Comment:
   Does ReadTableManually read the entire table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1242445314


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -178,13 +179,15 @@ class SerializedRowGroup : public RowGroupReader::Contents {
                      std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source,
                      int64_t source_size, FileMetaData* file_metadata,
                      int row_group_number, const ReaderProperties& props,
+                     const std::unordered_set<int>& prebuffered_column_chunks,

Review Comment:
   Done. Thanks!



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -297,9 +302,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
+    std::unordered_set<int> prebuffered_column_chunks;
+    // Avoid updating the map as this function can be called concurrently. The map can
+    // only be updated within Prebuffer().
+    auto prebuffered_column_chunks_iter = prebuffered_column_chunks_.find(i);
+    if (prebuffered_column_chunks_iter != prebuffered_column_chunks_.end()) {
+      prebuffered_column_chunks = prebuffered_column_chunks_iter->second;
+    }
+
     std::unique_ptr<SerializedRowGroup> contents = std::make_unique<SerializedRowGroup>(
         source_, cached_source_, source_size_, file_metadata_.get(), i, properties_,
-        file_decryptor_);
+        prebuffered_column_chunks, file_decryptor_);

Review Comment:
   Done. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1623767491

   Thanks for the reply! @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36192: PARQUET-36189:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1599645086

   https://issues.apache.org/jira/browse/PARQUET-36189


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1606338805

   > Thanks for fixing the CI! I can confirm that the only failure is untreated to this patch. I have left some comments in addressing the readability in the test. Rest LGTM
   
   Thanks for your review! Updated as suggested. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1241369271


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -351,8 +364,11 @@ class SerializedFile : public ParquetFileReader::Contents {
     cached_source_ =
         std::make_shared<::arrow::io::internal::ReadRangeCache>(source_, ctx, options);
     std::vector<::arrow::io::ReadRange> ranges;
+    prebuffered_column_chunks_.clear();

Review Comment:
   After double check, I think there could be concurrent issues. If `PreBuffer` and `GetRowGroup` can be called concurrently, `cached_source_` and `prebuffered_column_chunks_` might be in an inconsistent state. For example, `GetRowGroup` reads an old state in `prebuffered_column_chunks_`, then `PreBuffer` creates brand new `cached_source_` and `prebuffered_column_chunks_`. After that, `GetRowGroup` may pass the new `cached_source_` to the RowGroupReader. I believe the followup read will get an error complaining `Status::Invalid("ReadRangeCache did not find matching cache entry")`.
   
   cc @westonpace 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1602942262

   Could you please fix the lint issue here: https://github.com/apache/arrow/actions/runs/5329892746/jobs/9656017234?pr=36192
   
   There is also a test failure, not sure if it is related: https://github.com/apache/arrow/actions/runs/5329892751/jobs/9656017548?pr=36192
   
   ```
   The following tests FAILED:
   Errors while running CTest
   	 63 - arrow-dataset-file-parquet-test (Timeout)
   Error: Process completed with exit code 8.
   ```
   
   Also cc @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1243719449


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -297,9 +302,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
+    std::unordered_set<int> prebuffered_column_chunks;
+    // Avoid updating the map as this function can be called concurrently. The map can
+    // only be updated within Prebuffer().
+    auto prebuffered_column_chunks_iter = prebuffered_column_chunks_.find(i);
+    if (prebuffered_column_chunks_iter != prebuffered_column_chunks_.end()) {
+      prebuffered_column_chunks = prebuffered_column_chunks_iter->second;
+    }
+
     std::unique_ptr<SerializedRowGroup> contents = std::make_unique<SerializedRowGroup>(
         source_, cached_source_, source_size_, file_metadata_.get(), i, properties_,
-        file_decryptor_);
+        prebuffered_column_chunks, file_decryptor_);

Review Comment:
   I'm not sure why it won't use the prebuffer as this `prebuffered_column_chunks` is a local object. Could you please elaborate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1241485010


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -351,8 +364,11 @@ class SerializedFile : public ParquetFileReader::Contents {
     cached_source_ =
         std::make_shared<::arrow::io::internal::ReadRangeCache>(source_, ctx, options);
     std::vector<::arrow::io::ReadRange> ranges;
+    prebuffered_column_chunks_.clear();

Review Comment:
   Thanks for bringing this up! If they can be called concurrently, then wouldn't the problem already exists before this PR, i.e., we already have race on the  `cached_source_` (`GetRowGroup` reads it while `Prebuffer` writes it)? 
   
   IIUC the `Prebuffer` is not for concurrent scenarios based on [here](https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L351) and the function [docs](https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.h#L166-L173) seems to imply a strict order between calling `Prebuffer` and creating reader / reading data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1606978787

   @westonpace @lidavidm Could one of you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1243771542


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -297,9 +302,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
+    std::unordered_set<int> prebuffered_column_chunks;
+    // Avoid updating the map as this function can be called concurrently. The map can
+    // only be updated within Prebuffer().
+    auto prebuffered_column_chunks_iter = prebuffered_column_chunks_.find(i);
+    if (prebuffered_column_chunks_iter != prebuffered_column_chunks_.end()) {
+      prebuffered_column_chunks = prebuffered_column_chunks_iter->second;
+    }
+
     std::unique_ptr<SerializedRowGroup> contents = std::make_unique<SerializedRowGroup>(
         source_, cached_source_, source_size_, file_metadata_.get(), i, properties_,
-        file_decryptor_);
+        prebuffered_column_chunks, file_decryptor_);

Review Comment:
   argh, you're right, we copied here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1599645895

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1633303562

   Conbench analyzed the 6 benchmark runs on commit `ec77fab2`.
   
   There were 5 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-07-09 23:38:50Z](http://conbench.ursa.dev/compare/runs/da6ab12ec469411a9513742177ba559d...a726c6e3d837485baae7bb2103892076/)
     - [params=<false,BooleanType>, source=cpp-micro, suite=parquet-arrow-reader-writer-benchmark](http://conbench.ursa.dev/compare/benchmarks/064ab1aac4ab772f8000b767cbe4bec9...064ab45588b27f7c8000581c2aa6e6f1)
     - [params=<FLBAType>, source=cpp-micro, suite=parquet-bloom-filter-benchmark](http://conbench.ursa.dev/compare/benchmarks/064ab1b148d2751a80001bcf048e81b4...064ab45ac57676148000426ed08f6851)
   - and 3 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14996504155) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1241517898


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -351,8 +364,11 @@ class SerializedFile : public ParquetFileReader::Contents {
     cached_source_ =
         std::make_shared<::arrow::io::internal::ReadRangeCache>(source_, ctx, options);
     std::vector<::arrow::io::ReadRange> ranges;
+    prebuffered_column_chunks_.clear();

Review Comment:
   That sounds reasonable. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac merged pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac merged PR #36192:
URL: https://github.com/apache/arrow/pull/36192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1609860015

   > I don't see anything wrong with this. I do think it might be hard to use. How is a user going to know which row groups or columns they are supposed to prebuffer? Maybe this is for a situation where a user needs to read the same parquet file over and over again and they can highly tune this operation?
   
   Thanks for your review. I think users can know the column chunk size by reading metadata and make their own choices.  Repeatedly reading and tuning can be a use case but i think this pr might be used for more fine-grained reading on individual column chunk level. 
   
   > Also, you may want to look into clearing the pre-buffer cache after a row group is read. That could help start releasing memory as the file is being read.
   
   Yes, clearing pre-buffer cache can be separate improvement?
   
   > Would this support a use case allowing us to use prebuffering and still read one row group at a time like:
   >
   > PreBuffer row group 0 Read row group 0 PreBuffer row group 1 Read row group 1 ...
   
   Yes, in addition i think it can be: Get column chunk sizes -> Determine prebuffer target chunks -> Call prebuffer -> Read row group -> ...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1599645801

   https://issues.apache.org/jira/browse/PARQUET-2316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1236249409


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -268,6 +272,7 @@ class SerializedRowGroup : public RowGroupReader::Contents {
   std::unique_ptr<RowGroupMetaData> row_group_metadata_;
   ReaderProperties properties_;
   int row_group_ordinal_;
+  std::unordered_set<int>& prebuffered_column_chunks_;

Review Comment:
   thanks for the review. Changed as suggested.



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -178,13 +179,15 @@ class SerializedRowGroup : public RowGroupReader::Contents {
                      std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source,
                      int64_t source_size, FileMetaData* file_metadata,
                      int row_group_number, const ReaderProperties& props,
+                     std::unordered_set<int>& prebuffered_column_chunks,

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1603458421

   > Could you please fix the lint issue here: https://github.com/apache/arrow/actions/runs/5329892746/jobs/9656017234?pr=36192
   > 
   > There is also a test failure, not sure if it is related: https://github.com/apache/arrow/actions/runs/5329892751/jobs/9656017548?pr=36192
   > 
   > ```
   > The following tests FAILED:
   > Errors while running CTest
   > 	 63 - arrow-dataset-file-parquet-test (Timeout)
   > Error: Process completed with exit code 8.
   > ```
   > 
   > Also cc @pitrou
   
   Fixed the link error.  The other error seems to be an irrelevant timeout


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1621069107

   @westonpace Do you still have any concern?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1236200894


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -268,6 +272,7 @@ class SerializedRowGroup : public RowGroupReader::Contents {
   std::unique_ptr<RowGroupMetaData> row_group_metadata_;
   ReaderProperties properties_;
   int row_group_ordinal_;
+  std::unordered_set<int>& prebuffered_column_chunks_;

Review Comment:
   IMO, this would be better to use `const std::unordered_set<int>`, without reference.



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -178,13 +179,15 @@ class SerializedRowGroup : public RowGroupReader::Contents {
                      std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source,
                      int64_t source_size, FileMetaData* file_metadata,
                      int row_group_number, const ReaderProperties& props,
+                     std::unordered_set<int>& prebuffered_column_chunks,

Review Comment:
   `const std::unordered_set<int>&` or `std::unordered_set<int>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1604689401

   > There are some tests failures:
   > 
   > https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47375345
   > 
   > ```
   > [----------] 32 tests from TestScan/TestParquetFileFormatScan
   > [ RUN      ] TestScan/TestParquetFileFormatScan.ScanRecordBatchReader/0Threaded16b1024r
   > unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
   > [  FAILED  ] TestScan/TestParquetFileFormatScan.ScanRecordBatchReader/0Threaded16b1024r, where GetParam() = Threaded16b1024r (30 ms)
   > ```
   > 
   > It seems that there are ASAN issues: https://github.com/apache/arrow/actions/runs/5352677942/jobs/9707827704?pr=36192
   
   Thanks! Fixed. It seems that the tests are calling `GetRowGroup()` concurrently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1241001888


##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2382,6 +2382,34 @@ TEST(TestArrowReadWrite, WaitCoalescedReads) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+// Use coalesced reads and non-coaleasced reads for different column chunks.
+TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
+  const int num_columns = 5;
+  const int num_rows = 128;

Review Comment:
   ```suggestion
     constexpr int num_columns = 5;
     constexpr int num_rows = 128;
   ```



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2382,6 +2382,34 @@ TEST(TestArrowReadWrite, WaitCoalescedReads) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+// Use coalesced reads and non-coaleasced reads for different column chunks.
+TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
+  const int num_columns = 5;
+  const int num_rows = 128;
+
+  std::shared_ptr<Table> expected;
+  ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(num_columns, num_rows, 1, &expected));

Review Comment:
   ```suggestion
     ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(num_columns, num_rows, /*nchunks=*/1, &expected));
   ```



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2382,6 +2382,34 @@ TEST(TestArrowReadWrite, WaitCoalescedReads) {
   ASSERT_EQ(actual_batch->num_rows(), num_rows);
 }
 
+// Use coalesced reads and non-coaleasced reads for different column chunks.
+TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
+  const int num_columns = 5;
+  const int num_rows = 128;
+
+  std::shared_ptr<Table> expected;
+  ASSERT_NO_FATAL_FAILURE(MakeDoubleTable(num_columns, num_rows, 1, &expected));
+
+  std::shared_ptr<Buffer> buffer;
+  ASSERT_NO_FATAL_FAILURE(WriteTableToBuffer(expected, num_rows / 2,
+                                             default_arrow_writer_properties(), &buffer));
+
+  std::unique_ptr<FileReader> reader;
+  ASSERT_OK_NO_THROW(OpenFile(std::make_shared<BufferReader>(buffer),
+                              ::arrow::default_memory_pool(), &reader));
+
+  ASSERT_EQ(2, reader->num_row_groups());
+
+  // Pre-buffer 3 columns in the 2nd row group.
+  reader->parquet_reader()->PreBuffer({1}, {0, 1, 4}, ::arrow::io::IOContext(),
+                                      ::arrow::io::CacheOptions::Defaults());
+  ASSERT_OK(reader->parquet_reader()->WhenBuffered({1}, {0, 1, 4}).status());

Review Comment:
   ```suggestion
     const std::vector<int> row_groups = {1};
     const std::vector<int> column_indices = {0, 1, 4};
     reader->parquet_reader()->PreBuffer(row_groups, column_indices, ::arrow::io::IOContext(),
                                         ::arrow::io::CacheOptions::Defaults());
     ASSERT_OK(reader->parquet_reader()->WhenBuffered(row_groups, column_indices).status());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36192: PARQUET-2316:[C++]: Allow partial prebuffer in parquet FileReader

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1604197085

   There are some tests failures: 
   
   https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47375345
   
   ```
   [----------] 32 tests from TestScan/TestParquetFileFormatScan
   [ RUN      ] TestScan/TestParquetFileFormatScan.ScanRecordBatchReader/0Threaded16b1024r
   unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
   [  FAILED  ] TestScan/TestParquetFileFormatScan.ScanRecordBatchReader/0Threaded16b1024r, where GetParam() = Threaded16b1024r (30 ms)
   ```
   
   It seems that there are ASAN issues: https://github.com/apache/arrow/actions/runs/5352677942/jobs/9707827704?pr=36192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jp0317 commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "jp0317 (via GitHub)" <gi...@apache.org>.
jp0317 commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1241485010


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -351,8 +364,11 @@ class SerializedFile : public ParquetFileReader::Contents {
     cached_source_ =
         std::make_shared<::arrow::io::internal::ReadRangeCache>(source_, ctx, options);
     std::vector<::arrow::io::ReadRange> ranges;
+    prebuffered_column_chunks_.clear();

Review Comment:
   Thanks for bringing this up! If they can be called concurrently, then wouldn't the problem already exists before this PR i.e., we already have race on the  `cached_source_` (`GetRowGroup` reads it while `Prebuffer` writes it)? 
   
   IIUC the `Prebuffer` is not for concurrent scenarios based on [here](https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L351) and the function [docs](https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.h#L166-L173) seems to imply a strict order between calling `Prebuffer` and creating reader / reading data.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1242115582


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -297,9 +302,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
+    std::unordered_set<int> prebuffered_column_chunks;
+    // Avoid updating the map as this function can be called concurrently. The map can
+    // only be updated within Prebuffer().
+    auto prebuffered_column_chunks_iter = prebuffered_column_chunks_.find(i);
+    if (prebuffered_column_chunks_iter != prebuffered_column_chunks_.end()) {
+      prebuffered_column_chunks = prebuffered_column_chunks_iter->second;
+    }
+
     std::unique_ptr<SerializedRowGroup> contents = std::make_unique<SerializedRowGroup>(
         source_, cached_source_, source_size_, file_metadata_.get(), i, properties_,
-        file_decryptor_);
+        prebuffered_column_chunks, file_decryptor_);

Review Comment:
   Here it will copy the `prebuffered_column_chunks` in mapping, would you mind change it to move, and change the argument in constructor to `std::unordered_map`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #36192:
URL: https://github.com/apache/arrow/pull/36192#discussion_r1242105046


##########
cpp/src/parquet/file_reader.cc:
##########
@@ -178,13 +179,15 @@ class SerializedRowGroup : public RowGroupReader::Contents {
                      std::shared_ptr<::arrow::io::internal::ReadRangeCache> cached_source,
                      int64_t source_size, FileMetaData* file_metadata,
                      int row_group_number, const ReaderProperties& props,
+                     const std::unordered_set<int>& prebuffered_column_chunks,

Review Comment:
   nit: take this by move and move in the constructor? (so that the caller chooses whether to copy or not)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #36192: PARQUET-2316: [C++] Allow partial PreBuffer in the parquet FileReader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #36192:
URL: https://github.com/apache/arrow/pull/36192#issuecomment-1623766184

   @wgtmac no concerns.  We can go ahead and merge this one.  Having more options and different ways to read the files is good I think as long as we document how it works and I think this PR is pretty clear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org