You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/07/16 14:12:10 UTC

[GitHub] [arrow] wgtmac commented on a diff in pull request #36649: PARQUET-2323: [C++] use bit vector to store prebuffered column chunks

wgtmac commented on code in PR #36649:
URL: https://github.com/apache/arrow/pull/36649#discussion_r1264688959


##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -2415,7 +2415,7 @@ TEST(TestArrowReadWrite, CoalescedReadsAndNonCoalescedReads) {
 
   // Pre-buffer 3 columns in the 2nd row group.
   const std::vector<int> row_groups = {1};
-  const std::vector<int> column_indices = {0, 1, 4};
+  const std::vector<int> column_indices = {0, 3};

Review Comment:
   Why need this change? It does not match the comment above.



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -302,17 +303,17 @@ class SerializedFile : public ParquetFileReader::Contents {
   }
 
   std::shared_ptr<RowGroupReader> GetRowGroup(int i) override {
-    std::unordered_set<int> prebuffered_column_chunks;
+    std::shared_ptr<Buffer> prebuffered_column_chunks_bitmap;
     // Avoid updating the map as this function can be called concurrently. The map can

Review Comment:
   ```suggestion
       // Avoid updating the bitmap as this function can be called concurrently. The bitmap can
   ```



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -366,9 +367,13 @@ class SerializedFile : public ParquetFileReader::Contents {
     std::vector<::arrow::io::ReadRange> ranges;
     prebuffered_column_chunks_.clear();
     for (int row : row_groups) {
-      std::unordered_set<int>& prebuffered = prebuffered_column_chunks_[row];
+      std::shared_ptr<Buffer>& col_bitmap = prebuffered_column_chunks_[row];
+      int num_cols = file_metadata_->num_columns();
+      PARQUET_THROW_NOT_OK(
+          AllocateBitmap(num_cols, properties_.memory_pool()).Value(&col_bitmap));
+      ::arrow::bit_util::ClearBitmap(col_bitmap->mutable_data(), 0, num_cols);

Review Comment:
   Should we just call `::memset(col_bitmap->mutable_data(), 0, col_bitmap->size())` ? It would be less efficient to call `ClearBitmap` if `num_cols` is not a multiply of 8.



##########
cpp/src/parquet/file_reader.cc:
##########
@@ -578,8 +583,9 @@ class SerializedFile : public ParquetFileReader::Contents {
   ReaderProperties properties_;
   std::shared_ptr<PageIndexReader> page_index_reader_;
   std::unique_ptr<BloomFilterReader> bloom_filter_reader_;
-  // Maps a row group to its column chunks that are cached via Prebuffer().
-  std::unordered_map<int, std::unordered_set<int>> prebuffered_column_chunks_;
+  // Maps a row group to a bitmap (stored in the Buffer) that marks its column chunks
+  // cached via Prebuffer().

Review Comment:
   ```suggestion
     // Maps row group ordinal and prebuffer status of its column chunks in the form
     // of a bitmap buffer.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org