You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "emkornfield (via GitHub)" <gi...@apache.org> on 2023/02/17 05:53:22 UTC

[GitHub] [arrow] emkornfield commented on a diff in pull request #34193: GH-34142: [C++][Parquet] Fix record not to span multiple pages

emkornfield commented on code in PR #34193:
URL: https://github.com/apache/arrow/pull/34193#discussion_r1109334136


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1014,11 +1014,33 @@ template <typename Action>
 inline void DoInBatches(int64_t total, int64_t batch_size, Action&& action) {
   int64_t num_batches = static_cast<int>(total / batch_size);
   for (int round = 0; round < num_batches; round++) {
-    action(round * batch_size, batch_size);
+    action(round * batch_size, batch_size, /*check_page=*/true);
   }
   // Write the remaining values
   if (total % batch_size > 0) {
-    action(num_batches * batch_size, total % batch_size);
+    action(num_batches * batch_size, total % batch_size, /*check_page=*/true);
+  }
+}
+
+template <typename Action>
+inline void DoInBatches(const int16_t* def_levels, const int16_t* rep_levels,
+                        int64_t num_levels, int64_t batch_size, Action&& action,
+                        bool pages_change_on_record_boundaries) {
+  if (!pages_change_on_record_boundaries || !rep_levels) {
+    // If rep_levels is null, then we are writing a non-repeated column.
+    // In this case, every record contains only one level.
+    return DoInBatches(num_levels, batch_size, std::forward<Action>(action));
+  }
+
+  int64_t offset = 0;
+  while (offset < num_levels) {
+    int64_t end_offset = std::min(offset + batch_size, num_levels);
+    // Find next record boundary (i.e. ref_level = 0)
+    while (end_offset < num_levels && rep_levels[end_offset] != 0) {
+      end_offset++;
+    }
+    action(offset, end_offset - offset, /*check_page=*/end_offset < num_levels);

Review Comment:
   this seems to assume that rep_levels always ends at a record boundary?  I can't recall off the top of my head if this is something that is document, or if it holds true in general even for the arrow case.  It seems a more robust solution would be  to propagate the offset back from this method once it reaches the end for the the callers to confirm. 
   
   A less robust solution would be to at least sanity check that if there is more then one rep_level that `rep_level[0]` is == 0 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org