You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "zeroshade (via GitHub)" <gi...@apache.org> on 2023/04/12 16:59:15 UTC

[GitHub] [arrow] zeroshade commented on a diff in pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

zeroshade commented on code in PR #34896:
URL: https://github.com/apache/arrow/pull/34896#discussion_r1164406740


##########
cpp/src/arrow/util/ree_util.h:
##########
@@ -231,31 +253,52 @@ class RunEndEncodedArraySpan {
   ///
   /// \param logical_pos is an index in the [0, length()] range
   Iterator iterator(int64_t logical_pos) const {
-    return iterator(logical_pos, logical_pos < length()
-                                     ? PhysicalIndex(logical_pos)
-                                     : RunEndsArray(array_span).length);
+    if (logical_pos < length()) {
+      return iterator(logical_pos, PhysicalIndex(logical_pos));
+    }
+    // If logical_pos is above the valid range, use length() as the logical
+    // position and calculate the physical address right after the last valid
+    // physical position. Which is the physical index of the last logical
+    // position, plus 1.
+    return (length() == 0) ? iterator(0, PhysicalIndex(0))
+                           : iterator(length(), PhysicalIndex(length() - 1) + 1);
   }
 
   /// \brief Create an iterator representing the logical begin of the run-end
   /// encoded array
-  Iterator begin() const { return iterator(0); }
+  Iterator begin() const { return iterator(0, PhysicalIndex(0)); }
 
   /// \brief Create an iterator representing the first invalid logical position
   /// of the run-end encoded array
   ///
-  /// The Iterator returned by end() should not be
+  /// \warning Avoid calling end() in a loop, as it will recompute the physical
+  /// length of the array on each call (O(log N) cost per call).
+  ///
+  /// \par You can write your loops like this instead:
+  /// \code
+  /// for (auto it = array.begin(), end = array.end(); it != end; ++it) {
+  ///   // ...
+  /// }
+  /// \endcode
+  ///
+  /// \par Or this version that does not look like idiomatic C++, but removes
+  /// the need for calling end() completely:
+  /// \code
+  /// for (auto it = array.begin(); !it.is_end(array); ++it) {
+  ///   // ...
+  /// }
+  /// \endcode

Review Comment:
   I'm honestly not a fan of this, but can't think of a better way at the moment yet that wouldn't be just offloading the cost somewhere else.  
   
   Could we potentially use some sentinel value for `physical_pos_` (like <0?) that will have the iterator compute it if `++`/`--` are called on the iterator? That would allow it to remain cheap to call `end()`. Or would that just make things worse?
   
   hmm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org