You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "felipecrv (via GitHub)" <gi...@apache.org> on 2023/04/05 01:00:47 UTC

[GitHub] [arrow] felipecrv opened a new pull request, #34896: GH-34893: [C++]

felipecrv opened a new pull request, #34896:
URL: https://github.com/apache/arrow/pull/34896

   ### Rationale for this change
   
   The value of `RunEndEncodedArraySpan::Iterator::physical_pos_` from the instance returned by `RunEndEncodedArraySpan::end()` is smaller (by 1) than the results of successive applications of `operator++()` until `logical_pos_` reaches the logical length of the run-end encoded array.
   
   To support backwards iteration, the `Iterator` returned by `RunEndEncodedArraySpan::end()` should be well-formed.
   
   There was no `operator--()` in `RunEndEncodedArraySpan::Iterator` before this PR, so one can't go backwards from `end()` and this latent bug doesn't manifest. 
   
   ### What changes are included in this PR?
   
    - Addition of `RunEndEncodedArraySpan::Iterator::operator--()`
    - Addition of `MergedRunsIterator::operator--()`
    - Rename `isEnd()` to `is_end()` (it was a code-style violation)
    - More test cases
   
   ### Are these changes tested?
   
   By existing and new unit tests.
   
   ### Are there any user-facing changes?
   
   A public function -- `isEnd` -- was renamed, but this code hasn't been released yet, so it's OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade merged pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade merged PR #34896:
URL: https://github.com/apache/arrow/pull/34896


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34896:
URL: https://github.com/apache/arrow/pull/34896#issuecomment-1509433468

   Benchmark runs are scheduled for baseline = 88fd5237591c6eb521a8eea1bbe1eb1507b61c3d and contender = 48294b0790e684c704a563cb98cada9fa59d0e2f. 48294b0790e684c704a563cb98cada9fa59d0e2f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/a4f691a7777145308bd62db0b7d25bfe...96aa6f0b8aa74a5586883633d47153fc/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/311053a8d2fc4dd99d65fd6de434b19d...d1e05acbf62a4944abe29caf180cb635/)
   [Finished :arrow_down:5.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3fb7cb48bed14aa8a80b3a48b5274f53...9cbce3a49a2048c584390ebc52fe8fea/)
   [Finished :arrow_down:1.66% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/94bdb64d476e47ca9774e524ff110ef7...cbca52db126746a9a3883fd7af7ad7c1/)
   Buildkite builds:
   [Finished] [`48294b07` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2699)
   [Failed] [`48294b07` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2733)
   [Finished] [`48294b07` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2697)
   [Finished] [`48294b07` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2724)
   [Finished] [`88fd5237` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2698)
   [Failed] [`88fd5237` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2732)
   [Finished] [`88fd5237` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2696)
   [Finished] [`88fd5237` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2723)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #34896:
URL: https://github.com/apache/arrow/pull/34896#discussion_r1164406740


##########
cpp/src/arrow/util/ree_util.h:
##########
@@ -231,31 +253,52 @@ class RunEndEncodedArraySpan {
   ///
   /// \param logical_pos is an index in the [0, length()] range
   Iterator iterator(int64_t logical_pos) const {
-    return iterator(logical_pos, logical_pos < length()
-                                     ? PhysicalIndex(logical_pos)
-                                     : RunEndsArray(array_span).length);
+    if (logical_pos < length()) {
+      return iterator(logical_pos, PhysicalIndex(logical_pos));
+    }
+    // If logical_pos is above the valid range, use length() as the logical
+    // position and calculate the physical address right after the last valid
+    // physical position. Which is the physical index of the last logical
+    // position, plus 1.
+    return (length() == 0) ? iterator(0, PhysicalIndex(0))
+                           : iterator(length(), PhysicalIndex(length() - 1) + 1);
   }
 
   /// \brief Create an iterator representing the logical begin of the run-end
   /// encoded array
-  Iterator begin() const { return iterator(0); }
+  Iterator begin() const { return iterator(0, PhysicalIndex(0)); }
 
   /// \brief Create an iterator representing the first invalid logical position
   /// of the run-end encoded array
   ///
-  /// The Iterator returned by end() should not be
+  /// \warning Avoid calling end() in a loop, as it will recompute the physical
+  /// length of the array on each call (O(log N) cost per call).
+  ///
+  /// \par You can write your loops like this instead:
+  /// \code
+  /// for (auto it = array.begin(), end = array.end(); it != end; ++it) {
+  ///   // ...
+  /// }
+  /// \endcode
+  ///
+  /// \par Or this version that does not look like idiomatic C++, but removes
+  /// the need for calling end() completely:
+  /// \code
+  /// for (auto it = array.begin(); !it.is_end(array); ++it) {
+  ///   // ...
+  /// }
+  /// \endcode

Review Comment:
   I'm honestly not a fan of this, but can't think of a better way at the moment yet that wouldn't be just offloading the cost somewhere else.  
   
   Could we potentially use some sentinel value for `physical_pos_` (like <0?) that will have the iterator compute it if `++`/`--` are called on the iterator? That would allow it to remain cheap to call `end()`. Or would that just make things worse?
   
   hmm



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34896:
URL: https://github.com/apache/arrow/pull/34896#issuecomment-1499266536

   @zeroshade this is ready for review and would be nice to get merged before release as I rename `isEnd` to `is_end` in a public interface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34896:
URL: https://github.com/apache/arrow/pull/34896#discussion_r1164438328


##########
cpp/src/arrow/util/ree_util.h:
##########
@@ -231,31 +253,52 @@ class RunEndEncodedArraySpan {
   ///
   /// \param logical_pos is an index in the [0, length()] range
   Iterator iterator(int64_t logical_pos) const {
-    return iterator(logical_pos, logical_pos < length()
-                                     ? PhysicalIndex(logical_pos)
-                                     : RunEndsArray(array_span).length);
+    if (logical_pos < length()) {
+      return iterator(logical_pos, PhysicalIndex(logical_pos));
+    }
+    // If logical_pos is above the valid range, use length() as the logical
+    // position and calculate the physical address right after the last valid
+    // physical position. Which is the physical index of the last logical
+    // position, plus 1.
+    return (length() == 0) ? iterator(0, PhysicalIndex(0))
+                           : iterator(length(), PhysicalIndex(length() - 1) + 1);
   }
 
   /// \brief Create an iterator representing the logical begin of the run-end
   /// encoded array
-  Iterator begin() const { return iterator(0); }
+  Iterator begin() const { return iterator(0, PhysicalIndex(0)); }
 
   /// \brief Create an iterator representing the first invalid logical position
   /// of the run-end encoded array
   ///
-  /// The Iterator returned by end() should not be
+  /// \warning Avoid calling end() in a loop, as it will recompute the physical
+  /// length of the array on each call (O(log N) cost per call).
+  ///
+  /// \par You can write your loops like this instead:
+  /// \code
+  /// for (auto it = array.begin(), end = array.end(); it != end; ++it) {
+  ///   // ...
+  /// }
+  /// \endcode
+  ///
+  /// \par Or this version that does not look like idiomatic C++, but removes
+  /// the need for calling end() completely:
+  /// \code
+  /// for (auto it = array.begin(); !it.is_end(array); ++it) {
+  ///   // ...
+  /// }
+  /// \endcode

Review Comment:
   Things got tricky because I want to be able to land on a valid value after applying `--` to the Iterator returned by `array.end()`. Introducing a sentinel value would add a check for every `--` and give the compiler a hard time optimizing loops. What I had before this PR kinda worked (taking the physical length), but didn't survive the introduction of `operator--`.
   
   
   `for (auto it = array.begin(), end = array.end(); it != end; ++it)` is what performance-conscious C++ developers already use to write loops and the only extra cost incurred is a single O(log n) search. So I don't think adding more fields to the iterator and complicating the loop conditions is a good idea.
   
   `for (auto it = array.begin(); !it.is_end(array); ++it)` is weird, but makes the perfectionists (me) very happy about eliminating that extra O(log n) search completely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34896:
URL: https://github.com/apache/arrow/pull/34896#issuecomment-1509434041

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3fb7cb48bed14aa8a80b3a48b5274f53...9cbce3a49a2048c584390ebc52fe8fea/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34896: GH-34893: [C++] Fix run-end encoded array iterator issues that manifest on backwards iteration

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34896:
URL: https://github.com/apache/arrow/pull/34896#issuecomment-1496795243

   @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34896: GH-34893: [C++]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34896:
URL: https://github.com/apache/arrow/pull/34896#issuecomment-1496784260

   * Closes: #34893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org