You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/08 18:03:57 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6260: Add more documentation to SortPreservingMergeStream

alamb commented on code in PR #6260:
URL: https://github.com/apache/arrow-datafusion/pull/6260#discussion_r1187729554


##########
datafusion/core/src/physical_plan/sorts/merge.rs:
##########
@@ -242,22 +260,61 @@ impl<C: Cursor> SortPreservingMergeStream<C> {
         }
     }
 
+    /// Find the leaf node index in the loser tree for the given cursor index
+    ///
+    /// Note that this is not necessarily a leaf node in the tree, but it can
+    /// also be a half-node (a node with only one child). This happens when the
+    /// number of cursors/streams is not a power of two. Thus, the loser tree
+    /// will be unbalanced, but it will still work correctly.
+    ///
+    /// For example, with 5 streams, the loser tree will look like this:
+    ///
+    /// ```text
+    ///           0 (winner)
+    ///
+    ///           1
+    ///        /     \
+    ///       2       3
+    ///     /  \     / \
+    ///    4    |   |   |
+    ///   / \   |   |   |
+    /// -+---+--+---+---+---- Below is not a part of loser tree
+    ///  S3 S4 S0   S1  S2
+    /// ```
+    ///
+    /// S0, S1, ... S4 are the streams (read: stream at index 0, stream at
+    /// index 1, etc.)
+    ///
+    /// Zooming in at node 2 in the loser tree as an example, we can see that
+    /// it takes as input the next item at (S0) and the loser of (S3, S4).
+    ///
+    #[inline]
+    fn lt_leaf_node_index(&self, cursor_index: usize) -> usize {
+        (self.cursors.len() + cursor_index) / 2
+    }
+
+    /// Find the parent node index for the given node index

Review Comment:
   👍  I like refactoring of this functionality as a way to document what is going on



##########
datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs:
##########
@@ -506,6 +506,71 @@ mod tests {
         assert_batches_eq!(exp, collected.as_slice());
     }
 
+    #[tokio::test]
+    async fn test_merge_five_partitions() {

Review Comment:
   I wonder if you can leave some comments about why this test was added / what it is covering to help future readers?
   
   I think the coverage of merging is pretty good with this: https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/merge_fuzz.rs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org