You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/06 17:00:23 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #5897: Specialize Primitive Cursor

tustvold opened a new pull request, #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897

   _Draft as builds on #5895_
   
   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Part of #5882
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ```
   merge sorted i64        time:   [5.3681 ms 5.3730 ms 5.3784 ms]
                           change: [-43.762% -43.690% -43.621%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   sort merge i64          time:   [5.6187 ms 5.6264 ms 5.6344 ms]
                           change: [-42.354% -42.259% -42.162%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   merge sorted f64        time:   [5.6240 ms 5.6307 ms 5.6386 ms]
                           change: [-40.730% -40.639% -40.548%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   sort merge f64          time:   [5.9045 ms 5.9124 ms 5.9207 ms]
                           change: [-39.173% -39.065% -38.945%] (p = 0.00 < 0.05)
                           Performance has improved.
   ```
   
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single primitive column faster

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1162453846


##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,
+    options: SortOptions,
+}
+
+impl<T: ArrowNativeTypeOp> PrimitiveCursor<T> {
+    /// Create a new [`PrimitiveCursor`] from the provided `values` sorted according to `options`
+    pub fn new(options: SortOptions, values: ScalarBuffer<T>, null_count: usize) -> Self {
+        assert!(null_count <= values.len());
+
+        let null_threshold = match options.nulls_first {
+            true => null_count,
+            false => values.len() - null_count,
+        };
+
+        Self {
+            values,
+            offset: 0,
+            null_threshold,
+            options,
+        }
+    }
+
+    fn is_null(&self) -> bool {
+        (self.offset < self.null_threshold) == self.options.nulls_first
+    }
+
+    fn value(&self) -> T {
+        self.values[self.offset]
+    }
+}
+
+impl<T: ArrowNativeTypeOp> PartialEq for PrimitiveCursor<T> {
+    fn eq(&self, other: &Self) -> bool {
+        self.cmp(other).is_eq()
+    }
+}
+
+impl<T: ArrowNativeTypeOp> Eq for PrimitiveCursor<T> {}
+impl<T: ArrowNativeTypeOp> PartialOrd for PrimitiveCursor<T> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<T: ArrowNativeTypeOp> Ord for PrimitiveCursor<T> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        match (self.is_null(), other.is_null()) {
+            (true, true) => Ordering::Equal,
+            (true, false) => match self.options.nulls_first {
+                true => Ordering::Less,
+                false => Ordering::Greater,
+            },
+            (false, true) => match self.options.nulls_first {
+                true => Ordering::Greater,
+                false => Ordering::Less,
+            },
+            (false, false) => {
+                let s_v = self.value();
+                let o_v = other.value();
+
+                match self.options.descending {

Review Comment:
   It's definitely something we could experiment, my expectation is that it won't make a huge difference given how branch heavy merging inherently is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single primitive column faster

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1162452933


##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,

Review Comment:
   I thought null_index might be confused for the index of the first null, which is not always the case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single primitive column faster

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1162616186


##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,

Review Comment:
   that makes sense. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single primitive column faster

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5897: Specialize Primitive Cursor

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1160051954


##########
datafusion/core/Cargo.toml:
##########
@@ -57,6 +57,8 @@ unicode_expressions = ["datafusion-physical-expr/unicode_expressions", "datafusi
 ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] }
 apache-avro = { version = "0.14", optional = true }
 arrow = { workspace = true }
+arrow-schema = { workspace = true }
+arrow-array = { workspace = true }

Review Comment:
   The downcast macros require these crates to be in scope



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single primitive column faster

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#issuecomment-1502140737

   Thanks @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5897: Specialize Primitive Cursor

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1160051588


##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,
+    options: SortOptions,
+}
+
+impl<T: ArrowNativeTypeOp> PrimitiveCursor<T> {

Review Comment:
   The astute will notice this is parameterised on the underlying Native type, helping to reduce codegen by not forcing SortPreservingMergeStream to be generic over ArrowPrimitiveType



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5897: Specialize Primitive Cursor -- make sorts / merges on a single column faster

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1161935087


##########
datafusion/core/src/physical_plan/sorts/merge.rs:
##########
@@ -37,8 +52,16 @@ pub(crate) fn streaming_merge(
     tracking_metrics: MemTrackingMetrics,
     batch_size: usize,
 ) -> Result<SendableRecordBatchStream> {
-    let streams = RowCursorStream::try_new(schema.as_ref(), expressions, streams)?;
+    if expressions.len() == 1 {

Review Comment:
   I think it is worth a comment here explaining the rationale for this for future readers. Something like
   
   ```suggestion
       // special case single column primitives to avoid overhead of runtime comparators
       if expressions.len() == 1 {
   ```



##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,
+    options: SortOptions,
+}
+
+impl<T: ArrowNativeTypeOp> PrimitiveCursor<T> {
+    /// Create a new [`PrimitiveCursor`] from the provided `values` sorted according to `options`
+    pub fn new(options: SortOptions, values: ScalarBuffer<T>, null_count: usize) -> Self {
+        assert!(null_count <= values.len());
+
+        let null_threshold = match options.nulls_first {
+            true => null_count,
+            false => values.len() - null_count,
+        };
+
+        Self {
+            values,
+            offset: 0,
+            null_threshold,
+            options,
+        }
+    }
+
+    fn is_null(&self) -> bool {
+        (self.offset < self.null_threshold) == self.options.nulls_first
+    }
+
+    fn value(&self) -> T {
+        self.values[self.offset]
+    }
+}
+
+impl<T: ArrowNativeTypeOp> PartialEq for PrimitiveCursor<T> {
+    fn eq(&self, other: &Self) -> bool {
+        self.cmp(other).is_eq()
+    }
+}
+
+impl<T: ArrowNativeTypeOp> Eq for PrimitiveCursor<T> {}
+impl<T: ArrowNativeTypeOp> PartialOrd for PrimitiveCursor<T> {
+    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl<T: ArrowNativeTypeOp> Ord for PrimitiveCursor<T> {
+    fn cmp(&self, other: &Self) -> Ordering {
+        match (self.is_null(), other.is_null()) {
+            (true, true) => Ordering::Equal,
+            (true, false) => match self.options.nulls_first {
+                true => Ordering::Less,
+                false => Ordering::Greater,
+            },
+            (false, true) => match self.options.nulls_first {
+                true => Ordering::Greater,
+                false => Ordering::Less,
+            },
+            (false, false) => {
+                let s_v = self.value();
+                let o_v = other.value();
+
+                match self.options.descending {

Review Comment:
   I wonder if it would be worth specializing on consts `NULLS_FIRST` and `ASC/DESC` as well to avoid what looks like runtime overhead in the hot path.
   
   Maybe it doesn't make any practical difference



##########
datafusion/core/src/physical_plan/sorts/stream.rs:
##########
@@ -139,3 +142,67 @@ impl PartitionedStream for RowCursorStream {
         }))
     }
 }
+
+pub struct PrimitiveCursorStream<T: ArrowPrimitiveType> {

Review Comment:
   ```suggestion
   /// Specialized stream for sorts on single primitive columns
   pub struct PrimitiveCursorStream<T: ArrowPrimitiveType> {
   ```



##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,

Review Comment:
   Is there any reason it is called `null_threshold` rather than `null_index` given it is a null index 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #5897: Specialize Primitive Cursor

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5897:
URL: https://github.com/apache/arrow-datafusion/pull/5897#discussion_r1160050921


##########
datafusion/core/src/physical_plan/sorts/cursor.rs:
##########
@@ -93,3 +96,232 @@ impl Cursor for RowCursor {
         t
     }
 }
+
+/// A cursor over sorted, nullable [`ArrowNativeTypeOp`]
+///
+/// Note: comparing cursors with different `SortOptions` will yield an arbitrary ordering
+#[derive(Debug)]
+pub struct PrimitiveCursor<T: ArrowNativeTypeOp> {
+    values: ScalarBuffer<T>,
+    offset: usize,
+    // If nulls first, the first non-null index
+    // Otherwise, the first null index
+    null_threshold: usize,

Review Comment:
   I'm quite pleased with this formulation of nulls, it avoids needing to consult the null mask at all :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org