You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jaylmiller (via GitHub)" <gi...@apache.org> on 2023/02/16 17:20:19 UTC

[GitHub] [arrow-datafusion] jaylmiller opened a new pull request, #5308: Add SortExec input cases to bench for SortPreservingMergeExec

jaylmiller opened a new pull request, #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308

   # Which issue does this PR close?
   #5230 (does not close it.)
   
   # Rationale for this change
   The current merge benchmarks take MemoryExec as input. In #5230 we want to modify SortPreservingMergeExec and SortExec so that when row encoding is used within the sort node, it is preserved for usage by the merge node. 
   
   Which means, we need bench cases on main for SortPreservingMergeExec when it has SortExec as input to see if the changes for #5230 are helping perf.
   
   # What changes are included in this PR?
   
   - Added cases with SortExec as input to the merge bench.
   - Fixed minor error I made in the sort bench.
   
   # Are these changes tested?
   # Are there any user-facing changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5308: Fix SortExec bench case and Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5308: Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on code in PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308#discussion_r1108930715


##########
datafusion/core/benches/merge.rs:
##########
@@ -214,6 +266,26 @@ impl MergeBenchCase {
         }
     }
 
+    fn new_with_sort_input(partitions: &[Vec<RecordBatch>]) -> Self {
+        let runtime = tokio::runtime::Builder::new_multi_thread().build().unwrap();
+        let session_ctx = SessionContext::new();
+        let task_ctx = session_ctx.task_ctx();
+
+        let schema = partitions[0][0].schema();
+        let sort = make_sort_exprs(schema.as_ref());
+
+        let projection = None;
+        let exec = Arc::new(MemoryExec::try_new(partitions, schema, projection).unwrap());

Review Comment:
   Yes, I have been following along, great work!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5308: Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308#discussion_r1108929173


##########
datafusion/core/benches/merge.rs:
##########
@@ -214,6 +266,26 @@ impl MergeBenchCase {
         }
     }
 
+    fn new_with_sort_input(partitions: &[Vec<RecordBatch>]) -> Self {
+        let runtime = tokio::runtime::Builder::new_multi_thread().build().unwrap();
+        let session_ctx = SessionContext::new();
+        let task_ctx = session_ctx.task_ctx();
+
+        let schema = partitions[0][0].schema();
+        let sort = make_sort_exprs(schema.as_ref());
+
+        let projection = None;
+        let exec = Arc::new(MemoryExec::try_new(partitions, schema, projection).unwrap());

Review Comment:
   I double checked that the sort exec does actually make multiple output streams. 👍 
   
   FYI @mustafasrepo  and @ozankabak and @mingmwang   -- not sure if you are following this work from @jaylmiller but I think his work may make repartitioned sorts (in parallel) a more compelling option



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5308: Fix SortExec bench case and Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller commented on code in PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308#discussion_r1111122943


##########
datafusion/core/benches/sort.rs:
##########
@@ -199,7 +200,8 @@ impl SortBenchCase {
 
         let projection = None;
         let exec = MemoryExec::try_new(partitions, schema, projection).unwrap();
-        let plan = Arc::new(SortExec::try_new(sort, Arc::new(exec), None).unwrap());
+        let exec = Arc::new(CoalescePartitionsExec::new(Arc::new(exec)));
+        let plan = Arc::new(SortExec::try_new(sort, exec, None).unwrap());

Review Comment:
   without this change it was only ever benchmarking the case where the `execute` receives a single record batch to sort.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jaylmiller commented on a diff in pull request #5308: Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "jaylmiller (via GitHub)" <gi...@apache.org>.
jaylmiller commented on code in PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308#discussion_r1109153838


##########
datafusion/core/benches/merge.rs:
##########
@@ -214,6 +266,26 @@ impl MergeBenchCase {
         }
     }
 
+    fn new_with_sort_input(partitions: &[Vec<RecordBatch>]) -> Self {
+        let runtime = tokio::runtime::Builder::new_multi_thread().build().unwrap();
+        let session_ctx = SessionContext::new();
+        let task_ctx = session_ctx.task_ctx();
+
+        let schema = partitions[0][0].schema();
+        let sort = make_sort_exprs(schema.as_ref());
+
+        let projection = None;
+        let exec = Arc::new(MemoryExec::try_new(partitions, schema, projection).unwrap());

Review Comment:
   Is the preserve partitioning case necessary for the SortExec bench? I'm realizing now it doesn't really measure anything useful in terms of perf 😅 thinking it might make sense to remove those cases in this PR. What do u think?@alamb



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5308: Fix SortExec bench case and Add SortExec input cases to bench for SortPreservingMergeExec

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5308:
URL: https://github.com/apache/arrow-datafusion/pull/5308#issuecomment-1437031526

   Benchmark runs are scheduled for baseline = 222205df2e26befa3f7f8d862fefea413fbafc4d and contender = e13c6537c5222a9254a6d614b0edfefd167a5b0e. e13c6537c5222a9254a6d614b0edfefd167a5b0e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e96dd70c5684cb68b71a32eca9fb105...81e2abf725d4400ca25a0bde27d6468f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e7187a367fbf484e9a976735b6dc26d0...349dc1693f6c4c519f9f7901cec342c1/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5e22f64af4cb4c6283ecb8e8aaaaea1b...6b821ceb40dd467fa6702bd8b1ffe7b4/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9991c2d7ab99444781696a5896a327e4...bafe901a2a9244f98d410401762f843d/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org