You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/03 15:14:39 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #2431: WIP Add benchmark for sort preserving merge

alamb opened a new pull request, #2431:
URL: https://github.com/apache/arrow-datafusion/pull/2431

   NOT READY FOR REVIEW
   
   # Which issue does this PR close?
   
   Part of https://github.com/apache/arrow-datafusion/issues/2427
   
    # Rationale for this change
   Add benchmarks for the cases I intended to optimize for in https://github.com/apache/arrow-datafusion/issues/2427
   
   # What changes are included in this PR?
   new `merge` benchmark
   
   run:
   
   ```shell
   cargo bench --bench merge
   ```
   
   # Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2431: Benchmark for sort preserving merge

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2431:
URL: https://github.com/apache/arrow-datafusion/pull/2431#issuecomment-1133147901

   Thanks @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2431: Benchmark for sort preserving merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2431:
URL: https://github.com/apache/arrow-datafusion/pull/2431#discussion_r878291123


##########
datafusion/core/benches/merge.rs:
##########
@@ -0,0 +1,455 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Benchmarks for Merge performance
+//!
+//! Each benchmark:
+//! 1. Creates a sorted RecordBatch of some number of columns
+//!
+//! 2. Divides that `RecordBatch` into some number of "streams"
+//! (`RecordBatch`s with a subset of the rows, still ordered)
+//!
+//! 3. Times how long it takes for [`SortPreservingMergeExec`] to
+//! merge the "streams" back together into the original RecordBatch.
+//!
+//! Pictorally:
+//!
+//! ```
+//!                           Rows are randombly
+//!                          divided into separate
+//!                         RecordBatch "streams",
+//! ┌────┐ ┌────┐ ┌────┐     preserving the order        ┌────┐ ┌────┐ ┌────┐
+//! │    │ │    │ │    │                                 │    │ │    │ │    │
+//! │    │ │    │ │    │ ──────────────┐                 │    │ │    │ │    │
+//! │    │ │    │ │    │               └─────────────▶   │ C1 │ │... │ │ CN │
+//! │    │ │    │ │    │ ───────────────┐                │    │ │    │ │    │
+//! │    │ │    │ │    │               ┌┼─────────────▶  │    │ │    │ │    │
+//! │    │ │    │ │    │               ││                │    │ │    │ │    │
+//! │    │ │    │ │    │               ││                └────┘ └────┘ └────┘
+//! │    │ │    │ │    │               ││                ┌────┐ ┌────┐ ┌────┐
+//! │    │ │    │ │    │               │└───────────────▶│    │ │    │ │    │
+//! │    │ │    │ │    │               │                 │    │ │    │ │    │
+//! │    │ │    │ │    │         ...   │                 │ C1 │ │... │ │ CN │
+//! │    │ │    │ │    │ ──────────────┘                 │    │ │    │ │    │
+//! │    │ │    │ │    │                ┌──────────────▶ │    │ │    │ │    │
+//! │ C1 │ │... │ │ CN │                │                │    │ │    │ │    │
+//! │    │ │    │ │    │───────────────┐│                └────┘ └────┘ └────┘
+//! │    │ │    │ │    │               ││
+//! │    │ │    │ │    │               ││
+//! │    │ │    │ │    │               ││                         ...
+//! │    │ │    │ │    │   ────────────┼┼┐
+//! │    │ │    │ │    │               │││
+//! │    │ │    │ │    │               │││               ┌────┐ ┌────┐ ┌────┐
+//! │    │ │    │ │    │ ──────────────┼┘│               │    │ │    │ │    │
+//! │    │ │    │ │    │               │ │               │    │ │    │ │    │
+//! │    │ │    │ │    │               │ │               │ C1 │ │... │ │ CN │
+//! │    │ │    │ │    │               └─┼────────────▶  │    │ │    │ │    │
+//! │    │ │    │ │    │                 │               │    │ │    │ │    │
+//! │    │ │    │ │    │                 └─────────────▶ │    │ │    │ │    │
+//! └────┘ └────┘ └────┘                                 └────┘ └────┘ └────┘
+//!    Input RecordBatch                                  NUM_STREAMS input
+//!      Columns 1..N                                       RecordBatches
+//! INPUT_SIZE sorted rows                                (still INPUT_SIZE total
+//!     ~10% duplicates                                          rows)

Review Comment:
   :heart: Love the diagram! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove merged pull request #2431: Benchmark for sort preserving merge

Posted by GitBox <gi...@apache.org>.
andygrove merged PR #2431:
URL: https://github.com/apache/arrow-datafusion/pull/2431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2431: Benchmark for sort preserving merge

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2431:
URL: https://github.com/apache/arrow-datafusion/pull/2431#issuecomment-1130392672

   cc @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org