You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 03:29:23 UTC
[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2132: WIP: Reduce sort memory usage v1
yjshen commented on code in PR #2132:
URL: https://github.com/apache/arrow-datafusion/pull/2132#discussion_r841342087
##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -105,13 +107,21 @@ impl ExternalSorter {
}
}
- async fn insert_batch(&self, input: RecordBatch) -> Result<()> {
+ async fn insert_batch(
+ &self,
+ input: RecordBatch,
+ tracking_metrics: &MemTrackingMetrics,
+ ) -> Result<()> {
if input.num_rows() > 0 {
let size = batch_byte_size(&input);
self.try_grow(size).await?;
self.metrics.mem_used().add(size);
let mut in_mem_batches = self.in_mem_batches.lock().await;
- in_mem_batches.push(input);
+ // NB timer records time taken on drop, so there are no
+ // calls to `timer.done()` below.
+ let _timer = tracking_metrics.elapsed_compute().timer();
+ let partial = sort_batch(input, self.schema.clone(), &self.expr)?;
Review Comment:
Performance would deteriorate significantly without this change:
```
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 4619.9 ms and returned 6001214 rows
Query 1 iteration 1 took 4561.0 ms and returned 6001214 rows
Query 1 iteration 2 took 4527.7 ms and returned 6001214 rows
```
The main reason I think is caused by random memory access while constructing output batches. Without this per-batch sort, while collecting cells from unsorted batches, the memory access would be fully randomized. With this per-batch sort, we are accessing memory linearly for each column in each batch, this would results in much predictable memory access pattern and benefits the CPU cache.
I think the perf counter confirms the above speculation:
```
sudo perf stat -a -e cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096
```
Without this per-batch sort:
```
Performance counter stats for 'system wide':
1,340,359,889 cache-misses # 35.817 % of all cache refs
3,742,289,458 cache-references
1,984,089,839 l3_cache_accesses
540,429,658 l3_misses
303,508,234 dTLB-load-misses # 49.51% of all dTLB cache accesses
613,048,439 dTLB-loads
14.222309739 seconds time elapsed
```
With this per-batch sort:
```
Performance counter stats for 'system wide':
1,059,913,512 cache-misses # 30.715 % of all cache refs
3,450,839,405 cache-references
1,388,975,765 l3_cache_accesses
235,570,805 l3_misses
239,390,511 dTLB-load-misses # 51.36% of all dTLB cache accesses
466,141,655 dTLB-loads
8.675278258 seconds time elapsed
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: dev-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org