You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 04:27:10 UTC
[GitHub] [arrow-datafusion] yjshen opened a new pull request, #2146: Buffer records in row format in memory for SortExec
yjshen opened a new pull request, #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
-->
Closes #.
This PR is based on #2132.
# Rationale for this change
Row format is more cache-friendly for "row logic" operators, at the cost of more computation involved while incorporating in vectorized engines like our DataFusion. (there would be extra row <-> columnar conversions of course)
This work is to explore the usage of row-format as the in-memory records buffer layout, for benchmark, test, and discussion purposes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1087954980
This PR looks amazing -- I can't wait to review it -- I plan to do so tomorrow. Thank you @yjshen
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1088263158
Thanks @alamb! This PR acts more like a playground and testbed for the row format usage in the sort's payload. It's not mature enough but I've got some numbers we could discuss.
As revealed by the benchmark above, row <-> columnar batch comes with the price of more extra CPU computations, and the cost pays off when columnar memory access is more expensive while dealing with a bigger dataset.
Obviously, we need a clever/adaptive mechanism to choose which memory layout we should employ regarding `the number of payload columns`, size of input data, etc. So I'm posting the experimental implementation with some micro bench results to gain insights from more brilliant minds.
cc @Dandandan @houqp @tustvold you might be interested in this as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1087134690
# TPC-H SF=1
`master`:
```
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "../../tpch-parquet/", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 2851.7 ms and returned 6001214 rows
Query 1 iteration 1 took 2817.7 ms and returned 6001214 rows
Query 1 iteration 2 took 2735.9 ms and returned 6001214 rows
Query 1 avg time: 2801.75 ms
```
This PR:
```
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 3174.9 ms and returned 6001214 rows
Query 1 iteration 1 took 3130.8 ms and returned 6001214 rows
Query 1 iteration 2 took 3058.3 ms and returned 6001214 rows
Query 1 avg time: 3121.35 ms
```
The row format comes with a price of more computation, with ~11% performance deterioration witnessed. Although this PR is showing a better cache locality, the computation cost overweight the cache benefits:
```
sudo perf stat -a -e cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096
```
`master`
```
Performance counter stats for 'system wide':
756,702,553 cache-misses # 34.256 % of all cache refs
2,208,936,269 cache-references
1,156,898,644 l3_cache_accesses
362,860,081 l3_misses
215,166,268 dTLB-load-misses # 45.27% of all dTLB cache accesses
475,312,480 dTLB-loads
8.774750150 seconds time elapsed
```
This PR:
```
Performance counter stats for 'system wide':
593,785,538 cache-misses # 25.841 % of all cache refs
2,297,807,480 cache-references
835,622,737 l3_cache_accesses
227,838,803 l3_misses
146,442,785 dTLB-load-misses # 55.76% of all dTLB cache accesses
262,616,456 dTLB-loads
10.556249442 seconds time elapsed
```
Much better cache accessing behavior with the row format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1089340239
I didn't quite make it to this PR today, but I plan to do so tomorrow (I want to check it out locally and play around with it / make some flamegraphs)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1092164658
As I have mentioned, I am very interested in this work but I have not yet found time to give it the deep study it deserves (I am especially interested in the profiling).
However, I have been super busy with other work items and keeping things going in IOx and DataFusion. I am not sure when I will have a chance to study this one carefully
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1087143111
# TPC-H SF=10
`master`
```
target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet-sf10 --format parquet --query 1 --batch-size 4096
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 47772.6 ms and returned 59986051 rows
Query 1 iteration 1 took 47899.2 ms and returned 59986051 rows
Query 1 iteration 2 took 48861.9 ms and returned 59986051 rows
Query 1 avg time: 48177.89 ms
```
This PR:
```
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 38565.1 ms and returned 59986051 rows
Query 1 iteration 1 took 37786.0 ms and returned 59986051 rows
Query 1 iteration 2 took 37056.7 ms and returned 59986051 rows
Query 1 avg time: 37802.62 ms
```
The performance has **improved** by ~21.5% this time. The advantage of better memory accessing pattern pays off the extra computation for row <-> columnar transformation.
`master`
```
Performance counter stats for 'system wide':
9,443,323,018 cache-misses # 41.338 % of all cache refs
22,844,399,240 cache-references
14,787,052,560 l3_cache_accesses
5,753,820,101 l3_misses
3,046,705,364 dTLB-load-misses # 54.75% of all dTLB cache accesses
5,565,251,257 dTLB-loads
147.045336524 seconds time elapsed
```
This PR:
```
Performance counter stats for 'system wide':
6,750,648,518 cache-misses # 30.344 % of all cache refs
22,247,021,905 cache-references
10,821,629,799 l3_cache_accesses
2,122,684,404 l3_misses
2,348,824,410 dTLB-load-misses # 64.09% of all dTLB cache accesses
3,664,743,134 dTLB-loads
115.306819499 seconds time elapsed
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1087124793
Hardware Settings:
```
H/W path Device Class Description
====================================================================
system MS-7D53 (To be filled by O.E.M.)
/0 bus MPG X570S EDGE MAX WIFI (MS-7D53)
/0/0 memory 64KiB BIOS
/0/11 memory 32GiB System Memory
/0/11/0 memory 3600 MHz (0.3 ns) [empty]
/0/11/1 memory 16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3
/0/11/2 memory 3600 MHz (0.3 ns) [empty]
/0/11/3 memory 16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3
/0/14 memory 1MiB L1 cache
/0/15 memory 8MiB L2 cache
/0/16 memory 64MiB L3 cache
/0/17 processor AMD Ryzen 9 5950X 16-Core Processor
```
## A modified version of TPC-H q1:
```sql
select
l_returnflag,
l_linestatus,
l_quantity,
l_extendedprice,
l_discount,
l_tax
from
lineitem
order by
l_extendedprice,
l_discount;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] yjshen closed pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
yjshen closed pull request #2146: Buffer records in row format in memory for SortExec
URL: https://github.com/apache/arrow-datafusion/pull/2146
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #2146: Buffer records in row format in memory for SortExec
Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2146:
URL: https://github.com/apache/arrow-datafusion/pull/2146#issuecomment-1100180806
Marking as draft to make it clear this PR isn't waiting on review -- please feel free to mark it as ready for review if that analysis is not correct or if it is ready to review again
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org