You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/03 14:16:11 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #2427: Improve Sorting / Merge performance by Apply the row format, and taking advantage of new JIT framework

alamb opened a new issue, #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I plan to make sorting / merging faster. My reasons;
   
   1. I find it personally interesting
   2. It is a key  piece of technology to bring DataFusion's performance to be on par with things like DuckDB 
   1. It is important for my project IOx in the medium term
   
   **Describe the solution you'd like**
   Basically the plan is to follow the advice given by Goetz Graefe in [Implementing sorting in database systems
   ](https://dl.acm.org/doi/10.1145/1132960.1132964) and successfully implemented in systems like DuckDB (see [blog post](https://duckdb.org/2021/08/27/external-sorting.html))`
   
   It will likely involve some combination of a specialization of the row format and JIT comparisons
   
   Here is my rough plan and a sketch of the kinds of things I want to work on
   - [ ] Benchmarks
   - [ ] POC of comparing using row format
   - [ ] Add full type support for row format comparisons
   - [ ] Turn POC to real
   - [ ] #2150
   - [ ] #2151
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2427: Improve Sorting / Merge performance

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1499149743

   I believe @tustvold  is actively working on this (e.g https://github.com/apache/arrow-datafusion/pull/5894, https://github.com/apache/arrow-datafusion/pull/5895, and https://github.com/apache/arrow-datafusion/pull/5886)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2427: Improve Sorting / Merge performance by Apply the row format, and taking advantage of new JIT framework

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1140425072

   In case anyone is interested, here are some flame graphs I gathered for the merge benchmarks:
   All gathered from `alamb/merge_bench` which is based on master at 7b7edf9c43383c1d3310286b69d2d037db72c967
   
   ```shell
   alamb@MacBook-Pro-6 arrow-datafusion % git merge-base alamb/alamb/merge_bench apache/master
   7b7edf9c43383c1d3310286b69d2d037db72c967
   ```
   
   ![flamegraph-merge-utf8-tuple](https://user-images.githubusercontent.com/490673/170864629-612ea5cb-1a18-4ec3-ba77-885104f84d84.svg)
   
   ![flamegraph-merge-utf8-low](https://user-images.githubusercontent.com/490673/170864628-5d449964-c2f0-4321-898b-fbc5966452cc.svg)
   
   ![flamegraph-merge-utf8-high](https://user-images.githubusercontent.com/490673/170864623-54b8783a-af8a-431c-a940-0706b7924431.svg)
   
   
   
   ![flamegraph-merge-i64](https://user-images.githubusercontent.com/490673/170864615-4acc62ca-ccdf-41c7-9ebd-5a632cdae5bd.svg)
   ![flamegraph-merge-mixed-tuple](https://user-images.githubusercontent.com/490673/170864618-d0fe094b-646c-4bb9-81a5-7b3dc23d76aa.svg)
   
   
   ![flamegraph-merge-f64](https://user-images.githubusercontent.com/490673/170864613-91b1b63a-7f14-4bc7-b474-6b02bfa9c62e.svg)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2427: Improve Sorting / Merge performance by Apply the row format, and taking advantage of new JIT framework

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1165903697

   Here is a trace from IOx showing a lot of time being spent sorting batches...
   
   ```
   22.02 s   22.7%	0 s	 	                                                           iox_query::provider::deduplicate::deduplicate::_$u7b$$u7b$closure$u7d$$u7d$::hf3d0d2543c390834
   19.25 s   19.9%	0 s	 	                                                            _$LT$futures_util..stream..stream..next..Next$LT$St$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h6e92074c50cc7f69
   19.25 s   19.9%	0 s	 	                                                             futures_util::stream::stream::StreamExt::poll_next_unpin::h0a7d4570974377a3
   19.25 s   19.9%	0 s	 	                                                              _$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::hee2bdbc79b3020cf
   19.25 s   19.9%	0 s	 	                                                               _$LT$datafusion..physical_plan..union..ObservedStream$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::hf4bd36dcc3490171
   19.25 s   19.9%	0 s	 	                                                                futures_util::stream::stream::StreamExt::poll_next_unpin::h0a7d4570974377a3
   19.25 s   19.9%	0 s	 	                                                                 _$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::hee2bdbc79b3020cf
   19.25 s   19.9%	0 s	 	                                                                  _$LT$datafusion..physical_plan..stream..RecordBatchStreamAdapter$LT$S$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::h2cb791419b210a88
   19.25 s   19.9%	0 s	 	                                                                   _$LT$futures_util..stream..try_stream..try_flatten..TryFlatten$LT$St$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::hecbc974d417753bb
   19.25 s   19.9%	0 s	 	                                                                    _$LT$S$u20$as$u20$futures_core..stream..TryStream$GT$::try_poll_next::ha20467981653b215
   19.25 s   19.9%	0 s	 	                                                                     _$LT$futures_util..stream..once..Once$LT$Fut$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::h95fb74a12fffeea3
   19.25 s   19.9%	0 s	 	                                                                      _$LT$futures_util..future..try_future..MapErr$LT$Fut$C$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h3230513ec689705a
   19.25 s   19.9%	0 s	 	                                                                       _$LT$futures_util..future..future..Map$LT$Fut$C$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h464940ca2f951951
   19.25 s   19.9%	0 s	 	                                                                        _$LT$futures_util..future..future..map..Map$LT$Fut$C$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::heb35220cf8b84857
   19.25 s   19.9%	0 s	 	                                                                         _$LT$futures_util..future..try_future..into_future..IntoFuture$LT$Fut$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h55a6d3f5af84763f
   19.25 s   19.9%	0 s	 	                                                                          _$LT$F$u20$as$u20$futures_core..future..TryFuture$GT$::try_poll::hf219144ea170e2c8
   19.25 s   19.9%	0 s	 	                                                                           _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h79164ac5d2444444
   19.25 s   19.9%	0 s	 	                                                                            datafusion::physical_plan::sorts::sort::do_sort::_$u7b$$u7b$closure$u7d$$u7d$::h40cf3c40db84fa17
   19.24 s   19.9%	0 s	 	                                                                             _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hf5dad25a12c01571
   19.24 s   19.9%	0 s	 	                                                                              datafusion::physical_plan::sorts::sort::ExternalSorter::insert_batch::_$u7b$$u7b$closure$u7d$$u7d$::h09bf6567368d0832
   19.24 s   19.9%	0 s	 	                                                                               datafusion::physical_plan::sorts::sort::sort_batch::h7b6b5407071c4f46
   18.64 s   19.2%	0 s	 	                                                                                arrow::compute::kernels::sort::lexsort_to_indices::h4e2ca9dbf2774da6
   18.61 s   19.2%	0 s	 	                                                                                 arrow::compute::kernels::sort::sort_unstable_by::h492ca4aa0dad4bef
   18.61 s   19.2%	0 s	 	                                                                                  core::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::sort_unstable_by::he9ab2244924b2cf3
   18.61 s   19.2%	0 s	 	                                                                                   core::slice::sort::quicksort::h2837f00ef0d13950
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #2427: Improve Sorting / Merge performance

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1615822402

   I think we can close this ticket https://github.com/apache/arrow-datafusion/pull/5894, https://github.com/apache/arrow-datafusion/pull/5895, and https://github.com/apache/arrow-datafusion/pull/5886 and others are merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2427: Improve Sorting / Merge performance

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1212414027

   I believe @tustvold  is working on some aspects of this (specifically creating faster comparators based on JIT). Is that tracked somewhere @tustvold ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #2427: Improve Sorting / Merge performance

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2427:
URL: https://github.com/apache/arrow-datafusion/issues/2427#issuecomment-1212455842

   I'm currently prototyping some stuff, hope to push a draft in the coming days for feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan closed issue #2427: Improve Sorting / Merge performance

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan closed issue #2427:  Improve Sorting / Merge performance 
URL: https://github.com/apache/arrow-datafusion/issues/2427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org