You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jaylmiller (via GitHub)" <gi...@apache.org> on 2023/03/14 15:33:07 UTC

[GitHub] [arrow-datafusion] jaylmiller commented on pull request #5292: use row encoding for SortExec

jaylmiller commented on PR #5292:
URL: https://github.com/apache/arrow-datafusion/pull/5292#issuecomment-1468327766

   Coding-wise everything is finished and code is ready to review. But in terms of bench results, I'm not 100% confident yet. 
   
   Sort micro-benchmarks are looking pretty good: significant improvements on cases where row encoding is actually used, minor regressions--mostly within error bars--on cases without row encoding but of course more experienced contributors would know better about how significant these regressions actually are (I'll repost them at the bottom): 
   
   ```
   group                                                     main-sort                                rows-sort
   -----                                                     ---------                                ---------
   sort f64                                                  1.00     10.8±0.23ms        ? ?/sec      1.04     11.2±0.93ms        ? ?/sec
   sort f64 preserve partitioning                            1.00      4.0±0.27ms        ? ?/sec      1.04      4.1±0.28ms        ? ?/sec
   sort i64                                                  1.00      9.5±0.55ms        ? ?/sec      1.09     10.3±0.74ms        ? ?/sec
   sort i64 preserve partitioning                            1.00      3.3±0.10ms        ? ?/sec      1.06      3.5±0.13ms        ? ?/sec
   sort mixed tuple                                          1.28     28.3±3.35ms        ? ?/sec      1.00     22.2±1.60ms        ? ?/sec
   sort mixed tuple preserve partitioning                    1.00      3.6±0.17ms        ? ?/sec      1.15      4.1±1.09ms        ? ?/sec
   sort mixed utf8 dictionary tuple                          2.84     52.7±8.27ms        ? ?/sec      1.00     18.6±1.29ms        ? ?/sec
   sort mixed utf8 dictionary tuple preserve partitioning    1.02      4.2±0.92ms        ? ?/sec      1.00      4.1±0.55ms        ? ?/sec
   sort utf8 dictionary                                      1.00      3.7±0.21ms        ? ?/sec      1.04      3.9±0.33ms        ? ?/sec
   sort utf8 dictionary preserve partitioning                1.00  1487.2±1444.67µs        ? ?/sec    1.01  1502.8±315.79µs        ? ?/sec
   sort utf8 dictionary tuple                                3.26    57.0±11.35ms        ? ?/sec      1.00     17.5±2.08ms        ? ?/sec
   sort utf8 dictionary tuple preserve partitioning          1.13      4.1±1.08ms        ? ?/sec      1.00      3.6±0.52ms        ? ?/sec
   sort utf8 high cardinality                                1.01     28.0±3.70ms        ? ?/sec      1.00     27.6±3.81ms        ? ?/sec
   sort utf8 high cardinality preserve partitioning          1.00     11.1±1.48ms        ? ?/sec      1.21     13.5±3.38ms        ? ?/sec
   sort utf8 low cardinality                                 1.00     15.3±5.08ms        ? ?/sec      1.10     16.9±6.20ms        ? ?/sec
   sort utf8 low cardinality preserve partitioning           1.03      8.1±2.21ms        ? ?/sec      1.00      7.8±1.75ms        ? ?/sec
   sort utf8 tuple                                           1.96     56.8±8.36ms        ? ?/sec      1.00     29.0±4.82ms        ? ?/sec
   sort utf8 tuple preserve partitioning                     1.02      6.7±0.95ms        ? ?/sec      1.00      6.5±0.46ms        ? ?/sec
   ```
   
   
   In summary, I'd like to get an opinion on these micro bench results. And then also ideally, we can run the e2e bench comparisons (#5561) on `tpch` and `parquet` and get a bit more data on whether this change is worth merging.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org