You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "rtpsw (via GitHub)" <gi...@apache.org> on 2023/06/17 12:02:39 UTC

[GitHub] [arrow] rtpsw opened a new pull request, #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

rtpsw opened a new pull request, #36146:
URL: https://github.com/apache/arrow/pull/36146

   ### What changes are included in this PR?
   
   The `MemoStore` algorithms are improved to be linear-time using updated data-structures. Also, the benchmark is enhanced to cover various tolerance values.
   
   ### Are these changes tested?
   
   Yes, by existing tests. In addition, the benchmark shows improved running time.
   
   ### Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #36146:
URL: https://github.com/apache/arrow/pull/36146#issuecomment-1596099853

   <details>
   
   <summary>Baseline results:</summary>
   
   ```
   Running ./release/arrow-acero-asof-join-benchmark
   Run on (8 X 2304 MHz CPU s)
   CPU Caches:
     L1 Data 32 KiB (x8)
     L1 Instruction 32 KiB (x8)
     L2 Unified 256 KiB (x8)
     L3 Unified 16384 KiB (x8)
   Load Average: 0.07, 0.21, 1.47
   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                                                                                                                                            Time             CPU   Iterations UserCounters...
   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   AsOfJoinOverhead/left_freq:200/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:200/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      211922079 ns       660678 ns            3 bytes_per_second=130.671M/s input_rows_per_second=759.713k/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      114079272 ns       460190 ns            6 bytes_per_second=122.126M/s input_rows_per_second=710.033k/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:1000/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:1000/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time     48980588 ns       271786 ns           12 bytes_per_second=115.883M/s input_rows_per_second=673.736k/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:10/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:10/right_ids:500/right_batch_size:4000/tolerance:0/real_time       96047373 ns       337427 ns            6 bytes_per_second=77.5867M/s input_rows_per_second=843.334k/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      110910426 ns       455473 ns            5 bytes_per_second=125.615M/s input_rows_per_second=730.319k/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:100/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:100/right_ids:500/right_batch_size:4000/tolerance:0/real_time    230659601 ns      1459631 ns            3 bytes_per_second=285.147M/s input_rows_per_second=351.167k/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:100/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:100/right_batch_size:4000/tolerance:0/real_time        8913949 ns       199234 ns           67 bytes_per_second=312.589M/s input_rows_per_second=1.81738M/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      116265624 ns       477030 ns            6 bytes_per_second=119.829M/s input_rows_per_second=696.681k/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:1000/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:1000/right_batch_size:4000/tolerance:0/real_time    358051162 ns       745842 ns            2 bytes_per_second=77.8213M/s input_rows_per_second=452.449k/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      108821232 ns       559738 ns            6 bytes_per_second=128.026M/s input_rows_per_second=744.34k/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:10/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time    1064261524 ns      2046008 ns            1 bytes_per_second=71.9992M/s input_rows_per_second=418.6k/s maximum_peak_memory=83.9027M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:50/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time    5091578748 ns     10242650 ns            1 bytes_per_second=69.7752M/s input_rows_per_second=405.67k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:1000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:1000/tolerance:0/real_time      116616300 ns      1283586 ns            6 bytes_per_second=119.469M/s input_rows_per_second=694.586k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      104528444 ns       547224 ns            6 bytes_per_second=133.284M/s input_rows_per_second=774.909k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:32000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:32000/tolerance:0/real_time    153004093 ns       249713 ns            4 bytes_per_second=91.0564M/s input_rows_per_second=529.398k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:-1500/real_time  115176504 ns       469413 ns            6 bytes_per_second=120.962M/s input_rows_per_second=703.268k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:-500/real_time   114855370 ns       555388 ns            6 bytes_per_second=121.3M/s input_rows_per_second=705.235k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time      117198144 ns       470146 ns            6 bytes_per_second=118.876M/s input_rows_per_second=691.137k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:500/real_time    109962907 ns       487912 ns            7 bytes_per_second=126.697M/s input_rows_per_second=736.612k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:1500/real_time   117599868 ns       501521 ns            6 bytes_per_second=118.47M/s input_rows_per_second=688.776k/s maximum_peak_memory=388.827M
   ```
   
   </details>
   
   <details>
   
   <summary>PR results</summary>
   
   ```
   Running ./release/arrow-acero-asof-join-benchmark
   Run on (8 X 2304 MHz CPU s)
   CPU Caches:
     L1 Data 32 KiB (x8)
     L1 Instruction 32 KiB (x8)
     L2 Unified 256 KiB (x8)
     L3 Unified 16384 KiB (x8)
   Load Average: 1.95, 6.53, 5.09
   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                                                                                                                                            Time             CPU   Iterations UserCounters...
   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   AsOfJoinOverhead/left_freq:200/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:200/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       76385205 ns       691089 ns            7 bytes_per_second=362.531M/s input_rows_per_second=2.10774M/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       40956330 ns       460700 ns           18 bytes_per_second=340.167M/s input_rows_per_second=1.97772M/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:1000/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:1000/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time     18057196 ns       322803 ns           38 bytes_per_second=314.335M/s input_rows_per_second=1.82753M/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:10/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:10/right_ids:500/right_batch_size:4000/tolerance:0/real_time       26092684 ns       344036 ns           25 bytes_per_second=285.597M/s input_rows_per_second=3.10432M/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       39741214 ns       472783 ns           18 bytes_per_second=350.568M/s input_rows_per_second=2.03819M/s maximum_peak_memory=29.0547M
   AsOfJoinOverhead/left_freq:400/left_cols:100/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:100/right_ids:500/right_batch_size:4000/tolerance:0/real_time    150683536 ns      1454621 ns            5 bytes_per_second=436.491M/s input_rows_per_second=537.55k/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:100/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:100/right_batch_size:4000/tolerance:0/real_time        8025929 ns       208403 ns           78 bytes_per_second=347.175M/s input_rows_per_second=2.01846M/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       39462927 ns       454937 ns           18 bytes_per_second=353.04M/s input_rows_per_second=2.05256M/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:1000/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:1000/right_batch_size:4000/tolerance:0/real_time     76068165 ns       781966 ns            8 bytes_per_second=366.303M/s input_rows_per_second=2.12967M/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       38436892 ns       483746 ns           18 bytes_per_second=362.464M/s input_rows_per_second=2.10735M/s maximum_peak_memory=72.2598M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:10/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time     235748713 ns      2376327 ns            3 bytes_per_second=325.033M/s input_rows_per_second=1.88972M/s maximum_peak_memory=83.9027M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:50/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time    1183713083 ns     10726631 ns            1 bytes_per_second=300.128M/s input_rows_per_second=1.74493M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:1000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:1000/tolerance:0/real_time       41843732 ns      1371985 ns           17 bytes_per_second=332.953M/s input_rows_per_second=1.93577M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       38562434 ns       458523 ns           18 bytes_per_second=361.284M/s input_rows_per_second=2.10049M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:32000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:32000/tolerance:0/real_time     76335078 ns       240058 ns            8 bytes_per_second=182.511M/s input_rows_per_second=1061.11k/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:-1500/real_time   38884782 ns       461708 ns           18 bytes_per_second=358.289M/s input_rows_per_second=2.08308M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:-500/real_time    38414866 ns       497694 ns           18 bytes_per_second=362.672M/s input_rows_per_second=2.10856M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:0/real_time       39200118 ns       473774 ns           18 bytes_per_second=355.407M/s input_rows_per_second=2.06632M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:500/real_time     39321300 ns       492033 ns           15 bytes_per_second=354.312M/s input_rows_per_second=2.05995M/s maximum_peak_memory=388.827M
   AsOfJoinOverhead/left_freq:400/left_cols:20/left_ids:500/left_batch_size:4000/num_right_tables:1/right_freq:400/right_cols:20/right_ids:500/right_batch_size:4000/tolerance:1500/real_time    39055427 ns       509834 ns           18 bytes_per_second=356.724M/s input_rows_per_second=2.07398M/s maximum_peak_memory=388.827M
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on PR #36146:
URL: https://github.com/apache/arrow/pull/36146#issuecomment-1613443042

   Agree with @pitrou. (We are actually doing some exploring/work internally)
   
   @rtpsw Let us close this for now and open PR when it's ready.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #36146:
URL: https://github.com/apache/arrow/pull/36146#issuecomment-1595724081

   cc @westonpace, @icexelloss


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36146:
URL: https://github.com/apache/arrow/pull/36146#issuecomment-1613438367

   While the perf improvements look impressive, can we prioritize making the tests more reliable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #36146: GH-36144: [C++] Improve future as-of-join algorithmic complexity

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #36146:
URL: https://github.com/apache/arrow/pull/36146#issuecomment-1613471759

   I converted to a draft until it's picked up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org