You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 21:32:06 UTC

[GitHub] [arrow-rs] tustvold commented on pull request #2123: Faster parquet DictEncoder

tustvold commented on PR #2123:
URL: https://github.com/apache/arrow-rs/pull/2123#issuecomment-1191955893

   Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant).
   
   The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck.
   
   ```
   write_batch primitive/4096 values primitive                                                                             
                           time:   [1.5325 ms 1.5331 ms 1.5338 ms]
                           thrpt:  [115.02 MiB/s 115.07 MiB/s 115.12 MiB/s]
                    change:
                           time:   [-20.677% -20.632% -20.590%] (p = 0.00 < 0.05)
                           thrpt:  [+25.929% +25.995% +26.068%]
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
   write_batch primitive/4096 values primitive non-null                                                                             
                           time:   [1.4838 ms 1.4847 ms 1.4857 ms]
                           thrpt:  [116.44 MiB/s 116.52 MiB/s 116.59 MiB/s]
                    change:
                           time:   [-12.080% -12.017% -11.954%] (p = 0.00 < 0.05)
                           thrpt:  [+13.577% +13.659% +13.739%]
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   write_batch primitive/4096 values bool                                                                            
                           time:   [111.01 us 111.09 us 111.19 us]
                           thrpt:  [10.224 MiB/s 10.233 MiB/s 10.240 MiB/s]
                    change:
                           time:   [-0.8794% -0.6831% -0.4488%] (p = 0.00 < 0.05)
                           thrpt:  [+0.4508% +0.6878% +0.8872%]
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   write_batch primitive/4096 values bool non-null                                                                            
                           time:   [52.931 us 53.012 us 53.094 us]
                           thrpt:  [21.411 MiB/s 21.444 MiB/s 21.477 MiB/s]
                    change:
                           time:   [-2.2177% -2.1085% -1.9913%] (p = 0.00 < 0.05)
                           thrpt:  [+2.0318% +2.1539% +2.2680%]
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     5 (5.00%) high mild
     10 (10.00%) high severe
   write_batch primitive/4096 values string                                                                            
                           time:   [891.20 us 891.52 us 891.88 us]
                           thrpt:  [89.239 MiB/s 89.275 MiB/s 89.306 MiB/s]
                    change:
                           time:   [-8.4838% -8.4391% -8.3955%] (p = 0.00 < 0.05)
                           thrpt:  [+9.1650% +9.2170% +9.2703%]
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   Benchmarking write_batch primitive/4096 values string non-null: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
   write_batch primitive/4096 values string non-null                                                                             
                           time:   [1.0208 ms 1.0213 ms 1.0218 ms]
                           thrpt:  [77.889 MiB/s 77.931 MiB/s 77.970 MiB/s]
                    change:
                           time:   [+0.0730% +0.1746% +0.2545%] (p = 0.00 < 0.05)
                           thrpt:  [-0.2538% -0.1743% -0.0730%]
                           Change within noise threshold.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s, enable flat sampling, or reduce sample count to 50.
   write_batch nested/4096 values primitive list                                                                             
                           time:   [1.9798 ms 2.0064 ms 2.0368 ms]
                           thrpt:  [80.409 MiB/s 81.627 MiB/s 82.725 MiB/s]
                    change:
                           time:   [+0.9435% +1.8832% +3.0013%] (p = 0.00 < 0.05)
                           thrpt:  [-2.9139% -1.8484% -0.9347%]
                           Change within noise threshold.
   Found 19 outliers among 100 measurements (19.00%)
     1 (1.00%) high mild
     18 (18.00%) high severe
   write_batch nested/4096 values primitive list non-null                                                                             
                           time:   [2.4385 ms 2.4696 ms 2.5038 ms]
                           thrpt:  [76.896 MiB/s 77.959 MiB/s 78.952 MiB/s]
                    change:
                           time:   [-0.1096% +1.1302% +2.5102%] (p = 0.10 > 0.05)
                           thrpt:  [-2.4488% -1.1176% +0.1097%]
                           No change in performance detected.
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org