You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/04 14:24:07 UTC

[GitHub] [arrow] pitrou opened a new pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

pitrou opened a new pull request #7098:
URL: https://github.com/apache/arrow/pull/7098


   The AWS SDK creates a auto-growing StringStream by default, entailing multiple memory copies when transferring large data blocks (because of resizes).  Instead, write directly into the target data area.
   
   Low-level benchmarks with a local Minio server:
   * before:
   ```
   -----------------------------------------------------------------------------------------------------
   Benchmark                                           Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time        434528630 ns    431461370 ns            2 bytes_per_second=1.1237G/s items_per_second=2.30134/s
   MinioFixture/ReadChunked500Mib/real_time    419380389 ns    339293384 ns            2 bytes_per_second=1.16429G/s items_per_second=2.38447/s
   MinioFixture/ReadCoalesced500Mib/real_time  258812283 ns       470149 ns            3 bytes_per_second=1.88662G/s items_per_second=3.8638/s
   ```
   * after:
   ```
   MinioFixture/ReadAll500Mib/real_time        194620947 ns    161227337 ns            4 bytes_per_second=2.50888G/s items_per_second=5.13819/s
   MinioFixture/ReadChunked500Mib/real_time    276437393 ns    183030215 ns            3 bytes_per_second=1.76634G/s items_per_second=3.61746/s
   MinioFixture/ReadCoalesced500Mib/real_time   86693750 ns       448568 ns            6 bytes_per_second=5.63225G/s items_per_second=11.5349/s
   ```
   
   Parquet read benchmarks from a local Minio server show speedups from 1.1x to 1.9x.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623666418


   S3 benchmarks run outside of EC2 aren't likely to be useful


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
lidavidm edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623645720


   Ok, I ran the benchmarks against S3 several times, but performance is wildly inconsistent. This is from an EC2 VM to S3 in the same region.
   
   Before:
   ```
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       5850095808 ns 1881569619 ns          7   85.4687MB/s   0.170937 items/s
   MinioFixture/ReadChunked500Mib/real_time   7583846744 ns 1568950938 ns          6   65.9296MB/s   0.131859 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 5935405783 ns     592848 ns          7   84.2402MB/s    0.16848 items/s
   ```
   
   After:
   ```
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       10612223830 ns 2214309641 ns          6   47.1155MB/s   0.094231 items/s
   MinioFixture/ReadChunked500Mib/real_time   17048801064 ns 3879733068 ns          2   29.3276MB/s  0.0586552 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 17039251080 ns     655276 ns          2    29.344MB/s   0.058688 items/s
   
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       5867569374 ns 1152395630 ns          4   85.2142MB/s   0.170428 items/s
   MinioFixture/ReadChunked500Mib/real_time   6496429473 ns 1172657713 ns          3   76.9654MB/s   0.153931 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 4892376030 ns     575236 ns          4     102.2MB/s     0.2044 items/s
   ```
   
   I think S3 performance is too variable/not high enough for this optimization to be noticeable, at least in this context :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623666418


   S3 benchmarks run outside of EC2 aren't likely to be (too) useful


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623499876


   https://issues.apache.org/jira/browse/ARROW-8692


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] fsaintjacques commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623657503


   Locally:
   ```
   # Before
   $ time cpp/build/conda-release/release/dataset-parquet-scan-example 's3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000'
   Table size: 53747                                                                                                                            
                                                                                                                                                
   real    0m6.917s                                                                                                                             
   user    0m49.500s                                                                                                                                                                                                                                                                         
   sys     0m3.758s    
   # After
   $ time cpp/build/conda-release/release/dataset-parquet-scan-example 's3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000'
   Table size: 53747
   
   real    0m6.456s
   user    0m45.141s
   sys     0m3.322s
   ```
   
   A small improvement.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623507548


   @lidavidm It would be nice if you could run the benchmarks and post number on your setup (perhaps on S3 too?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623507548


   @lidavidm It would be nice if you could run the benchmarks and post numbers on your setup (perhaps on S3 too?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623645720


   Ok, I ran the benchmarks against S3 several times, but performance is wildly inconsistent. 
   
   Before:
   ```
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       5850095808 ns 1881569619 ns          7   85.4687MB/s   0.170937 items/s
   MinioFixture/ReadChunked500Mib/real_time   7583846744 ns 1568950938 ns          6   65.9296MB/s   0.131859 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 5935405783 ns     592848 ns          7   84.2402MB/s    0.16848 items/s
   ```
   
   After:
   ```
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       10612223830 ns 2214309641 ns          6   47.1155MB/s   0.094231 items/s
   MinioFixture/ReadChunked500Mib/real_time   17048801064 ns 3879733068 ns          2   29.3276MB/s  0.0586552 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 17039251080 ns     655276 ns          2    29.344MB/s   0.058688 items/s
   
   ----------------------------------------------------------------------------------
   Benchmark                                           Time           CPU Iterations
   ----------------------------------------------------------------------------------
   MinioFixture/ReadAll500Mib/real_time       5867569374 ns 1152395630 ns          4   85.2142MB/s   0.170428 items/s
   MinioFixture/ReadChunked500Mib/real_time   6496429473 ns 1172657713 ns          3   76.9654MB/s   0.153931 items/s
   MinioFixture/ReadCoalesced500Mib/real_time 4892376030 ns     575236 ns          4     102.2MB/s     0.2044 items/s
   ```
   
   I think S3 performance is too variable/not high enough for this optimization to be noticeable, at least in this context :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org