You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/04 14:24:07 UTC
[GitHub] [arrow] pitrou opened a new pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
pitrou opened a new pull request #7098:
URL: https://github.com/apache/arrow/pull/7098
The AWS SDK creates a auto-growing StringStream by default, entailing multiple memory copies when transferring large data blocks (because of resizes). Instead, write directly into the target data area.
Low-level benchmarks with a local Minio server:
* before:
```
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 434528630 ns 431461370 ns 2 bytes_per_second=1.1237G/s items_per_second=2.30134/s
MinioFixture/ReadChunked500Mib/real_time 419380389 ns 339293384 ns 2 bytes_per_second=1.16429G/s items_per_second=2.38447/s
MinioFixture/ReadCoalesced500Mib/real_time 258812283 ns 470149 ns 3 bytes_per_second=1.88662G/s items_per_second=3.8638/s
```
* after:
```
MinioFixture/ReadAll500Mib/real_time 194620947 ns 161227337 ns 4 bytes_per_second=2.50888G/s items_per_second=5.13819/s
MinioFixture/ReadChunked500Mib/real_time 276437393 ns 183030215 ns 3 bytes_per_second=1.76634G/s items_per_second=3.61746/s
MinioFixture/ReadCoalesced500Mib/real_time 86693750 ns 448568 ns 6 bytes_per_second=5.63225G/s items_per_second=11.5349/s
```
Parquet read benchmarks from a local Minio server show speedups from 1.1x to 1.9x.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623666418
S3 benchmarks run outside of EC2 aren't likely to be useful
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
lidavidm edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623645720
Ok, I ran the benchmarks against S3 several times, but performance is wildly inconsistent. This is from an EC2 VM to S3 in the same region.
Before:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 5850095808 ns 1881569619 ns 7 85.4687MB/s 0.170937 items/s
MinioFixture/ReadChunked500Mib/real_time 7583846744 ns 1568950938 ns 6 65.9296MB/s 0.131859 items/s
MinioFixture/ReadCoalesced500Mib/real_time 5935405783 ns 592848 ns 7 84.2402MB/s 0.16848 items/s
```
After:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 10612223830 ns 2214309641 ns 6 47.1155MB/s 0.094231 items/s
MinioFixture/ReadChunked500Mib/real_time 17048801064 ns 3879733068 ns 2 29.3276MB/s 0.0586552 items/s
MinioFixture/ReadCoalesced500Mib/real_time 17039251080 ns 655276 ns 2 29.344MB/s 0.058688 items/s
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 5867569374 ns 1152395630 ns 4 85.2142MB/s 0.170428 items/s
MinioFixture/ReadChunked500Mib/real_time 6496429473 ns 1172657713 ns 3 76.9654MB/s 0.153931 items/s
MinioFixture/ReadCoalesced500Mib/real_time 4892376030 ns 575236 ns 4 102.2MB/s 0.2044 items/s
```
I think S3 performance is too variable/not high enough for this optimization to be noticeable, at least in this context :/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623666418
S3 benchmarks run outside of EC2 aren't likely to be (too) useful
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623499876
https://issues.apache.org/jira/browse/ARROW-8692
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
fsaintjacques commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623657503
Locally:
```
# Before
$ time cpp/build/conda-release/release/dataset-parquet-scan-example 's3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000'
Table size: 53747
real 0m6.917s
user 0m49.500s
sys 0m3.758s
# After
$ time cpp/build/conda-release/release/dataset-parquet-scan-example 's3://123:12345678@nyc-tlc/parquet?scheme=http&endpoint_override=localhost:9000'
Table size: 53747
real 0m6.456s
user 0m45.141s
sys 0m3.322s
```
A small improvement.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623507548
@lidavidm It would be nice if you could run the benchmarks and post number on your setup (perhaps on S3 too?).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou edited a comment on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623507548
@lidavidm It would be nice if you could run the benchmarks and post numbers on your setup (perhaps on S3 too?).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7098: ARROW-8692: [C++] Avoid memory copies when downloading from S3
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7098:
URL: https://github.com/apache/arrow/pull/7098#issuecomment-623645720
Ok, I ran the benchmarks against S3 several times, but performance is wildly inconsistent.
Before:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 5850095808 ns 1881569619 ns 7 85.4687MB/s 0.170937 items/s
MinioFixture/ReadChunked500Mib/real_time 7583846744 ns 1568950938 ns 6 65.9296MB/s 0.131859 items/s
MinioFixture/ReadCoalesced500Mib/real_time 5935405783 ns 592848 ns 7 84.2402MB/s 0.16848 items/s
```
After:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 10612223830 ns 2214309641 ns 6 47.1155MB/s 0.094231 items/s
MinioFixture/ReadChunked500Mib/real_time 17048801064 ns 3879733068 ns 2 29.3276MB/s 0.0586552 items/s
MinioFixture/ReadCoalesced500Mib/real_time 17039251080 ns 655276 ns 2 29.344MB/s 0.058688 items/s
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
MinioFixture/ReadAll500Mib/real_time 5867569374 ns 1152395630 ns 4 85.2142MB/s 0.170428 items/s
MinioFixture/ReadChunked500Mib/real_time 6496429473 ns 1172657713 ns 3 76.9654MB/s 0.153931 items/s
MinioFixture/ReadCoalesced500Mib/real_time 4892376030 ns 575236 ns 4 102.2MB/s 0.2044 items/s
```
I think S3 performance is too variable/not high enough for this optimization to be noticeable, at least in this context :/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org