You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Tom-Newton (via GitHub)" <gi...@apache.org> on 2024/02/11 13:31:44 UTC

[I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Tom-Newton opened a new issue, #40035:
URL: https://github.com/apache/arrow/issues/40035

### Describe the enhancement requested

Optimisation to https://github.com/apache/arrow/issues/37511
Child of https://github.com/apache/arrow/issues/18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native `AzureFileSystem`:
1. Arrow will make concurrent calls to `ReadAt` for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
2. Within `ReadAt` the `AzureFileSystem` calls `BlobClient::DownloadTo` which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the [config options for this parallelism](https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/inc/azure/storage/blobs/blob_options.hpp#L691-L709) configurable by the user.

### Component(s)

C++

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1939677286

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.

felipecrv commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1942073463

> These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

Start exposing `read_file_max_concurrency` with a generous value and hardcode values for `initial_chunk_size` and `chunk_size` inside `azurefs.cc` to your liking, then open a separate issue that I can work on.

My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads.

The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (`>256MB`). We know Arrow workloads are probably small to medium `ReadAt` calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower `initial_chunk_size` and `chunk_size` with a high `concurrency`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.

felipecrv commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1941785546

I think exposing all these settings in `AzureOptions` can be premature. They are per-request settings, so allowing a config in `AzureOptions` would force the internal implementation to stick to one set of values of every `DownloadTo` request.

My suggestion: we keep statistics about the `ReadAt` calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in `AzureOptions` that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random.

An alternative to the named policies can be: we expose only `read_file_max_concurrency` [1], and we can keep learning a simple model that sets the best `initial_chunk_size` and `chunk_size` parameters adaptively. The goal becomes minimizing the latency of each individual `ReadAt` call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the `FileSystem` interface is managing multiple threads themselves.

[1] set it to some multiple of CPU cores by default

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1939674490

   I can measure a clear performance advantage from tuning these parameters a bit. 
   Defaults: median=9.0 seconds P95=12.5 seconds
   
   Setting
   ```
           initial_chunk_size=8*1024*1024,
           chunk_size=8*1024 * 1024,
           concurrency=5
   ```
   median=5.6 seconds P95=6.7 seconds
   
   Personally I think this is enough to motivate exposing these config options. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1942014499

   These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-2008971194

   I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1937824890

   I'm going to run some performance tests for my usecase. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1939742567

   Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for https://github.com/apache/arrow/pull/40021 to merge first. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org