You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/10 19:48:30 UTC

[GitHub] [arrow] westonpace commented on issue #34118: Allow configuration of size of AWS event loop thread pool

westonpace commented on issue #34118:
URL: https://github.com/apache/arrow/issues/34118#issuecomment-1426262379

   > When calling DoInitializeS3, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system.
   
   I thought the default behavior was for AWS to [not use a pool at all and spin up a brand new detached thread per-request](https://aws.amazon.com/blogs/developer/using-a-thread-pool-with-the-aws-sdk-for-c/) but that article is pretty old so maybe it is no longer the behavior.
   
   Furthermore, the [docs](https://awslabs.github.io/aws-crt-cpp/class_aws_1_1_crt_1_1_io_1_1_event_loop_group.html) state "which will create one for each processor on the machine."  Perhaps it is a typo on their part but unless you have a multi-CPU machine (e.g. NUMA) I would expect this to use a single thread (and it would be weird if their default went against their recommendations).  Although, looking at the linked issue, it does indeed seem to be a lot of threads.  And...after further debugging...it does seem to be thread per physical core on my system.
   
   > This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription.
   
   I wouldn't be terribly worried about this.  I expect these threads will spend the majority of their time in a blocked state, nonscheduled by the OS.  I agree there is some minor hit to having more threads than you need but this isn't the more significant hit you get by over-scheduling CPU threads which leads to an excess of context switches.
   
   > It would be nice if there were a way to control the size of this thread pool
   
   Agreed, there is already `arrow::fs::S3GlobalOptions` so we have some precedent.  I don't know if there are python bindings and it seems we need to add an "event loop thread pool count" to the mix.
   
   > I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread.
   
   It sounds like it would be a good idea in general to change the default to 1 anyways.  Though this could use some benchmarking.
   
   > Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly
   
   Do you want to open a separate issue for this?  Seems like a reasonable request. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org