You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 18:14:33 UTC

[GitHub] [beam] damccorm opened a new issue, #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

damccorm opened a new issue, #20574:
URL: https://github.com/apache/beam/issues/20574

   Beam jobs with slow, memory-hungry, or otherwise resource-intensive DoFn implementations perform quite poorly (or even OOM) due to the fact that an `[UnboundedThreadPoolExecutor|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/thread_pool_executor.py#L89]` is used to spawn workers.
   
   The Python SDK no longer seems to have any methods by which to control concurrent execution of user code. Resource-intensive DoFns can control their own execution by maintaining their own semaphores, but that causes input elements to effectively spool in-memory, with one thread created for every new message. If the input rate of data to a worker exceeds the worker's ability to process those messages, an unbounded number of threads will be spawned to handle incoming work.
   
   Versions of Beam before 2.18 allowed specifying the \--worker_threads experimental flag to control concurrency more effectively, but that was [removed in November of 2019](https://github.com/apache/beam/pull/10123) by [~lukecwik@gmail.com] (see: BEAM-8151).
   
   One possible solution would be to re-introduce a limit on the size of the `_SharedUnboundedThreadPoolExecutor` to ensure that we don't create too many threads, but I'm unsure of what kind of backpressure this would create and what effect it may have on the rest of the harness.
   
   Imported from Jira [BEAM-11051](https://issues.apache.org/jira/browse/BEAM-11051). Original Jira may contain additional context.
   Reported by: psobotspotify.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] psobot commented on issue #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

Posted by GitBox <gi...@apache.org>.
psobot commented on issue #20574:
URL: https://github.com/apache/beam/issues/20574#issuecomment-1188880452

   Hey @tvalentyn! I believe this is fixed now, and the `--number_of_worker_harness_threads` option is indeed reliable. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lukecwik closed issue #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

Posted by "lukecwik (via GitHub)" <gi...@apache.org>.
lukecwik closed issue #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns
URL: https://github.com/apache/beam/issues/20574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #20574: Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #20574:
URL: https://github.com/apache/beam/issues/20574#issuecomment-1164307484

   @psobot is this still an issue? Which runner do you use?
   
   There is --number_of_worker_harness_threads 
   
   https://github.com/apache/beam/blob/79c067cc625b2f02d32334564e86a6480dd545ff/sdks/python/apache_beam/options/pipeline_options.py#L1027
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org