You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2021/02/08 17:16:05 UTC

[jira] [Commented] (BEAM-11051) Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns

    [ https://issues.apache.org/jira/browse/BEAM-11051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281208#comment-17281208 ] 

Beam JIRA Bot commented on BEAM-11051:
--------------------------------------

This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3.

Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean.


> Python SDK harness's UnboundedThreadPoolExecutor performs poorly with slow DoFns
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-11051
>                 URL: https://issues.apache.org/jira/browse/BEAM-11051
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-harness
>    Affects Versions: 2.18.0, 2.19.0, 2.20.0, 2.21.0, 2.22.0, 2.23.0, 2.24.0
>            Reporter: Peter Sobot
>            Priority: P2
>              Labels: stale-P2
>
> Beam jobs with slow, memory-hungry, or otherwise resource-intensive DoFn implementations perform quite poorly (or even OOM) due to the fact that an {{[UnboundedThreadPoolExecutor|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/thread_pool_executor.py#L89]}} is used to spawn workers.
> The Python SDK no longer seems to have any methods by which to control concurrent execution of user code. Resource-intensive DoFns can control their own execution by maintaining their own semaphores, but that causes input elements to effectively spool in-memory, with one thread created for every new message. If the input rate of data to a worker exceeds the worker's ability to process those messages, an unbounded number of threads will be spawned to handle incoming work.
> Versions of Beam before 2.18 allowed specifying the --worker_threads experimental flag to control concurrency more effectively, but that was [removed in November of 2019|https://github.com/apache/beam/pull/10123] by [~lukecwik@gmail.com] (see: BEAM-8151).
> One possible solution would be to re-introduce a limit on the size of the {{_SharedUnboundedThreadPoolExecutor}} to ensure that we don't create too many threads, but I'm unsure of what kind of backpressure this would create and what effect it may have on the rest of the harness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)