You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kamil Wasilewski <ka...@polidea.com> on 2020/08/20 14:33:15 UTC

Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Hi all,

As I stated in the title, is there an equivalent for
--numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
Switching to highmem workers solved the issue, but I wonder if I can set a
limit of threads that will be used in a single worker to decrease memory
usage.

Regards,
Kamil

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Reuven Lax <re...@google.com>.
Streaming Dataflow relies on high thread count for performance. Streaming
threads spend a high percentage of time blocked on IO, so in order to get
decent CPU utilization we need a lot of threads. Limiting the thread count
risks causing performance issues.

On Fri, Aug 21, 2020 at 8:00 AM Kamil Wasilewski <
kamil.wasilewski@polidea.com> wrote:

> No, I'm not. But thanks anyway, I totally missed that option!
>
> It occurs in a simple pipeline that executes CoGroupByKey over two
> PCollections. Reading from a bounded source, 20 millions and 2 millions
> elements, respectively. One global window. Here's a link to the code, it's
> one of our tests:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/co_group_by_key_test.py
>
>
> On Thu, Aug 20, 2020 at 6:48 PM Luke Cwik <lc...@google.com> wrote:
>
>> +user <us...@beam.apache.org>
>>
>> On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Are you using Dataflow runner v2[1]?
>>>
>>> If so, then you can use:
>>> --number_of_worker_harness_threads=X
>>>
>>> Do you know where/why the OOM is occurring?
>>>
>>> 1:
>>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
>>> 2:
>>> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>>>
>>> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
>>> kamil.wasilewski@polidea.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> As I stated in the title, is there an equivalent for
>>>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>>>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>>>> Switching to highmem workers solved the issue, but I wonder if I can set a
>>>> limit of threads that will be used in a single worker to decrease memory
>>>> usage.
>>>>
>>>> Regards,
>>>> Kamil
>>>>
>>>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Reuven Lax <re...@google.com>.
Streaming Dataflow relies on high thread count for performance. Streaming
threads spend a high percentage of time blocked on IO, so in order to get
decent CPU utilization we need a lot of threads. Limiting the thread count
risks causing performance issues.

On Fri, Aug 21, 2020 at 8:00 AM Kamil Wasilewski <
kamil.wasilewski@polidea.com> wrote:

> No, I'm not. But thanks anyway, I totally missed that option!
>
> It occurs in a simple pipeline that executes CoGroupByKey over two
> PCollections. Reading from a bounded source, 20 millions and 2 millions
> elements, respectively. One global window. Here's a link to the code, it's
> one of our tests:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/co_group_by_key_test.py
>
>
> On Thu, Aug 20, 2020 at 6:48 PM Luke Cwik <lc...@google.com> wrote:
>
>> +user <us...@beam.apache.org>
>>
>> On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Are you using Dataflow runner v2[1]?
>>>
>>> If so, then you can use:
>>> --number_of_worker_harness_threads=X
>>>
>>> Do you know where/why the OOM is occurring?
>>>
>>> 1:
>>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
>>> 2:
>>> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>>>
>>> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
>>> kamil.wasilewski@polidea.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> As I stated in the title, is there an equivalent for
>>>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>>>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>>>> Switching to highmem workers solved the issue, but I wonder if I can set a
>>>> limit of threads that will be used in a single worker to decrease memory
>>>> usage.
>>>>
>>>> Regards,
>>>> Kamil
>>>>
>>>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Kamil Wasilewski <ka...@polidea.com>.
No, I'm not. But thanks anyway, I totally missed that option!

It occurs in a simple pipeline that executes CoGroupByKey over two
PCollections. Reading from a bounded source, 20 millions and 2 millions
elements, respectively. One global window. Here's a link to the code, it's
one of our tests:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/co_group_by_key_test.py


On Thu, Aug 20, 2020 at 6:48 PM Luke Cwik <lc...@google.com> wrote:

> +user <us...@beam.apache.org>
>
> On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>
>> Are you using Dataflow runner v2[1]?
>>
>> If so, then you can use:
>> --number_of_worker_harness_threads=X
>>
>> Do you know where/why the OOM is occurring?
>>
>> 1:
>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
>> 2:
>> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>>
>> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
>> kamil.wasilewski@polidea.com> wrote:
>>
>>> Hi all,
>>>
>>> As I stated in the title, is there an equivalent for
>>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>>> Switching to highmem workers solved the issue, but I wonder if I can set a
>>> limit of threads that will be used in a single worker to decrease memory
>>> usage.
>>>
>>> Regards,
>>> Kamil
>>>
>>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Kamil Wasilewski <ka...@polidea.com>.
No, I'm not. But thanks anyway, I totally missed that option!

It occurs in a simple pipeline that executes CoGroupByKey over two
PCollections. Reading from a bounded source, 20 millions and 2 millions
elements, respectively. One global window. Here's a link to the code, it's
one of our tests:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/co_group_by_key_test.py


On Thu, Aug 20, 2020 at 6:48 PM Luke Cwik <lc...@google.com> wrote:

> +user <us...@beam.apache.org>
>
> On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>
>> Are you using Dataflow runner v2[1]?
>>
>> If so, then you can use:
>> --number_of_worker_harness_threads=X
>>
>> Do you know where/why the OOM is occurring?
>>
>> 1:
>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
>> 2:
>> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>>
>> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
>> kamil.wasilewski@polidea.com> wrote:
>>
>>> Hi all,
>>>
>>> As I stated in the title, is there an equivalent for
>>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>>> Switching to highmem workers solved the issue, but I wonder if I can set a
>>> limit of threads that will be used in a single worker to decrease memory
>>> usage.
>>>
>>> Regards,
>>> Kamil
>>>
>>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Luke Cwik <lc...@google.com>.
+user <us...@beam.apache.org>

On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:

> Are you using Dataflow runner v2[1]?
>
> If so, then you can use:
> --number_of_worker_harness_threads=X
>
> Do you know where/why the OOM is occurring?
>
> 1:
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
> 2:
> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>
> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
> kamil.wasilewski@polidea.com> wrote:
>
>> Hi all,
>>
>> As I stated in the title, is there an equivalent for
>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>> Switching to highmem workers solved the issue, but I wonder if I can set a
>> limit of threads that will be used in a single worker to decrease memory
>> usage.
>>
>> Regards,
>> Kamil
>>
>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Luke Cwik <lc...@google.com>.
+user <us...@beam.apache.org>

On Thu, Aug 20, 2020 at 9:47 AM Luke Cwik <lc...@google.com> wrote:

> Are you using Dataflow runner v2[1]?
>
> If so, then you can use:
> --number_of_worker_harness_threads=X
>
> Do you know where/why the OOM is occurring?
>
> 1:
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
> 2:
> https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834
>
> On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
> kamil.wasilewski@polidea.com> wrote:
>
>> Hi all,
>>
>> As I stated in the title, is there an equivalent for
>> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
>> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
>> Switching to highmem workers solved the issue, but I wonder if I can set a
>> limit of threads that will be used in a single worker to decrease memory
>> usage.
>>
>> Regards,
>> Kamil
>>
>>

Re: Is there an equivalent for --numberOfWorkerHarnessThreads in Python SDK?

Posted by Luke Cwik <lc...@google.com>.
Are you using Dataflow runner v2[1]?

If so, then you can use:
--number_of_worker_harness_threads=X

Do you know where/why the OOM is occurring?

1:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
2:
https://github.com/apache/beam/blob/017936f637b119f0b0c0279a226c9f92a2cf4f15/sdks/python/apache_beam/options/pipeline_options.py#L834

On Thu, Aug 20, 2020 at 7:33 AM Kamil Wasilewski <
kamil.wasilewski@polidea.com> wrote:

> Hi all,
>
> As I stated in the title, is there an equivalent for
> --numberOfWorkerHarnessThreads in Python SDK? I've got a streaming pipeline
> in Python which suffers from OutOfMemory exceptions (I'm using Dataflow).
> Switching to highmem workers solved the issue, but I wonder if I can set a
> limit of threads that will be used in a single worker to decrease memory
> usage.
>
> Regards,
> Kamil
>
>