You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Tony Wei <to...@gmail.com> on 2023/03/08 05:05:30 UTC

is there any detrimental side-effect if i set the max parallelism as 32768

Hi experts,

Setting the maximum parallelism to a very large value can be detrimental to
> performance because some state backends have to keep internal data
> structures that scale with the number of key-groups (which are the internal
> implementation mechanism for rescalable state).
>
> Changing the maximum parallelism explicitly when recovery from original
> job will lead to state incompatibility.
>

I read the section above from Flink official document [1], and I'm
wondering what the detail is regarding to the side-effect.

Suppose that I have a Flink SQL job with large state, large parallelism and
using RocksDB as my state backend.
I would like to set the max parallelism as 32768, so that I don't bother if
the max parallelism can be divided by the parallelism whenever I want to
scale my job,
because the number of key groups will not differ too much between each
subtask.

I'm wondering if this is a good practice, because based on the
official document it is not recommended actually.
If possible, I would like to know the detail about this side-effect. Which
state backend will have this issue? and Why?
Please give me an advice. Thanks in advance.

Best regards,
Tony Wei

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#setting-the-maximum-parallelism

Re: is there any detrimental side-effect if i set the max parallelism as 32768

Posted by Tony Wei <to...@gmail.com>.
Hi Hangxiang, David,

Thank you for your replies. Your responses are very helpful.

Best regards,
Tony Wei

David Anderson <da...@apache.org> 於 2023年3月14日 週二 下午12:12寫道:

> I believe there is some noticeable overhead if you are using the
> heap-based state backend, but with RocksDB I think the difference is
> negligible.
>
> David
>
> On Tue, Mar 7, 2023 at 11:10 PM Hangxiang Yu <ma...@gmail.com> wrote:
> >
> > Hi, Tony.
> > "be detrimental to performance" means that some extra space overhead of
> the field of the key-group may influence performance.
> > As we know, Flink will write the key group as the prefix of the key to
> speed up rescaling.
> > So the format will be like: key group | key len | key | ......
> > You could check the relationship between max parallelism and bytes of
> key group as below:
> > ------------------------------------------
> > max parallelism   bytes of key group
> >        128                    1
> >       32768                 2
> > ------------------------------------------
> > So I think the cost will be very small if the real key length >> 2 bytes.
> >
> > On Wed, Mar 8, 2023 at 1:06 PM Tony Wei <to...@gmail.com> wrote:
> >>
> >> Hi experts,
> >>
> >>> Setting the maximum parallelism to a very large value can be
> detrimental to performance because some state backends have to keep
> internal data structures that scale with the number of key-groups (which
> are the internal implementation mechanism for rescalable state).
> >>>
> >>> Changing the maximum parallelism explicitly when recovery from
> original job will lead to state incompatibility.
> >>
> >>
> >> I read the section above from Flink official document [1], and I'm
> wondering what the detail is regarding to the side-effect.
> >>
> >> Suppose that I have a Flink SQL job with large state, large parallelism
> and using RocksDB as my state backend.
> >> I would like to set the max parallelism as 32768, so that I don't
> bother if the max parallelism can be divided by the parallelism whenever I
> want to scale my job,
> >> because the number of key groups will not differ too much between each
> subtask.
> >>
> >> I'm wondering if this is a good practice, because based on the official
> document it is not recommended actually.
> >> If possible, I would like to know the detail about this side-effect.
> Which state backend will have this issue? and Why?
> >> Please give me an advice. Thanks in advance.
> >>
> >> Best regards,
> >> Tony Wei
> >>
> >> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#setting-the-maximum-parallelism
> >
> >
> >
> > --
> > Best,
> > Hangxiang.
>

Re: is there any detrimental side-effect if i set the max parallelism as 32768

Posted by David Anderson <da...@apache.org>.
I believe there is some noticeable overhead if you are using the
heap-based state backend, but with RocksDB I think the difference is
negligible.

David

On Tue, Mar 7, 2023 at 11:10 PM Hangxiang Yu <ma...@gmail.com> wrote:
>
> Hi, Tony.
> "be detrimental to performance" means that some extra space overhead of the field of the key-group may influence performance.
> As we know, Flink will write the key group as the prefix of the key to speed up rescaling.
> So the format will be like: key group | key len | key | ......
> You could check the relationship between max parallelism and bytes of key group as below:
> ------------------------------------------
> max parallelism   bytes of key group
>        128                    1
>       32768                 2
> ------------------------------------------
> So I think the cost will be very small if the real key length >> 2 bytes.
>
> On Wed, Mar 8, 2023 at 1:06 PM Tony Wei <to...@gmail.com> wrote:
>>
>> Hi experts,
>>
>>> Setting the maximum parallelism to a very large value can be detrimental to performance because some state backends have to keep internal data structures that scale with the number of key-groups (which are the internal implementation mechanism for rescalable state).
>>>
>>> Changing the maximum parallelism explicitly when recovery from original job will lead to state incompatibility.
>>
>>
>> I read the section above from Flink official document [1], and I'm wondering what the detail is regarding to the side-effect.
>>
>> Suppose that I have a Flink SQL job with large state, large parallelism and using RocksDB as my state backend.
>> I would like to set the max parallelism as 32768, so that I don't bother if the max parallelism can be divided by the parallelism whenever I want to scale my job,
>> because the number of key groups will not differ too much between each subtask.
>>
>> I'm wondering if this is a good practice, because based on the official document it is not recommended actually.
>> If possible, I would like to know the detail about this side-effect. Which state backend will have this issue? and Why?
>> Please give me an advice. Thanks in advance.
>>
>> Best regards,
>> Tony Wei
>>
>> [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#setting-the-maximum-parallelism
>
>
>
> --
> Best,
> Hangxiang.

Re: is there any detrimental side-effect if i set the max parallelism as 32768

Posted by Hangxiang Yu <ma...@gmail.com>.
Hi, Tony.
"be detrimental to performance" means that some extra space overhead of the
field of the key-group may influence performance.
As we know, Flink will write the key group as the prefix of the key to
speed up rescaling.
So the format will be like: key group | key len | key | ......
You could check the relationship between max parallelism and bytes of key
group as below:
------------------------------------------
max parallelism   bytes of key group
       128                    1
      32768                 2
------------------------------------------
So I think the cost will be very small if the real key length >> 2 bytes.

On Wed, Mar 8, 2023 at 1:06 PM Tony Wei <to...@gmail.com> wrote:

> Hi experts,
>
> Setting the maximum parallelism to a very large value can be detrimental
>> to performance because some state backends have to keep internal data
>> structures that scale with the number of key-groups (which are the internal
>> implementation mechanism for rescalable state).
>>
>> Changing the maximum parallelism explicitly when recovery from original
>> job will lead to state incompatibility.
>>
>
> I read the section above from Flink official document [1], and I'm
> wondering what the detail is regarding to the side-effect.
>
> Suppose that I have a Flink SQL job with large state, large parallelism
> and using RocksDB as my state backend.
> I would like to set the max parallelism as 32768, so that I don't bother
> if the max parallelism can be divided by the parallelism whenever I want to
> scale my job,
> because the number of key groups will not differ too much between each
> subtask.
>
> I'm wondering if this is a good practice, because based on the
> official document it is not recommended actually.
> If possible, I would like to know the detail about this side-effect. Which
> state backend will have this issue? and Why?
> Please give me an advice. Thanks in advance.
>
> Best regards,
> Tony Wei
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#setting-the-maximum-parallelism
>


-- 
Best,
Hangxiang.