You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Jason Liu <ja...@ucla.edu> on 2021/08/31 00:12:29 UTC

Flink performance with multiple operators reshuffling data

Hi there,

    We have this use case where we need to have multiple keybys operators
with its own MapState, all with different keys, in a single Flink app. This
obviously means we'll be reshuffling our data a lot.
    Our TPS is around 1-2k, with ~2kb per event and we use Kinesis Data
Analytics as the infrastructure (running roughly on ~128 KPU of hardware).
I'm currently in the design phase of this system and just wondering if we
can put the data through 4-5 keyed process functions all with different key
bys and if it can be scalable with a large enough Flink cluster. I don't
think we can get around this requirement much (other than replicating
data). Alternatively, we can just run multiple small Flink clusters, each
with its own unique keyBys but I'm not sure if or how much that'll help.
     Thanks for any potential insights!

-Jason

Re: Flink performance with multiple operators reshuffling data

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

Key-by operations can scale with parallelisms. Flink will shuffle your
record to different sub-task according to the hash value of the key modulo
number of parallelism, so the more parallelism you have the faster Flink
can process data, unless there is a data skew.

Jason Liu <ja...@ucla.edu> 于2021年8月31日周二 上午8:12写道:

> Hi there,
>
>     We have this use case where we need to have multiple keybys operators
> with its own MapState, all with different keys, in a single Flink app. This
> obviously means we'll be reshuffling our data a lot.
>     Our TPS is around 1-2k, with ~2kb per event and we use Kinesis Data
> Analytics as the infrastructure (running roughly on ~128 KPU of hardware).
> I'm currently in the design phase of this system and just wondering if we
> can put the data through 4-5 keyed process functions all with different key
> bys and if it can be scalable with a large enough Flink cluster. I don't
> think we can get around this requirement much (other than replicating
> data). Alternatively, we can just run multiple small Flink clusters, each
> with its own unique keyBys but I'm not sure if or how much that'll help.
>      Thanks for any potential insights!
>
> -Jason
>

Re: Flink performance with multiple operators reshuffling data

Posted by JING ZHANG <be...@gmail.com>.
Hi Jason,
> In our case, our input/output ratio of these Flin operators are all 1 to
1, so I guess it doesn't matter that much..
Yes
> But I think the keys we are using in general are pretty uniform.
Cool. You could run for a period of time to see if there is data skew. If
there is indeed a data skew, then consider how to solve it.

Best,
JING ZHANG

Jason Liu <ja...@ucla.edu> 于2021年8月31日周二 下午4:23写道:

> Thanks for the help guys!
>
> Yea we can potentially append random strings to the keys and duplicate
> data across them to avoid skewness, if necessary. But I think the keys we
> are using in general are pretty uniform.
> The lowest selectivity at the up fornt method is really interesting
> though. In our case, our input/output ratio of these Flin operators are all
> 1 to 1, so I guess it doesn't matter that much..?
> It's good to know Flink would be scalable in this situation.
>
> -Jason
>
>
>

Re: Flink performance with multiple operators reshuffling data

Posted by Jason Liu <ja...@ucla.edu>.
Thanks for the help guys!

Yea we can potentially append random strings to the keys and duplicate data
across them to avoid skewness, if necessary. But I think the keys we are
using in general are pretty uniform.
The lowest selectivity at the up fornt method is really interesting though.
In our case, our input/output ratio of these Flin operators are all 1 to 1,
so I guess it doesn't matter that much..?
It's good to know Flink would be scalable in this situation.

-Jason

Re: Flink performance with multiple operators reshuffling data

Posted by JING ZHANG <be...@gmail.com>.
Hi Jason,
A job with multiple reshuffle data could be scalable under normal
circumstances.
But we should carefully avoid data skew. Because if input stream has data
skew, add more resources would not help.
Besides that, if we could adjust the order of the functions, we could put
the keyed process function with the lowest selectivity at the top. The
lower the ratio of output records number to input records number, the lower
the selectivity is.

Best,
JING ZHANG


Jason Liu <ja...@ucla.edu> 于2021年8月31日周二 上午8:12写道:

> Hi there,
>
>     We have this use case where we need to have multiple keybys operators
> with its own MapState, all with different keys, in a single Flink app. This
> obviously means we'll be reshuffling our data a lot.
>     Our TPS is around 1-2k, with ~2kb per event and we use Kinesis Data
> Analytics as the infrastructure (running roughly on ~128 KPU of hardware).
> I'm currently in the design phase of this system and just wondering if we
> can put the data through 4-5 keyed process functions all with different key
> bys and if it can be scalable with a large enough Flink cluster. I don't
> think we can get around this requirement much (other than replicating
> data). Alternatively, we can just run multiple small Flink clusters, each
> with its own unique keyBys but I'm not sure if or how much that'll help.
>      Thanks for any potential insights!
>
> -Jason
>