You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Shimin Yang (JIRA)" <ji...@apache.org> on 2018/11/07 16:20:00 UTC
[jira] [Comment Edited] (FLINK-10815) Rethink the rescale operation, can we do it async

    [ https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678456#comment-16678456 ] 

Shimin Yang edited comment on FLINK-10815 at 11/7/18 4:19 PM:
--------------------------------------------------------------

[~till.rohrmann], I would like to hear your opinion before diving into more detail. Since I am not very sure whether this is a nice feature and is it worthy to work on this. Currently, it's more like a proposal.


was (Author: dangdangdang):
[~till.rohrmann], I would like to hear your opinion before diving into more detail. Since I am not very sure whether this is a nice feature and is it worthy to work on this.

> Rethink the rescale operation, can we do it async
> -------------------------------------------------
>
>                 Key: FLINK-10815
>                 URL: https://issues.apache.org/jira/browse/FLINK-10815
>             Project: Flink
>          Issue Type: Improvement
>          Components: ResourceManager, Scheduler
>            Reporter: Shimin Yang
>            Assignee: Shimin Yang
>            Priority: Major
>
> Currently, the rescale operation is to stop the whole job and restart it with different parrellism. But the rescale operation cost a lot and took lots of time to recover if the state size is quite big. 
> And a long-time rescale might cause other problems like latency increase and back pressure. For some circumstances like a streaming computing cloud service, users may be very sensitive to latency and resource usage. So it would be better to make the rescale a cheaper operation.
> I wonder if we could make it an async operation just like checkpoint. But how to deal with the keyed state would be a pain in the ass. Currently I just want to make some assumption to make things simpler. The asnyc rescale operation can only double the parrellism or make it half.
> In the scale up circumstance, we can copy the state to the newly created worker and change the partitioner of the upstream. The best timing might be get notified of checkpoint completed. But we still need to change the partitioner of upstream. So the upstream should buffer the result or block the computation util the state copy finished. Then make the partitioner to send differnt elements with the same key to the same downstream operator.
> In the scale down circumstance, we can merge the keyed state of two operators and also change the partitioner of upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)