You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yangze Guo (Jira)" <ji...@apache.org> on 2020/12/22 05:03:00 UTC

[jira] [Commented] (FLINK-17295) Refactor the ExecutionAttemptID to consist of ExecutionVertexID and attemptNumber

    [ https://issues.apache.org/jira/browse/FLINK-17295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253275#comment-17253275 ] 

Yangze Guo commented on FLINK-17295:
------------------------------------

Hi, there. Since the 1.12 has been released, I'd like to revive this ticket.

In the beginning, this ticket proposed to make the ExecutionAttemptID being composed of (ExecutionVertexID, attemptNumber) to improve the log readability. In FLINK-19264, we found this change broke the assumption that ExecutionAttemptIDs are unique because there will be a collision of VertexID in graphs with the same topology. Then, we decided to add the JobID to it. However, in FLINK-19805, we found it still has some bad cases.

To solve the problem in FLINK-19805, we can:
- Introducing a field to identify the leader session or ensure the attempt number is monotone increasing across sessions.
- Introducing a truly random element. It seems to be the safest way to prevent other rare cases.

Considering the serialization overhead, come up with an attempt counter (stored in ZK/ConfigMap) might be a better choice. Add a truly random element(16bits) can increase the TDD size ~25% in my experiment(3000 parallelsim WordCount). However, we can't ensure that there are no new bad cases in the future. If the increase of TDD size is affordable, I tend to introduce a truly random element.

WDYT?

> Refactor the ExecutionAttemptID to consist of ExecutionVertexID and attemptNumber
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-17295
>                 URL: https://issues.apache.org/jira/browse/FLINK-17295
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Yangze Guo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>
> Make the ExecutionAttemptID being composed of (ExecutionVertexID, attemptNumber).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)