You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/01/03 09:38:00 UTC

[jira] [Commented] (FLINK-11256) Referencing StreamNode objects directly in StreamEdge causes the sizes of JobGraph and TDD to become unnecessarily large

    [ https://issues.apache.org/jira/browse/FLINK-11256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732809#comment-16732809 ] 

Till Rohrmann commented on FLINK-11256:
---------------------------------------

Good point [~sunhaibotb]. I quick skim over the code looks as if the {{StreamNode}} information is indeed not needed.

An alternative solution could also be to introduce a runtime {{StreamEdge}} type which only contains the required information ({{selectedNames}} and {{StreamPartitioner}}) for the runtime. This would be cleaner because it better separates concerns.

> Referencing StreamNode objects directly in StreamEdge causes the sizes of JobGraph and TDD to become unnecessarily large
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11256
>                 URL: https://issues.apache.org/jira/browse/FLINK-11256
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.7.0, 1.7.1
>            Reporter: Haibo Suen
>            Assignee: Haibo Suen
>            Priority: Major
>
> When a job graph is generated from StreamGraph, StreamEdge(s) on the stream graph are serialized to StreamConfig and stored into the job graph. After that, the serialized bytes will be included in the TDD and distributed to TM. Because StreamEdge directly reference to StreamNode objects including sourceVertex and targetVertex, these objects are also written transitively on serializing StreamEdge. But these StreamNode objects are not needed in JM and Task. For a large size topology, this will causes JobGraph/TDD to become much larger than that actually need, and more likely to occur rpc timeout when transmitted.
> In StreamEdge, only the ID of StreamNode should be stored to avoid this situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)