You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2018/09/25 18:25:00 UTC

[jira] [Comment Edited] (FLINK-10292) Generate JobGraph in StandaloneJobClusterEntrypoint only once

    [ https://issues.apache.org/jira/browse/FLINK-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626190#comment-16626190 ] 

Ufuk Celebi edited comment on FLINK-10292 at 9/25/18 6:24 PM:
--------------------------------------------------------------

I understand that non-determinism may be an issue when generating the {{JobGraph}}, but do we have some data about how common that is for applications? Would it be possible to keep a fixed JobGraph in the image instead of persisting one in the {{SubmittedJobGraphStore}}?

I like our current approach, because it keeps the source of truth with image-based deployments such as Kubernetes *in* the image instead of the {{SubmittedJobGraphStore}}. I'm wondering about the following scenario in particular (this is independent of the question whether it runs on Kubernetes or not and can be reproduced in an other way as well):
 * A user creates a job cluster with high availability enabled (cluster ID for the logical application, e.g. myapp)
 ** This will persist the job with a fixed ID (after FLINK-10291) on first submission
 * The user kills the application *without* cancelling
 ** This will leave all data in the high availability store(s) such as job graphs or checkpoints
 * The user updates the image with a modified application and keeps the high availability configuration (e.g. cluster ID stays myapp)
 ** This will result in the job in the image to be ignored since we already have a job graph with the same (fixed) ID

I think in such a scenario it can be desirable to still have the checkpoints available, but it might be problematic if the job graph is recovered from the {{SubmittedJobGraphStore}} instead of using the job that is part of the image. What do you think about this scenario? Is it the responsibility of the user to handle this? If so, I think that the approach outlined in this ticket makes sense. If not, we may want to consider alternatives or ignore potential non-determinism.


was (Author: uce):
I understand that non-determinism may be an issue when generating the {{JobGraph}}, but do we have some data about how common that is for applications? Would it be possible to keep a fixed JobGraph in the image instead of persisting one in the {{SubmittedJobGraphStore}}?

I like our current approach, because it keeps the source of truth for the job in the image instead of the {{SubmittedJobGraphStore}}. I'm wondering about the following scenario:
 * A user creates a job cluster with high availability enabled (cluster ID for the logical application, e.g. myapp)
 ** This will persist the job with a fixed ID (after FLINK-10291) on first submission
 * The user kills the application *without* cancelling
 ** This will leave all data in the high availability store(s) such as job graphs or checkpoints
 * The user updates the image with a modified application and keeps the high availability configuration (e.g. cluster ID stays myapp)
 ** This will result in the job in the image to be ignored since we already have a job graph with the same (fixed) ID

I think in such a scenario it can be desirable to still have the checkpoints available, but it might be problematic if the job graph is recovered from the {{SubmittedJobGraphStore}} instead of using the job that is part of the image. What do you think about this scenario? Is it the responsibility of the user to handle this? If so, I think that the approach outlined in this ticket makes sense. If not, we may want to consider alternatives or ignore potential non-determinism.

> Generate JobGraph in StandaloneJobClusterEntrypoint only once
> -------------------------------------------------------------
>
>                 Key: FLINK-10292
>                 URL: https://issues.apache.org/jira/browse/FLINK-10292
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Assignee: vinoyang
>            Priority: Major
>             Fix For: 1.7.0, 1.6.2
>
>
> Currently the {{StandaloneJobClusterEntrypoint}} generates the {{JobGraph}} from the given user code every time it starts/is restarted. This can be problematic if the the {{JobGraph}} generation has side effects. Therefore, it would be better to generate the {{JobGraph}} only once and store it in HA storage instead from where to retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)