You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2022/04/21 17:39:00 UTC

[jira] [Updated] (BEAM-14284) Server-side Dataflow job idempotence

     [ https://issues.apache.org/jira/browse/BEAM-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-14284:
-----------------------------------
    Status: Open  (was: Triage Needed)

> Server-side Dataflow job idempotence
> ------------------------------------
>
>                 Key: BEAM-14284
>                 URL: https://issues.apache.org/jira/browse/BEAM-14284
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: tol
>            Priority: P2
>
> *Issue*: when a job submission is retried, it may result in duplicate Dataflow jobs. The Dataflow job {{name}} only guarantees uniqueness for _active_ jobs -- that is, if a job with the same name exists but is already completed, the same {{name}} is allowed again. What we would like is job uniqueness regardless of job status.
> The Dataflow API provides a way to ensure unique jobs through the use of {{clientRequestId}}:
> {code:java}
> The client's unique identifier of the job, re-used 
> across retried attempts. If this field is set, the service will ensure 
> its uniqueness. The request to create a job will fail if the service has
>  knowledge of a previously submitted job with the same client's ID and 
> job name. The caller may use this field to ensure idempotence of job 
> creation across retried attempts to create a job. By default, the field 
> is empty and, in that case, the service ignores it. {code}
> [https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs]
> In DataflowRunner.java, {{clientRequestId}} is set with [a randomized value|https://github.com/apache/beam/blob/v2.37.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1125].
> *Proposed solution*: provide the ability to pass in a {{clientRequestId}} through {{DataflowPipelineOptions}} and set it on the {{Job}} when available, otherwise default to the randomized value.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)