You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/05/10 17:01:12 UTC

[GitHub] [airflow] mdnawed2010 opened a new issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

mdnawed2010 opened a new issue #8806:
URL: https://github.com/apache/airflow/issues/8806


   
   
   **Apache Airflow version**: 1.10.6 (looks like this problem is for all ver >=1.10.4)
   
   **Environment**: Google cloud composer(composer-1.10.2-airflow-1.10.6)
   
   Issue : This is regarding DataProcSparkOperator. As part of https://issues.apache.org/jira/browse/AIRFLOW-3211, fix for reattaching the previous instance of data proc job was introduced so that when the DAG restarts/re-triggered it doesn't end up re-running the data proc task which may be completed already or in running state at that time because of previous dag run.
   
   I am adding the below comment from the original JIRA itself, as this comment perfectly explains the issue and nobody has reverted there ==>
   
   
   The functionality added by this story actually broke the behavior of the dataproc hook and made a few 1.10.x releases unusable for dataproc users. The problem is that the hook only uses the task ID part of the dataproc job ID when looking for previous invocations of the job, so if dataproc history still has jobs corresponding to any of the previous dag runs, the dataproc hook doesn't execute the job.
   A proper way to implement this would be to associate dataproc jobs with particular dag runs by e.g. embedding a dag run id hash in the dataproc job id.
   In any case this functionality has to be optional. In our experience, users expect dataproc jobs to be re-executed when they re-execute the task, and this new behavior creates a lot of confusion.
   
   Old issue link :  https://issues.apache.org/jira/browse/AIRFLOW-3211
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #8806:
URL: https://github.com/apache/airflow/issues/8806#issuecomment-626358015


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mdnawed2010 closed issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
mdnawed2010 closed issue #8806:
URL: https://github.com/apache/airflow/issues/8806


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #8806:
URL: https://github.com/apache/airflow/issues/8806#issuecomment-638839479


   It will be part of backport packages:
   https://github.com/apache/airflow/blob/master/README.md#using-hooks-and-operators-from-master-in-airflow-110
   
   Not released yet, should be ready in next two weeks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #8806:
URL: https://github.com/apache/airflow/issues/8806#issuecomment-630633955


   Hi, @mdnawed2010 you are right. Hee's another comment on that issue:
   https://github.com/apache/airflow/pull/6371#issuecomment-586502594
   
   New operators accept `request_id` which if I'm correct works in the same way:
   > > Doesn't request_id work like that?
   
   > If the server receives two SubmitJobRequest requests with the same id, then the second request will be ignored and the first Job created and stored in the backend is returned.
   https://googleapis.dev/python/dataproc/latest/gapic/v1/api.html#google.cloud.dataproc_v1.JobControllerClient.submit_job


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mdnawed2010 commented on issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
mdnawed2010 commented on issue #8806:
URL: https://github.com/apache/airflow/issues/8806#issuecomment-638779490


   @turbaszek request_id will help. It will be very kind of you, if you can share the airflow version for which this new change was introduced?
   
   Closing this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #8806: DataProcSparkOperator getting reattached to the previous run - as this feature is not required every time, hence we need a flag in the task itself to override it for a fresh run always

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #8806:
URL: https://github.com/apache/airflow/issues/8806#issuecomment-629828037


   @turbaszek Can you look at it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org