You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/18 18:55:00 UTC

[GitHub] [airflow] jaketf edited a comment on issue #10381: Add on_kill method to DataprocSubmitJobOperator

jaketf edited a comment on issue #10381:
URL: https://github.com/apache/airflow/issues/10381#issuecomment-675654161


   Dataproc jobs are kind of a wild wild west and may have significant side effects. From a documentation perspective we should call out that `on_kill` simply kills the job but will not "roll back" changes in external systems (GCS, Hive Metastore, BQ, pubsub, etc) that may have occurred. Users should be careful to handle any such scenarios in the logic of their pipelines.
   
   A Few examples
   - even before completing as a spark driver could make arbitrary calls mutation data on GCS or a database (e.g. could write some sort of lock file that ends up being abandoned).
   - If you snipe a map reduce job in the middle and any intermediate files we flushed to GCS those will not get cleaned up.
   - a hive jobs can contain multiple query statements (e.g. a CREATE TABLE and a INSERT INTO) which may leave a side effect of a new empty table in hive metastore
   - sniping a spark streaming job subscribing to pubsub may lead to ACKed messages who's corresponding outputs were not committed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org