You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/18 19:09:46 UTC

[GitHub] [airflow] jaketf commented on issue #10382: Add on_kill method to BigQueryInsertJobOperator

jaketf commented on issue #10382:
URL: https://github.com/apache/airflow/issues/10382#issuecomment-675661153


   I think this is straightforward for import / query / copy jobs as they are all internal to bigquery and committed atomically.
   
   There may be a corner case with extract (to GCS) jobs. I do not believe export jobs > 1GB are atomic because the BigQuery export will write sharded GCS files. I imagine if killed at just the right time there would be just some portion of those sharded files committed to gcs.
   Would our expected `on_kill` behavior be to clean up those files?
   If we were to rerun the same export (with the same destination URIs in the config) those files would likely just be overwritten.
   UNLESS the table has become much smaller or larger between the original (killed) try and the second try (causing the number of shards to change).
   
   For example:
   Original extract commits these files to GCS 
   shard-00-of-5
   shard-01-of-5
   [original extract job killed]
   [we delete a few partitions from the source table]
   [submit a new extract w/ same config]
   shard-00-of-3
   shard-01-of-3
   shard-02-of-3
   
   This will leave the GCS prefix looking like this:
   shard-00-of-3
   shard-00-of-5
   shard-01-of-3
   shard-00-of-5
   shard-02-of-3
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org