You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Elad (JIRA)" <ji...@apache.org> on 2019/07/29 14:19:00 UTC

[jira] [Comment Edited] (AIRFLOW-3503) GoogleCloudStorageHook delete return success when nothing was done

    [ https://issues.apache.org/jira/browse/AIRFLOW-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895291#comment-16895291 ] 

Elad edited comment on AIRFLOW-3503 at 7/29/19 2:18 PM:
--------------------------------------------------------

I don't think this example code can ever work.
 The hook.delete() can delete a single file. You can't specify /* and expect it to delete everything in that path.
 The proper way to achieve such functionality is something like:
{code:java}
def delete_folder(path_to_delete):
    """
    Delete files Google cloud storage
    """
    hook = GoogleCloudStorageHook(
            google_cloud_storage_conn_id=CONNECTION_ID)
    files = hook.list(
        bucket=GCS_BUCKET_ID,
        prefix=path_to_delete)
    for file in files:
        hook.delete(
            bucket=GCS_BUCKET_ID,
            object=file)
{code}
Maybe the best approach to resolve this is to do what happens in delete_objects of [S3Hook|https://github.com/apache/airflow/blob/master/airflow/hooks/S3_hook.py#L520].
 The delete_objects know it's a single file if keys is a string and multiple files if keys is a list.

With that approach you can just use the output of list() directly as input to delete()

I think this simplify the process significantly.


was (Author: eladk):
I don't think this example code can ever work.
 The hook.delete() can delete a single file. You can specify /* and expect it to delete everything in that path.
 The proper way to achieve such functionality is something like:
{code:java}
def delete_folder(path_to_delete):
    """
    Delete files Google cloud storage
    """
    hook = GoogleCloudStorageHook(
            google_cloud_storage_conn_id=CONNECTION_ID)
    files = hook.list(
        bucket=GCS_BUCKET_ID,
        prefix=path_to_delete)
    for file in files:
        hook.delete(
            bucket=GCS_BUCKET_ID,
            object=file)
{code}
Maybe the best approach to resolve this is to do what happens in delete_objects of [S3Hook|https://github.com/apache/airflow/blob/master/airflow/hooks/S3_hook.py#L520].
 The delete_objects know it's a single file if keys is a string and multiple files if keys is a list.

With that approach you can just use the output of list() directly as input to delete()

I think this simplify the process significantly.

> GoogleCloudStorageHook  delete return success when nothing was done
> -------------------------------------------------------------------
>
>                 Key: AIRFLOW-3503
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3503
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: gcp
>    Affects Versions: 1.10.1
>            Reporter: lot
>            Assignee: Yohei Onishi
>            Priority: Major
>              Labels: gcp, gcs, hooks
>
> I'm loading files to BigQuery from Storage using:
>  
> {{gcs_export_uri = BQ_TABLE_NAME + '/' + EXEC_TIMESTAMP_PATH + '/*' gcs_to_bigquery_op = GoogleCloudStorageToBigQueryOperator( dag=dag, task_id='load_products_to_BigQuery', bucket=GCS_BUCKET_ID, destination_project_dataset_table=table_name_template, source_format='NEWLINE_DELIMITED_JSON', source_objects=[gcs_export_uri], src_fmt_configs=\{'ignoreUnknownValues': True}, create_disposition='CREATE_IF_NEEDED', write_disposition='WRITE_TRUNCATE', skip_leading_rows = 1, google_cloud_storage_conn_id=CONNECTION_ID, bigquery_conn_id=CONNECTION_ID)}}
>  
> After that I want to delete the files so I do:
> {{def delete_folder():}}
> {{    """}}
> {{    Delete files Google cloud storage}}
> {{    """}}
> {{    hook = GoogleCloudStorageHook(}}
> {{            google_cloud_storage_conn_id=CONNECTION_ID)}}
> {{    hook.delete(}}
> {{        bucket=GCS_BUCKET_ID,}}
> {{        object=gcs_export_uri)}}
>  
>  
> {{This runs with PythonOperator.}}
> {{The task marked as Success even though nothing was deleted.}}
> {{Log:}}
> [2018-12-12 11:31:29,247] \{base_task_runner.py:98} INFO - Subtask: [2018-12-12 11:31:29,247] \{transport.py:151} INFO - Attempting refresh to obtain initial access_token [2018-12-12 11:31:29,249] \{base_task_runner.py:98} INFO - Subtask: [2018-12-12 11:31:29,249] \{client.py:795} INFO - Refreshing access_token [2018-12-12 11:31:29,584] \{base_task_runner.py:98} INFO - Subtask: [2018-12-12 11:31:29,583] \{python_operator.py:90} INFO - Done. Returned value was: None
>  
>  
> I expect the function to fail and return something like "file was not found" if there is nothing to delete Or let the user decide with specific flag if he wants the function to fail or success if files were not found.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)