You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Berislav Lopac (JIRA)" <ji...@apache.org> on 2018/03/16 11:48:00 UTC

[jira] [Updated] (AIRFLOW-2222) GoogleCloudStorageHook.copy fails for large files between locations

     [ https://issues.apache.org/jira/browse/AIRFLOW-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Berislav Lopac updated AIRFLOW-2222:
------------------------------------
    Description: 
When copying large files (confirmed for around 3GB) between buckets in different projects, the operation fails and the Google API returns error [413—Payload Too Large|https://cloud.google.com/storage/docs/json_api/v1/status-codes#413_Payload_Too_Large]. The documentation for the error says:

{quote}The Cloud Storage JSON API supports up to 5 TB objects.

This error may, alternatively, arise if copying objects between locations and/or storage classes can not complete within 30 seconds. In this case, use the [Rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] method instead.{quote}

The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the API {{copy}} method.

h3. Proposed Solution

There are two potential solutions:

# Implement {{GoogleCloudStorageHook.rewrite}} method which can be called from operators and other objects to ensure successful execution. This method is more flexible but requires changes both in the {{GoogleCloudStorageHook}} class and any other classes that use it for copying files to ensure that they explicitly call {{rewrite}} when needed.
# Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} instead of {{copy}} underneath. This requires updating only the {{GoogleCloudStorageHook}} class, but the logic might not cover all the edge cases and might be difficult to define.

  was:
When copying large files (confirmed for around 3GB) between buckets in different projects, the operation fails and the Google API returns error [413—Payload Too Large|https://cloud.google.com/storage/docs/json_api/v1/status-codes#413_Payload_Too_Large]. The documentation for the error says:

{quote}The Cloud Storage JSON API supports up to 5 TB objects.

This error may, alternatively, arise if copying objects between locations and/or storage classes can not complete within 30 seconds. In this case, use the [Rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] method instead.{quote}

The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the API {{copy}} method.

h3. Proposed Solution

There are two potential solutions:

1. Implement {{GoogleCloudStorageHook.rewrite}} method which can be called from operators and other objects to ensure successful execution. This method is more flexible but requires changes both in the {{GoogleCloudStorageHook}} class and any other classes that use it for copying files to ensure that they explicitly call {{rewrite}} when needed.
bq. 2. Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} instead of {{copy}} underneath. This requires updating only the {{GoogleCloudStorageHook}} class, but the logic might not cover all the edge cases and might be difficult to define.


> GoogleCloudStorageHook.copy fails for large files between locations
> -------------------------------------------------------------------
>
>                 Key: AIRFLOW-2222
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2222
>             Project: Apache Airflow
>          Issue Type: Bug
>            Reporter: Berislav Lopac
>            Priority: Major
>
> When copying large files (confirmed for around 3GB) between buckets in different projects, the operation fails and the Google API returns error [413—Payload Too Large|https://cloud.google.com/storage/docs/json_api/v1/status-codes#413_Payload_Too_Large]. The documentation for the error says:
> {quote}The Cloud Storage JSON API supports up to 5 TB objects.
> This error may, alternatively, arise if copying objects between locations and/or storage classes can not complete within 30 seconds. In this case, use the [Rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] method instead.{quote}
> The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the API {{copy}} method.
> h3. Proposed Solution
> There are two potential solutions:
> # Implement {{GoogleCloudStorageHook.rewrite}} method which can be called from operators and other objects to ensure successful execution. This method is more flexible but requires changes both in the {{GoogleCloudStorageHook}} class and any other classes that use it for copying files to ensure that they explicitly call {{rewrite}} when needed.
> # Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} instead of {{copy}} underneath. This requires updating only the {{GoogleCloudStorageHook}} class, but the logic might not cover all the edge cases and might be difficult to define.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)