You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/19 16:36:42 UTC

[GitHub] [airflow] moreaupascal56 opened a new pull request, #25835: Use the extra_args in get_key()

moreaupascal56 opened a new pull request, #25835:
URL: https://github.com/apache/airflow/pull/25835

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #25833
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   
   The problem description is in the issue, but the main idea is to allow the use of encryption keys in the download_file function. 
   Indeed giving the extra_args in load allows to fix the bad request error by calling 
   ```
   response = client.head_object(
       Bucket='string',
       IfMatch='string',
       IfModifiedSince=datetime(2015, 1, 1),
       IfNoneMatch='string',
       IfUnmodifiedSince=datetime(2015, 1, 1),
       Key='string',
       Range='string',
       VersionId='string',
       SSECustomerAlgorithm='string',
       SSECustomerKey='string',
       RequestPayer='requester',
       PartNumber=123,
       ExpectedBucketOwner='string',
       ChecksumMode='ENABLED'
   )
   ```
   
   with its arguments.  
   
   
   
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] github-actions[bot] commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1276869329

   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Use the extra_args in get_key() [airflow]

Posted by "ferruzzi (via GitHub)" <gi...@apache.org>.
ferruzzi commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1751473029

   This is pretty old, but if it's still an issue, how do we feel about accepting extra_args and then checking it against [boto3.s3.transfer.S3Transfer.ALLOWED_DOWNLOAD_ARGS](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.S3Transfer.ALLOWED_DOWNLOAD_ARGS) and passing only the allowed values into ExtraArgs?
   
   Based on the S3 API docs [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/download_file.html) load() is calling head_object(), and head_object() accepts all of the parameters listed in ALLOWED_DOWNLOAD_ARGS, so that seems to me like the best route without doing a full overhaul of the S3 hook (which IMHO is long overdue also)
   
   TL;DR:  my proposal is the same as this PR, except that we ensure that only valid keys are provided to load() by checking them against the provided API list
   
   
   
   ___
   
   A quick script that checks the Boto API and returns a dict {name: type} of all available parameters:
   
   ```
   import boto3
   def get_boto_params(service_name, method_name):
       client_meta = boto3.client(service_name).meta
   
       api_method = client_meta.method_to_api_mapping[method_name]
       service_model = client_meta.service_model
       ops_model = service_model.operation_model(api_method)
       inputs = ops_model.input_shape
   
       return {
           'available': dict(inputs.members),
           **inputs.metadata
       } 
    ```
    `get_boto_params("s3", "head_object")` returns:
    
    ```
   {'available': {'Bucket': <StringShape(BucketName)>,
     'IfMatch': <StringShape(IfMatch)>,
     'IfModifiedSince': <Shape(IfModifiedSince)>,
     'IfNoneMatch': <StringShape(IfNoneMatch)>,
     'IfUnmodifiedSince': <Shape(IfUnmodifiedSince)>,
     'Key': <StringShape(ObjectKey)>,
     'Range': <StringShape(Range)>,
     'VersionId': <StringShape(ObjectVersionId)>,
     'SSECustomerAlgorithm': <StringShape(SSECustomerAlgorithm)>,
     'SSECustomerKey': <StringShape(SSECustomerKey)>,
     'SSECustomerKeyMD5': <StringShape(SSECustomerKeyMD5)>,
     'RequestPayer': <StringShape(RequestPayer)>,
     'PartNumber': <Shape(PartNumber)>,
     'ExpectedBucketOwner': <StringShape(AccountId)>,
     'ChecksumMode': <StringShape(ChecksumMode)>},
    'required': ['Bucket', 'Key']}
   ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950571144


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   I would suggest to remove this part
   https://github.com/apache/airflow/blob/4a5e247536b2b7cb304f2f8dc5961d94278affe1/airflow/providers/amazon/aws/hooks/s3.py#L869-L877
   
   At that moment in this method first we try to check if object exists before download and raise an error if not exists.
   This check could cause an issues in some cases.
   
   Might be better do not use `boto3.resource("s3").Object` at all and just try to download object by `boto3.client("s3")` instead if any error happen we still can catch them and inform users.
   
   Example:
   
   ```python
           with NamedTemporaryFile(dir=local_path, prefix='airflow_tmp_', delete=False) as local_tmp_file:
               err = None
               try:
                   self.get_conn().download_fileobj(
                       bucket_name,
                       key,
                       local_tmp_file,
                       ExtraArgs=self.extra_args,
                       Config=self.transfer_config,
                   )
               except Exception as e:
                   err = True
                   if isinstance(e, ClientError):
                       ...
                       # Do some error handling here
                   raise
               finally:
                   if err:
                       if not local_tmp_file.closed:
                           local_tmp_file.close()
                       Path(local_tmp_file.name).unlink()
   ```
   
   WDYT @moreaupascal56 
   
   cc: @o-nikolas @vincbeck @ferruzzi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950684114


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   > Just so that I fully understand your proposal, what are some cases you're thinking of?
   
   Both general case and @moreaupascal56 case. My idea is just use more common pythonic approach: **Easier to Ask for Forgiveness than Permission**. 
   
   Most of the code was add about 2 years ago and most probably it just migrated from < Airflow 2 and I thought current implementation do not cover all use cases.
   
   Initially by the code method tried get information about object but do not pass any of the `boto3.s3.transfer` Download Arguments there:  `'ChecksumMode', 'VersionId', 'SSECustomerAlgorithm', 'SSECustomerKey', 'SSECustomerKeyMD5', 'RequestPayer', 'ExpectedBucketOwner'`.
   
   If user want specific _VersionId_ it actually check latest _VersionId_ or delete marker (need to check just my thoughts).
   
   Example
   ```
   hook = S3Hook(aws_conn_id="aws_default", extra_args={"VersionId": "AAAAAAAA"})
   hook.download_file(key="some/awesome/key.parquet", bucket_name="awesome-bucket")
   ```
   
   So what might happen here?
   `s3_obj = self.get_key(key, bucket_name)` might return object with Version ID = "AAAAAAAA" or "BBBBBBBB"
   
   After that we tried to download file and pass ExtraArgs = `{"VersionId": "AAAAAAAA"}` and I thought it is possible that get some funny error if `s3_obj` referenced to Version ID = "BBBBBBBB".
   
   So my idea get rid of call `self.get_key` and just use [S3.Client.download_fileobj()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_fileobj) instead which might handle passing correct headers for customer-provided encryption and/or Version ID
   
   > you're explicitly closing and unlinking the temp file, do you not just the context manager to do so?
   
   As I understand current code correct `NamedTemporaryFile` use for create unique file inside Airflow Worker, and explicit set that this file shouldn't be deleted because this method returns file path to downloaded file.
   
   In my snippet I just add manually close file if any error happen because we still in context manager and unlink them because if some error happen this file not useful anymore and we could remove it.
   Some sample of error which could keep some unwanted files, all of them not related to permission:
   - Connection error in the middle of download file
   - User tried download 20 GiB file inside Fargate temp directory and get "No space left on device"
   
   If everything alright than context manager close the file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r951746571


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   The context manager will clean up the file even in the case of exception (see [here](https://stackoverflow.com/a/28158006/1055702)) so there is no need to do any file handling in a finally. This is the beauty of Context Managers! :smiley: 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1222741886

   > Hello all, thank you for the responses. Yes this is about the issue #25833 (I was sure to have mentioned it, my bad).
   > 
   > I have no strong opinion, but yes I am not sure about the usefulness of this `get_key()` and `load()` method in the case of `s3.Hook.download_files()`. But if I summarize there is 2 options:
   > 
   >     * skip the` get_key()` function
   > 
   >     * filter the `extra_args` to keep only the values compatible with `obj.load()`
   > 
   > 
   > What do you think ?
   
   I think, in general, I agree with @Taragolis here. The S3 code is a bit old and unnecessarily complicated in many ways. It might be best to switch away from fetching the s3 object resource + calling load on it to instead using the client api directly, simply using get_object (though there may be some historical way that it is the way it is now). WDYT @moreaupascal56?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1221299612

   BTW, I'm the person who add ExtraArgs to `S3Hook.download_file` method and give to users false hope that it would work 🙄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950615759


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   A few things are getting conflated in this thread.
   #### 1) The PR code change:
   It looks like you're just unpacking the contents of `extra_args` into the `load` call. I agree with @Taragolis that I'm not sure `obj.load` expects all the things it could receive from that dict (or if any of them would even help with encryption). Object.load ([docs](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.load)) basically just calls s3client `head_object` ([docs](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.head_object)) which is one of those weird S3 APIs:
   >A HEAD request has the same options as a GET action on an object. The response is identical to the GET response except that there is no response body. Because of this, if the HEAD request generates an error, it returns a generic 404 Not Found or 403 Forbidden code. It is not possible to retrieve the exact exception beyond these error codes.
   
   To handle encryption when using that API you have to set request headers (see again the previous doc link):
   
   >If you encrypt an object by using server-side encryption with customer-provided encryption keys (SSE-C) when you store the object in Amazon S3, then when you retrieve the metadata from the object, you must use the following headers:
   
       x-amz-server-side-encryption-customer-algorithm
       x-amz-server-side-encryption-customer-key
       x-amz-server-side-encryption-customer-key-MD5
   
   #### 2) Andrey's suggested change:
   
   > At that moment in this method first we try to check if object exists before download and raise an error if not exists.
   This check could cause an issues in some cases.
   
   Just so that I fully understand your proposal, what are some cases you're thinking of? Also this change seems out of scope for what @moreaupascal56 is trying to achieve?
   
   Also, also, for your suggested code snippet, you're explicitly closing and unlinking the temp file, do you not just the context manager to do so?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1221169912

   > The problem description is in the issue, but the main idea is...
   
   Can you link the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] github-actions[bot] closed pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #25835: Use the extra_args in get_key()
URL: https://github.com/apache/airflow/pull/25835


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1220875288

   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better 🚀.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1221280477

   I thought this issue https://github.com/apache/airflow/issues/25833


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950684114


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   > Just so that I fully understand your proposal, what are some cases you're thinking of?
   
   Both general case and @moreaupascal56 case. My idea is just use more common pythonic approach: **Easier to Ask for Forgiveness than Permission**. 
   
   Most of the code was add about 2 years ago and most probably it just migrated from < Airflow 2 and I thought current implementation do not cover all use cases.
   
   Initially by selected method tried to get information about object but do not pass any of the `boto3.s3.transfer` Download Arguments there:  `'ChecksumMode', 'VersionId', 'SSECustomerAlgorithm', 'SSECustomerKey', 'SSECustomerKeyMD5', 'RequestPayer', 'ExpectedBucketOwner'`.
   
   If user want specific _VersionId_ it actually check latest _VersionId_ or delete marker (need to check just my thoughts).
   
   Example
   ```
   hook = S3Hook(aws_conn_id="aws_default", extra_args={"VersionId": "AAAAAAAA"})
   hook.download_file(key="some/awesome/key.parquet", bucket_name="awesome-bucket")
   ```
   
   So what might happen here?
   `s3_obj = self.get_key(key, bucket_name)` might return object with Version ID = "AAAAAAAA" or "BBBBBBBB"
   
   After that we tried to download file and pass ExtraArgs = `{"VersionId": "AAAAAAAA"}` and I thought it is possible that get some funny error if `s3_obj` referenced to Version ID = "BBBBBBBB".
   
   So my idea get rid of call `self.get_key` and just use [S3.Client.download_fileobj()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_fileobj) instead which might handle passing correct headers for customer-provided encryption and/or Version ID
   
   > you're explicitly closing and unlinking the temp file, do you not just the context manager to do so?
   
   As I understand current code correct `NamedTemporaryFile` use for create unique file inside Airflow Worker, and explicit set that this file shouldn't be deleted because this method returns file path to downloaded file.
   
   In my snippet I just add manually close file if any error happen because we still in context manager and unlink them because if some error happen this file not useful anymore and we could remove it.
   Some sample of error which could keep some unwanted files, all of them not related to permission:
   - Connection error in the middle of download file
   - User tried download 20 GiB file inside Fargate temp directory and get "No space left on device"
   
   If everything alright than context manager close the file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950684114


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   > Just so that I fully understand your proposal, what are some cases you're thinking of?
   
   Both general case and @moreaupascal56 case. My idea is just use more common pythonic approach: **Easier to Ask for Forgiveness than Permission**. 
   
   Most of the code was add about 2 years ago and most probably it just migrated from < Airflow 2 and I thought current implementation do not cover all use cases.
   
   Initially by selected method tried to get information about object but do not pass any of the `boto3.s3.transfer` Download Arguments there:  `'ChecksumMode', 'VersionId', 'SSECustomerAlgorithm', 'SSECustomerKey', 'SSECustomerKeyMD5', 'RequestPayer', 'ExpectedBucketOwner'`.
   
   If user want specific _VersionId_ it actually check latest _VersionId_ or delete marker (need to check just my thoughts).
   
   Example
   ```
   hook = S3Hook(aws_conn_id="aws_default", extra_args={"VersionId": "AAAAAAAA"})
   hook.download_file(key="some/awesome/key.parquet", bucket_name="awesome-bucket")
   ```
   
   So what might happen here?
   `s3_obj = self.get_key(key, bucket_name)` might return object with Version ID = "AAAAAAAA" or "BBBBBBBB"
   
   After that we tried to download file and pass ExtraArgs = `{"VersionId": "AAAAAAAA"}` and I thought it is possible get some funny error if `s3_obj` referenced to Version ID = "BBBBBBBB".
   
   So my idea get rid of call `self.get_key` and just use [S3.Client.download_fileobj()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_fileobj) instead which might handle passing correct headers for customer-provided encryption and/or Version ID
   
   > you're explicitly closing and unlinking the temp file, do you not just the context manager to do so?
   
   As I understand current code correct `NamedTemporaryFile` use for create unique file inside Airflow Worker, and explicit set that this file shouldn't be deleted because this method returns file path to downloaded file.
   
   In my snippet I just add manually close file if any error happen because we still in context manager and unlink them because if some error happen this file not useful anymore and we could remove it.
   Some sample of error which could keep some unwanted files, all of them not related to permission:
   - Connection error in the middle of download file
   - User tried download 20 GiB file inside Fargate temp directory and get "No space left on device"
   
   If everything alright than context manager close the file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] moreaupascal56 commented on pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
moreaupascal56 commented on PR #25835:
URL: https://github.com/apache/airflow/pull/25835#issuecomment-1221974054

   Hello all, thank you for the responses. Yes this is about the issue  #25833 (I was sure to have mentioned it, my bad). 
   
   I have no strong opinion, but yes I am not sure about the usefulness of this `get_key()` and `load()` method in the case of `s3.Hook.download_files()`.
   But if I summarize there is 2 options: 
   - skip the` get_key()` function 
   - filter the `extra_args` to keep only the values compatible with `obj.load()`
   
   What do you think ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Taragolis commented on a diff in pull request #25835: Use the extra_args in get_key()

Posted by GitBox <gi...@apache.org>.
Taragolis commented on code in PR #25835:
URL: https://github.com/apache/airflow/pull/25835#discussion_r950511996


##########
airflow/providers/amazon/aws/hooks/s3.py:
##########
@@ -469,7 +469,7 @@ def get_key(self, key: str, bucket_name: Optional[str] = None) -> S3Transfer:
             verify=self.verify,
         )
         obj = s3_resource.Object(bucket_name, key)
-        obj.load()
+        obj.load(**self.extra_args)

Review Comment:
   Are you sure that you could provide anything to [S3.Object.load()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.load) method?
   
   Also please have a look [this](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#s3-transfers). `extra_args` expected only use in this s3 client methods
   
   - [S3.Client.download_file()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_file)
   - [S3.Client.download_fileobj()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.download_fileobj)
   - [S3.Client.upload_file()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file)
   - [S3.Client.upload_fileobj()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj)
   
   And I do not sure `extra_args` values fully compatible with `obj.load()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org