You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/02/03 14:09:17 UTC

[GitHub] [airflow] rafalh opened a new pull request #21295: Use temporary file in GCSToS3Operator

rafalh opened a new pull request #21295:
URL: https://github.com/apache/airflow/pull/21295


   Use temporary file in GCSToS3Operator instead of keeping copied file content in the process memory. It allows copying big files on machines with small RAM size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1029026964


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1030250710


   Some tests are failing (related) though


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1031189447


   Any PRs for that are most welcome :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1030915449


   > maybe add an option to the operator, so user have the choice
   > 
   > ```python
   > in_memory:bool = True
   > ```
   
   Is there any drawback to not having it ? I believe this is higly unlikely to have less disk than memory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #21295:
URL: https://github.com/apache/airflow/pull/21295


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1031189121


   > But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.
   
   Very much so, but it wasn't doing it - it was reading whole file to memory and pusing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on a change in pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#discussion_r798633735



##########
File path: airflow/providers/amazon/aws/transfers/gcs_to_s3.py
##########
@@ -164,14 +165,18 @@ def execute(self, context: 'Context') -> List[str]:
         if files:
 
             for file in files:
-                file_bytes = hook.download(object_name=file, bucket_name=self.bucket)
-
-                dest_key = self.dest_s3_key + file
-                self.log.info("Saving file to %s", dest_key)
-
-                s3_hook.load_bytes(
-                    file_bytes, key=dest_key, replace=self.replace, acl_policy=self.s3_acl_policy
-                )
+                with NamedTemporaryFile() as local_tmp_file:
+                    hook.download(object_name=file, bucket_name=self.bucket, filename=local_tmp_file.name)

Review comment:
       What do you think about `GCSHook.provide_file`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafalh commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
rafalh commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1030936257


   I was considering to add `in_memory` argument but I agree it does not bring much and unnecessary increases operator implantation and API complexity. AFAIK other operators also use temporary files and don't have an option to change it.
   I through test failures were unrelated to my changes but I was wrong. I am going to look into them this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafalh edited a comment on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
rafalh edited a comment on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1032878466


   > But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.
   
   It would be cool but it would require more changes because AFAIK hook classes do not support streaming right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafalh commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
rafalh commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1032878466


   > But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.
   
   It would be cool but it would require more changes because AFAIK hook classes does not support streaming right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1036006529


   Awesome work, congrats on your first merged pull request!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafalh commented on a change in pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
rafalh commented on a change in pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#discussion_r798653064



##########
File path: airflow/providers/amazon/aws/transfers/gcs_to_s3.py
##########
@@ -164,14 +165,18 @@ def execute(self, context: 'Context') -> List[str]:
         if files:
 
             for file in files:
-                file_bytes = hook.download(object_name=file, bucket_name=self.bucket)
-
-                dest_key = self.dest_s3_key + file
-                self.log.info("Saving file to %s", dest_key)
-
-                s3_hook.load_bytes(
-                    file_bytes, key=dest_key, replace=self.replace, acl_policy=self.s3_acl_policy
-                )
+                with NamedTemporaryFile() as local_tmp_file:
+                    hook.download(object_name=file, bucket_name=self.bucket, filename=local_tmp_file.name)

Review comment:
       Changed the code to use `GCSHook.provide_file`. I didn't use it at first because in other operators i've seen `download_file` was used more frequently




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1030063534


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] raphaelauv commented on pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on pull request #21295:
URL: https://github.com/apache/airflow/pull/21295#issuecomment-1030940476


   If It's the commun pattern to write to a temp file , then you are right it's better to align the operators.
   
   But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal merged pull request #21295: Use temporary file in GCSToS3Operator

Posted by GitBox <gi...@apache.org>.
eladkal merged pull request #21295:
URL: https://github.com/apache/airflow/pull/21295


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org