You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/13 20:21:09 UTC

[GitHub] [airflow] john-jac opened a new pull request #17609: sftp_to_s3 stream file option

john-jac opened a new pull request #17609:
URL: https://github.com/apache/airflow/pull/17609


   Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file.  This is required whenever the size of the file exceeds the temporary storage of the worker.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17609: sftp_to_s3 stream file option

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-898912306


   Two things:
   1) Static checks are failing (i heartily recommend installing [pre-commit](https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) to catch those kind of checks at commit time (saves a lot on iterating with the change) 
   
   2) Unit tests should be added. We are very keen on getting all the new functionality covered by unit tests - any change that adds some functionality should have accompanying unit-tests to avoid regressions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17609: sftp_to_s3 stream file option

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-899055233


   > Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system
   
   I think the main reason are implementation details of the `upload_fileobj`. It's not really obvious how the data is buffered while `upload_fileobj` runs so there might be significant memory usage during this operation. But the main reason is that from what I see the description of upload_fileobj, whenever possible it will use multiple threads and upload s3 object in parallel (which - I know for a fact) can speed up the s3 upload immensely (this is how S3 upload is designed). However (my guess but quite likely), this cannot be done if the "fileobj" does not provide "seek()" functionality. Looking how sftp get is implemented, it's fileobj does not allow seek, it can only read the file sequentially (this is how sftp protocol works I believe). It could only provide "seek" if it loaded the file entirely in memory first (but this would not be good for huge files).
   
   So if you have a fast (local network) sftp connection, downloading the file first and then uploading the local file might significantly speed up the transfer, as `upload_fileobj` will be able to utilise multiple threads to upload.  That's moslty educated guess, but I think  it's very likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JavierLopezT commented on pull request #17609: sftp_to_s3 stream file option

Posted by GitBox <gi...@apache.org>.
JavierLopezT commented on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-899034684


   Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #17609: sftp_to_s3 stream file option

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #17609:
URL: https://github.com/apache/airflow/pull/17609


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17609: sftp_to_s3 stream file option

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-899055233


   > Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system
   
   I think the main reason are implementation details of the `upload_fileobj`. It's not really obvious how the data is buffered while `upload_fileobj` runs so there might be significant memory usage during this operation. From what I see the description of upload_fileobj, whenever possible it will use multiple threads and upload s3 object in parallel (which - I know for a fact) can speed up the s3 upload immensely (this is how S3 upload is designed). However (my guess but quite likely), this cannot be done if the "fileobj" does not provide "seek()" functionality. Looking how sftp get is implemented, it's fileobj does not allow seek, it can only read the file sequentially (this is how sftp protocol works I believe). It could only provide "seek" if it loaded the file entirely in memory first (but this would not be good for huge files).
   
   So if you have a fast (local network) sftp connection, downloading the file first and then uploading the local file might significantly speed up the transfer, as `upload_fileobj` will be able to utilise multiple threads to upload.  That's moslty educated guess, but I think  it's very likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org