You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/06 10:44:49 UTC

[GitHub] [airflow] randr97 opened a new issue #16286: SFTPHook cannot download large files

randr97 opened a new issue #16286:
URL: https://github.com/apache/airflow/issues/16286


   **Apache Airflow version**: 2.0.1 (Should apply to previous versions and later ones as well)
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS, EC2 c5.xlarge
   - **OS** (e.g. from /etc/os-release): ubuntu 18.04
   - **Kernel** (e.g. `uname -a`): Linux
   
   **What happened**:
   In `airflow.providers.sftp.hooks.sftp.SFTPHook`, when we try to download a file greater than 18 MiB, the download keeps happening forever and never gets completed.
   
   **What you expected to happen**:
   The download should have completed in seconds but did not. A file less than 18MiB gets downloaded in few seconds.
   Looks like this is an underlying issue in the `paramiko` library. 
   Attaching a bunch of issues on paramiko's git and stackoverflow -
   1. https://github.com/paramiko/paramiko/issues/926
   2. https://stackoverflow.com/questions/12486623/paramiko-fails-to-download-large-files-1gb
   3. https://stackoverflow.com/questions/3459071/paramiko-sftp-hangs-on-get
   
   **How to reproduce it**:
   1. Create a large file size > 18MiB
   2. Dump it in an SFTP server
   3. Use airflow SFTPHook to download it
   4. You should be able to see the task run forever
   
   **Anything else we need to know**:
   I after exploring found a solution to the problem and have fixed it in my project but if the community can dive deep it would be great.
   Link to the solution is - https://gist.github.com/vznncv/cb454c21d901438cc228916fbe6f070f
   This gist is by @vznncv and credits to him for coming up with a solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-861788138


   @uranusjr you can build it from source, against system-provided libraries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] randr97 commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
randr97 commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-869184869


   I think this solution [Link](https://gist.github.com/vznncv/cb454c21d901438cc228916fbe6f070f) can be a fix without really changing any libraries... Have been using it for sometime and haven't gotten into any issues till now! What do you guys think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] randr97 commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
randr97 commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-855391416


   @potiuk that would be definitely a good place to start. How do you want to go about this? I would definitely want to contribute to this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-860650099


   Agree with @ashb . Parallel-ssh looks better as well. Seems rather popular from the number of forks, has very little number of dependencies (gevent, python-ssh2). Not super active (but what activity you'd expect from such library?). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-855389811


   Maybe we should consider rewriting the hook using Twisted? It sems all APIs needed are there https://twistedmatrix.com/documents/current/api/twisted.conch.ssh.filetransfer.FileTransferClient.html, Twisted seems to be a bit more modern with the async approach and the bug in paramiko seems to be around since 2017.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe edited a comment on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
malthe edited a comment on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-860623530


   @potiuk what about https://github.com/ParallelSSH/parallel-ssh#sftp – ?
   
   See also https://github.com/ParallelSSH/parallel-ssh#why-this-library:
   
   > Because other options are either immature, unstable, lacking in performance or all of the aforementioned.
   > 
   > Certain other self-proclaimed leading Python SSH libraries leave a lot to be desired from a performance and stability point of view, as well as suffering from a lack of maintenance with hundreds of open issues, unresolved pull requests and inherent design flaws.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] randr97 commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
randr97 commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-855395137


   @potiuk will take some time out and start working on the PR. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-1005970693






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-1005970693


   > Have been using it for sometime and haven't gotten into any issues till now! What do you guys think?
   
   If you have a proposed solution - it's always good to open a PR and let people review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-860623530


   @potiuk what about https://github.com/ParallelSSH/parallel-ssh#sftp – ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-861691933


   Hmm :( tough choice :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-860638926


   I'd probably caution against introducing twisted -- with Python 3.6/3.7+ the built in asyncio can do most of twisted without the need for a large external dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-1005970693


   > Have been using it for sometime and haven't gotten into any issues till now! What do you guys think?
   
   If you have a proposed solution - it's better to open a PR and let people review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe edited a comment on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
malthe edited a comment on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-861788138


   @uranusjr you can build it from source, against system-provided libraries.
   
   But there's something wrong in the packaging because when I look at the installed files, there are lots of .c source files:
   
   <img width="507" alt="image" src="https://user-images.githubusercontent.com/26405/122115227-bf0e0680-ce13-11eb-82ce-9fd6f8a85290.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-855393404


   > @potiuk that would be definitely a good place to start. How do you want to go about this? I would definitely want to contribute to this.
   
   I do not think there is anything special needed. Just rewrite the Hook/Operator, starting with replacing the twisted library in setup.py deps instead of paramiko/sftp. Then we can relase a major release of SFTP provider with it. That's pretty much it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr edited a comment on issue #16286: SFTPHook cannot download large files

Posted by GitBox <gi...@apache.org>.
uranusjr edited a comment on issue #16286:
URL: https://github.com/apache/airflow/issues/16286#issuecomment-861455393


   Note that parallel-ssh is not small either (mainly because it ships its own copy of `libssh` and `libssh2`).
   
   ```
   $ pip wheel -w tw twisted
   ...
   $ pip wheel -w ps parallel-ssh
   ...
   $ du -s tw ps
   4336    tw
   18132   ps
   ```
   
   Parallel-ssh is over four times larger than Twisted if dependencies are considered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org