You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/07 01:10:37 UTC

[GitHub] [arrow] lukehsiao opened a new issue, #15233: pyarrow copy_files hangs indefinitely

lukehsiao opened a new issue, #15233:
URL: https://github.com/apache/arrow/issues/15233

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   We are working on a python package that calls pyarrow's copy_files function to copy a local directory to S3. We notice that this seems to hang indefinitely for a directory, even though it works for individual files.
   
   A simple reproducer seems to be:
   ```py
   from ray.air._internal.remote_storage import upload_to_uri
   dir = "/some/directory/local"
   uri = "s3://some/s3/bucket"
   upload_to_uri(dir, uri)
   ```
   
   Where ray just wraps pyarrow.fs.copy_files: https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L195
   
   This results in the following flamegraph.
   
   ![profile-idle](https://user-images.githubusercontent.com/7573542/211124149-1da9fc70-2dbb-4094-83cb-54a7c6c2bacf.svg)
   
   And an `strace` of that process looks like
   
   ```
   stat("/some/directory/local/result.json", {st_mode=S_IFREG|0664, st_size=2345, ...}) = 0
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
   getpid()                                = 1639901
   getpid()                                = 1639901
   getpid()                                = 1639901
   futex(0x5580a3d374e8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
   --- SIGTSTP {si_signo=SIGTSTP, si_code=SI_KERNEL} ---
   --- stopped by SIGTSTP ---
   +++ killed by SIGKILL +++
   ````
   
   This is using
   ```
   .venv ❯ pip show pyarrow                                                                                                                                                                                                    [23/01/6|17:08]
   Name: pyarrow
   Version: 8.0.0
   ```
   On Python 3.10.9, on a machine running Ubuntu 20.04.5 LTS.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] EpsilonPrime commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely

Posted by GitBox <gi...@apache.org>.
EpsilonPrime commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1376686021

   I'm taking a look at this.  I have one debugging question for you in the meantime however.  When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy.  Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes?  One way of determining this would be to insert a log/print at https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L238 to output the paths that are going to be copied from.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] krfricke commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely

Posted by "krfricke (via GitHub)" <gi...@apache.org>.
krfricke commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1421103441

   I think this issue is a duplicate of #32372. I've added more details in that issue, but in a nutshell, pyarrow.fs.copy_files hangs for s3 buckets and with `use_threads=True` if more files are uploaded than CPU cores available:
   
   ```
   mkdir -p /tmp/pa-s3
   cd /tmp/pa-s3 
   for i in {1..7}; do touch $i.txt; done
   # This works
   python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
   for i in {1..8}; do touch $i.txt; done  
   # This hangs forever
   python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1421342033

   This sounds very similar to nested parallelism deadlocks we have had in the past.
   
   Outermost call: fork-join on a bunch of items (in this case it looks like we are doing fork-join on files)
   Inner task: fork-join on something else (e.g. in parquet it would be parquet column decoding)
   
   If the inner-task is blocking on a "join" then it is wasting a thread pool thread.  If enough of these thread pool threads get wasted then all thread pool threads are blocked waiting for other thread pool tasks and no thread is free to actually do the tasks.
   
   The solution we adopted was to migrate to an async model so that "join" step becomes "return a future" instead of "block until done".  This yields roughly the following rules:
   
    * The user thread (the python top-level thread) should block on a top-level future
    * CPU threads should never block (outside of minor blocking on mutex guards to sequence a tiny critical section)
    * I/O threads should only block on OS calls.  They should never block waiting for other tasks.
   
   It seems like the copy_files/s3 combination is violating one of the above rules.  There is an OptionalParallelFor in CopyFiles which blocks but I think that is called from the user thread and so that is ok.  @EpsilonPrime if you can reproduce I would grab a thread dump from gdb and check and see what the thread tasks are blocking on.  The fix will probably be to move copy files over to using async APIs (internally).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lukehsiao commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely

Posted by GitBox <gi...@apache.org>.
lukehsiao commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1376741426

   > I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes? 
   
   In our case, `exclude` is None, so that code path to `_upload_to_uri_with_exclude` is actually never hit (and you'll notice it doesn't show in the flamegraph). So, it's essentially just the call to `pyarrow.fs.copy_files`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] EpsilonPrime commented on issue #15233: [Python] pyarrow.fs.copy_files hangs indefinitely

Posted by "EpsilonPrime (via GitHub)" <gi...@apache.org>.
EpsilonPrime commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-1477386944

   I have written a reproduction testcase that detects the thread contention issue (and is ready to check in once the fix is ready).  What is happening is that when copying a file (filesystem.cc:613) the CopyStream happens as expected and then is passed to the close routine to complete.  That delegates to CloseAsync which handles uploading parts (calling UploadPart).  To do this UploadPart then adds its work to the threadpool which overloads the executor.  For the case of an 8 thread pool with 8 tasks (each small enough to fit in a single part) this ends up being 16 busy threads in a size 8 executor.
   
   The easy solution is to limit the number of tasks to the pool (merely leaving one extra thread appears to be enough for the pool to empty although this needs verification).  The second is to modify the close routine to take over the work of the existing thread (not be asynchronous).  This would require reworking of at least 5 functions and might require even more work for the case where there are multiple parts per file (which we do not have a test for yet).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org