You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David McGuire (Jira)" <ji...@apache.org> on 2020/09/17 16:10:00 UTC

[jira] [Comment Edited] (ARROW-10029) Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

    [ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197784#comment-17197784 ] 

David McGuire edited comment on ARROW-10029 at 9/17/20, 4:09 PM:
-----------------------------------------------------------------

If it's not multi-threaded, then there won't be deadlock:

{code}
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))
{code}

Running:

{code}
$ python repro.py 
6
6
{code}


was (Author: dmcguire):
If it's not multi-threaded, then there won't be deadlock:

{code}
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))
{code}

> Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> --------------------------------------------------------------------
>
>                 Key: ARROW-10029
>                 URL: https://issues.apache.org/jira/browse/ARROW-10029
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: David McGuire
>            Priority: Major
>         Attachments: repro.py
>
>
> @martindurant good news (for you): I have a repro test case that is 100% {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on {{pyarrow.filesystem.LocalFileSystem}}?
> Viewing the File System *directories* as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.
> 1) dataset
> 2) dataset/foo=1
> 3) dataset/foo=1/bar=2
> 4) dataset/foo=1/bar=2/baz=0
> 5) dataset/foo=1/bar=2/baz=1
> 6) dataset/foo=1/bar=2/baz=2
> *) dataset/foo=1/bar=2/baz=0/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=true
> *) dataset/foo=1/bar=2/baz=0/qux=true
> *) dataset/foo=1/bar=2/baz=2/qux=false
> *) dataset/foo=1/bar=2/baz=2/qux=true
> {code}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
>     def walk(self, path):
>         print(path)
>         return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {code}
> *_Call with 6 threads completes._*
> *_Call with 5 threads hangs indefinitely._*
> {code}
> $ python repro.py 
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {code}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and when omitting the {{filesystem}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)