You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David McGuire (Jira)" <ji...@apache.org> on 2020/09/23 20:15:00 UTC
[jira] [Comment Edited] (ARROW-10029) [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

    [ https://issues.apache.org/jira/browse/ARROW-10029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201068#comment-17201068 ] 

David McGuire edited comment on ARROW-10029 at 9/23/20, 8:14 PM:
-----------------------------------------------------------------

[~jorisvandenbossche] best I can tell, this impacts all Hive-style partitioned datasets across S3 and local storage, and going back as many versions as I was able to test (until {{0.10.0}}, with the dependency on {{Cython}}). This makes me think that supporting this use case was never very important, and might not be in the future. Can you help inform my understanding of the relative priority of fixing this defect?

Also, if the lack of traction is owing to it using the _legacy_ implementation, should I start exploring the new implementation available starting with {{1.0.0}}, and reporting defects against it? Seems like it trips up on something pretty simple, to start (a Spark {{_SUCCESS}} file):

 
{code:java}
OSError: Could not open parquet input source 'our-bucket/some/partitioned/dataset/_SUCCESS': Invalid: Parquet file size is 0 bytes
{code}


was (Author: dmcguire):
[~jorisvandenbossche] best I can tell, this impacts all Hive-style partitioned datasets across S3 and local storage, and going back as many versions as I was able to test (until `0.10.0`, with the dependency on `Cython`). This makes me think that supporting this use case was never very important, and might not be in the future. Can you help inform my understanding of the relative priority of fixing this defect?

Also, if the lack of traction is owing to it using the _legacy_ implementation, should I start exploring the new implementation available starting with `1.0.0`, and reporting defects against it? Seems like it trips up on something pretty simple, to start (a Spark {{_SUCCESS}} file):

 
{code}
OSError: Could not open parquet input source 'our-bucket/some/partitioned/dataset/_SUCCESS': Invalid: Parquet file size is 0 bytes
{code}

> [Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10029
>                 URL: https://issues.apache.org/jira/browse/ARROW-10029
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: David McGuire
>            Priority: Major
>         Attachments: repro.py
>
>
> @martindurant good news (for you): I have a repro test case that is 100% {{pyarrow}}, so it looks like {{s3fs}} is not involved.
> @jorisvandenbossche how should I follow up with this, based on {{pyarrow.filesystem.LocalFileSystem}}?
> Viewing the File System *directories* as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.
> 1) dataset
> 2) dataset/foo=1
> 3) dataset/foo=1/bar=2
> 4) dataset/foo=1/bar=2/baz=0
> 5) dataset/foo=1/bar=2/baz=1
> 6) dataset/foo=1/bar=2/baz=2
> *) dataset/foo=1/bar=2/baz=0/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=false
> *) dataset/foo=1/bar=2/baz=1/qux=true
> *) dataset/foo=1/bar=2/baz=0/qux=true
> *) dataset/foo=1/bar=2/baz=2/qux=false
> *) dataset/foo=1/bar=2/baz=2/qux=true
> {code}
> import pyarrow.parquet as pq
> import pyarrow.filesystem as fs
> class LoggingLocalFileSystem(fs.LocalFileSystem):
>     def walk(self, path):
>         print(path)
>         return super().walk(path)
> fs = LoggingLocalFileSystem()
> dataset_url = "dataset"
> threads = 6
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> threads = 5
> dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
> print(len(dataset.pieces))
> {code}
> *_Call with 6 threads completes._*
> *_Call with 5 threads hangs indefinitely._*
> {code}
> $ python repro.py 
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> dataset/foo=1/bar=2/baz=0/qux=false
> dataset/foo=1/bar=2/baz=0/qux=true
> dataset/foo=1/bar=2/baz=1/qux=false
> dataset/foo=1/bar=2/baz=1/qux=true
> dataset/foo=1/bar=2/baz=2/qux=false
> dataset/foo=1/bar=2/baz=2/qux=true
> 6
> dataset
> dataset/foo=1
> dataset/foo=1/bar=2
> dataset/foo=1/bar=2/baz=0
> dataset/foo=1/bar=2/baz=1
> dataset/foo=1/bar=2/baz=2
> ^C
> ...
> KeyboardInterrupt
> ^C
> ...
> KeyboardInterrupt
> {code}
> **NOTE:** this *also* happens with the un-decorated {{LocalFileSystem}}, and when omitting the {{filesystem}} argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)