You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Jon Rosenberg (Jira)" <ji...@apache.org> on 2022/03/30 20:17:00 UTC

[jira] [Created] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

Jon Rosenberg created ARROW-16077:
-------------------------------------

             Summary: [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
                 Key: ARROW-16077
                 URL: https://issues.apache.org/jira/browse/ARROW-16077
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
            Reporter: Jon Rosenberg


Reading a partitioned parquet from adlfs with pyarrow through pandas will throw unnecessary exceptions on not matching forward slashes in the listed files returned from adlfs, ie:

 
{code:python}
import pandas as pd

pd.read_parquet("adl://resource/path/to/parquet/files"){code}
results in exception of the form
{code:bash}
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to/parquet/files/'{code}
 

and testing with modifying the adlfs method to prepend slashes to all returned files, we still end up with an error on file paths that would otherwise be handled correctly where there is a double slash in a location where there should be one, ie:



 
{code:python}
import pandas as pd

pd.read_parquet("adl://resource/path/to//parquet/files") {code}
would throw
{code:bash}
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path '/path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to//parquet/files/' {code}
In both cases the ls has returned correctly from adlfs, given it's discovered the file part-0001.parquet but the pyarrow exception stops what could otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)