You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jon Rosenberg (Jira)" <ji...@apache.org> on 2022/03/30 20:17:00 UTC
[jira] [Created] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
Jon Rosenberg created ARROW-16077:
-------------------------------------
Summary: [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
Key: ARROW-16077
URL: https://issues.apache.org/jira/browse/ARROW-16077
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 7.0.0
Reporter: Jon Rosenberg
Reading a partitioned parquet from adlfs with pyarrow through pandas will throw unnecessary exceptions on not matching forward slashes in the listed files returned from adlfs, ie:
{code:python}
import pandas as pd
pd.read_parquet("adl://resource/path/to/parquet/files"){code}
results in exception of the form
{code:bash}
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to/parquet/files/'{code}
and testing with modifying the adlfs method to prepend slashes to all returned files, we still end up with an error on file paths that would otherwise be handled correctly where there is a double slash in a location where there should be one, ie:
{code:python}
import pandas as pd
pd.read_parquet("adl://resource/path/to//parquet/files") {code}
would throw
{code:bash}
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path '/path/to/parquet/files/part-0001.parquet', which is outside base dir '/path/to//parquet/files/' {code}
In both cases the ls has returned correctly from adlfs, given it's discovered the file part-0001.parquet but the pyarrow exception stops what could otherwise be successful processing.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)