You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/16 16:19:00 UTC
[jira] [Commented] (ARROW-10937) ArrowInvalid error on reading
partitioned parquet files from S3 (arrow-2.0.0)
[ https://issues.apache.org/jira/browse/ARROW-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250423#comment-17250423 ]
Joris Van den Bossche commented on ARROW-10937:
-----------------------------------------------
Not necessarily related, but ARROW-10264 has a similar error message.
[~Filimonov] Thanks for the report!
Could you try something like this:
{code}
filesystem = s3fs.S3FileSystem()
from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
fs = PyFileSystem(FSSpecHandler(filesystem))
selector = FileSelector("s3://bucket/test_pyarrow.parquet", recursive=True)
fs.get_file_info(selector)
{code}
and tell us what it returns?
(note I didn't try that code, so there might be a mistake in it, but think it _should_ work)
> ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)
> -----------------------------------------------------------------------------
>
> Key: ARROW-10937
> URL: https://issues.apache.org/jira/browse/ARROW-10937
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Vladimir
> Priority: Major
> Fix For: 3.0.0
>
>
> Hello
> It looks like pyarrow-2.0.0 could not read partitioned datasets from S3 buckets:
> {code:java}
> import s3fs
> import pyarrow as pa
> import pyarrow.parquet as pq
> filesystem = s3fs.S3FileSystem()
> d = pd.date_range('1990-01-01', freq='D', periods=10000)
> vals = np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> table = pa.Table.from_pandas(x, preserve_index=True)
> pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', partition_cols=['Year'], filesystem=filesystem)
> {code}
>
> Now, reading it via pq.read_table:
> {code:java}
> pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, use_pandas_metadata=True)
> {code}
> Raises exception:
> {code:java}
> ArrowInvalid: GetFileInfo() yielded path 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet', which is outside base dir 's3://bucket/test_pyarrow.parquet'
> {code}
>
> Direct read in pandas:
> {code:java}
> pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
> returns empty DataFrame.
>
> The issue does not exist in pyarrow-1.0.1
--
This message was sent by Atlassian Jira
(v8.3.4#803005)