You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/16 08:49:00 UTC

[jira] [Commented] (ARROW-10923) Failure to read parquet from s3 after uploading file to s3-object

    [ https://issues.apache.org/jira/browse/ARROW-10923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250175#comment-17250175 ] 

Joris Van den Bossche commented on ARROW-10923:
-----------------------------------------------

[~dazza] Would you be able to show the actual code you ran to get this error, and also the full error message? (the final error message seems to be missing from the traceback) 
Otherwise it is hard to help / know what could be going on.

> Failure to read parquet from s3 after uploading file to s3-object
> -----------------------------------------------------------------
>
>                 Key: ARROW-10923
>                 URL: https://issues.apache.org/jira/browse/ARROW-10923
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Darren Weber
>            Priority: Major
>
> After a parquet file was copied to an s3-bucket and s3-key, pyarrow does not read it from s3.  Desired behavior is that an s3-object for parquet should be self-contained, it should not depend on or track any substantial metadata about the storage engine or file system location it was saved to in such a way that it prevents relocating the object.  To try to replicate the problem, save any parquet file on a linux file system (ext4) and then use the aws-cli to copy that file to any s3-object and then try to use geopandas.read_parquet to load that s3-object.
> ```
> File "/opt/conda/envs/project/lib/python3.7/site-packages/geopandas/io/arrow.py", line 404, in _read_parquet
>  table = parquet.read_table(path, columns=columns, **kwargs)
>  File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", line 1573, in read_table
>  ignore_prefixes=ignore_prefixes,
>  File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/parquet.py", line 1434, in __init__
>  ignore_prefixes=ignore_prefixes)
>  File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 667, in dataset
>  return _filesystem_dataset(source, **kwargs)
>  File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 424, in _filesystem_dataset
>  fs, paths_or_selector = _ensure_single_source(source, filesystem)
>  File "/opt/conda/envs/project/lib/python3.7/site-packages/pyarrow/dataset.py", line 391, in _ensure_single_source
>  file_info = filesystem.get_file_info([path])[0]
>  File "pyarrow/_fs.pyx", line 429, in pyarrow._fs.FileSystem.get_file_info
>  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> ```
> ```
> $ poetry show pyarrow
> name         : pyarrow
> version      : 1.0.1
> description  : Python library for Apache Arrow
> dependencies
>  - numpy >=1.14
> $ poetry show geopandas
> name         : geopandas
> version      : 0.8.1
> description  : Geographic pandas extensions
> dependencies
>  - fiona *
>  - pandas >=0.23.0
>  - pyproj >=2.2.0
>  - shapely *
> $ poetry show s3fs
> name         : s3fs
> version      : 0.4.2
> description  : Convenient Filesystem interface over S3
> dependencies
>  - botocore >=1.12.91
>  - fsspec >=0.6.0
> $ poetry show fsspec
> name         : fsspec
> version      : 0.8.4
> description  : File-system specification
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)