You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "William Tardio (Jira)" <ji...@apache.org> on 2019/11/22 21:05:00 UTC

[jira] [Commented] (ARROW-7244) [Python] Inconsistent behavior with reading in S3 parquet objects

    [ https://issues.apache.org/jira/browse/ARROW-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980504#comment-16980504 ] 

William Tardio commented on ARROW-7244:
---------------------------------------

Thanks [~wesm] for the look. Does the stack trace give any hints? I looked into it and tried to follow along  _make_manifest() but couldn't see anything obvious.

> [Python] Inconsistent behavior with reading in S3 parquet objects
> -----------------------------------------------------------------
>
>                 Key: ARROW-7244
>                 URL: https://issues.apache.org/jira/browse/ARROW-7244
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: running in a lambda, compiled on an EC2 using linux
>            Reporter: William Tardio
>            Priority: Major
>
> We are piloting using pyarrow to reaching parquet files from AWS S3.
>  
> We got it working in combination with s3fs as the filesystem. However, we are seeing very inconsistent results when reading in parquet objects with
> s3=s3fs.S3FileSystem()
> ParquetDataset(url, filesystem=s3)
>  
> The read inconsistently throws this error:
>  
> [ERROR] OSError: Passed non-file path: s3://bucket/schedule/sxaup/fms_db_aub/adn_master/trunc/20191122024436.parquet
> Traceback (most recent call last):
>   File "/var/task/file_check.py", line 35, in lambda_handler
>     main(event, context)
>   File "/var/task/file_check.py", line 260, in main
>     validate_resp['object_type'])
>   File "/opt/python/utils.py", line 80, in schema_check
>     stage_pya_dataset = ParquetDataset(full_URL_stage, filesystem=s3)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1030, in __init__
>     open_file_func=partial(_open_dataset_file, self._metadata)
>   File "/opt/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1229, in _make_manifest
>     .format(path))
>  
> As you can see, the path is valid and sometimes works, others times does not (no modification of the file between those successful and error runs). Does ParquetDataset actually open the file and validate it and so the error is in regards to the data?
>  
> Willing to do any troubleshooting for get this solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)