You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Siddharth (JIRA)" <ji...@apache.org> on 2019/07/29 15:24:00 UTC
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when
reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895353#comment-16895353 ]
Siddharth edited comment on ARROW-6058 at 7/29/19 3:23 PM:
-----------------------------------------------------------
hey [~wesmckinn] thanks for your prompt reply. I checked each part file and found the part file which gives an error. But the entire partition can be ready by redshift and athena without any errors. I can do a count(*) on the partition and it works so I am not sure if the part file is corrupted. But I can dig in to see if there is an issue with the part file. Any idea how we can fix this if part file was produced as expected?
was (Author: sid88in):
hey [~wesmckinn] thanks for your prompt reply. I checked each part file and found the part file which gives an error. But the entire partition can be ready by redshift and athena without any errors. I can do a count(*) on the partition and it works so I am not sure if the part file is corrupted. But I can dig in to see if there is an issue with the part file. Any idea how we can fix this?
> [Python][Parquet] Failure when reading Parquet file from S3
> ------------------------------------------------------------
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 0.14.1
> Reporter: Siddharth
> Priority: Major
> Labels: parquet
>
> I am reading parquet data from S3 and get ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929)
> {code}
>
> *Note: Same code works on relatively smaller dataset (approx < 50M records)*
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)