You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Siddharth (JIRA)" <ji...@apache.org> on 2019/07/29 09:33:00 UTC

[jira] [Updated] (ARROW-6058) pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller than expected

     [ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth updated ARROW-6058:
-----------------------------
    Description: 
I am reading parquet data from S3 and get  ArrowIOError error.

Size of the data: 32 part files 90 MB each (3GB approx)

Number of records: Approx 100M

Code Snippet:
{code:java}
from s3fs import S3FileSystem
import pyarrow.parquet as pq

s3 = S3FileSystem()

dataset = pq.ParquetDataset("s3://location", filesystem=s3)

df = dataset.read_pandas().to_pandas()
{code}
Stack Trace:

```
 df = dataset.read_pandas().to_pandas()
 File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas
 return self.read(use_pandas_metadata=True, **kwargs)
 File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read
 use_pandas_metadata=use_pandas_metadata)
 File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read
 table = reader.read(**options)
 File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929)
 ```

 

 

 

  was:
I am reading parquet data from S3 and get  ArrowIOError error.

Size of the data: 32 part files 90 MB each (3GB approx)

Number of records: Approx 100M

Code Snippet:

```

from s3fs import S3FileSystem
import pyarrow.parquet as pq

s3 = S3FileSystem()

dataset = pq.ParquetDataset("s3://location", filesystem=s3)

df = dataset.read_pandas().to_pandas()

```

Stack Trace:

```
df = dataset.read_pandas().to_pandas()
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas
return self.read(use_pandas_metadata=True, **kwargs)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read
use_pandas_metadata=use_pandas_metadata)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read
table = reader.read(**options)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929)
```

 

 

 


> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller than expected
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-6058
>                 URL: https://issues.apache.org/jira/browse/ARROW-6058
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.1
>            Reporter: Siddharth
>            Priority: Major
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> ```
>  df = dataset.read_pandas().to_pandas()
>  File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas
>  return self.read(use_pandas_metadata=True, **kwargs)
>  File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read
>  table = reader.read(**options)
>  File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929)
>  ```
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)