You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/17 18:41:00 UTC

[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

     [ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-3238:
--------------------------------
    Summary: [Python] Can't read pyarrow string columns in fastparquet  (was: Can't read pyarrow string columns in fastparquet)

> [Python] Can't read pyarrow string columns in fastparquet
> ---------------------------------------------------------
>
>                 Key: ARROW-3238
>                 URL: https://issues.apache.org/jira/browse/ARROW-3238
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Theo Walker
>            Priority: Major
>              Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
> dff = pf.to_pandas(['A'])
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
> index=index, assign=parts)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
> scheme=self.file_scheme)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
> cats, selfmade, assign=assign)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet(){code}
>  
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)