You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/15 01:14:00 UTC

[jira] [Commented] (ARROW-3238) Can't read pyarrow string columns in fastparquet

    [ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615544#comment-16615544 ] 

Wes McKinney commented on ARROW-3238:
-------------------------------------

[~mdurant] if I had to guess, the dictionary grew too large during encoding and the Parquet writer switched to using plain encoding part way through the ColumnChunk. Have you implemented support for this case? Both the Java and C++ Parquet libraries have a threshold for switching midstream to plain from dictionary encoding

> Can't read pyarrow string columns in fastparquet
> ------------------------------------------------
>
>                 Key: ARROW-3238
>                 URL: https://issues.apache.org/jira/browse/ARROW-3238
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Theo Walker
>            Priority: Major
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
> dff = pf.to_pandas(['A'])
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
> index=index, assign=parts)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
> scheme=self.file_scheme)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
> cats, selfmade, assign=assign)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet(){code}
>  
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)