You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Nicolas Elie (Jira)" <ji...@apache.org> on 2020/06/24 09:57:00 UTC

[jira] [Commented] (ARROW-7939) [Python] crashes when reading parquet file compressed with snappy

    [ https://issues.apache.org/jira/browse/ARROW-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143714#comment-17143714 ] 

Nicolas Elie commented on ARROW-7939:
-------------------------------------

Hello,

I just faced the same issue with pyarrow 0.17.1 installed from conda-forge on Windows 7 64 bit with Python 3.8.3.

The example code given in description crash with array_nok and compression='snappy' but not with other compression algorithms nor with array_ok. Works fine with pyarrow 0.15.1

I also tried with a clean environment as suggested in first comment (pyarrow 0.16) and the problem is the same.

I tried to open the generated parquet files (that can't be read by pyarrow, compression='snappy') with [fastparquet |https://github.com/dask/fastparquet] or [ParquetViewer|https://github.com/mukunku/ParquetViewer]: They both can't read the parquet file...

What additionnal information could I give you?

> [Python] crashes when reading parquet file compressed with snappy
> -----------------------------------------------------------------
>
>                 Key: ARROW-7939
>                 URL: https://issues.apache.org/jira/browse/ARROW-7939
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Windows 7
> python 3.6.9
> pyarrow 0.16 from conda-forge
>            Reporter: Marc Bernot
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 0.17.0
>
>
> When I installed pyarrow 0.16, some parquet files created with pyarrow 0.15.1 would make python crash. I drilled down to the simplest example I could find.
> It happens that some parquet files created with pyarrow 0.16 cannot either be read back. The example below works fine with arrays_ok but python crashes with arrays_nok (and as soon as they are at least three different values apparently).
> Besides, it works fine with 'none', 'gzip' and 'brotli' compression. The problem seems to happen only with snappy.
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> arrays_ok = [[0,1]]
> arrays_ok = [[0,1,1]]
> arrays_nok = [[0,1,2]]
> table = pa.Table.from_arrays(arrays_nok,names=['a'])
> pq.write_table(table,'foo.parquet',compression='snappy')
> pq.read_table('foo.parquet')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)