You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Michael Peleshenko (Jira)" <ji...@apache.org> on 2021/01/12 17:33:00 UTC

[jira] [Commented] (ARROW-7939) [Python] crashes when reading parquet file compressed with snappy

    [ https://issues.apache.org/jira/browse/ARROW-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263542#comment-17263542 ] 

Michael Peleshenko commented on ARROW-7939:
-------------------------------------------

We seem to be facing the same crash with the sample code here, but with the pyarrow 2.0.0 pip wheel for Windows and Python 3.8.
On an Intel Xeon Silver 4114 CPU, I have no issues.
On an Intel Xeon E5-2620, my Python crashes.

According to https://ark.intel.com/content/www/us/en/ark/products/64594/intel-xeon-processor-e5-2620-15m-cache-2-00-ghz-7-20-gt-s-intel-qpi.html, the Xeon E5-2620, does not support AVX2, while the other one does, so I suspect we are hitting the same issue here.

Assuming this is the same issue, has this snappy fix been including in the pyarrow pip wheel build for Windows?

> [Python] crashes when reading parquet file compressed with snappy
> -----------------------------------------------------------------
>
>                 Key: ARROW-7939
>                 URL: https://issues.apache.org/jira/browse/ARROW-7939
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Windows 7
> python 3.6.9
> pyarrow 0.16 from conda-forge
>            Reporter: Marc Bernot
>            Assignee: Uwe Korn
>            Priority: Major
>             Fix For: 1.0.0
>
>
> When I installed pyarrow 0.16, some parquet files created with pyarrow 0.15.1 would make python crash. I drilled down to the simplest example I could find.
> It happens that some parquet files created with pyarrow 0.16 cannot either be read back. The example below works fine with arrays_ok but python crashes with arrays_nok (and as soon as they are at least three different values apparently).
> Besides, it works fine with 'none', 'gzip' and 'brotli' compression. The problem seems to happen only with snappy.
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> arrays_ok = [[0,1]]
> arrays_ok = [[0,1,1]]
> arrays_nok = [[0,1,2]]
> table = pa.Table.from_arrays(arrays_nok,names=['a'])
> pq.write_table(table,'foo.parquet',compression='snappy')
> pq.read_table('foo.parquet')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)