You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrus (Jira)" <ji...@apache.org> on 2020/09/28 18:28:00 UTC

[jira] [Comment Edited] (ARROW-8385) [Python][Parquet] Crash on parquet.read_table on windows python 3.82

    [ https://issues.apache.org/jira/browse/ARROW-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203447#comment-17203447 ] 

Andrus edited comment on ARROW-8385 at 9/28/20, 6:27 PM:
---------------------------------------------------------

Same issue here. Calling read_table produces an exit with no exception. Feather works flawlessly. Python 3.8.5/Win10.

CPU is an old (2014) dual core Pentium G3258.

No problem when using this df:
 {color:#d4d4d4}df = pd.DataFrame([{color}{color:#b5cea8}11{color}{color:#d4d4d4}, {color}{color:#b5cea8}22{color}{color:#d4d4d4}], {color}{color:#9cdcfe}columns{color}{color:#d4d4d4}=[{color}{color:#ce9178}'col'{color}{color:#d4d4d4}]){color}
 Failure with:
 {color:#d4d4d4}df = pd.DataFrame([{color}{color:#b5cea8}11{color}{color:#d4d4d4}, {color}{color:#b5cea8}22{color}{color:#d4d4d4}, {color}{color:#b5cea8}33{color}{color:#d4d4d4}], {color}{color:#9cdcfe}columns{color}{color:#d4d4d4}=[{color}{color:#ce9178}'col'{color}{color:#d4d4d4}]){color}


 SO reference with example: [https://stackoverflow.com/questions/64106111/can-only-read-empty-pandas-dataframes-with-parquet/]


was (Author: misantroop):
Same issue here. Calling read_table produces an exit with no exception. Feather works flawlessly. Python 3.8.5/Win10.

CPU is an old (2014) dual core Pentium G3258.

No problem when using this df:
{color:#d4d4d4}df = pd.DataFrame([{color}{color:#b5cea8}11{color}{color:#d4d4d4}, {color}{color:#b5cea8}22{color}{color:#d4d4d4}], {color}{color:#9cdcfe}columns{color}{color:#d4d4d4}=[{color}{color:#ce9178}'col'{color}{color:#d4d4d4}]){color}
Failure with:
{color:#d4d4d4}df = pd.DataFrame([{color}{color:#b5cea8}11{color}{color:#d4d4d4}, {color}{color:#b5cea8}22{color}{color:#d4d4d4}, {color}{color:#b5cea8}33{color}{color:#d4d4d4}], {color}{color:#9cdcfe}columns{color}{color:#d4d4d4}=[{color}{color:#ce9178}'col'{color}{color:#d4d4d4}]){color}
SO reference with example: https://stackoverflow.com/questions/64106111/can-only-read-empty-pandas-dataframes-with-parquet/

> [Python][Parquet] Crash on parquet.read_table on windows python 3.82
> --------------------------------------------------------------------
>
>                 Key: ARROW-8385
>                 URL: https://issues.apache.org/jira/browse/ARROW-8385
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Window 10 
> python 3.8.2 pip 20.0.2
> pip freeze ->
> numpy==1.18.2
> pandas==1.0.3
> pyarrow==0.16.0
> python-dateutil==2.8.1
> pytz==2019.3
> six==1.14.0
>            Reporter: Geoff Quested-Jones
>            Priority: Major
>         Attachments: crash.parquet
>
>
> On read of parquet file using pyarrow the program spontaneously exits no thrown exceptions windows only. Testing the same setup on linux (debian 10 in a Docker) reading the same parquet file is done without issue.
> The follow can reproduce the crash in a python 3.8.2 environment env listed bellow but is essentially pip install pandas and pyarrow.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_pandas_write_read():
>     df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
>     df_out.to_parquet("crash.parquet")
>     df_in = pd.read_parquet("crash.parquet")
>     print(df_in)
> def test_arrow_write_read():
>     df = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
>     table_out = pa.Table.from_pandas(df)
>     pq.write_table(table_out, 'crash.parquet')
>     table_in = pq.read_table('crash.parquet')
>     print(table_in)
> if _name_ == "_main_":
>     test_pandas_write_read()
>     test_arrow_write_read()
> {code}
>  The interpreter never reaches the print statements crashing somewhere in the call on line 252 of {{parquet.py}} no error is thrown just spontaneous program exit.
> {code:python}
>     self.reader.read_all(...
> {code}
> In contrast running the same code and python environment in debian 10 there is no error reading the parquet files generated by the same windows code. The sha2sum compare equal for the crash.parquet generated running on debian and windows so something appears to be up with the read. Attached is the crash.parquet file generated on my machine.
> Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)