You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/04/08 23:37:00 UTC

[jira] [Commented] (ARROW-8378) [Python] "empty" dtype metadata leads to wrong Parquet column type

    [ https://issues.apache.org/jira/browse/ARROW-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078813#comment-17078813 ] 

Wes McKinney commented on ARROW-8378:
-------------------------------------

I'm showing {{"pandas_type": "empty"}} for both {{df_1}} and {{df_2}}. How are we supposed to know that the column contains unicode values? pandas converts the unicode dtype to object dtype. 

In these situations, the safest way to write Parquet files is for you to write down the exact Arrow schema you want

Note:

{code}
In [15]: pa.table(df_1, schema=pa.schema([pa.field("col", "string")]))          
Out[15]: 
pyarrow.Table
col: string

In [16]: pa.table(df_1)                                                         
Out[16]: 
pyarrow.Table
col: null
{code}

Without any objects in {{df_1['col']}}, Arrow infers null type for "col"

> [Python] "empty" dtype metadata leads to wrong Parquet column type
> ------------------------------------------------------------------
>
>                 Key: ARROW-8378
>                 URL: https://issues.apache.org/jira/browse/ARROW-8378
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: Python: 3.7.6
> Pandas: 0.24.1, 0.25.3, 1.0.3
> Pyarrow: 0.16.0
> OS: OSX 10.15.3
>            Reporter: Diego Argueta
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 3.7:
> {code:python}
> import pandas as pd
> import numpy as np
> df_1 = pd.DataFrame({'col': [None, None, None]})
> df_1.col = df_1.col.astype(np.unicode_)
> df_1.to_parquet('right.parq', engine='pyarrow')
> series = pd.Series([None, None, None], dtype=np.unicode_)
> df_2 = pd.DataFrame({'col': series})
> df_2.to_parquet('wrong.parq', engine='pyarrow')
> {code}
> Examine the Parquet column type for each file (I use [parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.
> The following metadata is stored in the Parquet files:
> {{right.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "unicode"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> {{wrong.parq}}
> {code:json}
> {
>   "column_indexes": [],
>   "columns": [
>     {
>       "field_name": "col",
>       "metadata": null,
>       "name": "col",
>       "numpy_type": "object",
>       "pandas_type": "empty"
>     }
>   ],
>   "index_columns": [],
>   "pandas_version": "0.24.1"
> }
> {code}
> The difference between the two is that the {{pandas_type}} for the incorrect file is "empty" rather than the expected "unicode". PyArrow misinterprets this and defaults to a 32-bit integer column.
> The incorrect datatype will cause Redshift to reject the file when we try to read it because the column type in the file doesn't match the column type in the database table.
> I originally filed this as a bug in Pandas (see [this ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me over here because the dtype conversion is handled in PyArrow. I'm not sure how you'd handle this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)