You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Uwe L. Korn (JIRA)" <ji...@apache.org> on 2018/09/08 16:09:00 UTC
[jira] [Resolved] (ARROW-2799) [Python] Add safe option to Table.from_pandas to avoid unsafe casts

     [ https://issues.apache.org/jira/browse/ARROW-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe L. Korn resolved ARROW-2799.
--------------------------------
    Resolution: Fixed

Issue resolved by pull request 2504
[https://github.com/apache/arrow/pull/2504]

> [Python] Add safe option to Table.from_pandas to avoid unsafe casts
> -------------------------------------------------------------------
>
>                 Key: ARROW-2799
>                 URL: https://issues.apache.org/jira/browse/ARROW-2799
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Dave Hirschfeld
>            Assignee: Krisztian Szucs
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Ported over from [https://github.com/apache/arrow/issues/2217]
> ```python
> In [8]: import pandas as pd
>    ...: import pyarrow as arw
> In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
>    ...: df
> Out[9]:
>    A  B
> 0  a  0
> 1  b  1
> 2  c  2
> In [10]: schema = arw.schema([
>     ...:     arw.field('A', arw.string()),
>     ...:     arw.field('B', arw.int32()),
>     ...: ])
> In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
>     ...: tbl
> Out[11]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
>             b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
>             b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
>             b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
>             b', "pandas_version": "0.23.1"}'}
> In [12]: tbl.to_pandas().equals(df)
> Out[12]: True
> ```
> ...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame.
> Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!
> ```python
> In [13]: df['B'].iloc[0] = 1.23
>     ...: df
> Out[13]:
>    A     B
> 0  a  1.23
> 1  b  1.00
> 2  c  2.00
> In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
>     ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
>     ...: tbl
> Out[14]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
>             b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
>             b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
>             b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
>             b'}], "pandas_version": "0.23.1"}'}
> In [15]: tbl.to_pandas()  # <-- SILENT TRUNCATION!!!
> Out[15]:
>    A  B
> 0  a  1
> 1  b  1
> 2  c  2
> ```
> To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)