You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Uwe L. Korn (JIRA)" <ji...@apache.org> on 2018/09/08 16:09:00 UTC
[jira] [Resolved] (ARROW-2799) [Python] Add safe option to
Table.from_pandas to avoid unsafe casts
[ https://issues.apache.org/jira/browse/ARROW-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe L. Korn resolved ARROW-2799.
--------------------------------
Resolution: Fixed
Issue resolved by pull request 2504
[https://github.com/apache/arrow/pull/2504]
> [Python] Add safe option to Table.from_pandas to avoid unsafe casts
> -------------------------------------------------------------------
>
> Key: ARROW-2799
> URL: https://issues.apache.org/jira/browse/ARROW-2799
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.9.0
> Reporter: Dave Hirschfeld
> Assignee: Krisztian Szucs
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.11.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Ported over from [https://github.com/apache/arrow/issues/2217]
> ```python
> In [8]: import pandas as pd
> ...: import pyarrow as arw
> In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
> ...: df
> Out[9]:
> A B
> 0 a 0
> 1 b 1
> 2 c 2
> In [10]: schema = arw.schema([
> ...: arw.field('A', arw.string()),
> ...: arw.field('B', arw.int32()),
> ...: ])
> In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
> ...: tbl
> Out[11]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
> b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
> b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
> b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
> b', "pandas_version": "0.23.1"}'}
> In [12]: tbl.to_pandas().equals(df)
> Out[12]: True
> ```
> ...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame.
> Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised!
> ```python
> In [13]: df['B'].iloc[0] = 1.23
> ...: df
> Out[13]:
> A B
> 0 a 1.23
> 1 b 1.00
> 2 c 2.00
> In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes
> ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
> ...: tbl
> Out[14]:
> pyarrow.Table
> A: string
> B: int32
> metadata
> --------
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
> b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
> b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
> b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
> b'}], "pandas_version": "0.23.1"}'}
> In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!!
> Out[15]:
> A B
> 0 a 1
> 1 b 1
> 2 c 2
> ```
> To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)