You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/10/07 11:23:00 UTC
[jira] [Created] (ARROW-10211) [Python] Storing negative and
positive zeros in dictionary array
Krisztian Szucs created ARROW-10211:
---------------------------------------
Summary: [Python] Storing negative and positive zeros in dictionary array
Key: ARROW-10211
URL: https://issues.apache.org/jira/browse/ARROW-10211
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Krisztian Szucs
Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:
{code:python}
arr = pa.array([0., -0.], type=pa.dictionary(pa.int8(), pa.float32()))
arr.to_pandas()
{code}
raises:
{code:python}
categories = Float64Index([0.0, -0.0], dtype='float64'), fastpath = False
@staticmethod
def validate_categories(categories, fastpath: bool = False):
"""
Validates that we have good categories
Parameters
----------
categories : array-like
fastpath : bool
Whether to skip nan and uniqueness checks
Returns
-------
categories : Index
"""
from pandas.core.indexes.base import Index
if not fastpath and not is_list_like(categories):
raise TypeError(
f"Parameter 'categories' must be list-like, was {repr(categories)}"
)
elif not isinstance(categories, ABCIndexClass):
categories = Index(categories, tupleize_cols=False)
if not fastpath:
if categories.hasnans:
raise ValueError("Categorical categories cannot be null")
if not categories.is_unique:
> raise ValueError("Categorical categories must be unique")
E ValueError: Categorical categories must be unique
{code}
The arrow array looks like the following:
{code}
-- dictionary:
[
0,
-0
]
-- indices:
[
0,
1
]
{code}
So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:
{code}
In [2]: np.array(-0.) == np.array(0.)
Out[2]: True
In [3]: -0.0 == 0.0
Out[3]: True
In [4]: np.unique(np.array([0.0, -0.0]))
Out[4]: array([0.])
{code}
Although {{0.0}} and {{-0.0}} are different values they are considered equal according to the standard.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)