You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/10/07 11:23:00 UTC

[jira] [Created] (ARROW-10211) [Python] Storing negative and positive zeros in dictionary array

Krisztian Szucs created ARROW-10211:
---------------------------------------

             Summary: [Python] Storing negative and positive zeros in dictionary array
                 Key: ARROW-10211
                 URL: https://issues.apache.org/jira/browse/ARROW-10211
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Krisztian Szucs


Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:

{code:python}
arr = pa.array([0., -0.], type=pa.dictionary(pa.int8(), pa.float32()))
arr.to_pandas()
{code}

raises: 

{code:python}
categories = Float64Index([0.0, -0.0], dtype='float64'), fastpath = False

    @staticmethod
    def validate_categories(categories, fastpath: bool = False):
        """
        Validates that we have good categories

        Parameters
        ----------
        categories : array-like
        fastpath : bool
            Whether to skip nan and uniqueness checks

        Returns
        -------
        categories : Index
        """
        from pandas.core.indexes.base import Index

        if not fastpath and not is_list_like(categories):
            raise TypeError(
                f"Parameter 'categories' must be list-like, was {repr(categories)}"
            )
        elif not isinstance(categories, ABCIndexClass):
            categories = Index(categories, tupleize_cols=False)

        if not fastpath:

            if categories.hasnans:
                raise ValueError("Categorical categories cannot be null")

            if not categories.is_unique:
>               raise ValueError("Categorical categories must be unique")
E               ValueError: Categorical categories must be unique
{code}

The arrow array looks like the following:

{code}
-- dictionary:
  [
    0,
    -0
  ]
-- indices:
  [
    0,
    1
  ]
{code}

So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:

{code}
In [2]: np.array(-0.) == np.array(0.)
Out[2]: True

In [3]: -0.0 == 0.0
Out[3]: True

In [4]: np.unique(np.array([0.0, -0.0]))
Out[4]: array([0.])
{code}

Although {{0.0}} and {{-0.0}} are different values they are considered equal according to the standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)