You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/10/07 11:51:00 UTC
[jira] [Comment Edited] (ARROW-10211) [Python] Storing negative and positive zeros in dictionary array

    [ https://issues.apache.org/jira/browse/ARROW-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209495#comment-17209495 ] 

Joris Van den Bossche edited comment on ARROW-10211 at 10/7/20, 11:50 AM:
--------------------------------------------------------------------------

> If we treat 0 and -0 as equal, then the categorization will lose information

To be clear, we already treat 0 and -0 equal in other situations:

{code}
In [25]: a1 = pa.array([0., -0.])

In [26]: a2 = pa.array([-0., 0.])

In [27]: a1.equals(a2)
Out[27]: True

In [28]: import pyarrow.compute as pc

In [29]: pc.equal(a1, a2)
Out[29]: 
<pyarrow.lib.BooleanArray object at 0x7f633bad7288>
[
  true,
  true
]
{code}

(of course that are other operations, so that doesn't mean we need to use the same semantics for encoding/unique).

I don't think many people use floating points values in dictionaries/categoricals. And personally, I don't care that much about the pandas conversion / python roundtrip in this case. It can perfectly be one of the exceptions on roundtrip (it's in the end pandas that is more strict as arrow in this case).  
I think it is rather the underlying issue that this test case brought up that is interesting: should our hashing code regard 0 and \-0 as equal or not? (since that impacts actual pyarrow functionality: dictionary encoding, unique, .., independent from arrow<->python conversions). 

Now, I don't have a strong opinion on this last aspect, though. I was mainly pointing out that python/numpy/pandas do treat them as equal also in hash/unique contexts. But eg I checked with Julia, and they keep 0 and -0 as distinct values (while still evaluating them equal in {{==}}, i.e. the same behaviour as Arrow currently has).




was (Author: jorisvandenbossche):
> If we treat 0 and -0 as equal, then the categorization will lose information

To be clear, we already treat 0 and -0 equal in other situations:

{code}
In [25]: a1 = pa.array([0., -0.])

In [26]: a2 = pa.array([-0., 0.])

In [27]: a1.equals(a2)
Out[27]: True

In [28]: import pyarrow.compute as pc

In [29]: pc.equal(a1, a2)
Out[29]: 
<pyarrow.lib.BooleanArray object at 0x7f633bad7288>
[
  true,
  true
]
{code}

(of course that are other operations, so that doesn't mean we need to use the same semantics for encoding/unique).

I don't think many people use floating points values in dictionaries/categoricals. And personally, I don't care that much about the pandas conversion / python roundtrip in this case. It can perfectly be one of the exceptions on roundtrip (it's in the end pandas that is more strict as arrow in this case).  
I think it is rather the underlying issue that this test case brought up that is interesting: should our hashing code regard 0 and -0 as equal or not? (since that impacts actual pyarrow functionality: dictionary encoding, unique, .., independent from arrow<->python conversions). 

Now, I don't have a strong opinion on this last aspect, though. I was mainly pointing out that python/numpy/pandas do treat them as equal also in hash/unique contexts. But eg I checked with Julia, and they keep 0 and -0 as distinct values (while still evaluating them equal in {{==}}, i.e. the same behaviour as Arrow currently has).



> [Python] Storing negative and positive zeros in dictionary array
> ----------------------------------------------------------------
>
>                 Key: ARROW-10211
>                 URL: https://issues.apache.org/jira/browse/ARROW-10211
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Krisztian Szucs
>            Priority: Major
>
> Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:
> {code:python}
> arr = pa.array([0., -0.], type=pa.dictionary(pa.int8(), pa.float32()))
> arr.to_pandas()
> {code}
> raises: 
> {code:python}
> categories = Float64Index([0.0, -0.0], dtype='float64'), fastpath = False
>     @staticmethod
>     def validate_categories(categories, fastpath: bool = False):
>         """
>         Validates that we have good categories
>         Parameters
>         ----------
>         categories : array-like
>         fastpath : bool
>             Whether to skip nan and uniqueness checks
>         Returns
>         -------
>         categories : Index
>         """
>         from pandas.core.indexes.base import Index
>         if not fastpath and not is_list_like(categories):
>             raise TypeError(
>                 f"Parameter 'categories' must be list-like, was {repr(categories)}"
>             )
>         elif not isinstance(categories, ABCIndexClass):
>             categories = Index(categories, tupleize_cols=False)
>         if not fastpath:
>             if categories.hasnans:
>                 raise ValueError("Categorical categories cannot be null")
>             if not categories.is_unique:
> >               raise ValueError("Categorical categories must be unique")
> E               ValueError: Categorical categories must be unique
> {code}
> The arrow array looks like the following:
> {code}
> -- dictionary:
>   [
>     0,
>     -0
>   ]
> -- indices:
>   [
>     0,
>     1
>   ]
> {code}
> So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:
> {code}
> In [2]: np.array(-0.) == np.array(0.)
> Out[2]: True
> In [3]: -0.0 == 0.0
> Out[3]: True
> In [4]: np.unique(np.array([0.0, -0.0]))
> Out[4]: array([0.])
> {code}
> Although {{0.0}} and {{-0.0}} are different values they are considered equal according to the standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)