You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/10/07 11:29:00 UTC
[jira] [Commented] (ARROW-10211) [Python] Storing negative and
positive zeros in dictionary array
[ https://issues.apache.org/jira/browse/ARROW-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209484#comment-17209484 ]
Antoine Pitrou commented on ARROW-10211:
----------------------------------------
What is {{validate_categories}}? Is it in PyArrow?
> [Python] Storing negative and positive zeros in dictionary array
> ----------------------------------------------------------------
>
> Key: ARROW-10211
> URL: https://issues.apache.org/jira/browse/ARROW-10211
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Krisztian Szucs
> Priority: Major
>
> Hypothesis has discovered a corner case when converting a dictionary array with float values to a pandas series:
> {code:python}
> arr = pa.array([0., -0.], type=pa.dictionary(pa.int8(), pa.float32()))
> arr.to_pandas()
> {code}
> raises:
> {code:python}
> categories = Float64Index([0.0, -0.0], dtype='float64'), fastpath = False
> @staticmethod
> def validate_categories(categories, fastpath: bool = False):
> """
> Validates that we have good categories
> Parameters
> ----------
> categories : array-like
> fastpath : bool
> Whether to skip nan and uniqueness checks
> Returns
> -------
> categories : Index
> """
> from pandas.core.indexes.base import Index
> if not fastpath and not is_list_like(categories):
> raise TypeError(
> f"Parameter 'categories' must be list-like, was {repr(categories)}"
> )
> elif not isinstance(categories, ABCIndexClass):
> categories = Index(categories, tupleize_cols=False)
> if not fastpath:
> if categories.hasnans:
> raise ValueError("Categorical categories cannot be null")
> if not categories.is_unique:
> > raise ValueError("Categorical categories must be unique")
> E ValueError: Categorical categories must be unique
> {code}
> The arrow array looks like the following:
> {code}
> -- dictionary:
> [
> 0,
> -0
> ]
> -- indices:
> [
> 0,
> 1
> ]
> {code}
> So we hash the negative and positive zeroes to different values so pandas/numpy is unable to convert it to a categorical series since the values as not unique:
> {code}
> In [2]: np.array(-0.) == np.array(0.)
> Out[2]: True
> In [3]: -0.0 == 0.0
> Out[3]: True
> In [4]: np.unique(np.array([0.0, -0.0]))
> Out[4]: array([0.])
> {code}
> Although {{0.0}} and {{-0.0}} are different values they are considered equal according to the standard.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)