You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Bryan Cutler <cu...@gmail.com> on 2020/08/05 23:11:31 UTC

change in pyarrow scalar equality?

Hi all,

I came across a behavior change from 0.17.1 when comparing array scalar
values with python objects. This used to work for 0.17.1 and before, but in
1.0.0 equals always returns false. I saw there was a previous discussion on
Python equality semantics, but not sure if the conclusion is the behavior
I'm seeing. For example:

In [4]: a = pa.array([1,2,3])


In [5]: a[0] == 1

Out[5]: False

In [6]: a[0].as_py() == 1

Out[6]: True

I know the scalars can be converted with `as_py()`, but it does seem a
little strange to return False when compared with a python object. Is this
the expected behavior for 1.0.0+?

Thanks,
Bryan

Re: change in pyarrow scalar equality?

Posted by Bryan Cutler <cu...@gmail.com>.

Thanks for the detailed response and background on this Joris! My case was
certainly not necessary to compare pyarrow scalars, so it would have been
better to just raise an error, but there are probably other cases where
that wouldn't be preferred. Anyway, I think it would be a good idea to
document this since I'm sure others will hit it. I made
https://issues.apache.org/jira/browse/ARROW-9750 for adding some docs.

On Thu, Aug 6, 2020 at 12:18 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Bryan,
>
> This indeed changed in 1.0. The full scalar implementation in pyarrow was
> refactored (there were two types of scalars before, see
> https://issues.apache.org/jira/browse/ARROW-9017 /
> https://github.com/apache/arrow/pull/7519).
>
> Due to that PR, there was discussion about what "==" should mean
> (originally triggered by comparison with Null returning Null, but then
> expanded to comparison in general, see the mailing list thread "Extremely
> dubious Python equality semantics" ->
>
> https://lists.apache.org/thread.html/rdd11d3635c751a3a626e14106f1a95f3cddba4dd3bf44247edefde49%40%3Cdev.arrow.apache.org%3E
> ).
> The options for "==" are: is it a strict "data structure / object" equality
> (like the '.equals(..)' method), or is it an "analytical/semantic" equality
> (like the element-wise 'equal' compute method)?
>
> In the end, we opted for the object equality, and then made it actually
> strict to only have it compare equal to actual pyarrow scalars (and not do
> automatic conversion of python scalars to pyarrow scalars). But note that
> even different types will not compare equal like that at the moment:
>
> >>> a = pa.array([1,2,3], type="int64")
> >>> b = pa.array([1,2,3], type="int32")
> >>> a[0] == b[0]
> False
> >>> a[0] == 1
> False
> >>> a[0].equals(1)
> ...
> TypeError: Argument 'other' has incorrect type (expected
> pyarrow.lib.Scalar, got int)
>
> Using the pyarrow.compute module, you _should_ get the analytical equality
> as you expected in this case. However, it seems that the "equal" kernel is
> not yet implemented for differing types (I suppose an automatic casting
> step is still missing):
>
> >>> import pyarrow.compute as pc
> >>> pc.equal(a[0], b[0])
> ...
> ArrowNotImplementedError: Function equal has no kernel matching input types
> (scalar[int64], scalar[int32])
> >>> pc.equal(a[0], 1)
> ...
> TypeError: Got unexpected argument type <class 'int'> for compute function
>
> For this last one, we should probably do an attempt to convert the python
> scalar to a pyarrow scalar, and maybe for the "a[0] == 1" case as well
> (however, coerce to which type if there are multiple possibilities (eg
> int64 vs int32)?)
>
> I agree the new behaviour might be confusing (if you expect semantic
> equality), but on the other hand is also clear avoiding dubious cases. But
> I don't think this is already set in stone, so more feedback is certainly
> welcome.
>
> Joris
>
> On Thu, 6 Aug 2020 at 01:12, Bryan Cutler <cu...@gmail.com> wrote:
>
> > Hi all,
> >
> > I came across a behavior change from 0.17.1 when comparing array scalar
> > values with python objects. This used to work for 0.17.1 and before, but
> in
> > 1.0.0 equals always returns false. I saw there was a previous discussion
> on
> > Python equality semantics, but not sure if the conclusion is the behavior
> > I'm seeing. For example:
> >
> > In [4]: a = pa.array([1,2,3])
> >
> >
> > In [5]: a[0] == 1
> >
> > Out[5]: False
> >
> > In [6]: a[0].as_py() == 1
> >
> > Out[6]: True
> >
> > I know the scalars can be converted with `as_py()`, but it does seem a
> > little strange to return False when compared with a python object. Is
> this
> > the expected behavior for 1.0.0+?
> >
> > Thanks,
> > Bryan
> >
>

Re: change in pyarrow scalar equality?

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Bryan,

This indeed changed in 1.0. The full scalar implementation in pyarrow was
refactored (there were two types of scalars before, see
https://issues.apache.org/jira/browse/ARROW-9017 /
https://github.com/apache/arrow/pull/7519).

Due to that PR, there was discussion about what "==" should mean
(originally triggered by comparison with Null returning Null, but then
expanded to comparison in general, see the mailing list thread "Extremely
dubious Python equality semantics" ->
https://lists.apache.org/thread.html/rdd11d3635c751a3a626e14106f1a95f3cddba4dd3bf44247edefde49%40%3Cdev.arrow.apache.org%3E
).
The options for "==" are: is it a strict "data structure / object" equality
(like the '.equals(..)' method), or is it an "analytical/semantic" equality
(like the element-wise 'equal' compute method)?

In the end, we opted for the object equality, and then made it actually
strict to only have it compare equal to actual pyarrow scalars (and not do
automatic conversion of python scalars to pyarrow scalars). But note that
even different types will not compare equal like that at the moment:

>>> a = pa.array([1,2,3], type="int64")
>>> b = pa.array([1,2,3], type="int32")
>>> a[0] == b[0]
False
>>> a[0] == 1
False
>>> a[0].equals(1)
...
TypeError: Argument 'other' has incorrect type (expected
pyarrow.lib.Scalar, got int)

Using the pyarrow.compute module, you _should_ get the analytical equality
as you expected in this case. However, it seems that the "equal" kernel is
not yet implemented for differing types (I suppose an automatic casting
step is still missing):

>>> import pyarrow.compute as pc
>>> pc.equal(a[0], b[0])
...
ArrowNotImplementedError: Function equal has no kernel matching input types
(scalar[int64], scalar[int32])
>>> pc.equal(a[0], 1)
...
TypeError: Got unexpected argument type <class 'int'> for compute function

For this last one, we should probably do an attempt to convert the python
scalar to a pyarrow scalar, and maybe for the "a[0] == 1" case as well
(however, coerce to which type if there are multiple possibilities (eg
int64 vs int32)?)

I agree the new behaviour might be confusing (if you expect semantic
equality), but on the other hand is also clear avoiding dubious cases. But
I don't think this is already set in stone, so more feedback is certainly
welcome.

Joris

On Thu, 6 Aug 2020 at 01:12, Bryan Cutler <cu...@gmail.com> wrote:

> Hi all,
>
> I came across a behavior change from 0.17.1 when comparing array scalar
> values with python objects. This used to work for 0.17.1 and before, but in
> 1.0.0 equals always returns false. I saw there was a previous discussion on
> Python equality semantics, but not sure if the conclusion is the behavior
> I'm seeing. For example:
>
> In [4]: a = pa.array([1,2,3])
>
>
> In [5]: a[0] == 1
>
> Out[5]: False
>
> In [6]: a[0].as_py() == 1
>
> Out[6]: True
>
> I know the scalars can be converted with `as_py()`, but it does seem a
> little strange to return False when compared with a python object. Is this
> the expected behavior for 1.0.0+?
>
> Thanks,
> Bryan
>