You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Fan Liya <li...@gmail.com> on 2019/11/14 12:08:02 UTC

[Discuss][Java] Appropriate semantics for comparing values in UnionVector

Dear all,

The problem arises from the discussion in a PR:
https://github.com/apache/arrow/pull/5544#discussion_r338394941.

We are trying to come up with a proper semantics to compare values in
UnionVectors.

According to the current logic in the code base, two values from two
UnionVectors are compared in two steps:

1. Child vectors for the two UnionVectors are compared, to make sure both
vectors have the same types of child vectors.
2. If step 1 passes, we continue to compare values in the corresponding
slots in the two union vectors.

This is a legitimate equality semantics (being reflexive, symmentirc, and
transitive). However, we think it is overly strict to for equality
determination, as it compares child vectors first, and this may lead to
unexpected results.

An example related to dictionary encoding UnionVectors is given: Suppose
our dictionary is a union vector with 3 elements: {Int (0), Long(1),
Byte(2)}. This dictionary vector has 3 child vectors: an IntVector, a
BigIntVector, and a SmallIntVector.

We want to encode another union vector with 2 elements: {Int(0), Byte(2)}.
The encoded vector should be an integer vector {0, 2}.

However, since the vector to encode has only 2 children: an IntVector and a
SmallIntVector, the check for child vectors will always fail, so no value
will be considered equal to any value in the dictionary, and dictionary
encoding will always fail.

So our propsed change is: we no longer compare child vectors, and only
compare values slots for UnionVectors. That is, we compare values in 2
steps too:

1. Make sure the slots in both vectors are of the same type (e.g. both are
IntVectors).
2. Compare values stored in the slots.

This is the *problem one* we want to discuss. What do you think?

*Problem two *is proposed by Micah Kornfield. Should we consider any of the
following semantics for comparing UnionVectors?

1. Is it OK for unions to compare against any other vector? (for example,
if the value slot of a  union vector has type IntVector, is it valid to
compare it with a real IntVector?)
2. Can we compare a dense union vector against a sparse union vector?
3. Is it only ok to compare unions that have the exact same metadata.

Please give your valuable feedback. Thank you in advance.

Best,
Liya Fan