You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Daniel Robinson <da...@gmail.com> on 2016/03/08 01:08:55 UTC

"Empty" slots

Am I correct that under the current draft spec, the values in meaningless
slots (such as slots corresponding to a 1 in an associated null bitmask, or
unused slots in a sparse union) are undefined?

If so, might it be worth considering requiring that they be zeroed out?
This would add some time overhead to array building (for sparse unions,
memory could be zeroed out when it is allocated; for nullable arrays, it
may be more efficient to do so when nulls are added), but would remove most
of the complexity from hash calculations and equality comparisons. (See
ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
arrays could be done with 2 calls to memcmp.  As a bonus, it would speed up
operations for which null and 0 are equivalent or near-equivalent (such as
sum, sort on unsigned integers, ).

Other than the overhead, one potential downside is that it would make it
impossible to just add a null_bitmask over an existing array without
copying all of the memory. Perhaps this would best be done with a
separately defined masking vector, which would not require that all masked
values be set to 0 (and would thus require more complex hashing algorithms).

Re: "Empty" slots

Posted by Wes McKinney <we...@cloudera.com>.
Thank you for catching this inconsistency with the memory
representation -- I created
https://issues.apache.org/jira/browse/ARROW-62 to track the issue so
we can make a decision (and IMHO we should amend the format documents
to match the Java memory representation).

On Wed, Mar 9, 2016 at 11:31 PM, Daniel Robinson
<da...@gmail.com> wrote:
> Thanks, Wes. For hashing, my thought was that you could just hash the null
> bitmask concatenated with the values array (this should be doable in
> constant space for ordinary one-pass hash functions). If nulls are zeroed
> out, this will guarantee that equal primitive arrays hash to the same
> value. But if the empty slots in the value array are undefined and could
> have any value, this method would give you false negatives, and you would
> need to do something more complex and slower.
>
> I was also curious about how this is treated in Drill ValueVectors. The
> example illustration in the introductory explanation at
> https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values,
> whatever that's worth.
>
> This is a tangent, but I noticed Drill (and thus for the moment Java Arrow:
> https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369)
> uses
> 0 in the null bitmask to represent null values, while you use the NumPy
> MaskedArray convention of having 1 represent null values. Is there an
> advantage?
>
> On Wednesday, March 9, 2016, Wes McKinney <we...@cloudera.com> wrote:
>
>> hey Dan,
>>
>> You bring up a good point. It's unclear whether this would make
>> hashing much less complex (if you ignore the null bits when computing
>> the hash function, then it probably would, especially on repeated
>> fields. We'd need to do some experiments to see if there are cases
>> where skipping the null bits would yield bad hashes. I'm not sure)
>>
>> For the memory comparison question, I'd like to hear from the Drill
>> team on their experience with ValueVectors and how they handle the
>> "null" space in arrays. In the case of list / repeated values, you
>> *do* need to initialize the null offset slot to the previous value so
>> that offset computations always work. For example, the list<int32>
>> array
>>
>> [[0, 1, 2], null, null, [4, 5, 6]]
>>
>> would need to have
>>
>> nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
>> offsets: 0 3 3 3 6
>> values 0 1 2 4 5 6
>>
>> - Wes
>>
>> On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
>> <da...@gmail.com> wrote:
>> > Am I correct that under the current draft spec, the values in meaningless
>> > slots (such as slots corresponding to a 1 in an associated null bitmask,
>> or
>> > unused slots in a sparse union) are undefined?
>> >
>> > If so, might it be worth considering requiring that they be zeroed out?
>> > This would add some time overhead to array building (for sparse unions,
>> > memory could be zeroed out when it is allocated; for nullable arrays, it
>> > may be more efficient to do so when nulls are added), but would remove
>> most
>> > of the complexity from hash calculations and equality comparisons. (See
>> > ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
>> > arrays could be done with 2 calls to memcmp.  As a bonus, it would speed
>> up
>> > operations for which null and 0 are equivalent or near-equivalent (such
>> as
>> > sum, sort on unsigned integers, ).
>> >
>> > Other than the overhead, one potential downside is that it would make it
>> > impossible to just add a null_bitmask over an existing array without
>> > copying all of the memory. Perhaps this would best be done with a
>> > separately defined masking vector, which would not require that all
>> masked
>> > values be set to 0 (and would thus require more complex hashing
>> algorithms).
>>

"Empty" slots

Posted by Daniel Robinson <da...@gmail.com>.
Thanks, Wes. For hashing, my thought was that you could just hash the null
bitmask concatenated with the values array (this should be doable in
constant space for ordinary one-pass hash functions). If nulls are zeroed
out, this will guarantee that equal primitive arrays hash to the same
value. But if the empty slots in the value array are undefined and could
have any value, this method would give you false negatives, and you would
need to do something more complex and slower.

I was also curious about how this is treated in Drill ValueVectors. The
example illustration in the introductory explanation at
https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values,
whatever that's worth.

This is a tangent, but I noticed Drill (and thus for the moment Java Arrow:
https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369)
uses
0 in the null bitmask to represent null values, while you use the NumPy
MaskedArray convention of having 1 represent null values. Is there an
advantage?

On Wednesday, March 9, 2016, Wes McKinney <we...@cloudera.com> wrote:

> hey Dan,
>
> You bring up a good point. It's unclear whether this would make
> hashing much less complex (if you ignore the null bits when computing
> the hash function, then it probably would, especially on repeated
> fields. We'd need to do some experiments to see if there are cases
> where skipping the null bits would yield bad hashes. I'm not sure)
>
> For the memory comparison question, I'd like to hear from the Drill
> team on their experience with ValueVectors and how they handle the
> "null" space in arrays. In the case of list / repeated values, you
> *do* need to initialize the null offset slot to the previous value so
> that offset computations always work. For example, the list<int32>
> array
>
> [[0, 1, 2], null, null, [4, 5, 6]]
>
> would need to have
>
> nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
> offsets: 0 3 3 3 6
> values 0 1 2 4 5 6
>
> - Wes
>
> On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
> <da...@gmail.com> wrote:
> > Am I correct that under the current draft spec, the values in meaningless
> > slots (such as slots corresponding to a 1 in an associated null bitmask,
> or
> > unused slots in a sparse union) are undefined?
> >
> > If so, might it be worth considering requiring that they be zeroed out?
> > This would add some time overhead to array building (for sparse unions,
> > memory could be zeroed out when it is allocated; for nullable arrays, it
> > may be more efficient to do so when nulls are added), but would remove
> most
> > of the complexity from hash calculations and equality comparisons. (See
> > ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
> > arrays could be done with 2 calls to memcmp.  As a bonus, it would speed
> up
> > operations for which null and 0 are equivalent or near-equivalent (such
> as
> > sum, sort on unsigned integers, ).
> >
> > Other than the overhead, one potential downside is that it would make it
> > impossible to just add a null_bitmask over an existing array without
> > copying all of the memory. Perhaps this would best be done with a
> > separately defined masking vector, which would not require that all
> masked
> > values be set to 0 (and would thus require more complex hashing
> algorithms).
>

Re: "Empty" slots

Posted by Wes McKinney <we...@cloudera.com>.
hey Dan,

You bring up a good point. It's unclear whether this would make
hashing much less complex (if you ignore the null bits when computing
the hash function, then it probably would, especially on repeated
fields. We'd need to do some experiments to see if there are cases
where skipping the null bits would yield bad hashes. I'm not sure)

For the memory comparison question, I'd like to hear from the Drill
team on their experience with ValueVectors and how they handle the
"null" space in arrays. In the case of list / repeated values, you
*do* need to initialize the null offset slot to the previous value so
that offset computations always work. For example, the list<int32>
array

[[0, 1, 2], null, null, [4, 5, 6]]

would need to have

nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
offsets: 0 3 3 3 6
values 0 1 2 4 5 6

- Wes

On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
<da...@gmail.com> wrote:
> Am I correct that under the current draft spec, the values in meaningless
> slots (such as slots corresponding to a 1 in an associated null bitmask, or
> unused slots in a sparse union) are undefined?
>
> If so, might it be worth considering requiring that they be zeroed out?
> This would add some time overhead to array building (for sparse unions,
> memory could be zeroed out when it is allocated; for nullable arrays, it
> may be more efficient to do so when nulls are added), but would remove most
> of the complexity from hash calculations and equality comparisons. (See
> ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
> arrays could be done with 2 calls to memcmp.  As a bonus, it would speed up
> operations for which null and 0 are equivalent or near-equivalent (such as
> sum, sort on unsigned integers, ).
>
> Other than the overhead, one potential downside is that it would make it
> impossible to just add a null_bitmask over an existing array without
> copying all of the memory. Perhaps this would best be done with a
> separately defined masking vector, which would not require that all masked
> values be set to 0 (and would thus require more complex hashing algorithms).