You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@cloudera.com> on 2016/03/04 00:14:50 UTC

Re: Format: storing null count + required/non-nullable types

Returning to this discussion. I did some C++ prototyping

https://github.com/apache/arrow/pull/9
https://github.com/apache/arrow/pull/10

A handful of thoughts:

1) It is most useful for compatibility with other systems (e.g. Parquet --
see ARROW-22) to have required/optional in the type metadata, but not in
the array data containers. For example, see some Schema unit tests here:
https://github.com/apache/arrow/pull/10/files#diff-3204cc418d726cfc6b7bc68f6ace30f5R18

2) The number of types and code branches to consider is definitely simpler
if you store only the null count in the data structures (in other words
"let the data be data").

3) The requirement that required types can only associate with data having
null_count == 0 is a contract that may be best enforced at IO / IPC
boundaries (but this may vary from implementation to implementation). In
the absence of restrictions, optional/nullable is probably a sensible
default.

Scrutiny welcome! This is important in the context of the metadata
definition process.

best,
Wes

On Sat, Feb 20, 2016 at 6:47 PM, Wes McKinney <we...@cloudera.com> wrote:

> Some inline thoughts
>
> On Sat, Feb 20, 2016 at 5:16 PM, Julian Hyde <jh...@apache.org> wrote:
> > Have you considered representing nulls using a placeholder value? For
> values with 4 or more bits, there is at least one value that doesn’t occur
> in a given batch. That value could be used as the null placeholder. Then
> you don’t need a null vector.
> >
>
> Using any kind of null sentinel / placeholder is tricky. Using a
> constant sentinel is not desirable as it limits your ability to fully
> represent the full range of values within primitive C types (also,
> what sentinel do you use for variable-length / list types?). Having a
> dynamic sentinel is interesting, but consider:
>
> - How do you know what value to use in incremental builder classes?
> - If two arrays of the same type use different, sentinels,
> concatenation may require non-trivial computation beyond a memcpy
> - Bitmasks allow you to use hardware popcount to determine if a range
> of values contains any nulls at all, which is much faster than
> value-by-value comparisons
>
> I haven't done a microbenchmark to understand the performance
> implications of primitive value comparison versus bit checking; that
> would be useful information to have.
>
> > For small values such as byte and short, it’s possible that the value
> space is filled up, in which case you could move to the next size up.
> >
>
> In general, many applications may not find this palatable, as data
> already stored in a particular integer type (e.g. int8) would have to
> be upcasted if it is deemed to not "fit" (determine this could itself
> be computationally intensive).
>
> > Representing nulls this way would save allocating an extra piece of
> memory, using up another cache line, and, I imagine could quite often
> reduce the number of branch instructions used: you could check ‘v[i] < 10’
> rather than ‘v[i] < 10 and !isnull[i]’.
> >
> > Or maybe this kind of null compression (and other kinds of compression
> like dictionary encoding and using non-multiples of 8 bits) belong at the
> application layer?
> >
>
> Perhaps so -- several people have asked me about dictionary encoding
> in the context of Arrow. In that particular case, a dictionary-encoded
> array is really two arrays -- an integer index array plus the
> dictionary array -- interpreted as an array of the logical type of the
> dictionary. That seems to me like a problem of the metadata or
> application domain. Of course, we can provide fast algorithms within
> Arrow for dictionary encoding and other such standard operations.
>
> > Julian
> >
>

Re: Format: storing null count + required/non-nullable types

Posted by Wes McKinney <we...@cloudera.com>.
This thread was about adding null count to the data structures (and
making nullability a property of the metadata, if at all). I'll reply
to the other thread about implementation matters.

On Fri, Mar 4, 2016 at 6:50 AM, Daniel Robinson
<da...@gmail.com> wrote:
> Wes,
>
> Thanks for soliciting so much input on these questions, and sharing the new
> prototypes.
>
> In response to point 2 and your e-mail from last week, I created some
> prototypes to illustrate what I think could be useful about having a
> Nullable<T> template in the C++ implementation.
>

Format: storing null count + required/non-nullable types

Posted by Daniel Robinson <da...@gmail.com>.
Wes,

Thanks for soliciting so much input on these questions, and sharing the new
prototypes.

In response to point 2 and your e-mail from last week, I created some
prototypes to illustrate what I think could be useful about having a
Nullable<T> template in the C++ implementation.

As far as code complexity, I think having a Nullable<T> type might simplify
the user interface (including the definitions of algorithms) by enabling
more generic programming, at the cost of some template-wrestling on the
developer side. You mentioned Take (
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.take.html).
As a somewhat silly illustration, nullable typing could allow Take to be
implemented in one line (
https://github.com/danrobinson/arrow-demo/blob/master/take.h#L9):

  return map<GetOperation>(index_array, array);

Behind the scenes, map() does a runtime check that short-circuits the
null-handling logic (
https://github.com/danrobinson/arrow-demo/blob/master/map.h#L55-L81; this
code seems to be irreducibly ugly but at least it's somewhat generic). It
then runs the array through an algorithm written in continuation-passing
style (https://github.com/danrobinson/arrow-demo/blob/master/map.h#L20-L22),
which in turn constructs an operation pipeline where each operation can
either call "step" (to yield a value) or "skip" (to yield a null value) on
the next operation. Thanks to the nullable type, there are two versions of
the get operation: one that checks for nulls, and one that knows it doesn't
have to. (
https://github.com/danrobinson/arrow-demo/blob/master/operations.h#L8-L55).

I'm not actually trying to push any of these half-baked functional
paradigms, let alone the hacky template metaprogramming tricks used to
implement some of them in the prototype.  The point I'm trying to
illustrate is that these kinds of abstractions would be more difficult to
implement without Nullable<T> typing, because without types, you can't
efficiently pass the information about whether an array has nulls or not
from function to function (and ultimately to the function that processes
each row). (Perhaps I'm missing something!)

Here's Take implemented as a single monolithic function that isn't aware of
nullability: https://github.com/danrobinson/arrow-demo/blob/master/take.h.
In my tests this is about 5-10% faster than the map() version and I expect
it would maintain an advantage if both were better optimized. Maybe a
45-line function like this is worth it for the core functions, but it might
be useful to expose higher-order functions like map() to C++ developers.

As for performance, code generation, and static polymorphism—is the issue
roughly that we need compiled instantiations of every function that might
be called, with every possible type, because at compile time we don't know
the structure of the data or what functions people may want to call from
(say) interpreted languages? I hadn't appreciated that, and it does seem
like a risk of using templates, but I think it actually increases the
upside of factoring out logic into abstractions like map().



On Thu, Mar 3, 2016 at 6:14 PM, Wes McKinney <we...@cloudera.com> wrote:

> Returning to this discussion. I did some C++ prototyping
>
> https://github.com/apache/arrow/pull/9
> https://github.com/apache/arrow/pull/10
>
> A handful of thoughts:
>
> 1) It is most useful for compatibility with other systems (e.g. Parquet --
> see ARROW-22) to have required/optional in the type metadata, but not in
> the array data containers. For example, see some Schema unit tests here:
>
> https://github.com/apache/arrow/pull/10/files#diff-3204cc418d726cfc6b7bc68f6ace30f5R18
>
> 2) The number of types and code branches to consider is definitely simpler
> if you store only the null count in the data structures (in other words
> "let the data be data").
>
> 3) The requirement that required types can only associate with data having
> null_count == 0 is a contract that may be best enforced at IO / IPC
> boundaries (but this may vary from implementation to implementation). In
> the absence of restrictions, optional/nullable is probably a sensible
> default.
>
> Scrutiny welcome! This is important in the context of the metadata
> definition process.
>
> best,
> Wes
>
> On Sat, Feb 20, 2016 at 6:47 PM, Wes McKinney <we...@cloudera.com> wrote:
>
> > Some inline thoughts
> >
> > On Sat, Feb 20, 2016 at 5:16 PM, Julian Hyde <jh...@apache.org> wrote:
> > > Have you considered representing nulls using a placeholder value? For
> > values with 4 or more bits, there is at least one value that doesn’t
> occur
> > in a given batch. That value could be used as the null placeholder. Then
> > you don’t need a null vector.
> > >
> >
> > Using any kind of null sentinel / placeholder is tricky. Using a
> > constant sentinel is not desirable as it limits your ability to fully
> > represent the full range of values within primitive C types (also,
> > what sentinel do you use for variable-length / list types?). Having a
> > dynamic sentinel is interesting, but consider:
> >
> > - How do you know what value to use in incremental builder classes?
> > - If two arrays of the same type use different, sentinels,
> > concatenation may require non-trivial computation beyond a memcpy
> > - Bitmasks allow you to use hardware popcount to determine if a range
> > of values contains any nulls at all, which is much faster than
> > value-by-value comparisons
> >
> > I haven't done a microbenchmark to understand the performance
> > implications of primitive value comparison versus bit checking; that
> > would be useful information to have.
> >
> > > For small values such as byte and short, it’s possible that the value
> > space is filled up, in which case you could move to the next size up.
> > >
> >
> > In general, many applications may not find this palatable, as data
> > already stored in a particular integer type (e.g. int8) would have to
> > be upcasted if it is deemed to not "fit" (determine this could itself
> > be computationally intensive).
> >
> > > Representing nulls this way would save allocating an extra piece of
> > memory, using up another cache line, and, I imagine could quite often
> > reduce the number of branch instructions used: you could check ‘v[i] <
> 10’
> > rather than ‘v[i] < 10 and !isnull[i]’.
> > >
> > > Or maybe this kind of null compression (and other kinds of compression
> > like dictionary encoding and using non-multiples of 8 bits) belong at the
> > application layer?
> > >
> >
> > Perhaps so -- several people have asked me about dictionary encoding
> > in the context of Arrow. In that particular case, a dictionary-encoded
> > array is really two arrays -- an integer index array plus the
> > dictionary array -- interpreted as an array of the logical type of the
> > dictionary. That seems to me like a problem of the metadata or
> > application domain. Of course, we can provide fast algorithms within
> > Arrow for dictionary encoding and other such standard operations.
> >
> > > Julian
> > >
> >
>