You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@cloudera.com> on 2016/03/04 17:14:15 UTC

Re: Should Nullable be a nested type?

Moving this thread over from the discussion about adding null count to
the physical format.

I never said that what you're describing an invalid approach, only
that it will yield more complexity for both library developers and
users without any clear performance or net productivity benefits. This
is the kind of C++ codebase I would personally choose not to be
involved with. At this point it's fairly hypothetical; perhaps we can
revisit in a few months after Arrow gets used for some real-world
applications.

It's probably a philosophical divide but I like C++ as a tool
(compared with plain old C) for several reasons:

- High performance C tends to encourage much more macro use (manual
code generation, basically)

- As a code generation tool, templates are more sane and give better
compiler errors than C macros.

- Object-oriented programming in C requires a lot of boilerplate. See
an example C codebase (https://github.com/torch/TH) using an
opinionated flavor of OOP, you end up with a half-reimplementation of
C++ classes!

- Memory-management using RAII and smart pointers makes me personally
a lot more productive with fewer mistakes

In particular, the Google C++ guide counterindicates complicated
template metaprogramming:
https://google.github.io/styleguide/cppguide.html#Template_metaprogramming

"The techniques used in template metaprogramming are often obscure to
anyone but language experts. Code that uses templates in complicated
ways is often unreadable, and is hard to debug or maintain.

Template metaprogramming often leads to extremely poor compiler time
error messages: even if an interface is simple, the complicated
implementation details become visible when the user does something
wrong."

The great part of Arrow is that the memory layout specification is
what really matters, so there is nothing stopping anyone from creating
alternate implementations that suit their needs, and if you need to
use functions from different implementations in an application, you
can do that because the memory is binary interoperable.

My intent for the C++ codebase is to make it the fastest reference
code available for these data structures while also readable and
accessible for a wide variety of programmers to contribute to, so
adding template metaprogramming constructs (as opposed to mainly using
templates primarily for code generation) might drive away certain
kinds of contributors. I would like for many of the algorithms to not
end up too dissimilar from the ones you would write in C.

- Wes

On Fri, Mar 4, 2016 at 6:50 AM, Daniel Robinson
<da...@gmail.com> wrote:
> Wes,
>
> Thanks for soliciting so much input on these questions, and sharing the new
> prototypes.
>
> In response to point 2 and your e-mail from last week, I created some
> prototypes to illustrate what I think could be useful about having a
> Nullable<T> template in the C++ implementation.
>
> As far as code complexity, I think having a Nullable<T> type might simplify
> the user interface (including the definitions of algorithms) by enabling
> more generic programming, at the cost of some template-wrestling on the
> developer side. You mentioned Take (
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.take.html).
> As a somewhat silly illustration, nullable typing could allow Take to be
> implemented in one line (
> https://github.com/danrobinson/arrow-demo/blob/master/take.h#L9):
>
>   return map<GetOperation>(index_array, array);
>
> Behind the scenes, map() does a runtime check that short-circuits the
> null-handling logic (
> https://github.com/danrobinson/arrow-demo/blob/master/map.h#L55-L81; this
> code seems to be irreducibly ugly but at least it's somewhat generic). It
> then runs the array through an algorithm written in continuation-passing
> style (https://github.com/danrobinson/arrow-demo/blob/master/map.h#L20-L22),
> which in turn constructs an operation pipeline where each operation can
> either call "step" (to yield a value) or "skip" (to yield a null value) on
> the next operation. Thanks to the nullable type, there are two versions of
> the get operation: one that checks for nulls, and one that knows it doesn't
> have to. (
> https://github.com/danrobinson/arrow-demo/blob/master/operations.h#L8-L55).
>
> I'm not actually trying to push any of these half-baked functional
> paradigms, let alone the hacky template metaprogramming tricks used to
> implement some of them in the prototype.  The point I'm trying to
> illustrate is that these kinds of abstractions would be more difficult to
> implement without Nullable<T> typing, because without types, you can't
> efficiently pass the information about whether an array has nulls or not
> from function to function (and ultimately to the function that processes
> each row). (Perhaps I'm missing something!)
>
> Here's Take implemented as a single monolithic function that isn't aware of
> nullability: https://github.com/danrobinson/arrow-demo/blob/master/take.h.
> In my tests this is about 5-10% faster than the map() version and I expect
> it would maintain an advantage if both were better optimized. Maybe a
> 45-line function like this is worth it for the core functions, but it might
> be useful to expose higher-order functions like map() to C++ developers.
>
> As for performance, code generation, and static polymorphism—is the issue
> roughly that we need compiled instantiations of every function that might
> be called, with every possible type, because at compile time we don't know
> the structure of the data or what functions people may want to call from
> (say) interpreted languages? I hadn't appreciated that, and it does seem
> like a risk of using templates, but I think it actually increases the
> upside of factoring out logic into abstractions like map().
>

Re: Should Nullable be a nested type?

Posted by Daniel Robinson <da...@gmail.com>.
That's convincing, thanks for the response.






On Fri, Mar 4, 2016 at 8:14 AM -0800, "Wes McKinney" <we...@cloudera.com> wrote:










Moving this thread over from the discussion about adding null count to
the physical format.

I never said that what you're describing an invalid approach, only
that it will yield more complexity for both library developers and
users without any clear performance or net productivity benefits. This
is the kind of C++ codebase I would personally choose not to be
involved with. At this point it's fairly hypothetical; perhaps we can
revisit in a few months after Arrow gets used for some real-world
applications.

It's probably a philosophical divide but I like C++ as a tool
(compared with plain old C) for several reasons:

- High performance C tends to encourage much more macro use (manual
code generation, basically)

- As a code generation tool, templates are more sane and give better
compiler errors than C macros.

- Object-oriented programming in C requires a lot of boilerplate. See
an example C codebase (https://github.com/torch/TH) using an
opinionated flavor of OOP, you end up with a half-reimplementation of
C++ classes!

- Memory-management using RAII and smart pointers makes me personally
a lot more productive with fewer mistakes

In particular, the Google C++ guide counterindicates complicated
template metaprogramming:
https://google.github.io/styleguide/cppguide.html#Template_metaprogramming

"The techniques used in template metaprogramming are often obscure to
anyone but language experts. Code that uses templates in complicated
ways is often unreadable, and is hard to debug or maintain.

Template metaprogramming often leads to extremely poor compiler time
error messages: even if an interface is simple, the complicated
implementation details become visible when the user does something
wrong."

The great part of Arrow is that the memory layout specification is
what really matters, so there is nothing stopping anyone from creating
alternate implementations that suit their needs, and if you need to
use functions from different implementations in an application, you
can do that because the memory is binary interoperable.

My intent for the C++ codebase is to make it the fastest reference
code available for these data structures while also readable and
accessible for a wide variety of programmers to contribute to, so
adding template metaprogramming constructs (as opposed to mainly using
templates primarily for code generation) might drive away certain
kinds of contributors. I would like for many of the algorithms to not
end up too dissimilar from the ones you would write in C.

- Wes

On Fri, Mar 4, 2016 at 6:50 AM, Daniel Robinson
 wrote:
> Wes,
>
> Thanks for soliciting so much input on these questions, and sharing the new
> prototypes.
>
> In response to point 2 and your e-mail from last week, I created some
> prototypes to illustrate what I think could be useful about having a
> Nullable template in the C++ implementation.
>
> As far as code complexity, I think having a Nullable type might simplify
> the user interface (including the definitions of algorithms) by enabling
> more generic programming, at the cost of some template-wrestling on the
> developer side. You mentioned Take (
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.take.html).
> As a somewhat silly illustration, nullable typing could allow Take to be
> implemented in one line (
> https://github.com/danrobinson/arrow-demo/blob/master/take.h#L9):
>
>   return map(index_array, array);
>
> Behind the scenes, map() does a runtime check that short-circuits the
> null-handling logic (
> https://github.com/danrobinson/arrow-demo/blob/master/map.h#L55-L81; this
> code seems to be irreducibly ugly but at least it's somewhat generic). It
> then runs the array through an algorithm written in continuation-passing
> style (https://github.com/danrobinson/arrow-demo/blob/master/map.h#L20-L22),
> which in turn constructs an operation pipeline where each operation can
> either call "step" (to yield a value) or "skip" (to yield a null value) on
> the next operation. Thanks to the nullable type, there are two versions of
> the get operation: one that checks for nulls, and one that knows it doesn't
> have to. (
> https://github.com/danrobinson/arrow-demo/blob/master/operations.h#L8-L55).
>
> I'm not actually trying to push any of these half-baked functional
> paradigms, let alone the hacky template metaprogramming tricks used to
> implement some of them in the prototype.  The point I'm trying to
> illustrate is that these kinds of abstractions would be more difficult to
> implement without Nullable typing, because without types, you can't
> efficiently pass the information about whether an array has nulls or not
> from function to function (and ultimately to the function that processes
> each row). (Perhaps I'm missing something!)
>
> Here's Take implemented as a single monolithic function that isn't aware of
> nullability: https://github.com/danrobinson/arrow-demo/blob/master/take.h.
> In my tests this is about 5-10% faster than the map() version and I expect
> it would maintain an advantage if both were better optimized. Maybe a
> 45-line function like this is worth it for the core functions, but it might
> be useful to expose higher-order functions like map() to C++ developers.
>
> As for performance, code generation, and static polymorphism—is the issue
> roughly that we need compiled instantiations of every function that might
> be called, with every possible type, because at compile time we don't know
> the structure of the data or what functions people may want to call from
> (say) interpreted languages? I hadn't appreciated that, and it does seem
> like a risk of using templates, but I think it actually increases the
> upside of factoring out logic into abstractions like map().
>