You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2016/07/14 05:28:14 UTC

Discussion: Should we make string/binary types first class Arrow Array types?

Today String and Binary types are represented in memory as list<byte> [1]
 and we use logical types to distinguish between a list of bytes and string
type [2].

The question of whether this is sufficient or if we should make a first
class string/binary type has come up tangentially on a few threads and we
should come try to come to a conclusion if we want to add it as part of a
spec.   I think the current proposal is that the String type would consist
of null-bitmap buffer, an offset buffer and a buffer containing bytes (for
strings the bytes would be UTF-8 encoded strings).  The main difference
with the list representation is, individual bytes cannot be marked as null
because there isn't a nested Array.

To quote Jacques for the pros of this approach:

 My main argument is that the most basic types most people need come in
this order from my experience:

Int
String
Float
Decimal
Binary

Note that I'm not focused on width here, just generally "what people use".
So I think a string comes second in priority and ease of
use/approachability necessitate this as a first class concept. This is
beyond the fact that it has specialized rules that are separate from a
List<Byte>.



The main argument for not doing this is it adds additional types that need
to be implemented and can lead to some amount of redundant code.  For
instance, in the current C++ implementation we are able to have a String
Array that extends a List Type and re-use already defined equality methods
[3].

What do people think?

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Layout.md
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs
[3]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

See ARROW-262

On Mon, Aug 15, 2016 at 3:38 PM, Wes McKinney <we...@gmail.com> wrote:

> These IPC details we should definitely document outside of the code.
>
> For the String/Binary type question, I want to start a document that
> explains the logical data types in Message.fbs in terms of
>
> - what Arrow memory layout they use (for example: Int32 uses fixed bit
> width 32 bits), and String uses List<UInt8> (with the restriction that
> the inner buffer must not have any nulls, and the validity bitmap is
> omitted)
>
> - any type-specific custom logic around deconstructing or
> reconstructing an Arrow container in an IPC/RPC setting. What we have
> been debating in this e-mail thread is altering the appearance of
> String/Binary representation in a record batch
> (https://github.com/apache/arrow/blob/master/format/Message.fbs#L147)
> from being identical to List<UInt8> (4 buffers in the flattened buffer
> list -- 2 for the list node, 2 for the UInt8 node) to its own
> "collapsed" form (3 buffers: bitmap, offsets, data). This means that
> any code that is sending/receiving a record batch will need separate
> code paths to handle List and String respectively (in the C++ code, we
> are currently using the same code path for both)
>
> For example, changes like ARROW-253
> (https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf
> 97c363e08e)
> would be documented outside of the code and message IDL.
>
> I will open one or more JIRAs and write a patch to try to close the
> loop on this.
>
> - Wes
>
> On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <ju...@dremio.com> wrote:
> > There's ARROW-258 which is about clarifying difference (if any) in
> metadata
> > across RPC (sockets), IPC (shared memory) and files.
> > The vector layout is the same except in RPC or files they get
> concatenated
> > together when copied over.
> > The metadata should be mostly the same (ideally the same). Buffer offsets
> > are relative to the beginning of the body in the context of RPC and file
> > start in files. In the context of IPC it looks like we need an extra
> page id
> > (from Message.fbs). Is this correct?
> >
> > On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfield@gmail.com
> >
> > wrote:
> >>
> >> Thanks Wes,
> >> This makes sense.  +1 on the "Logical Types / IPC layout
> >> document"  is there a JIRA open for this?
> >>
> >> I'll open a JIRA item to change the inheritance of string/binary in the
> >> C++ code base.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >>>
> >>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <
> emkornfield@gmail.com>
> >>> wrote:
> >>> > Sorry for the late reply.
> >>> >
> >>> > This all sounds reasonable to me.  But I'm not sure I understand
> >>> > exactly
> >>> > what you mean by
> >>> >
> >>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> >>> >> would be a single array unit in the buffer stream and flattened
> Field
> >>> >> metadata rather than nested types (2 array units as they are
> >>> >> presently).
> >>> >
> >>> >
> >>> > The way I read it this seems to me to contradict the
> >>> > cross-implementation as
> >>> > "List<UInt8-not null>"?
> >>> >
> >>> > Thanks,
> >>> > Micah
> >>> >
> >>>
> >>> I think we can resolve this by starting a "Logical Types and IPC/RPC
> >>> layout" specification document.
> >>>
> >>> The schema metadata
> >>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
> >>> as I understand it, strictly the domain of logical types. I believe
> >>> there is some minor conflation of the notions of primitive physical
> >>> types and primitive logical types.
> >>>
> >>> While String / Binary have identical physical layouts to List<UInt8
> >>> not null>, in the domain of logical types and IPC, what we are saying
> >>> is that these types are:
> >>>
> >>> - logically speaking: primitive, non-nested types
> >>> - their IPC layout is the flattened version of the nested List<UInt8>
> >>> counterpart -- a single Field node having String type (with a null
> >>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
> >>> data. Structurally on the wire / in shared memory (compared with
> >>> List<UInt8 not null>) the only difference is the Field metadata (since
> >>> if null count is 0 for the inner UInt8 values, then there is only a
> >>> single buffer) -- one node versus two
> >>>
> >>> Let me know if this does not make sense.
> >>>
> >>> To move this forward I propose to begin a Logical Types / IPC layout
> >>> document and begin to document the mapping between logical types and
> >>> their physical in-memory representation and layout on the wire.
> >>>
> >>> - Wes
> >>
> >>
> >
> >
> >
> > --
> > Julien
>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

These IPC details we should definitely document outside of the code.

For the String/Binary type question, I want to start a document that
explains the logical data types in Message.fbs in terms of

- what Arrow memory layout they use (for example: Int32 uses fixed bit
width 32 bits), and String uses List<UInt8> (with the restriction that
the inner buffer must not have any nulls, and the validity bitmap is
omitted)

- any type-specific custom logic around deconstructing or
reconstructing an Arrow container in an IPC/RPC setting. What we have
been debating in this e-mail thread is altering the appearance of
String/Binary representation in a record batch
(https://github.com/apache/arrow/blob/master/format/Message.fbs#L147)
from being identical to List<UInt8> (4 buffers in the flattened buffer
list -- 2 for the list node, 2 for the UInt8 node) to its own
"collapsed" form (3 buffers: bitmap, offsets, data). This means that
any code that is sending/receiving a record batch will need separate
code paths to handle List and String respectively (in the C++ code, we
are currently using the same code path for both)

For example, changes like ARROW-253
(https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf97c363e08e)
would be documented outside of the code and message IDL.

I will open one or more JIRAs and write a patch to try to close the
loop on this.

- Wes

On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <ju...@dremio.com> wrote:
> There's ARROW-258 which is about clarifying difference (if any) in metadata
> across RPC (sockets), IPC (shared memory) and files.
> The vector layout is the same except in RPC or files they get concatenated
> together when copied over.
> The metadata should be mostly the same (ideally the same). Buffer offsets
> are relative to the beginning of the body in the context of RPC and file
> start in files. In the context of IPC it looks like we need an extra page id
> (from Message.fbs). Is this correct?
>
> On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <em...@gmail.com>
> wrote:
>>
>> Thanks Wes,
>> This makes sense.  +1 on the "Logical Types / IPC layout
>> document"  is there a JIRA open for this?
>>
>> I'll open a JIRA item to change the inheritance of string/binary in the
>> C++ code base.
>>
>> Thanks,
>> Micah
>>
>> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>>>
>>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <em...@gmail.com>
>>> wrote:
>>> > Sorry for the late reply.
>>> >
>>> > This all sounds reasonable to me.  But I'm not sure I understand
>>> > exactly
>>> > what you mean by
>>> >
>>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>>> >> would be a single array unit in the buffer stream and flattened Field
>>> >> metadata rather than nested types (2 array units as they are
>>> >> presently).
>>> >
>>> >
>>> > The way I read it this seems to me to contradict the
>>> > cross-implementation as
>>> > "List<UInt8-not null>"?
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>>
>>> I think we can resolve this by starting a "Logical Types and IPC/RPC
>>> layout" specification document.
>>>
>>> The schema metadata
>>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
>>> as I understand it, strictly the domain of logical types. I believe
>>> there is some minor conflation of the notions of primitive physical
>>> types and primitive logical types.
>>>
>>> While String / Binary have identical physical layouts to List<UInt8
>>> not null>, in the domain of logical types and IPC, what we are saying
>>> is that these types are:
>>>
>>> - logically speaking: primitive, non-nested types
>>> - their IPC layout is the flattened version of the nested List<UInt8>
>>> counterpart -- a single Field node having String type (with a null
>>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
>>> data. Structurally on the wire / in shared memory (compared with
>>> List<UInt8 not null>) the only difference is the Field metadata (since
>>> if null count is 0 for the inner UInt8 values, then there is only a
>>> single buffer) -- one node versus two
>>>
>>> Let me know if this does not make sense.
>>>
>>> To move this forward I propose to begin a Logical Types / IPC layout
>>> document and begin to document the mapping between logical types and
>>> their physical in-memory representation and layout on the wire.
>>>
>>> - Wes
>>
>>
>
>
>
> --
> Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Julien Le Dem <ju...@dremio.com>.

There's ARROW-258 which is about clarifying difference (if any) in metadata
across RPC (sockets), IPC (shared memory) and files.
The vector layout is the same except in RPC or files they get concatenated
together when copied over.
The metadata should be mostly the same (ideally the same). Buffer offsets
are relative to the beginning of the body in the context of RPC and file
start in files. In the context of IPC it looks like we need an extra page
id (from Message.fbs). Is this correct?

On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <em...@gmail.com>
wrote:

> Thanks Wes,
> This makes sense.  +1 on the "Logical Types / IPC layout
> document"  is there a JIRA open for this?
>
> I'll open a JIRA item to change the inheritance of string/binary in the
> C++ code base.
>
> Thanks,
> Micah
>
> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <we...@gmail.com>
> wrote:
>
>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>> > Sorry for the late reply.
>> >
>> > This all sounds reasonable to me.  But I'm not sure I understand exactly
>> > what you mean by
>> >
>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> >> would be a single array unit in the buffer stream and flattened Field
>> >> metadata rather than nested types (2 array units as they are
>> >> presently).
>> >
>> >
>> > The way I read it this seems to me to contradict the
>> cross-implementation as
>> > "List<UInt8-not null>"?
>> >
>> > Thanks,
>> > Micah
>> >
>>
>> I think we can resolve this by starting a "Logical Types and IPC/RPC
>> layout" specification document.
>>
>> The schema metadata
>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
>> as I understand it, strictly the domain of logical types. I believe
>> there is some minor conflation of the notions of primitive physical
>> types and primitive logical types.
>>
>> While String / Binary have identical physical layouts to List<UInt8
>> not null>, in the domain of logical types and IPC, what we are saying
>> is that these types are:
>>
>> - logically speaking: primitive, non-nested types
>> - their IPC layout is the flattened version of the nested List<UInt8>
>> counterpart -- a single Field node having String type (with a null
>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
>> data. Structurally on the wire / in shared memory (compared with
>> List<UInt8 not null>) the only difference is the Field metadata (since
>> if null count is 0 for the inner UInt8 values, then there is only a
>> single buffer) -- one node versus two
>>
>> Let me know if this does not make sense.
>>
>> To move this forward I propose to begin a Logical Types / IPC layout
>> document and begin to document the mapping between logical types and
>> their physical in-memory representation and layout on the wire.
>>
>> - Wes
>>
>
>


-- 
Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Micah Kornfield <em...@gmail.com>.

Thanks Wes,
This makes sense.  +1 on the "Logical Types / IPC layout
document"  is there a JIRA open for this?

I'll open a JIRA item to change the inheritance of string/binary in the C++
code base.

Thanks,
Micah

On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <we...@gmail.com> wrote:

> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > Sorry for the late reply.
> >
> > This all sounds reasonable to me.  But I'm not sure I understand exactly
> > what you mean by
> >
> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> >> would be a single array unit in the buffer stream and flattened Field
> >> metadata rather than nested types (2 array units as they are
> >> presently).
> >
> >
> > The way I read it this seems to me to contradict the
> cross-implementation as
> > "List<UInt8-not null>"?
> >
> > Thanks,
> > Micah
> >
>
> I think we can resolve this by starting a "Logical Types and IPC/RPC
> layout" specification document.
>
> The schema metadata
> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
> as I understand it, strictly the domain of logical types. I believe
> there is some minor conflation of the notions of primitive physical
> types and primitive logical types.
>
> While String / Binary have identical physical layouts to List<UInt8
> not null>, in the domain of logical types and IPC, what we are saying
> is that these types are:
>
> - logically speaking: primitive, non-nested types
> - their IPC layout is the flattened version of the nested List<UInt8>
> counterpart -- a single Field node having String type (with a null
> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
> data. Structurally on the wire / in shared memory (compared with
> List<UInt8 not null>) the only difference is the Field metadata (since
> if null count is 0 for the inner UInt8 values, then there is only a
> single buffer) -- one node versus two
>
> Let me know if this does not make sense.
>
> To move this forward I propose to begin a Logical Types / IPC layout
> document and begin to document the mapping between logical types and
> their physical in-memory representation and layout on the wire.
>
> - Wes
>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <em...@gmail.com> wrote:
> Sorry for the late reply.
>
> This all sounds reasonable to me.  But I'm not sure I understand exactly
> what you mean by
>
>> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> would be a single array unit in the buffer stream and flattened Field
>> metadata rather than nested types (2 array units as they are
>> presently).
>
>
> The way I read it this seems to me to contradict the cross-implementation as
> "List<UInt8-not null>"?
>
> Thanks,
> Micah
>

I think we can resolve this by starting a "Logical Types and IPC/RPC
layout" specification document.

The schema metadata
(https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
as I understand it, strictly the domain of logical types. I believe
there is some minor conflation of the notions of primitive physical
types and primitive logical types.

While String / Binary have identical physical layouts to List<UInt8
not null>, in the domain of logical types and IPC, what we are saying
is that these types are:

- logically speaking: primitive, non-nested types
- their IPC layout is the flattened version of the nested List<UInt8>
counterpart -- a single Field node having String type (with a null
count, etc.), and 3 memory buffers: validity bitmap, offsets, and
data. Structurally on the wire / in shared memory (compared with
List<UInt8 not null>) the only difference is the Field metadata (since
if null count is 0 for the inner UInt8 values, then there is only a
single buffer) -- one node versus two

Let me know if this does not make sense.

To move this forward I propose to begin a Logical Types / IPC layout
document and begin to document the mapping between logical types and
their physical in-memory representation and layout on the wire.

- Wes

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Micah Kornfield <em...@gmail.com>.

Sorry for the late reply.

This all sounds reasonable to me.  But I'm not sure I understand exactly
what you mean by

Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> would be a single array unit in the buffer stream and flattened Field
> metadata rather than nested types (2 array units as they are
> presently).


The way I read it this seems to me to contradict the cross-implementation
as "List<UInt8-not null>"?

Thanks,
Micah


On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Micah
>
> I'm sorry for dropping the ball on this discussion. copying Julien as
> he's been looking at the metadata recently.
>
> My thinking is that we should indicate in the format document that the
> String and Binary logical types, as a matter of cross-implementation
> convention, will have List<UInt8-not null> memory layout.
>
> In the C++ library at least, we can collapse the class structure to
> make BinaryArray and StringArray not a subclass of ListArray,
> factoring out common code that can be reused into helper inline
> functions.
>
> Class hierarchy aside the main impact is adding entries to the Type
> union in the Flatbuffers metadata
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>
> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> would be a single array unit in the buffer stream and flattened Field
> metadata rather than nested types (2 array units as they are
> presently).
>
> Separately, I am very interested in discussing a form of logical
> Binary/StringArray in the C++ implementation that is internally
> dictionary encoded. I'm proposing this as a possible new UTF-8
> representation for pandas in the future:
> https://wesm.github.io/pandas2-design/strings.html#
> possible-solution-new-non-numpy-string-memory-layout
>
> Hopefully this isn't too incoherent, but it would be good to arrive at
> some conclusion in this discussion if we need to implement the
> changes.
>
> Thanks
> Wes
>
> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > Wes, Jacques, others...
> >
> > Any thoughts on this?   Let me know if you would like to clarify
> something,
> > I think I was a little long winded.  It would be good to come to a
> > consensus one way or another.
> >
> > Thanks,
> > Micah
> >
> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Wes and Jacques,
> >>
> >> Thanks for the thorough analysis.  I agree that Strings should be easy
> to
> >> work with.  I'm just trying to understand how making a distinct string
> type
> >> defined in the memory layout spec [1] brings a lot of additional
> utility.
> >>
> >> I think of there being two distinct concerns with Arrow:
> >>
> >> 1.  Layout - What metadata and data elements are required to represent a
> >> specific type in a flat address space.
> >>
> >> 2.  Manipulation - How we build interfaces for working with the memory
> >> layout.
> >>
> >> With respect to Memory Layout, introducing a new string type seems to
> add
> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
> >> represent the layout for strings.  So the main benefits for introducing
> a
> >> new memory layout for a string type is an optimization.  By introducing
> the
> >> new type we avoid invalid string construction (having uint_t elements
> >> marked as null in the nested array) and to save a few bytes/extra
> function
> >> call when "(de)serializing" a string column.
> >>
> >> With respect to manipulation, I agree, that having the right
> API/modeling
> >> to treat strings as first class objects makes a lot of sense.   But I
> don't
> >> think that the specification needs to explicitly make allowances for it.
> >> Once you have constructed a Java/C++ wrapper around the memory layout
> you
> >> can choose to expose the right convenience APIs through OO abstraction.
> >> The construction of the correct object wrapper is governed by Metadata
> >> defined in [2] and an understanding of how the logical type maps to the
> >> appropriate memory layout.  At the moment metadata doesn't specify any
> sort
> >> of class hierarchy which I believe is the correct thing to do from a
> >> specification perspective.
> >>
> >> The C++ implementation currently has StringArrays inheriting from a
> >> ListArrays which was an implementation convenience and something we
> should
> >> revisit (I agree with Wes's point on not relying on  C++'s type system
> for
> >> casting).
> >> The primary argument for changing the existing implementation seems to
> be
> >> that strings should be considered "non-nested" types.  Whether strings
> are
> >> nested or not seems to fall squarely into the manipulation concern
> (except
> >> for the optimizations mentioned above) and therefore, IMO, an
> >> implementation detail.     When thinking about how this plays out in
> code
> >> I imagine a visitor pattern.  I've provided some pseudo-code below for
> two
> >> possible visitor classes make StringArrays first class objects but
> wouldn't
> >> require updates to the specification.
> >>
> >> I've tried to think where testing a particular object for "nested"-ness
> >> makes sense by itself and couldn't come up with something off the top
> of my
> >> head.  It seems once you determine an Array is non-nested you still
> want to
> >> test for exact primitive type you are dealing with.
> >>
> >> Given these points I'm still ambivalent about adding a new string/binary
> >> type to the spec. It would be an improvement but it seems like a
> somewhat
> >> minor improvement.  If people can provide stronger use-cases for adding
> the
> >> new type I'd be less ambivalent, but at the moment this seems like more
> of
> >> an implementation concern.
> >>
> >> Thanks,
> >> Micah
> >>
> >> // Visitor patterns for arrays, that do not require any updates to the
> >> memory layout.
> >> class ClassVisitor {
> >>     void visit(Int32Array );
> >>     void visit(UInt32Array );
> >>     void visit(DoubleArray );
> >>     void visit(ListArray );
> >>     void visit(StringArray ); // if we changed the hierarchy, this would
> >> be sufficient to treat strings as a first class type
> >>     // Other types elided
> >> }
> >>
> >> or
> >>
> >> class BufferVisitor { // type disambiguation happens by calling the
> >> correctly
> >>                                 // overloaded method
> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
> >> nested_type);
> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
> >> byte_buffer); // sufficient for treating string types as non-nested.
> >>     // Other types elided.
> >> }
> >>
> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>
> >>
>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Julien Le Dem <ju...@dremio.com>.

Yes

On Wed, Aug 10, 2016 at 11:24 AM, Wes McKinney <we...@gmail.com> wrote:

> I see the primary point of discussion on this to be whether String/Binary
> have the same layout on the wire as List<uint8-not null> (i.e. one
> Field/Node in the type tree versus two). I think what we are working
> towards is a single field rather than a List field and an Int field (bit
> width 8).
>
> On Aug 10, 2016, at 10:46 AM, Julien Le Dem <ju...@dremio.com> wrote:
>
> Hi,
> Agreed.
> To paraphrase/complement what has been said:
> The types in format/Message.fbs [1] are "Logical types" or "user facing
> types", close to SQL types (they include String, Timestamp, Decimal, ...)
> and are related to Parquet's logical types [2][3].
> For each of those types there's a corresponding physical layout that is
> formally specified (example discussed here: String => List<UInt8-not null>).
> I'm going to open a couple of JIRA's to finalise the types and clarify the
> layout.
>
> [1] https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf00
> 45532d6607/format/Message.fbs#L63
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> [3] https://github.com/apache/parquet-format/blob/
> 66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/
> parquet.thrift#L48
>
> Julien
>
> On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Micah
>>
>> I'm sorry for dropping the ball on this discussion. copying Julien as
>> he's been looking at the metadata recently.
>>
>> My thinking is that we should indicate in the format document that the
>> String and Binary logical types, as a matter of cross-implementation
>> convention, will have List<UInt8-not null> memory layout.
>>
>> In the C++ library at least, we can collapse the class structure to
>> make BinaryArray and StringArray not a subclass of ListArray,
>> factoring out common code that can be reused into helper inline
>> functions.
>>
>> Class hierarchy aside the main impact is adding entries to the Type
>> union in the Flatbuffers metadata
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>>
>> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> would be a single array unit in the buffer stream and flattened Field
>> metadata rather than nested types (2 array units as they are
>> presently).
>>
>> Separately, I am very interested in discussing a form of logical
>> Binary/StringArray in the C++ implementation that is internally
>> dictionary encoded. I'm proposing this as a possible new UTF-8
>> representation for pandas in the future:
>> https://wesm.github.io/pandas2-design/strings.html#possible-
>> solution-new-non-numpy-string-memory-layout
>>
>> Hopefully this isn't too incoherent, but it would be good to arrive at
>> some conclusion in this discussion if we need to implement the
>> changes.
>>
>> Thanks
>> Wes
>>
>> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>> > Wes, Jacques, others...
>> >
>> > Any thoughts on this?   Let me know if you would like to clarify
>> something,
>> > I think I was a little long winded.  It would be good to come to a
>> > consensus one way or another.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfield@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Wes and Jacques,
>> >>
>> >> Thanks for the thorough analysis.  I agree that Strings should be easy
>> to
>> >> work with.  I'm just trying to understand how making a distinct string
>> type
>> >> defined in the memory layout spec [1] brings a lot of additional
>> utility.
>> >>
>> >> I think of there being two distinct concerns with Arrow:
>> >>
>> >> 1.  Layout - What metadata and data elements are required to represent
>> a
>> >> specific type in a flat address space.
>> >>
>> >> 2.  Manipulation - How we build interfaces for working with the memory
>> >> layout.
>> >>
>> >> With respect to Memory Layout, introducing a new string type seems to
>> add
>> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> >> represent the layout for strings.  So the main benefits for
>> introducing a
>> >> new memory layout for a string type is an optimization.  By
>> introducing the
>> >> new type we avoid invalid string construction (having uint_t elements
>> >> marked as null in the nested array) and to save a few bytes/extra
>> function
>> >> call when "(de)serializing" a string column.
>> >>
>> >> With respect to manipulation, I agree, that having the right
>> API/modeling
>> >> to treat strings as first class objects makes a lot of sense.   But I
>> don't
>> >> think that the specification needs to explicitly make allowances for
>> it.
>> >> Once you have constructed a Java/C++ wrapper around the memory layout
>> you
>> >> can choose to expose the right convenience APIs through OO abstraction.
>> >> The construction of the correct object wrapper is governed by Metadata
>> >> defined in [2] and an understanding of how the logical type maps to the
>> >> appropriate memory layout.  At the moment metadata doesn't specify any
>> sort
>> >> of class hierarchy which I believe is the correct thing to do from a
>> >> specification perspective.
>> >>
>> >> The C++ implementation currently has StringArrays inheriting from a
>> >> ListArrays which was an implementation convenience and something we
>> should
>> >> revisit (I agree with Wes's point on not relying on  C++'s type system
>> for
>> >> casting).
>> >> The primary argument for changing the existing implementation seems to
>> be
>> >> that strings should be considered "non-nested" types.  Whether strings
>> are
>> >> nested or not seems to fall squarely into the manipulation concern
>> (except
>> >> for the optimizations mentioned above) and therefore, IMO, an
>> >> implementation detail.     When thinking about how this plays out in
>> code
>> >> I imagine a visitor pattern.  I've provided some pseudo-code below for
>> two
>> >> possible visitor classes make StringArrays first class objects but
>> wouldn't
>> >> require updates to the specification.
>> >>
>> >> I've tried to think where testing a particular object for "nested"-ness
>> >> makes sense by itself and couldn't come up with something off the top
>> of my
>> >> head.  It seems once you determine an Array is non-nested you still
>> want to
>> >> test for exact primitive type you are dealing with.
>> >>
>> >> Given these points I'm still ambivalent about adding a new
>> string/binary
>> >> type to the spec. It would be an improvement but it seems like a
>> somewhat
>> >> minor improvement.  If people can provide stronger use-cases for
>> adding the
>> >> new type I'd be less ambivalent, but at the moment this seems like
>> more of
>> >> an implementation concern.
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >> // Visitor patterns for arrays, that do not require any updates to the
>> >> memory layout.
>> >> class ClassVisitor {
>> >>     void visit(Int32Array );
>> >>     void visit(UInt32Array );
>> >>     void visit(DoubleArray );
>> >>     void visit(ListArray );
>> >>     void visit(StringArray ); // if we changed the hierarchy, this
>> would
>> >> be sufficient to treat strings as a first class type
>> >>     // Other types elided
>> >> }
>> >>
>> >> or
>> >>
>> >> class BufferVisitor { // type disambiguation happens by calling the
>> >> correctly
>> >>                                 // overloaded method
>> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> >> nested_type);
>> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> >> byte_buffer); // sufficient for treating string types as non-nested.
>> >>     // Other types elided.
>> >> }
>> >>
>> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>
>> >>
>>
>
>
>
> --
> Julien
>
>


-- 
Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

I see the primary point of discussion on this to be whether String/Binary have the same layout on the wire as List<uint8-not null> (i.e. one Field/Node in the type tree versus two). I think what we are working towards is a single field rather than a List field and an Int field (bit width 8). 

> On Aug 10, 2016, at 10:46 AM, Julien Le Dem <ju...@dremio.com> wrote:
> 
> Hi,
> Agreed. 
> To paraphrase/complement what has been said:
> The types in format/Message.fbs [1] are "Logical types" or "user facing types", close to SQL types (they include String, Timestamp, Decimal, ...) and are related to Parquet's logical types [2][3].
> For each of those types there's a corresponding physical layout that is formally specified (example discussed here: String => List<UInt8-not null>).
> I'm going to open a couple of JIRA's to finalise the types and clarify the layout.
> 
> [1] https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> [3] https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48
> 
> Julien
> 
>> On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <we...@gmail.com> wrote:
>> hi Micah
>> 
>> I'm sorry for dropping the ball on this discussion. copying Julien as
>> he's been looking at the metadata recently.
>> 
>> My thinking is that we should indicate in the format document that the
>> String and Binary logical types, as a matter of cross-implementation
>> convention, will have List<UInt8-not null> memory layout.
>> 
>> In the C++ library at least, we can collapse the class structure to
>> make BinaryArray and StringArray not a subclass of ListArray,
>> factoring out common code that can be reused into helper inline
>> functions.
>> 
>> Class hierarchy aside the main impact is adding entries to the Type
>> union in the Flatbuffers metadata
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>> 
>> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> would be a single array unit in the buffer stream and flattened Field
>> metadata rather than nested types (2 array units as they are
>> presently).
>> 
>> Separately, I am very interested in discussing a form of logical
>> Binary/StringArray in the C++ implementation that is internally
>> dictionary encoded. I'm proposing this as a possible new UTF-8
>> representation for pandas in the future:
>> https://wesm.github.io/pandas2-design/strings.html#possible-solution-new-non-numpy-string-memory-layout
>> 
>> Hopefully this isn't too incoherent, but it would be good to arrive at
>> some conclusion in this discussion if we need to implement the
>> changes.
>> 
>> Thanks
>> Wes
>> 
>> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <em...@gmail.com> wrote:
>> > Wes, Jacques, others...
>> >
>> > Any thoughts on this?   Let me know if you would like to clarify something,
>> > I think I was a little long winded.  It would be good to come to a
>> > consensus one way or another.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <em...@gmail.com>
>> > wrote:
>> >
>> >> Hi Wes and Jacques,
>> >>
>> >> Thanks for the thorough analysis.  I agree that Strings should be easy to
>> >> work with.  I'm just trying to understand how making a distinct string type
>> >> defined in the memory layout spec [1] brings a lot of additional utility.
>> >>
>> >> I think of there being two distinct concerns with Arrow:
>> >>
>> >> 1.  Layout - What metadata and data elements are required to represent a
>> >> specific type in a flat address space.
>> >>
>> >> 2.  Manipulation - How we build interfaces for working with the memory
>> >> layout.
>> >>
>> >> With respect to Memory Layout, introducing a new string type seems to add
>> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> >> represent the layout for strings.  So the main benefits for introducing a
>> >> new memory layout for a string type is an optimization.  By introducing the
>> >> new type we avoid invalid string construction (having uint_t elements
>> >> marked as null in the nested array) and to save a few bytes/extra function
>> >> call when "(de)serializing" a string column.
>> >>
>> >> With respect to manipulation, I agree, that having the right API/modeling
>> >> to treat strings as first class objects makes a lot of sense.   But I don't
>> >> think that the specification needs to explicitly make allowances for it.
>> >> Once you have constructed a Java/C++ wrapper around the memory layout you
>> >> can choose to expose the right convenience APIs through OO abstraction.
>> >> The construction of the correct object wrapper is governed by Metadata
>> >> defined in [2] and an understanding of how the logical type maps to the
>> >> appropriate memory layout.  At the moment metadata doesn't specify any sort
>> >> of class hierarchy which I believe is the correct thing to do from a
>> >> specification perspective.
>> >>
>> >> The C++ implementation currently has StringArrays inheriting from a
>> >> ListArrays which was an implementation convenience and something we should
>> >> revisit (I agree with Wes's point on not relying on  C++'s type system for
>> >> casting).
>> >> The primary argument for changing the existing implementation seems to be
>> >> that strings should be considered "non-nested" types.  Whether strings are
>> >> nested or not seems to fall squarely into the manipulation concern (except
>> >> for the optimizations mentioned above) and therefore, IMO, an
>> >> implementation detail.     When thinking about how this plays out in code
>> >> I imagine a visitor pattern.  I've provided some pseudo-code below for two
>> >> possible visitor classes make StringArrays first class objects but wouldn't
>> >> require updates to the specification.
>> >>
>> >> I've tried to think where testing a particular object for "nested"-ness
>> >> makes sense by itself and couldn't come up with something off the top of my
>> >> head.  It seems once you determine an Array is non-nested you still want to
>> >> test for exact primitive type you are dealing with.
>> >>
>> >> Given these points I'm still ambivalent about adding a new string/binary
>> >> type to the spec. It would be an improvement but it seems like a somewhat
>> >> minor improvement.  If people can provide stronger use-cases for adding the
>> >> new type I'd be less ambivalent, but at the moment this seems like more of
>> >> an implementation concern.
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >> // Visitor patterns for arrays, that do not require any updates to the
>> >> memory layout.
>> >> class ClassVisitor {
>> >>     void visit(Int32Array );
>> >>     void visit(UInt32Array );
>> >>     void visit(DoubleArray );
>> >>     void visit(ListArray );
>> >>     void visit(StringArray ); // if we changed the hierarchy, this would
>> >> be sufficient to treat strings as a first class type
>> >>     // Other types elided
>> >> }
>> >>
>> >> or
>> >>
>> >> class BufferVisitor { // type disambiguation happens by calling the
>> >> correctly
>> >>                                 // overloaded method
>> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> >> nested_type);
>> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> >> byte_buffer); // sufficient for treating string types as non-nested.
>> >>     // Other types elided.
>> >> }
>> >>
>> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>
>> >>
> 
> 
> 
> -- 
> Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Julien Le Dem <ju...@dremio.com>.

Hi,
Agreed.
To paraphrase/complement what has been said:
The types in format/Message.fbs [1] are "Logical types" or "user facing
types", close to SQL types (they include String, Timestamp, Decimal, ...)
and are related to Parquet's logical types [2][3].
For each of those types there's a corresponding physical layout that is
formally specified (example discussed here: String => List<UInt8-not null>).
I'm going to open a couple of JIRA's to finalise the types and clarify the
layout.

[1]
https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63
[2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
[3]
https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48

Julien

On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Micah
>
> I'm sorry for dropping the ball on this discussion. copying Julien as
> he's been looking at the metadata recently.
>
> My thinking is that we should indicate in the format document that the
> String and Binary logical types, as a matter of cross-implementation
> convention, will have List<UInt8-not null> memory layout.
>
> In the C++ library at least, we can collapse the class structure to
> make BinaryArray and StringArray not a subclass of ListArray,
> factoring out common code that can be reused into helper inline
> functions.
>
> Class hierarchy aside the main impact is adding entries to the Type
> union in the Flatbuffers metadata
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>
> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> would be a single array unit in the buffer stream and flattened Field
> metadata rather than nested types (2 array units as they are
> presently).
>
> Separately, I am very interested in discussing a form of logical
> Binary/StringArray in the C++ implementation that is internally
> dictionary encoded. I'm proposing this as a possible new UTF-8
> representation for pandas in the future:
> https://wesm.github.io/pandas2-design/strings.html#
> possible-solution-new-non-numpy-string-memory-layout
>
> Hopefully this isn't too incoherent, but it would be good to arrive at
> some conclusion in this discussion if we need to implement the
> changes.
>
> Thanks
> Wes
>
> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > Wes, Jacques, others...
> >
> > Any thoughts on this?   Let me know if you would like to clarify
> something,
> > I think I was a little long winded.  It would be good to come to a
> > consensus one way or another.
> >
> > Thanks,
> > Micah
> >
> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Wes and Jacques,
> >>
> >> Thanks for the thorough analysis.  I agree that Strings should be easy
> to
> >> work with.  I'm just trying to understand how making a distinct string
> type
> >> defined in the memory layout spec [1] brings a lot of additional
> utility.
> >>
> >> I think of there being two distinct concerns with Arrow:
> >>
> >> 1.  Layout - What metadata and data elements are required to represent a
> >> specific type in a flat address space.
> >>
> >> 2.  Manipulation - How we build interfaces for working with the memory
> >> layout.
> >>
> >> With respect to Memory Layout, introducing a new string type seems to
> add
> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
> >> represent the layout for strings.  So the main benefits for introducing
> a
> >> new memory layout for a string type is an optimization.  By introducing
> the
> >> new type we avoid invalid string construction (having uint_t elements
> >> marked as null in the nested array) and to save a few bytes/extra
> function
> >> call when "(de)serializing" a string column.
> >>
> >> With respect to manipulation, I agree, that having the right
> API/modeling
> >> to treat strings as first class objects makes a lot of sense.   But I
> don't
> >> think that the specification needs to explicitly make allowances for it.
> >> Once you have constructed a Java/C++ wrapper around the memory layout
> you
> >> can choose to expose the right convenience APIs through OO abstraction.
> >> The construction of the correct object wrapper is governed by Metadata
> >> defined in [2] and an understanding of how the logical type maps to the
> >> appropriate memory layout.  At the moment metadata doesn't specify any
> sort
> >> of class hierarchy which I believe is the correct thing to do from a
> >> specification perspective.
> >>
> >> The C++ implementation currently has StringArrays inheriting from a
> >> ListArrays which was an implementation convenience and something we
> should
> >> revisit (I agree with Wes's point on not relying on  C++'s type system
> for
> >> casting).
> >> The primary argument for changing the existing implementation seems to
> be
> >> that strings should be considered "non-nested" types.  Whether strings
> are
> >> nested or not seems to fall squarely into the manipulation concern
> (except
> >> for the optimizations mentioned above) and therefore, IMO, an
> >> implementation detail.     When thinking about how this plays out in
> code
> >> I imagine a visitor pattern.  I've provided some pseudo-code below for
> two
> >> possible visitor classes make StringArrays first class objects but
> wouldn't
> >> require updates to the specification.
> >>
> >> I've tried to think where testing a particular object for "nested"-ness
> >> makes sense by itself and couldn't come up with something off the top
> of my
> >> head.  It seems once you determine an Array is non-nested you still
> want to
> >> test for exact primitive type you are dealing with.
> >>
> >> Given these points I'm still ambivalent about adding a new string/binary
> >> type to the spec. It would be an improvement but it seems like a
> somewhat
> >> minor improvement.  If people can provide stronger use-cases for adding
> the
> >> new type I'd be less ambivalent, but at the moment this seems like more
> of
> >> an implementation concern.
> >>
> >> Thanks,
> >> Micah
> >>
> >> // Visitor patterns for arrays, that do not require any updates to the
> >> memory layout.
> >> class ClassVisitor {
> >>     void visit(Int32Array );
> >>     void visit(UInt32Array );
> >>     void visit(DoubleArray );
> >>     void visit(ListArray );
> >>     void visit(StringArray ); // if we changed the hierarchy, this would
> >> be sufficient to treat strings as a first class type
> >>     // Other types elided
> >> }
> >>
> >> or
> >>
> >> class BufferVisitor { // type disambiguation happens by calling the
> >> correctly
> >>                                 // overloaded method
> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
> >> nested_type);
> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
> >> byte_buffer); // sufficient for treating string types as non-nested.
> >>     // Other types elided.
> >> }
> >>
> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>
> >>
>



-- 
Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

hi Micah

I'm sorry for dropping the ball on this discussion. copying Julien as
he's been looking at the metadata recently.

My thinking is that we should indicate in the format document that the
String and Binary logical types, as a matter of cross-implementation
convention, will have List<UInt8-not null> memory layout.

In the C++ library at least, we can collapse the class structure to
make BinaryArray and StringArray not a subclass of ListArray,
factoring out common code that can be reused into helper inline
functions.

Class hierarchy aside the main impact is adding entries to the Type
union in the Flatbuffers metadata
https://github.com/apache/arrow/blob/master/format/Message.fbs#L63

Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
would be a single array unit in the buffer stream and flattened Field
metadata rather than nested types (2 array units as they are
presently).

Separately, I am very interested in discussing a form of logical
Binary/StringArray in the C++ implementation that is internally
dictionary encoded. I'm proposing this as a possible new UTF-8
representation for pandas in the future:
https://wesm.github.io/pandas2-design/strings.html#possible-solution-new-non-numpy-string-memory-layout

Hopefully this isn't too incoherent, but it would be good to arrive at
some conclusion in this discussion if we need to implement the
changes.

Thanks
Wes

On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <em...@gmail.com> wrote:
> Wes, Jacques, others...
>
> Any thoughts on this?   Let me know if you would like to clarify something,
> I think I was a little long winded.  It would be good to come to a
> consensus one way or another.
>
> Thanks,
> Micah
>
> On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Wes and Jacques,
>>
>> Thanks for the thorough analysis.  I agree that Strings should be easy to
>> work with.  I'm just trying to understand how making a distinct string type
>> defined in the memory layout spec [1] brings a lot of additional utility.
>>
>> I think of there being two distinct concerns with Arrow:
>>
>> 1.  Layout - What metadata and data elements are required to represent a
>> specific type in a flat address space.
>>
>> 2.  Manipulation - How we build interfaces for working with the memory
>> layout.
>>
>> With respect to Memory Layout, introducing a new string type seems to add
>> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> represent the layout for strings.  So the main benefits for introducing a
>> new memory layout for a string type is an optimization.  By introducing the
>> new type we avoid invalid string construction (having uint_t elements
>> marked as null in the nested array) and to save a few bytes/extra function
>> call when "(de)serializing" a string column.
>>
>> With respect to manipulation, I agree, that having the right API/modeling
>> to treat strings as first class objects makes a lot of sense.   But I don't
>> think that the specification needs to explicitly make allowances for it.
>> Once you have constructed a Java/C++ wrapper around the memory layout you
>> can choose to expose the right convenience APIs through OO abstraction.
>> The construction of the correct object wrapper is governed by Metadata
>> defined in [2] and an understanding of how the logical type maps to the
>> appropriate memory layout.  At the moment metadata doesn't specify any sort
>> of class hierarchy which I believe is the correct thing to do from a
>> specification perspective.
>>
>> The C++ implementation currently has StringArrays inheriting from a
>> ListArrays which was an implementation convenience and something we should
>> revisit (I agree with Wes's point on not relying on  C++'s type system for
>> casting).
>> The primary argument for changing the existing implementation seems to be
>> that strings should be considered "non-nested" types.  Whether strings are
>> nested or not seems to fall squarely into the manipulation concern (except
>> for the optimizations mentioned above) and therefore, IMO, an
>> implementation detail.     When thinking about how this plays out in code
>> I imagine a visitor pattern.  I've provided some pseudo-code below for two
>> possible visitor classes make StringArrays first class objects but wouldn't
>> require updates to the specification.
>>
>> I've tried to think where testing a particular object for "nested"-ness
>> makes sense by itself and couldn't come up with something off the top of my
>> head.  It seems once you determine an Array is non-nested you still want to
>> test for exact primitive type you are dealing with.
>>
>> Given these points I'm still ambivalent about adding a new string/binary
>> type to the spec. It would be an improvement but it seems like a somewhat
>> minor improvement.  If people can provide stronger use-cases for adding the
>> new type I'd be less ambivalent, but at the moment this seems like more of
>> an implementation concern.
>>
>> Thanks,
>> Micah
>>
>> // Visitor patterns for arrays, that do not require any updates to the
>> memory layout.
>> class ClassVisitor {
>>     void visit(Int32Array );
>>     void visit(UInt32Array );
>>     void visit(DoubleArray );
>>     void visit(ListArray );
>>     void visit(StringArray ); // if we changed the hierarchy, this would
>> be sufficient to treat strings as a first class type
>>     // Other types elided
>> }
>>
>> or
>>
>> class BufferVisitor { // type disambiguation happens by calling the
>> correctly
>>                                 // overloaded method
>>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> nested_type);
>>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> byte_buffer); // sufficient for treating string types as non-nested.
>>     // Other types elided.
>> }
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>
>>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Micah Kornfield <em...@gmail.com>.

Wes, Jacques, others...

Any thoughts on this?   Let me know if you would like to clarify something,
I think I was a little long winded.  It would be good to come to a
consensus one way or another.

Thanks,
Micah

On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <em...@gmail.com>
wrote:

> Hi Wes and Jacques,
>
> Thanks for the thorough analysis.  I agree that Strings should be easy to
> work with.  I'm just trying to understand how making a distinct string type
> defined in the memory layout spec [1] brings a lot of additional utility.
>
> I think of there being two distinct concerns with Arrow:
>
> 1.  Layout - What metadata and data elements are required to represent a
> specific type in a flat address space.
>
> 2.  Manipulation - How we build interfaces for working with the memory
> layout.
>
> With respect to Memory Layout, introducing a new string type seems to add
> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
> represent the layout for strings.  So the main benefits for introducing a
> new memory layout for a string type is an optimization.  By introducing the
> new type we avoid invalid string construction (having uint_t elements
> marked as null in the nested array) and to save a few bytes/extra function
> call when "(de)serializing" a string column.
>
> With respect to manipulation, I agree, that having the right API/modeling
> to treat strings as first class objects makes a lot of sense.   But I don't
> think that the specification needs to explicitly make allowances for it.
> Once you have constructed a Java/C++ wrapper around the memory layout you
> can choose to expose the right convenience APIs through OO abstraction.
> The construction of the correct object wrapper is governed by Metadata
> defined in [2] and an understanding of how the logical type maps to the
> appropriate memory layout.  At the moment metadata doesn't specify any sort
> of class hierarchy which I believe is the correct thing to do from a
> specification perspective.
>
> The C++ implementation currently has StringArrays inheriting from a
> ListArrays which was an implementation convenience and something we should
> revisit (I agree with Wes's point on not relying on  C++'s type system for
> casting).
> The primary argument for changing the existing implementation seems to be
> that strings should be considered "non-nested" types.  Whether strings are
> nested or not seems to fall squarely into the manipulation concern (except
> for the optimizations mentioned above) and therefore, IMO, an
> implementation detail.     When thinking about how this plays out in code
> I imagine a visitor pattern.  I've provided some pseudo-code below for two
> possible visitor classes make StringArrays first class objects but wouldn't
> require updates to the specification.
>
> I've tried to think where testing a particular object for "nested"-ness
> makes sense by itself and couldn't come up with something off the top of my
> head.  It seems once you determine an Array is non-nested you still want to
> test for exact primitive type you are dealing with.
>
> Given these points I'm still ambivalent about adding a new string/binary
> type to the spec. It would be an improvement but it seems like a somewhat
> minor improvement.  If people can provide stronger use-cases for adding the
> new type I'd be less ambivalent, but at the moment this seems like more of
> an implementation concern.
>
> Thanks,
> Micah
>
> // Visitor patterns for arrays, that do not require any updates to the
> memory layout.
> class ClassVisitor {
>     void visit(Int32Array );
>     void visit(UInt32Array );
>     void visit(DoubleArray );
>     void visit(ListArray );
>     void visit(StringArray ); // if we changed the hierarchy, this would
> be sufficient to treat strings as a first class type
>     // Other types elided
> }
>
> or
>
> class BufferVisitor { // type disambiguation happens by calling the
> correctly
>                                 // overloaded method
>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
> nested_type);
>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
> byte_buffer); // sufficient for treating string types as non-nested.
>     // Other types elided.
> }
>
> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>
>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes and Jacques,

Thanks for the thorough analysis.  I agree that Strings should be easy to
work with.  I'm just trying to understand how making a distinct string type
defined in the memory layout spec [1] brings a lot of additional utility.

I think of there being two distinct concerns with Arrow:

1.  Layout - What metadata and data elements are required to represent a
specific type in a flat address space.

2.  Manipulation - How we build interfaces for working with the memory
layout.

With respect to Memory Layout, introducing a new string type seems to add
redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
represent the layout for strings.  So the main benefits for introducing a
new memory layout for a string type is an optimization.  By introducing the
new type we avoid invalid string construction (having uint_t elements
marked as null in the nested array) and to save a few bytes/extra function
call when "(de)serializing" a string column.

With respect to manipulation, I agree, that having the right API/modeling
to treat strings as first class objects makes a lot of sense.   But I don't
think that the specification needs to explicitly make allowances for it.
Once you have constructed a Java/C++ wrapper around the memory layout you
can choose to expose the right convenience APIs through OO abstraction.
The construction of the correct object wrapper is governed by Metadata
defined in [2] and an understanding of how the logical type maps to the
appropriate memory layout.  At the moment metadata doesn't specify any sort
of class hierarchy which I believe is the correct thing to do from a
specification perspective.

The C++ implementation currently has StringArrays inheriting from a
ListArrays which was an implementation convenience and something we should
revisit (I agree with Wes's point on not relying on  C++'s type system for
casting).
The primary argument for changing the existing implementation seems to be
that strings should be considered "non-nested" types.  Whether strings are
nested or not seems to fall squarely into the manipulation concern (except
for the optimizations mentioned above) and therefore, IMO, an
implementation detail.     When thinking about how this plays out in code I
imagine a visitor pattern.  I've provided some pseudo-code below for two
possible visitor classes make StringArrays first class objects but wouldn't
require updates to the specification.

I've tried to think where testing a particular object for "nested"-ness
makes sense by itself and couldn't come up with something off the top of my
head.  It seems once you determine an Array is non-nested you still want to
test for exact primitive type you are dealing with.

Given these points I'm still ambivalent about adding a new string/binary
type to the spec. It would be an improvement but it seems like a somewhat
minor improvement.  If people can provide stronger use-cases for adding the
new type I'd be less ambivalent, but at the moment this seems like more of
an implementation concern.

Thanks,
Micah

// Visitor patterns for arrays, that do not require any updates to the
memory layout.
class ClassVisitor {
    void visit(Int32Array );
    void visit(UInt32Array );
    void visit(DoubleArray );
    void visit(ListArray );
    void visit(StringArray ); // if we changed the hierarchy, this would be
sufficient to treat strings as a first class type
    // Other types elided
}

or

class BufferVisitor { // type disambiguation happens by calling the
correctly
                                // overloaded method
    void visit_numeric(TypeMedata, null_bitmap, value_buffer);
    void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
nested_type);
    void visit_string(TypeMetadata, null_bitmap, offset_buffer,
byte_buffer); // sufficient for treating string types as non-nested.
    // Other types elided.
}

[1] https://github.com/apache/arrow/blob/master/format/Layout.md
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Jacques Nadeau <ja...@apache.org>.

I'm +1 on what Wes said.

(and +1 on what I said... jk :)

I'll actually be offline most of next week but would like to continue to be
part of the discussion so I'll do my best to try to check in but let's hold
on making any formal decisions until next week if that is okay..

-j

On Fri, Jul 15, 2016 at 8:19 AM, Wes McKinney <we...@gmail.com> wrote:

> There's 3 distinct issues here:
>
> 1) Physical memory representation
> 2) Metadata
> 3) Implementation details
>
> On these
>
> 1) I think no one will argue that String/Binary have the same memory
> representation as List<uint8 [not-null]>, and regardless of the
> implementation that you can perform a zero-copy cast without copying
> or duplicating buffers, only changing the array container metadata.
>
> 2) I'm +1 on String/Binary being logically first-class primitive
> types, with the intent that they are not considered logically nested
> types (but you can perform the cast described in #1 if you want to get
> nested data without copying).
>
> 3) The C++ code sharing / duplication issue feels slightly orthogonal
> to the above two items, which are about user semantics and metadata.
> Effectively what would change is that
> std::dynamic_pointer_cast<ListArray>(string_data) would no longer be
> value, as in the class hierarchy, we would have
>
>
> - Primitive
>   - Integer
>   - Floating
>   - String
>   - ...
> - List
> - Struct
> - Union
>
> rather than the present
>
> - List
>   - String (with the type metadata always set to List<uint8 [not-null]>)
>
> From a coding point of view, I should think we would eventually want
> explicit casts that do not presume a certain C++ inheritance
> hierarchy, which might cause downstream code brittleness. Hard to
> predict this precisely at this moment.
>
> - Wes
>
> On Wed, Jul 13, 2016 at 10:28 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > Today String and Binary types are represented in memory as list<byte> [1]
> >  and we use logical types to distinguish between a list of bytes and
> string
> > type [2].
> >
> > The question of whether this is sufficient or if we should make a first
> > class string/binary type has come up tangentially on a few threads and we
> > should come try to come to a conclusion if we want to add it as part of a
> > spec.   I think the current proposal is that the String type would
> consist
> > of null-bitmap buffer, an offset buffer and a buffer containing bytes
> (for
> > strings the bytes would be UTF-8 encoded strings).  The main difference
> > with the list representation is, individual bytes cannot be marked as
> null
> > because there isn't a nested Array.
> >
> > To quote Jacques for the pros of this approach:
> >
> >  My main argument is that the most basic types most people need come in
> > this order from my experience:
> >
> > Int
> > String
> > Float
> > Decimal
> > Binary
> >
> > Note that I'm not focused on width here, just generally "what people
> use".
> > So I think a string comes second in priority and ease of
> > use/approachability necessitate this as a first class concept. This is
> > beyond the fact that it has specialized rules that are separate from a
> > List<Byte>.
> >
> >
> >
> > The main argument for not doing this is it adds additional types that
> need
> > to be implemented and can lead to some amount of redundant code.  For
> > instance, in the current C++ implementation we are able to have a String
> > Array that extends a List Type and re-use already defined equality
> methods
> > [3].
> >
> > What do people think?
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> > [3]
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68
>

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Posted by Wes McKinney <we...@gmail.com>.

There's 3 distinct issues here:

1) Physical memory representation
2) Metadata
3) Implementation details

On these

1) I think no one will argue that String/Binary have the same memory
representation as List<uint8 [not-null]>, and regardless of the
implementation that you can perform a zero-copy cast without copying
or duplicating buffers, only changing the array container metadata.

2) I'm +1 on String/Binary being logically first-class primitive
types, with the intent that they are not considered logically nested
types (but you can perform the cast described in #1 if you want to get
nested data without copying).

3) The C++ code sharing / duplication issue feels slightly orthogonal
to the above two items, which are about user semantics and metadata.
Effectively what would change is that
std::dynamic_pointer_cast<ListArray>(string_data) would no longer be
value, as in the class hierarchy, we would have

- Primitive
  - Integer
  - Floating
  - String
  - ...
- List
- Struct
- Union

rather than the present

- List
  - String (with the type metadata always set to List<uint8 [not-null]>)

From a coding point of view, I should think we would eventually want
explicit casts that do not presume a certain C++ inheritance
hierarchy, which might cause downstream code brittleness. Hard to
predict this precisely at this moment.

- Wes

On Wed, Jul 13, 2016 at 10:28 PM, Micah Kornfield <em...@gmail.com> wrote:
> Today String and Binary types are represented in memory as list<byte> [1]
>  and we use logical types to distinguish between a list of bytes and string
> type [2].
>
> The question of whether this is sufficient or if we should make a first
> class string/binary type has come up tangentially on a few threads and we
> should come try to come to a conclusion if we want to add it as part of a
> spec.   I think the current proposal is that the String type would consist
> of null-bitmap buffer, an offset buffer and a buffer containing bytes (for
> strings the bytes would be UTF-8 encoded strings).  The main difference
> with the list representation is, individual bytes cannot be marked as null
> because there isn't a nested Array.
>
> To quote Jacques for the pros of this approach:
>
>  My main argument is that the most basic types most people need come in
> this order from my experience:
>
> Int
> String
> Float
> Decimal
> Binary
>
> Note that I'm not focused on width here, just generally "what people use".
> So I think a string comes second in priority and ease of
> use/approachability necessitate this as a first class concept. This is
> beyond the fact that it has specialized rules that are separate from a
> List<Byte>.
>
>
>
> The main argument for not doing this is it adds additional types that need
> to be implemented and can lead to some amount of redundant code.  For
> instance, in the current C++ implementation we are able to have a String
> Array that extends a List Type and re-use already defined equality methods
> [3].
>
> What do people think?
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68