You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2016/05/01 09:58:52 UTC

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

I'm not exactly sure of my availability if I am available on spark, I
can likely make the hangout.

On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com> wrote:
> I was traveling today but I can do a hangout about this next week.
>
> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org> wrote:
>> Let's do a quick hangout on this. I'd like to better understand as I'm not
>> sure we're all talking about the same thing.
>>
>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> I'm -1 on making a new primitive type in the memory layout spec [1].
>>>
>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>> array" for Utf8 and Binary types should never contain null elements.
>>>
>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>
>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com> wrote:
>>> > Bumping this conversation.
>>> >
>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
>>> > UTF8 guarantee) primitive types in the spec. Let me know what others
>>> > think.
>>> >
>>> > Thanks
>>> >
>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com> wrote:
>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <we...@cloudera.com>
>>> wrote:
>>> >>>
>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>> emkornfield@gmail.com>
>>> >>>> wrote:
>>> >>>> > I like the current scheme of making String (UTF8) a primitive type
>>> in
>>> >>>> > regards to RPC but not modeling it as a special Array type.  I think
>>> >>>> > the key is formally describing how logical types map to physical
>>> types
>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>> >>>> >
>>> >>>> > I think there are two use-cases here:
>>> >>>> > 1.  Reconstructing Array's off the wire.
>>> >>>> > 2.  Writing algorithms/builders to deal with specific logical types
>>> >>>> > built on Arrays.
>>> >>>> >
>>> >>>> > For case 1, I think it is simpler to not special case string types
>>> as
>>> >>>> > primitives.  Understanding that a logical String type maps to a
>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>> >>>> > serialization code for ListArrays for these types.
>>> >>>> >
>>> >>>>
>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>> >>>> but one downside of having strings as a nested type is that there are
>>> >>>> certain code paths (for example: Parquet-related) which deal with the
>>> >>>> flat table case. To make a Parquet analogy, there is the special
>>> >>>> BYTE_ARRAY primitive type, even though you could technically represent
>>> >>>> variable-length binary data using a repeated field and using
>>> >>>> repetition/definition levels (but the encoding/decoding overhead for
>>> >>>> this in Parquet is much more significant than Arrow). There may be
>>> >>>> other reasons.
>>> >>>>
>>> >>>
>>> >>> I'm a bit confused about what everyone means. I didn't actually realize
>>> >>> that this [1] had been merged yet but I'm generally on board with how
>>> it is
>>> >>> constructed.
>>> >>>
>>> >>> With regards to the c++ implementation of the items at [1], abstracting
>>> >>> shared physical representations out seems fine to me but I don't think
>>> we
>>> >>> should necessitate effective 3NF for [1].
>>> >>>
>>> >>> One of the key points that I'm focused on in the Java space is that I'd
>>> >>> like to move to an always nullable pattern. This is vastly simplifying
>>> from
>>> >>> a code generation, casting and complexity perspective and is a nominal
>>> cost
>>> >>> when using column execution. If binary and varchar are primitive types
>>> as
>>> >>> there there is no weird special casing of avoiding the nullability
>>> bitmap
>>> >>> in the case of variable width items (for the offsets). But that is an
>>> >>> implementation detail of the Java library.
>>> >>>
>>> >>> So in general, I like the scheme at [1] for the concepts that we all
>>> are
>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>> >>>
>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> >>>
>>> >>
>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC
>>> >> purposes, at least. You can use the List code path for arbitrary List
>>> >> types as well as strings and binary. It sounds like either way on the
>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
>>> >> that you don't have to manage a separate never-used bitmap for the
>>> >> string/binary data. It seems useful enough to me to have a primitive
>>> >> variable-length binary/UTF8 type but I do not feel strongly about it.
>>> >>
>>> >>>
>>> >>>
>>> >>>> > For case 2, it would be nice to utilize the type system of the host
>>> >>>> > programming language to express the semantics of a function call
>>> (e.g.
>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>>> >>>> > but I think this can be implemented without requiring a new
>>> primitive
>>> >>>> > type in the spec.
>>> >>>> >
>>> >>>> > The more interesting thing to me is if we should have a new
>>> primitive
>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>> >>>> > offsets array isn't necessary in this case for random access.
>>> >>>> >
>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>> >>>> > (
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>> >>>> > are currently described as a null terminated UTF8 is problematic.  I
>>> >>>> > believe null bytes are valid UTF8 characters.
>>> >>>>
>>> >>>>
>>> >>>> >
>>> >>>>
>>> >>>> Good point, sorry about that. We probably would need to length-prefix
>>> >>>> the values, then.
>>> >>>>
>>> >>>
>>> >>>
>>> >>> Is this an input/output interface? Arrow structures should all be 4
>>> byte
>>> >>> offset based and be neither length prefixed nor null terminated.
>>> >>
>>> >> This was a question around the VARCHAR(k) type (which in many
>>> >> databases is distinct from a TEXT type in which any value can be
>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
>>> >> because you have the offsets encoding length (pardon the jet lag).
>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>> >> leftovers from my earliest draft implementation.
>>> >>
>>> >> - Wes
>>>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Micah Kornfield <em...@gmail.com>.
"hello world" makes sense as a good place to start for general IPC integration.

I thought there was still some disconnect on how strings were going to
be represented.  That was the basis for my suggestion above.  But the
integer use-case bypasses these concerns for now.

On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <ja...@apache.org> wrote:
> By usecase, I really meant "hello world"
>
> On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>
>> Let's start by creating a simple usecase. For example, I would start with
>> nullable 4 byte integer, maybe and use the example of java > (col1) > python
>> (or c++) > (newcol) > java that is one what I'd call a single batch
>> algorithm (e.g. one batch of values in, one out).
>>
>> A simple way to sidestep the memory management/reference counting issues
>> initially is for java to preallocate the output location for newcol for the
>> python (or c++) code.
>>
>> On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>>>
>>> Just to follow-up on this.  I got distracted on a few other items on
>>> the C++ implementation side, but my next task is to get a String types
>>> working for the C++ IPC unit test.   Once I send a PR for that, it
>>> might help clarify the concerns on both sides and we can hammer out
>>> the details from there.
>>>
>>> Sound reasonable?
>>>
>>> -Micah
>>>
>>> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> > Nudging this issue. We need to sketch out a plan to get IPC
>>> > integration tests working between the Java and C++ implementations --
>>> > what's the most expedient way we can work toward making that happen?
>>> >
>>> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <em...@gmail.com>
>>> > wrote:
>>> >> s/spark/slack/g
>>> >>
>>> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield
>>> >> <em...@gmail.com> wrote:
>>> >>> I'm not exactly sure of my availability if I am available on spark, I
>>> >>> can likely make the hangout.
>>> >>>
>>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com>
>>> >>> wrote:
>>> >>>> I was traveling today but I can do a hangout about this next week.
>>> >>>>
>>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org>
>>> >>>> wrote:
>>> >>>>> Let's do a quick hangout on this. I'd like to better understand as
>>> >>>>> I'm not
>>> >>>>> sure we're all talking about the same thing.
>>> >>>>>
>>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield
>>> >>>>> <em...@gmail.com>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec
>>> >>>>>> [1].
>>> >>>>>>
>>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>> >>>>>> array" for Utf8 and Binary types should never contain null
>>> >>>>>> elements.
>>> >>>>>>
>>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> >>>>>>
>>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com>
>>> >>>>>> wrote:
>>> >>>>>> > Bumping this conversation.
>>> >>>>>> >
>>> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
>>> >>>>>> > with a
>>> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
>>> >>>>>> > others
>>> >>>>>> > think.
>>> >>>>>> >
>>> >>>>>> > Thanks
>>> >>>>>> >
>>> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com>
>>> >>>>>> > wrote:
>>> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau
>>> >>>>>> >> <ja...@apache.org>
>>> >>>>>> wrote:
>>> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney
>>> >>>>>> >>> <we...@cloudera.com>
>>> >>>>>> wrote:
>>> >>>>>> >>>
>>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>> >>>>>> emkornfield@gmail.com>
>>> >>>>>> >>>> wrote:
>>> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
>>> >>>>>> >>>> > primitive type
>>> >>>>>> in
>>> >>>>>> >>>> > regards to RPC but not modeling it as a special Array type.
>>> >>>>>> >>>> > I think
>>> >>>>>> >>>> > the key is formally describing how logical types map to
>>> >>>>>> >>>> > physical
>>> >>>>>> types
>>> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > I think there are two use-cases here:
>>> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
>>> >>>>>> >>>> > logical types
>>> >>>>>> >>>> > built on Arrays.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > For case 1, I think it is simpler to not special case
>>> >>>>>> >>>> > string types
>>> >>>>>> as
>>> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
>>> >>>>>> >>>> > to a
>>> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>> >>>>>> >>>> > serialization code for ListArrays for these types.
>>> >>>>>> >>>> >
>>> >>>>>> >>>>
>>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
>>> >>>>>> >>>> comment
>>> >>>>>> >>>> but one downside of having strings as a nested type is that
>>> >>>>>> >>>> there are
>>> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
>>> >>>>>> >>>> with the
>>> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
>>> >>>>>> >>>> special
>>> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
>>> >>>>>> >>>> represent
>>> >>>>>> >>>> variable-length binary data using a repeated field and using
>>> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
>>> >>>>>> >>>> overhead for
>>> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
>>> >>>>>> >>>> may be
>>> >>>>>> >>>> other reasons.
>>> >>>>>> >>>>
>>> >>>>>> >>>
>>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't
>>> >>>>>> >>> actually realize
>>> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
>>> >>>>>> >>> with how
>>> >>>>>> it is
>>> >>>>>> >>> constructed.
>>> >>>>>> >>>
>>> >>>>>> >>> With regards to the c++ implementation of the items at [1],
>>> >>>>>> >>> abstracting
>>> >>>>>> >>> shared physical representations out seems fine to me but I
>>> >>>>>> >>> don't think
>>> >>>>>> we
>>> >>>>>> >>> should necessitate effective 3NF for [1].
>>> >>>>>> >>>
>>> >>>>>> >>> One of the key points that I'm focused on in the Java space is
>>> >>>>>> >>> that I'd
>>> >>>>>> >>> like to move to an always nullable pattern. This is vastly
>>> >>>>>> >>> simplifying
>>> >>>>>> from
>>> >>>>>> >>> a code generation, casting and complexity perspective and is a
>>> >>>>>> >>> nominal
>>> >>>>>> cost
>>> >>>>>> >>> when using column execution. If binary and varchar are
>>> >>>>>> >>> primitive types
>>> >>>>>> as
>>> >>>>>> >>> there there is no weird special casing of avoiding the
>>> >>>>>> >>> nullability
>>> >>>>>> bitmap
>>> >>>>>> >>> in the case of variable width items (for the offsets). But
>>> >>>>>> >>> that is an
>>> >>>>>> >>> implementation detail of the Java library.
>>> >>>>>> >>>
>>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
>>> >>>>>> >>> we all
>>> >>>>>> are
>>> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>> >>>>>> >>>
>>> >>>>>> >>> [1]
>>> >>>>>> >>> https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> >>>>>> >>>
>>> >>>>>> >>
>>> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
>>> >>>>>> >> for IPC
>>> >>>>>> >> purposes, at least. You can use the List code path for
>>> >>>>>> >> arbitrary List
>>> >>>>>> >> types as well as strings and binary. It sounds like either way
>>> >>>>>> >> on the
>>> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
>>> >>>>>> >> primitive so
>>> >>>>>> >> that you don't have to manage a separate never-used bitmap for
>>> >>>>>> >> the
>>> >>>>>> >> string/binary data. It seems useful enough to me to have a
>>> >>>>>> >> primitive
>>> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
>>> >>>>>> >> about it.
>>> >>>>>> >>
>>> >>>>>> >>>
>>> >>>>>> >>>
>>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
>>> >>>>>> >>>> > the host
>>> >>>>>> >>>> > programming language to express the semantics of a function
>>> >>>>>> >>>> > call
>>> >>>>>> (e.g.
>>> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
>>> >>>>>> >>>> > strings),
>>> >>>>>> >>>> > but I think this can be implemented without requiring a new
>>> >>>>>> primitive
>>> >>>>>> >>>> > type in the spec.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > The more interesting thing to me is if we should have a new
>>> >>>>>> primitive
>>> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>>> >>>>>> >>>> > The
>>> >>>>>> >>>> > offsets array isn't necessary in this case for random
>>> >>>>>> >>>> > access.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
>>> >>>>>> >>>> > C++
>>> >>>>>> >>>> > (
>>> >>>>>>
>>> >>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>> >>>>>> >>>> > are currently described as a null terminated UTF8 is
>>> >>>>>> >>>> > problematic.  I
>>> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
>>> >>>>>> >>>>
>>> >>>>>> >>>>
>>> >>>>>> >>>> >
>>> >>>>>> >>>>
>>> >>>>>> >>>> Good point, sorry about that. We probably would need to
>>> >>>>>> >>>> length-prefix
>>> >>>>>> >>>> the values, then.
>>> >>>>>> >>>>
>>> >>>>>> >>>
>>> >>>>>> >>>
>>> >>>>>> >>> Is this an input/output interface? Arrow structures should all
>>> >>>>>> >>> be 4
>>> >>>>>> byte
>>> >>>>>> >>> offset based and be neither length prefixed nor null
>>> >>>>>> >>> terminated.
>>> >>>>>> >>
>>> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
>>> >>>>>> >> databases is distinct from a TEXT type in which any value can
>>> >>>>>> >> be
>>> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
>>> >>>>>> >> that no
>>> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
>>> >>>>>> >> metadata
>>> >>>>>> >> because you have the offsets encoding length (pardon the jet
>>> >>>>>> >> lag).
>>> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>> >>>>>> >> leftovers from my earliest draft implementation.
>>> >>>>>> >>
>>> >>>>>> >> - Wes
>>> >>>>>>
>>
>>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Jacques Nadeau <ja...@apache.org>.
By usecase, I really meant "hello world"

On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Let's start by creating a simple usecase. For example, I would start with
> nullable 4 byte integer, maybe and use the example of java > (col1) >
> python (or c++) > (newcol) > java that is one what I'd call a single batch
> algorithm (e.g. one batch of values in, one out).
>
> A simple way to sidestep the memory management/reference counting issues
> initially is for java to preallocate the output location for newcol for the
> python (or c++) code.
>
> On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Just to follow-up on this.  I got distracted on a few other items on
>> the C++ implementation side, but my next task is to get a String types
>> working for the C++ IPC unit test.   Once I send a PR for that, it
>> might help clarify the concerns on both sides and we can hammer out
>> the details from there.
>>
>> Sound reasonable?
>>
>> -Micah
>>
>> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>> > Nudging this issue. We need to sketch out a plan to get IPC
>> > integration tests working between the Java and C++ implementations --
>> > what's the most expedient way we can work toward making that happen?
>> >
>> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <em...@gmail.com>
>> wrote:
>> >> s/spark/slack/g
>> >>
>> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <
>> emkornfield@gmail.com> wrote:
>> >>> I'm not exactly sure of my availability if I am available on spark, I
>> >>> can likely make the hangout.
>> >>>
>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com>
>> wrote:
>> >>>> I was traveling today but I can do a hangout about this next week.
>> >>>>
>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> >>>>> Let's do a quick hangout on this. I'd like to better understand as
>> I'm not
>> >>>>> sure we're all talking about the same thing.
>> >>>>>
>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <
>> emkornfield@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec
>> [1].
>> >>>>>>
>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>> >>>>>> array" for Utf8 and Binary types should never contain null
>> elements.
>> >>>>>>
>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>>>>>
>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com>
>> wrote:
>> >>>>>> > Bumping this conversation.
>> >>>>>> >
>> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
>> with a
>> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
>> others
>> >>>>>> > think.
>> >>>>>> >
>> >>>>>> > Thanks
>> >>>>>> >
>> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com>
>> wrote:
>> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <
>> jacques@apache.org>
>> >>>>>> wrote:
>> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <
>> wes@cloudera.com>
>> >>>>>> wrote:
>> >>>>>> >>>
>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>> >>>>>> emkornfield@gmail.com>
>> >>>>>> >>>> wrote:
>> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
>> primitive type
>> >>>>>> in
>> >>>>>> >>>> > regards to RPC but not modeling it as a special Array
>> type.  I think
>> >>>>>> >>>> > the key is formally describing how logical types map to
>> physical
>> >>>>>> types
>> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>> >>>>>> >>>> >
>> >>>>>> >>>> > I think there are two use-cases here:
>> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
>> logical types
>> >>>>>> >>>> > built on Arrays.
>> >>>>>> >>>> >
>> >>>>>> >>>> > For case 1, I think it is simpler to not special case
>> string types
>> >>>>>> as
>> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
>> to a
>> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>> >>>>>> >>>> > serialization code for ListArrays for these types.
>> >>>>>> >>>> >
>> >>>>>> >>>>
>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
>> comment
>> >>>>>> >>>> but one downside of having strings as a nested type is that
>> there are
>> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
>> with the
>> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
>> special
>> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
>> represent
>> >>>>>> >>>> variable-length binary data using a repeated field and using
>> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
>> overhead for
>> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
>> may be
>> >>>>>> >>>> other reasons.
>> >>>>>> >>>>
>> >>>>>> >>>
>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't
>> actually realize
>> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
>> with how
>> >>>>>> it is
>> >>>>>> >>> constructed.
>> >>>>>> >>>
>> >>>>>> >>> With regards to the c++ implementation of the items at [1],
>> abstracting
>> >>>>>> >>> shared physical representations out seems fine to me but I
>> don't think
>> >>>>>> we
>> >>>>>> >>> should necessitate effective 3NF for [1].
>> >>>>>> >>>
>> >>>>>> >>> One of the key points that I'm focused on in the Java space is
>> that I'd
>> >>>>>> >>> like to move to an always nullable pattern. This is vastly
>> simplifying
>> >>>>>> from
>> >>>>>> >>> a code generation, casting and complexity perspective and is a
>> nominal
>> >>>>>> cost
>> >>>>>> >>> when using column execution. If binary and varchar are
>> primitive types
>> >>>>>> as
>> >>>>>> >>> there there is no weird special casing of avoiding the
>> nullability
>> >>>>>> bitmap
>> >>>>>> >>> in the case of variable width items (for the offsets). But
>> that is an
>> >>>>>> >>> implementation detail of the Java library.
>> >>>>>> >>>
>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
>> we all
>> >>>>>> are
>> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>> >>>>>> >>>
>> >>>>>> >>> [1]
>> https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>>>>> >>>
>> >>>>>> >>
>> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
>> for IPC
>> >>>>>> >> purposes, at least. You can use the List code path for
>> arbitrary List
>> >>>>>> >> types as well as strings and binary. It sounds like either way
>> on the
>> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
>> primitive so
>> >>>>>> >> that you don't have to manage a separate never-used bitmap for
>> the
>> >>>>>> >> string/binary data. It seems useful enough to me to have a
>> primitive
>> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
>> about it.
>> >>>>>> >>
>> >>>>>> >>>
>> >>>>>> >>>
>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
>> the host
>> >>>>>> >>>> > programming language to express the semantics of a function
>> call
>> >>>>>> (e.g.
>> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
>> strings),
>> >>>>>> >>>> > but I think this can be implemented without requiring a new
>> >>>>>> primitive
>> >>>>>> >>>> > type in the spec.
>> >>>>>> >>>> >
>> >>>>>> >>>> > The more interesting thing to me is if we should have a new
>> >>>>>> primitive
>> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>>  The
>> >>>>>> >>>> > offsets array isn't necessary in this case for random
>> access.
>> >>>>>> >>>> >
>> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
>> C++
>> >>>>>> >>>> > (
>> >>>>>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>> >>>>>> >>>> > are currently described as a null terminated UTF8 is
>> problematic.  I
>> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
>> >>>>>> >>>>
>> >>>>>> >>>>
>> >>>>>> >>>> >
>> >>>>>> >>>>
>> >>>>>> >>>> Good point, sorry about that. We probably would need to
>> length-prefix
>> >>>>>> >>>> the values, then.
>> >>>>>> >>>>
>> >>>>>> >>>
>> >>>>>> >>>
>> >>>>>> >>> Is this an input/output interface? Arrow structures should all
>> be 4
>> >>>>>> byte
>> >>>>>> >>> offset based and be neither length prefixed nor null
>> terminated.
>> >>>>>> >>
>> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
>> >>>>>> >> databases is distinct from a TEXT type in which any value can be
>> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
>> that no
>> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
>> metadata
>> >>>>>> >> because you have the offsets encoding length (pardon the jet
>> lag).
>> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>> >>>>>> >> leftovers from my earliest draft implementation.
>> >>>>>> >>
>> >>>>>> >> - Wes
>> >>>>>>
>>
>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Jacques Nadeau <ja...@apache.org>.
Let's start by creating a simple usecase. For example, I would start with
nullable 4 byte integer, maybe and use the example of java > (col1) >
python (or c++) > (newcol) > java that is one what I'd call a single batch
algorithm (e.g. one batch of values in, one out).

A simple way to sidestep the memory management/reference counting issues
initially is for java to preallocate the output location for newcol for the
python (or c++) code.

On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <em...@gmail.com>
wrote:

> Just to follow-up on this.  I got distracted on a few other items on
> the C++ implementation side, but my next task is to get a String types
> working for the C++ IPC unit test.   Once I send a PR for that, it
> might help clarify the concerns on both sides and we can hammer out
> the details from there.
>
> Sound reasonable?
>
> -Micah
>
> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <we...@gmail.com>
> wrote:
> > Nudging this issue. We need to sketch out a plan to get IPC
> > integration tests working between the Java and C++ implementations --
> > what's the most expedient way we can work toward making that happen?
> >
> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <em...@gmail.com>
> wrote:
> >> s/spark/slack/g
> >>
> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <em...@gmail.com>
> wrote:
> >>> I'm not exactly sure of my availability if I am available on spark, I
> >>> can likely make the hangout.
> >>>
> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com>
> wrote:
> >>>> I was traveling today but I can do a hangout about this next week.
> >>>>
> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>>>> Let's do a quick hangout on this. I'd like to better understand as
> I'm not
> >>>>> sure we're all talking about the same thing.
> >>>>>
> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <
> emkornfield@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
> >>>>>>
> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
> >>>>>> array" for Utf8 and Binary types should never contain null elements.
> >>>>>>
> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>>>>>
> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com>
> wrote:
> >>>>>> > Bumping this conversation.
> >>>>>> >
> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
> with a
> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
> others
> >>>>>> > think.
> >>>>>> >
> >>>>>> > Thanks
> >>>>>> >
> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com>
> wrote:
> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <
> jacques@apache.org>
> >>>>>> wrote:
> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <wes@cloudera.com
> >
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
> >>>>>> emkornfield@gmail.com>
> >>>>>> >>>> wrote:
> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
> primitive type
> >>>>>> in
> >>>>>> >>>> > regards to RPC but not modeling it as a special Array type.
> I think
> >>>>>> >>>> > the key is formally describing how logical types map to
> physical
> >>>>>> types
> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
> >>>>>> >>>> >
> >>>>>> >>>> > I think there are two use-cases here:
> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
> logical types
> >>>>>> >>>> > built on Arrays.
> >>>>>> >>>> >
> >>>>>> >>>> > For case 1, I think it is simpler to not special case string
> types
> >>>>>> as
> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
> to a
> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
> >>>>>> >>>> > serialization code for ListArrays for these types.
> >>>>>> >>>> >
> >>>>>> >>>>
> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
> comment
> >>>>>> >>>> but one downside of having strings as a nested type is that
> there are
> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
> with the
> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
> special
> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
> represent
> >>>>>> >>>> variable-length binary data using a repeated field and using
> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
> overhead for
> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
> may be
> >>>>>> >>>> other reasons.
> >>>>>> >>>>
> >>>>>> >>>
> >>>>>> >>> I'm a bit confused about what everyone means. I didn't actually
> realize
> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
> with how
> >>>>>> it is
> >>>>>> >>> constructed.
> >>>>>> >>>
> >>>>>> >>> With regards to the c++ implementation of the items at [1],
> abstracting
> >>>>>> >>> shared physical representations out seems fine to me but I
> don't think
> >>>>>> we
> >>>>>> >>> should necessitate effective 3NF for [1].
> >>>>>> >>>
> >>>>>> >>> One of the key points that I'm focused on in the Java space is
> that I'd
> >>>>>> >>> like to move to an always nullable pattern. This is vastly
> simplifying
> >>>>>> from
> >>>>>> >>> a code generation, casting and complexity perspective and is a
> nominal
> >>>>>> cost
> >>>>>> >>> when using column execution. If binary and varchar are
> primitive types
> >>>>>> as
> >>>>>> >>> there there is no weird special casing of avoiding the
> nullability
> >>>>>> bitmap
> >>>>>> >>> in the case of variable width items (for the offsets). But that
> is an
> >>>>>> >>> implementation detail of the Java library.
> >>>>>> >>>
> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
> we all
> >>>>>> are
> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
> >>>>>> >>>
> >>>>>> >>> [1]
> https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>>>>> >>>
> >>>>>> >>
> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
> for IPC
> >>>>>> >> purposes, at least. You can use the List code path for arbitrary
> List
> >>>>>> >> types as well as strings and binary. It sounds like either way
> on the
> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
> primitive so
> >>>>>> >> that you don't have to manage a separate never-used bitmap for
> the
> >>>>>> >> string/binary data. It seems useful enough to me to have a
> primitive
> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
> about it.
> >>>>>> >>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
> the host
> >>>>>> >>>> > programming language to express the semantics of a function
> call
> >>>>>> (e.g.
> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
> strings),
> >>>>>> >>>> > but I think this can be implemented without requiring a new
> >>>>>> primitive
> >>>>>> >>>> > type in the spec.
> >>>>>> >>>> >
> >>>>>> >>>> > The more interesting thing to me is if we should have a new
> >>>>>> primitive
> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>  The
> >>>>>> >>>> > offsets array isn't necessary in this case for random access.
> >>>>>> >>>> >
> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
> C++
> >>>>>> >>>> > (
> >>>>>>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
> >>>>>> >>>> > are currently described as a null terminated UTF8 is
> problematic.  I
> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
> >>>>>> >>>>
> >>>>>> >>>>
> >>>>>> >>>> >
> >>>>>> >>>>
> >>>>>> >>>> Good point, sorry about that. We probably would need to
> length-prefix
> >>>>>> >>>> the values, then.
> >>>>>> >>>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> Is this an input/output interface? Arrow structures should all
> be 4
> >>>>>> byte
> >>>>>> >>> offset based and be neither length prefixed nor null terminated.
> >>>>>> >>
> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
> >>>>>> >> databases is distinct from a TEXT type in which any value can be
> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
> that no
> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
> metadata
> >>>>>> >> because you have the offsets encoding length (pardon the jet
> lag).
> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
> >>>>>> >> leftovers from my earliest draft implementation.
> >>>>>> >>
> >>>>>> >> - Wes
> >>>>>>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Micah Kornfield <em...@gmail.com>.
Just to follow-up on this.  I got distracted on a few other items on
the C++ implementation side, but my next task is to get a String types
working for the C++ IPC unit test.   Once I send a PR for that, it
might help clarify the concerns on both sides and we can hammer out
the details from there.

Sound reasonable?

-Micah

On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <we...@gmail.com> wrote:
> Nudging this issue. We need to sketch out a plan to get IPC
> integration tests working between the Java and C++ implementations --
> what's the most expedient way we can work toward making that happen?
>
> On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <em...@gmail.com> wrote:
>> s/spark/slack/g
>>
>> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <em...@gmail.com> wrote:
>>> I'm not exactly sure of my availability if I am available on spark, I
>>> can likely make the hangout.
>>>
>>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>> I was traveling today but I can do a hangout about this next week.
>>>>
>>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>>> Let's do a quick hangout on this. I'd like to better understand as I'm not
>>>>> sure we're all talking about the same thing.
>>>>>
>>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <em...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
>>>>>>
>>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>>>>> array" for Utf8 and Binary types should never contain null elements.
>>>>>>
>>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>>
>>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>>>> > Bumping this conversation.
>>>>>> >
>>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
>>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what others
>>>>>> > think.
>>>>>> >
>>>>>> > Thanks
>>>>>> >
>>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <ja...@apache.org>
>>>>>> wrote:
>>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <we...@cloudera.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>>>>> emkornfield@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>> > I like the current scheme of making String (UTF8) a primitive type
>>>>>> in
>>>>>> >>>> > regards to RPC but not modeling it as a special Array type.  I think
>>>>>> >>>> > the key is formally describing how logical types map to physical
>>>>>> types
>>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>>>>> >>>> >
>>>>>> >>>> > I think there are two use-cases here:
>>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific logical types
>>>>>> >>>> > built on Arrays.
>>>>>> >>>> >
>>>>>> >>>> > For case 1, I think it is simpler to not special case string types
>>>>>> as
>>>>>> >>>> > primitives.  Understanding that a logical String type maps to a
>>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>>>>> >>>> > serialization code for ListArrays for these types.
>>>>>> >>>> >
>>>>>> >>>>
>>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>>>>> >>>> but one downside of having strings as a nested type is that there are
>>>>>> >>>> certain code paths (for example: Parquet-related) which deal with the
>>>>>> >>>> flat table case. To make a Parquet analogy, there is the special
>>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically represent
>>>>>> >>>> variable-length binary data using a repeated field and using
>>>>>> >>>> repetition/definition levels (but the encoding/decoding overhead for
>>>>>> >>>> this in Parquet is much more significant than Arrow). There may be
>>>>>> >>>> other reasons.
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>> I'm a bit confused about what everyone means. I didn't actually realize
>>>>>> >>> that this [1] had been merged yet but I'm generally on board with how
>>>>>> it is
>>>>>> >>> constructed.
>>>>>> >>>
>>>>>> >>> With regards to the c++ implementation of the items at [1], abstracting
>>>>>> >>> shared physical representations out seems fine to me but I don't think
>>>>>> we
>>>>>> >>> should necessitate effective 3NF for [1].
>>>>>> >>>
>>>>>> >>> One of the key points that I'm focused on in the Java space is that I'd
>>>>>> >>> like to move to an always nullable pattern. This is vastly simplifying
>>>>>> from
>>>>>> >>> a code generation, casting and complexity perspective and is a nominal
>>>>>> cost
>>>>>> >>> when using column execution. If binary and varchar are primitive types
>>>>>> as
>>>>>> >>> there there is no weird special casing of avoiding the nullability
>>>>>> bitmap
>>>>>> >>> in the case of variable width items (for the offsets). But that is an
>>>>>> >>> implementation detail of the Java library.
>>>>>> >>>
>>>>>> >>> So in general, I like the scheme at [1] for the concepts that we all
>>>>>> are
>>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>>>>> >>>
>>>>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>> >>>
>>>>>> >>
>>>>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC
>>>>>> >> purposes, at least. You can use the List code path for arbitrary List
>>>>>> >> types as well as strings and binary. It sounds like either way on the
>>>>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
>>>>>> >> that you don't have to manage a separate never-used bitmap for the
>>>>>> >> string/binary data. It seems useful enough to me to have a primitive
>>>>>> >> variable-length binary/UTF8 type but I do not feel strongly about it.
>>>>>> >>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>> > For case 2, it would be nice to utilize the type system of the host
>>>>>> >>>> > programming language to express the semantics of a function call
>>>>>> (e.g.
>>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>>>>>> >>>> > but I think this can be implemented without requiring a new
>>>>>> primitive
>>>>>> >>>> > type in the spec.
>>>>>> >>>> >
>>>>>> >>>> > The more interesting thing to me is if we should have a new
>>>>>> primitive
>>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>>>>> >>>> > offsets array isn't necessary in this case for random access.
>>>>>> >>>> >
>>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>>>>> >>>> > (
>>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>>>>> >>>> > are currently described as a null terminated UTF8 is problematic.  I
>>>>>> >>>> > believe null bytes are valid UTF8 characters.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> >
>>>>>> >>>>
>>>>>> >>>> Good point, sorry about that. We probably would need to length-prefix
>>>>>> >>>> the values, then.
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Is this an input/output interface? Arrow structures should all be 4
>>>>>> byte
>>>>>> >>> offset based and be neither length prefixed nor null terminated.
>>>>>> >>
>>>>>> >> This was a question around the VARCHAR(k) type (which in many
>>>>>> >> databases is distinct from a TEXT type in which any value can be
>>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
>>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
>>>>>> >> because you have the offsets encoding length (pardon the jet lag).
>>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>>>>> >> leftovers from my earliest draft implementation.
>>>>>> >>
>>>>>> >> - Wes
>>>>>>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Wes McKinney <we...@gmail.com>.
Nudging this issue. We need to sketch out a plan to get IPC
integration tests working between the Java and C++ implementations --
what's the most expedient way we can work toward making that happen?

On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <em...@gmail.com> wrote:
> s/spark/slack/g
>
> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <em...@gmail.com> wrote:
>> I'm not exactly sure of my availability if I am available on spark, I
>> can likely make the hangout.
>>
>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com> wrote:
>>> I was traveling today but I can do a hangout about this next week.
>>>
>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>>> Let's do a quick hangout on this. I'd like to better understand as I'm not
>>>> sure we're all talking about the same thing.
>>>>
>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
>>>>>
>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>>>> array" for Utf8 and Binary types should never contain null elements.
>>>>>
>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>
>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>>> > Bumping this conversation.
>>>>> >
>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what others
>>>>> > think.
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <ja...@apache.org>
>>>>> wrote:
>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <we...@cloudera.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>> > I like the current scheme of making String (UTF8) a primitive type
>>>>> in
>>>>> >>>> > regards to RPC but not modeling it as a special Array type.  I think
>>>>> >>>> > the key is formally describing how logical types map to physical
>>>>> types
>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>>>> >>>> >
>>>>> >>>> > I think there are two use-cases here:
>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific logical types
>>>>> >>>> > built on Arrays.
>>>>> >>>> >
>>>>> >>>> > For case 1, I think it is simpler to not special case string types
>>>>> as
>>>>> >>>> > primitives.  Understanding that a logical String type maps to a
>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>>>> >>>> > serialization code for ListArrays for these types.
>>>>> >>>> >
>>>>> >>>>
>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>>>> >>>> but one downside of having strings as a nested type is that there are
>>>>> >>>> certain code paths (for example: Parquet-related) which deal with the
>>>>> >>>> flat table case. To make a Parquet analogy, there is the special
>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically represent
>>>>> >>>> variable-length binary data using a repeated field and using
>>>>> >>>> repetition/definition levels (but the encoding/decoding overhead for
>>>>> >>>> this in Parquet is much more significant than Arrow). There may be
>>>>> >>>> other reasons.
>>>>> >>>>
>>>>> >>>
>>>>> >>> I'm a bit confused about what everyone means. I didn't actually realize
>>>>> >>> that this [1] had been merged yet but I'm generally on board with how
>>>>> it is
>>>>> >>> constructed.
>>>>> >>>
>>>>> >>> With regards to the c++ implementation of the items at [1], abstracting
>>>>> >>> shared physical representations out seems fine to me but I don't think
>>>>> we
>>>>> >>> should necessitate effective 3NF for [1].
>>>>> >>>
>>>>> >>> One of the key points that I'm focused on in the Java space is that I'd
>>>>> >>> like to move to an always nullable pattern. This is vastly simplifying
>>>>> from
>>>>> >>> a code generation, casting and complexity perspective and is a nominal
>>>>> cost
>>>>> >>> when using column execution. If binary and varchar are primitive types
>>>>> as
>>>>> >>> there there is no weird special casing of avoiding the nullability
>>>>> bitmap
>>>>> >>> in the case of variable width items (for the offsets). But that is an
>>>>> >>> implementation detail of the Java library.
>>>>> >>>
>>>>> >>> So in general, I like the scheme at [1] for the concepts that we all
>>>>> are
>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>>>> >>>
>>>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>> >>>
>>>>> >>
>>>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC
>>>>> >> purposes, at least. You can use the List code path for arbitrary List
>>>>> >> types as well as strings and binary. It sounds like either way on the
>>>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
>>>>> >> that you don't have to manage a separate never-used bitmap for the
>>>>> >> string/binary data. It seems useful enough to me to have a primitive
>>>>> >> variable-length binary/UTF8 type but I do not feel strongly about it.
>>>>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>>> > For case 2, it would be nice to utilize the type system of the host
>>>>> >>>> > programming language to express the semantics of a function call
>>>>> (e.g.
>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>>>>> >>>> > but I think this can be implemented without requiring a new
>>>>> primitive
>>>>> >>>> > type in the spec.
>>>>> >>>> >
>>>>> >>>> > The more interesting thing to me is if we should have a new
>>>>> primitive
>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>>>> >>>> > offsets array isn't necessary in this case for random access.
>>>>> >>>> >
>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>>>> >>>> > (
>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>>>> >>>> > are currently described as a null terminated UTF8 is problematic.  I
>>>>> >>>> > believe null bytes are valid UTF8 characters.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> >
>>>>> >>>>
>>>>> >>>> Good point, sorry about that. We probably would need to length-prefix
>>>>> >>>> the values, then.
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Is this an input/output interface? Arrow structures should all be 4
>>>>> byte
>>>>> >>> offset based and be neither length prefixed nor null terminated.
>>>>> >>
>>>>> >> This was a question around the VARCHAR(k) type (which in many
>>>>> >> databases is distinct from a TEXT type in which any value can be
>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
>>>>> >> because you have the offsets encoding length (pardon the jet lag).
>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>>>> >> leftovers from my earliest draft implementation.
>>>>> >>
>>>>> >> - Wes
>>>>>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Posted by Micah Kornfield <em...@gmail.com>.
s/spark/slack/g

On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <em...@gmail.com> wrote:
> I'm not exactly sure of my availability if I am available on spark, I
> can likely make the hangout.
>
> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <we...@cloudera.com> wrote:
>> I was traveling today but I can do a hangout about this next week.
>>
>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>> Let's do a quick hangout on this. I'd like to better understand as I'm not
>>> sure we're all talking about the same thing.
>>>
>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
>>>>
>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>>> array" for Utf8 and Binary types should never contain null elements.
>>>>
>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>
>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>> > Bumping this conversation.
>>>> >
>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
>>>> > UTF8 guarantee) primitive types in the spec. Let me know what others
>>>> > think.
>>>> >
>>>> > Thanks
>>>> >
>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <we...@cloudera.com> wrote:
>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <we...@cloudera.com>
>>>> wrote:
>>>> >>>
>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>>> emkornfield@gmail.com>
>>>> >>>> wrote:
>>>> >>>> > I like the current scheme of making String (UTF8) a primitive type
>>>> in
>>>> >>>> > regards to RPC but not modeling it as a special Array type.  I think
>>>> >>>> > the key is formally describing how logical types map to physical
>>>> types
>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>>> >>>> >
>>>> >>>> > I think there are two use-cases here:
>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>>> >>>> > 2.  Writing algorithms/builders to deal with specific logical types
>>>> >>>> > built on Arrays.
>>>> >>>> >
>>>> >>>> > For case 1, I think it is simpler to not special case string types
>>>> as
>>>> >>>> > primitives.  Understanding that a logical String type maps to a
>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>>> >>>> > serialization code for ListArrays for these types.
>>>> >>>> >
>>>> >>>>
>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>>> >>>> but one downside of having strings as a nested type is that there are
>>>> >>>> certain code paths (for example: Parquet-related) which deal with the
>>>> >>>> flat table case. To make a Parquet analogy, there is the special
>>>> >>>> BYTE_ARRAY primitive type, even though you could technically represent
>>>> >>>> variable-length binary data using a repeated field and using
>>>> >>>> repetition/definition levels (but the encoding/decoding overhead for
>>>> >>>> this in Parquet is much more significant than Arrow). There may be
>>>> >>>> other reasons.
>>>> >>>>
>>>> >>>
>>>> >>> I'm a bit confused about what everyone means. I didn't actually realize
>>>> >>> that this [1] had been merged yet but I'm generally on board with how
>>>> it is
>>>> >>> constructed.
>>>> >>>
>>>> >>> With regards to the c++ implementation of the items at [1], abstracting
>>>> >>> shared physical representations out seems fine to me but I don't think
>>>> we
>>>> >>> should necessitate effective 3NF for [1].
>>>> >>>
>>>> >>> One of the key points that I'm focused on in the Java space is that I'd
>>>> >>> like to move to an always nullable pattern. This is vastly simplifying
>>>> from
>>>> >>> a code generation, casting and complexity perspective and is a nominal
>>>> cost
>>>> >>> when using column execution. If binary and varchar are primitive types
>>>> as
>>>> >>> there there is no weird special casing of avoiding the nullability
>>>> bitmap
>>>> >>> in the case of variable width items (for the offsets). But that is an
>>>> >>> implementation detail of the Java library.
>>>> >>>
>>>> >>> So in general, I like the scheme at [1] for the concepts that we all
>>>> are
>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>>> >>>
>>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>> >>>
>>>> >>
>>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC
>>>> >> purposes, at least. You can use the List code path for arbitrary List
>>>> >> types as well as strings and binary. It sounds like either way on the
>>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
>>>> >> that you don't have to manage a separate never-used bitmap for the
>>>> >> string/binary data. It seems useful enough to me to have a primitive
>>>> >> variable-length binary/UTF8 type but I do not feel strongly about it.
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>>> > For case 2, it would be nice to utilize the type system of the host
>>>> >>>> > programming language to express the semantics of a function call
>>>> (e.g.
>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>>>> >>>> > but I think this can be implemented without requiring a new
>>>> primitive
>>>> >>>> > type in the spec.
>>>> >>>> >
>>>> >>>> > The more interesting thing to me is if we should have a new
>>>> primitive
>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>>> >>>> > offsets array isn't necessary in this case for random access.
>>>> >>>> >
>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>>> >>>> > (
>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>>> >>>> > are currently described as a null terminated UTF8 is problematic.  I
>>>> >>>> > believe null bytes are valid UTF8 characters.
>>>> >>>>
>>>> >>>>
>>>> >>>> >
>>>> >>>>
>>>> >>>> Good point, sorry about that. We probably would need to length-prefix
>>>> >>>> the values, then.
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> Is this an input/output interface? Arrow structures should all be 4
>>>> byte
>>>> >>> offset based and be neither length prefixed nor null terminated.
>>>> >>
>>>> >> This was a question around the VARCHAR(k) type (which in many
>>>> >> databases is distinct from a TEXT type in which any value can be
>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
>>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
>>>> >> because you have the offsets encoding length (pardon the jet lag).
>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>>> >> leftovers from my earliest draft implementation.
>>>> >>
>>>> >> - Wes
>>>>