You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Edward Loper <ed...@google.com.INVALID> on 2019/07/29 14:01:06 UTC

[DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

The FixedSizeList type, which was added to Arrow a few months ago, is an
array where each slot contains a fixed-size sequence of values.  It is
specified as FixedSizeList<T>[N], where T is a child type and N is a signed
int32 that specifies the length of each list.

This is useful for encoding fixed-size tensors.  E.g., if I have a 100x8x10
tensor, then I can encode it as
FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].

But I'm also interested in encoding tensors where some dimension sizes are
not known in advance.  It seems to me that FixedSizeList could be extended
to support this fairly easily, by simply defining that N=-1 means "each
array slot has the same length, but that length is not known in advance."
 So e.g. we could encode a 100x?x10 tensor as
FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].

Since these N=-1 row-lengths are not encoded in the type, we need some way
to determine what they are.  Luckily, every Field in the schema has a
corresponding FieldNode in the message; and those FieldNodes can be used to
deduce the row lengths.  In particular, the row length must be equal to the
length of the child node divided by the length of the FixedSizeList.  E.g.,
if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3, 4],
[5, 6]] then the message representation is:

* Length: 3, Null count: 0
* Null bitmap buffer: Not required
* Values array (byte array):
    * Length: 6,  Null count: 0
    * Null bitmap buffer: Not required
    * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]

So we can deduce that the row length is 6/3=2.

It looks to me like it would be fairly easy to add support for this.  E.g.,
in the FixedSizeListArray constructor in c++, if list_type()->list_size()
is -1, then set list_size_ to values.length()/length.  There would be no
changes to the schema.fbs/message.fbs files -- we would just be assigning a
meaning to something that's currently meaningless (having
FixedSizeList.listSize=-1).

If there's support for adding this to Arrow, then I could put together a PR.

Thanks,
-Edward

P.S. Apologies if this gets posted twice -- I sent it out a couple days ago
right before subscribing to the mailing list; but I don't see it on the
archives, presumably because I wasn't subscribed yet when I sent it out.

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Edward Loper <ed...@google.com.INVALID>.
Brian: yes, you're correct.  Sorry, I've been playing around with a couple
different ways to extend things, and was conflating them when I wrote my
response.  For this proposal, the dimension must have the same size for all
items in a given record batch.

As suggested by Francois and Wes, I will look into using the ExtensionType
to implement this proposal.

-Edward

On Wed, Jul 31, 2019 at 1:00 PM Brian Hulette <bh...@apache.org> wrote:

> I'm a little confused about the proposal now. If the unknown dimension
> doesn't have to be the same within a record batch, how would you be able to
> deduce it with the approach you described (dividing the logical length of
> the values array by the length of the record batch)?
>
> On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I agree this sounds like a good application for ExtensionType. At
> > minimum, ExtensionType can be used to develop a working version of
> > what you need to help guide further discussions.
> >
> > On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > if each record has a different size, then I suggest to just use a
> > > Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
> > > struct). You can probably add your own logic with the recently
> > > introduced ExtensionType [1].
> > >
> > > François
> > > [1]
> >
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> > >
> > > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper
> <ed...@google.com.invalid>
> > wrote:
> > > >
> > > > The intention is that each individual record could have a different
> > size.
> > > > This could be consistent within a given batch, but wouldn't need to
> be.
> > > > For example, if I wanted to send a 3-channel image, but the image
> size
> > may
> > > > vary for each record, then I could use
> > > > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
> > > >
> > > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org>
> > wrote:
> > > >
> > > > > This isn't really relevant but I feel compelled to point it out -
> the
> > > > > FixedSizeList type has actually been in the Arrow spec for a while,
> > but it
> > > > > was only implemented in JS and Java initially. It was implemented
> in
> > C++
> > > > > just a few months ago.
> > > > >
> > > >
> > > > Thanks for the clarification -- I was going based on the blame
> history
> > for
> > > > Layout.rst, but I guess it just didn't get officially documented
> there
> > > > until the c++ implementation was added.
> > > >
> > > > -Edward
> > > >
> > > >
> > > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> > <ed...@google.com.invalid>
> > > > > wrote:
> > > > >
> > > > > > The FixedSizeList type, which was added to Arrow a few months
> ago,
> > is an
> > > > > > array where each slot contains a fixed-size sequence of values.
> > It is
> > > > > > specified as FixedSizeList<T>[N], where T is a child type and N
> is
> > a
> > > > > signed
> > > > > > int32 that specifies the length of each list.
> > > > > >
> > > > > > This is useful for encoding fixed-size tensors.  E.g., if I have
> a
> > > > > 100x8x10
> > > > > > tensor, then I can encode it as
> > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > > > > >
> > > > > > But I'm also interested in encoding tensors where some dimension
> > sizes
> > > > > are
> > > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > > extended
> > > > > > to support this fairly easily, by simply defining that N=-1 means
> > "each
> > > > > > array slot has the same length, but that length is not known in
> > advance."
> > > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > > > > >
> > > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> > some
> > > > > way
> > > > > > to determine what they are.  Luckily, every Field in the schema
> > has a
> > > > > > corresponding FieldNode in the message; and those FieldNodes can
> > be used
> > > > > to
> > > > > > deduce the row lengths.  In particular, the row length must be
> > equal to
> > > > > the
> > > > > > length of the child node divided by the length of the
> > FixedSizeList.
> > > > > E.g.,
> > > > > > if we have a FixedSizeList<byte>[-1] array with the values [[1,
> > 2], [3,
> > > > > 4],
> > > > > > [5, 6]] then the message representation is:
> > > > > >
> > > > > > * Length: 3, Null count: 0
> > > > > > * Null bitmap buffer: Not required
> > > > > > * Values array (byte array):
> > > > > >     * Length: 6,  Null count: 0
> > > > > >     * Null bitmap buffer: Not required
> > > > > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding
> bytes>]
> > > > > >
> > > > > > So we can deduce that the row length is 6/3=2.
> > > > > >
> > > > > > It looks to me like it would be fairly easy to add support for
> > this.
> > > > > E.g.,
> > > > > > in the FixedSizeListArray constructor in c++, if
> > list_type()->list_size()
> > > > > > is -1, then set list_size_ to values.length()/length.  There
> would
> > be no
> > > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > > assigning a
> > > > > > meaning to something that's currently meaningless (having
> > > > > > FixedSizeList.listSize=-1).
> > > > > >
> > > > > > If there's support for adding this to Arrow, then I could put
> > together a
> > > > > > PR.
> > > > > >
> > > > > > Thanks,
> > > > > > -Edward
> > > > > >
> > > > > > P.S. Apologies if this gets posted twice -- I sent it out a
> couple
> > days
> > > > > ago
> > > > > > right before subscribing to the mailing list; but I don't see it
> > on the
> > > > > > archives, presumably because I wasn't subscribed yet when I sent
> > it out.
> > > > > >
> > > > >
> >
>

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Brian Hulette <bh...@apache.org>.
I'm a little confused about the proposal now. If the unknown dimension
doesn't have to be the same within a record batch, how would you be able to
deduce it with the approach you described (dividing the logical length of
the values array by the length of the record batch)?

On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney <we...@gmail.com> wrote:

> I agree this sounds like a good application for ExtensionType. At
> minimum, ExtensionType can be used to develop a working version of
> what you need to help guide further discussions.
>
> On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > Hello,
> >
> > if each record has a different size, then I suggest to just use a
> > Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
> > struct). You can probably add your own logic with the recently
> > introduced ExtensionType [1].
> >
> > François
> > [1]
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> >
> > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper <ed...@google.com.invalid>
> wrote:
> > >
> > > The intention is that each individual record could have a different
> size.
> > > This could be consistent within a given batch, but wouldn't need to be.
> > > For example, if I wanted to send a 3-channel image, but the image size
> may
> > > vary for each record, then I could use
> > > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
> > >
> > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org>
> wrote:
> > >
> > > > This isn't really relevant but I feel compelled to point it out - the
> > > > FixedSizeList type has actually been in the Arrow spec for a while,
> but it
> > > > was only implemented in JS and Java initially. It was implemented in
> C++
> > > > just a few months ago.
> > > >
> > >
> > > Thanks for the clarification -- I was going based on the blame history
> for
> > > Layout.rst, but I guess it just didn't get officially documented there
> > > until the c++ implementation was added.
> > >
> > > -Edward
> > >
> > >
> > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> <ed...@google.com.invalid>
> > > > wrote:
> > > >
> > > > > The FixedSizeList type, which was added to Arrow a few months ago,
> is an
> > > > > array where each slot contains a fixed-size sequence of values.
> It is
> > > > > specified as FixedSizeList<T>[N], where T is a child type and N is
> a
> > > > signed
> > > > > int32 that specifies the length of each list.
> > > > >
> > > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > > 100x8x10
> > > > > tensor, then I can encode it as
> > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > > > >
> > > > > But I'm also interested in encoding tensors where some dimension
> sizes
> > > > are
> > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > extended
> > > > > to support this fairly easily, by simply defining that N=-1 means
> "each
> > > > > array slot has the same length, but that length is not known in
> advance."
> > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > > > >
> > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> some
> > > > way
> > > > > to determine what they are.  Luckily, every Field in the schema
> has a
> > > > > corresponding FieldNode in the message; and those FieldNodes can
> be used
> > > > to
> > > > > deduce the row lengths.  In particular, the row length must be
> equal to
> > > > the
> > > > > length of the child node divided by the length of the
> FixedSizeList.
> > > > E.g.,
> > > > > if we have a FixedSizeList<byte>[-1] array with the values [[1,
> 2], [3,
> > > > 4],
> > > > > [5, 6]] then the message representation is:
> > > > >
> > > > > * Length: 3, Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Values array (byte array):
> > > > >     * Length: 6,  Null count: 0
> > > > >     * Null bitmap buffer: Not required
> > > > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
> > > > >
> > > > > So we can deduce that the row length is 6/3=2.
> > > > >
> > > > > It looks to me like it would be fairly easy to add support for
> this.
> > > > E.g.,
> > > > > in the FixedSizeListArray constructor in c++, if
> list_type()->list_size()
> > > > > is -1, then set list_size_ to values.length()/length.  There would
> be no
> > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > assigning a
> > > > > meaning to something that's currently meaningless (having
> > > > > FixedSizeList.listSize=-1).
> > > > >
> > > > > If there's support for adding this to Arrow, then I could put
> together a
> > > > > PR.
> > > > >
> > > > > Thanks,
> > > > > -Edward
> > > > >
> > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple
> days
> > > > ago
> > > > > right before subscribing to the mailing list; but I don't see it
> on the
> > > > > archives, presumably because I wasn't subscribed yet when I sent
> it out.
> > > > >
> > > >
>

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Wes McKinney <we...@gmail.com>.
I agree this sounds like a good application for ExtensionType. At
minimum, ExtensionType can be used to develop a working version of
what you need to help guide further discussions.

On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> Hello,
>
> if each record has a different size, then I suggest to just use a
> Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
> struct). You can probably add your own logic with the recently
> introduced ExtensionType [1].
>
> François
> [1] https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
>
> On Mon, Jul 29, 2019 at 3:15 PM Edward Loper <ed...@google.com.invalid> wrote:
> >
> > The intention is that each individual record could have a different size.
> > This could be consistent within a given batch, but wouldn't need to be.
> > For example, if I wanted to send a 3-channel image, but the image size may
> > vary for each record, then I could use
> > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
> >
> > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org> wrote:
> >
> > > This isn't really relevant but I feel compelled to point it out - the
> > > FixedSizeList type has actually been in the Arrow spec for a while, but it
> > > was only implemented in JS and Java initially. It was implemented in C++
> > > just a few months ago.
> > >
> >
> > Thanks for the clarification -- I was going based on the blame history for
> > Layout.rst, but I guess it just didn't get officially documented there
> > until the c++ implementation was added.
> >
> > -Edward
> >
> >
> > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <ed...@google.com.invalid>
> > > wrote:
> > >
> > > > The FixedSizeList type, which was added to Arrow a few months ago, is an
> > > > array where each slot contains a fixed-size sequence of values.  It is
> > > > specified as FixedSizeList<T>[N], where T is a child type and N is a
> > > signed
> > > > int32 that specifies the length of each list.
> > > >
> > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > 100x8x10
> > > > tensor, then I can encode it as
> > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > > >
> > > > But I'm also interested in encoding tensors where some dimension sizes
> > > are
> > > > not known in advance.  It seems to me that FixedSizeList could be
> > > extended
> > > > to support this fairly easily, by simply defining that N=-1 means "each
> > > > array slot has the same length, but that length is not known in advance."
> > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > > >
> > > > Since these N=-1 row-lengths are not encoded in the type, we need some
> > > way
> > > > to determine what they are.  Luckily, every Field in the schema has a
> > > > corresponding FieldNode in the message; and those FieldNodes can be used
> > > to
> > > > deduce the row lengths.  In particular, the row length must be equal to
> > > the
> > > > length of the child node divided by the length of the FixedSizeList.
> > > E.g.,
> > > > if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3,
> > > 4],
> > > > [5, 6]] then the message representation is:
> > > >
> > > > * Length: 3, Null count: 0
> > > > * Null bitmap buffer: Not required
> > > > * Values array (byte array):
> > > >     * Length: 6,  Null count: 0
> > > >     * Null bitmap buffer: Not required
> > > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
> > > >
> > > > So we can deduce that the row length is 6/3=2.
> > > >
> > > > It looks to me like it would be fairly easy to add support for this.
> > > E.g.,
> > > > in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> > > > is -1, then set list_size_ to values.length()/length.  There would be no
> > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > assigning a
> > > > meaning to something that's currently meaningless (having
> > > > FixedSizeList.listSize=-1).
> > > >
> > > > If there's support for adding this to Arrow, then I could put together a
> > > > PR.
> > > >
> > > > Thanks,
> > > > -Edward
> > > >
> > > > P.S. Apologies if this gets posted twice -- I sent it out a couple days
> > > ago
> > > > right before subscribing to the mailing list; but I don't see it on the
> > > > archives, presumably because I wasn't subscribed yet when I sent it out.
> > > >
> > >

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Francois Saint-Jacques <fs...@gmail.com>.
Hello,

if each record has a different size, then I suggest to just use a
Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
struct). You can probably add your own logic with the recently
introduced ExtensionType [1].

François
[1] https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h

On Mon, Jul 29, 2019 at 3:15 PM Edward Loper <ed...@google.com.invalid> wrote:
>
> The intention is that each individual record could have a different size.
> This could be consistent within a given batch, but wouldn't need to be.
> For example, if I wanted to send a 3-channel image, but the image size may
> vary for each record, then I could use
> FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
>
> On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org> wrote:
>
> > This isn't really relevant but I feel compelled to point it out - the
> > FixedSizeList type has actually been in the Arrow spec for a while, but it
> > was only implemented in JS and Java initially. It was implemented in C++
> > just a few months ago.
> >
>
> Thanks for the clarification -- I was going based on the blame history for
> Layout.rst, but I guess it just didn't get officially documented there
> until the c++ implementation was added.
>
> -Edward
>
>
> > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <ed...@google.com.invalid>
> > wrote:
> >
> > > The FixedSizeList type, which was added to Arrow a few months ago, is an
> > > array where each slot contains a fixed-size sequence of values.  It is
> > > specified as FixedSizeList<T>[N], where T is a child type and N is a
> > signed
> > > int32 that specifies the length of each list.
> > >
> > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > 100x8x10
> > > tensor, then I can encode it as
> > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > >
> > > But I'm also interested in encoding tensors where some dimension sizes
> > are
> > > not known in advance.  It seems to me that FixedSizeList could be
> > extended
> > > to support this fairly easily, by simply defining that N=-1 means "each
> > > array slot has the same length, but that length is not known in advance."
> > >  So e.g. we could encode a 100x?x10 tensor as
> > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > >
> > > Since these N=-1 row-lengths are not encoded in the type, we need some
> > way
> > > to determine what they are.  Luckily, every Field in the schema has a
> > > corresponding FieldNode in the message; and those FieldNodes can be used
> > to
> > > deduce the row lengths.  In particular, the row length must be equal to
> > the
> > > length of the child node divided by the length of the FixedSizeList.
> > E.g.,
> > > if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3,
> > 4],
> > > [5, 6]] then the message representation is:
> > >
> > > * Length: 3, Null count: 0
> > > * Null bitmap buffer: Not required
> > > * Values array (byte array):
> > >     * Length: 6,  Null count: 0
> > >     * Null bitmap buffer: Not required
> > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
> > >
> > > So we can deduce that the row length is 6/3=2.
> > >
> > > It looks to me like it would be fairly easy to add support for this.
> > E.g.,
> > > in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> > > is -1, then set list_size_ to values.length()/length.  There would be no
> > > changes to the schema.fbs/message.fbs files -- we would just be
> > assigning a
> > > meaning to something that's currently meaningless (having
> > > FixedSizeList.listSize=-1).
> > >
> > > If there's support for adding this to Arrow, then I could put together a
> > > PR.
> > >
> > > Thanks,
> > > -Edward
> > >
> > > P.S. Apologies if this gets posted twice -- I sent it out a couple days
> > ago
> > > right before subscribing to the mailing list; but I don't see it on the
> > > archives, presumably because I wasn't subscribed yet when I sent it out.
> > >
> >

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Edward Loper <ed...@google.com.INVALID>.
The intention is that each individual record could have a different size.
This could be consistent within a given batch, but wouldn't need to be.
For example, if I wanted to send a 3-channel image, but the image size may
vary for each record, then I could use
FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].

On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org> wrote:

> This isn't really relevant but I feel compelled to point it out - the
> FixedSizeList type has actually been in the Arrow spec for a while, but it
> was only implemented in JS and Java initially. It was implemented in C++
> just a few months ago.
>

Thanks for the clarification -- I was going based on the blame history for
Layout.rst, but I guess it just didn't get officially documented there
until the c++ implementation was added.

-Edward


> On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <ed...@google.com.invalid>
> wrote:
>
> > The FixedSizeList type, which was added to Arrow a few months ago, is an
> > array where each slot contains a fixed-size sequence of values.  It is
> > specified as FixedSizeList<T>[N], where T is a child type and N is a
> signed
> > int32 that specifies the length of each list.
> >
> > This is useful for encoding fixed-size tensors.  E.g., if I have a
> 100x8x10
> > tensor, then I can encode it as
> > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> >
> > But I'm also interested in encoding tensors where some dimension sizes
> are
> > not known in advance.  It seems to me that FixedSizeList could be
> extended
> > to support this fairly easily, by simply defining that N=-1 means "each
> > array slot has the same length, but that length is not known in advance."
> >  So e.g. we could encode a 100x?x10 tensor as
> > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> >
> > Since these N=-1 row-lengths are not encoded in the type, we need some
> way
> > to determine what they are.  Luckily, every Field in the schema has a
> > corresponding FieldNode in the message; and those FieldNodes can be used
> to
> > deduce the row lengths.  In particular, the row length must be equal to
> the
> > length of the child node divided by the length of the FixedSizeList.
> E.g.,
> > if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3,
> 4],
> > [5, 6]] then the message representation is:
> >
> > * Length: 3, Null count: 0
> > * Null bitmap buffer: Not required
> > * Values array (byte array):
> >     * Length: 6,  Null count: 0
> >     * Null bitmap buffer: Not required
> >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
> >
> > So we can deduce that the row length is 6/3=2.
> >
> > It looks to me like it would be fairly easy to add support for this.
> E.g.,
> > in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> > is -1, then set list_size_ to values.length()/length.  There would be no
> > changes to the schema.fbs/message.fbs files -- we would just be
> assigning a
> > meaning to something that's currently meaningless (having
> > FixedSizeList.listSize=-1).
> >
> > If there's support for adding this to Arrow, then I could put together a
> > PR.
> >
> > Thanks,
> > -Edward
> >
> > P.S. Apologies if this gets posted twice -- I sent it out a couple days
> ago
> > right before subscribing to the mailing list; but I don't see it on the
> > archives, presumably because I wasn't subscribed yet when I sent it out.
> >
>

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Posted by Brian Hulette <bh...@apache.org>.
I think it may be helpful to clarify what you mean by dimensions that are
not known in advance. I believe the intention here is that this unknown
dimension is consistent within a record batch, but it is allowed to vary
from batch to batch. Otherwise, I would say you could just delay creating
the schema until you do know the unknown dimension.

This isn't really relevant but I feel compelled to point it out - the
FixedSizeList type has actually been in the Arrow spec for a while, but it
was only implemented in JS and Java initially. It was implemented in C++
just a few months ago.

On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <ed...@google.com.invalid>
wrote:

> The FixedSizeList type, which was added to Arrow a few months ago, is an
> array where each slot contains a fixed-size sequence of values.  It is
> specified as FixedSizeList<T>[N], where T is a child type and N is a signed
> int32 that specifies the length of each list.
>
> This is useful for encoding fixed-size tensors.  E.g., if I have a 100x8x10
> tensor, then I can encode it as
> FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
>
> But I'm also interested in encoding tensors where some dimension sizes are
> not known in advance.  It seems to me that FixedSizeList could be extended
> to support this fairly easily, by simply defining that N=-1 means "each
> array slot has the same length, but that length is not known in advance."
>  So e.g. we could encode a 100x?x10 tensor as
> FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
>
> Since these N=-1 row-lengths are not encoded in the type, we need some way
> to determine what they are.  Luckily, every Field in the schema has a
> corresponding FieldNode in the message; and those FieldNodes can be used to
> deduce the row lengths.  In particular, the row length must be equal to the
> length of the child node divided by the length of the FixedSizeList.  E.g.,
> if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3, 4],
> [5, 6]] then the message representation is:
>
> * Length: 3, Null count: 0
> * Null bitmap buffer: Not required
> * Values array (byte array):
>     * Length: 6,  Null count: 0
>     * Null bitmap buffer: Not required
>     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
>
> So we can deduce that the row length is 6/3=2.
>
> It looks to me like it would be fairly easy to add support for this.  E.g.,
> in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> is -1, then set list_size_ to values.length()/length.  There would be no
> changes to the schema.fbs/message.fbs files -- we would just be assigning a
> meaning to something that's currently meaningless (having
> FixedSizeList.listSize=-1).
>
> If there's support for adding this to Arrow, then I could put together a
> PR.
>
> Thanks,
> -Edward
>
> P.S. Apologies if this gets posted twice -- I sent it out a couple days ago
> right before subscribing to the mailing list; but I don't see it on the
> archives, presumably because I wasn't subscribed yet when I sent it out.
>