You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Edward Loper <ed...@google.com.INVALID> on 2019/08/01 12:55:41 UTC

Re: [DISCUSS][Format] FixedSizeList w/ row-length not specified as part of the type

Brian: yes, you're correct.  Sorry, I've been playing around with a couple
different ways to extend things, and was conflating them when I wrote my
response.  For this proposal, the dimension must have the same size for all
items in a given record batch.

As suggested by Francois and Wes, I will look into using the ExtensionType
to implement this proposal.

-Edward

On Wed, Jul 31, 2019 at 1:00 PM Brian Hulette <bh...@apache.org> wrote:

> I'm a little confused about the proposal now. If the unknown dimension
> doesn't have to be the same within a record batch, how would you be able to
> deduce it with the approach you described (dividing the logical length of
> the values array by the length of the record batch)?
>
> On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I agree this sounds like a good application for ExtensionType. At
> > minimum, ExtensionType can be used to develop a working version of
> > what you need to help guide further discussions.
> >
> > On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > if each record has a different size, then I suggest to just use a
> > > Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
> > > struct). You can probably add your own logic with the recently
> > > introduced ExtensionType [1].
> > >
> > > François
> > > [1]
> >
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> > >
> > > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper
> <ed...@google.com.invalid>
> > wrote:
> > > >
> > > > The intention is that each individual record could have a different
> > size.
> > > > This could be consistent within a given batch, but wouldn't need to
> be.
> > > > For example, if I wanted to send a 3-channel image, but the image
> size
> > may
> > > > vary for each record, then I could use
> > > > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
> > > >
> > > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bh...@apache.org>
> > wrote:
> > > >
> > > > > This isn't really relevant but I feel compelled to point it out -
> the
> > > > > FixedSizeList type has actually been in the Arrow spec for a while,
> > but it
> > > > > was only implemented in JS and Java initially. It was implemented
> in
> > C++
> > > > > just a few months ago.
> > > > >
> > > >
> > > > Thanks for the clarification -- I was going based on the blame
> history
> > for
> > > > Layout.rst, but I guess it just didn't get officially documented
> there
> > > > until the c++ implementation was added.
> > > >
> > > > -Edward
> > > >
> > > >
> > > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> > <ed...@google.com.invalid>
> > > > > wrote:
> > > > >
> > > > > > The FixedSizeList type, which was added to Arrow a few months
> ago,
> > is an
> > > > > > array where each slot contains a fixed-size sequence of values.
> > It is
> > > > > > specified as FixedSizeList<T>[N], where T is a child type and N
> is
> > a
> > > > > signed
> > > > > > int32 that specifies the length of each list.
> > > > > >
> > > > > > This is useful for encoding fixed-size tensors.  E.g., if I have
> a
> > > > > 100x8x10
> > > > > > tensor, then I can encode it as
> > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > > > > >
> > > > > > But I'm also interested in encoding tensors where some dimension
> > sizes
> > > > > are
> > > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > > extended
> > > > > > to support this fairly easily, by simply defining that N=-1 means
> > "each
> > > > > > array slot has the same length, but that length is not known in
> > advance."
> > > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > > > > >
> > > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> > some
> > > > > way
> > > > > > to determine what they are.  Luckily, every Field in the schema
> > has a
> > > > > > corresponding FieldNode in the message; and those FieldNodes can
> > be used
> > > > > to
> > > > > > deduce the row lengths.  In particular, the row length must be
> > equal to
> > > > > the
> > > > > > length of the child node divided by the length of the
> > FixedSizeList.
> > > > > E.g.,
> > > > > > if we have a FixedSizeList<byte>[-1] array with the values [[1,
> > 2], [3,
> > > > > 4],
> > > > > > [5, 6]] then the message representation is:
> > > > > >
> > > > > > * Length: 3, Null count: 0
> > > > > > * Null bitmap buffer: Not required
> > > > > > * Values array (byte array):
> > > > > >     * Length: 6,  Null count: 0
> > > > > >     * Null bitmap buffer: Not required
> > > > > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding
> bytes>]
> > > > > >
> > > > > > So we can deduce that the row length is 6/3=2.
> > > > > >
> > > > > > It looks to me like it would be fairly easy to add support for
> > this.
> > > > > E.g.,
> > > > > > in the FixedSizeListArray constructor in c++, if
> > list_type()->list_size()
> > > > > > is -1, then set list_size_ to values.length()/length.  There
> would
> > be no
> > > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > > assigning a
> > > > > > meaning to something that's currently meaningless (having
> > > > > > FixedSizeList.listSize=-1).
> > > > > >
> > > > > > If there's support for adding this to Arrow, then I could put
> > together a
> > > > > > PR.
> > > > > >
> > > > > > Thanks,
> > > > > > -Edward
> > > > > >
> > > > > > P.S. Apologies if this gets posted twice -- I sent it out a
> couple
> > days
> > > > > ago
> > > > > > right before subscribing to the mailing list; but I don't see it
> > on the
> > > > > > archives, presumably because I wasn't subscribed yet when I sent
> > it out.
> > > > > >
> > > > >
> >
>