You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yue Ni <ni...@gmail.com> on 2023/01/01 14:01:30 UTC

modeling column group

Hi there,

Happy new year.

I store some data in arrow IPC files. And I have two fields that are always
accessed at the same time, namely, when accessing these two fields, they
are accessed in a row oriented manner and are always fetched together, but
other fields are accessed in columnar manner. One of the fields is a string
field, and the other is an int32 field. I would like to know if there is
any canonical approach for modeling this kind of usage in arrow.

The IPC files are memory mapped, and are randomly accessed. Because of the
columnar storage,  when accessing the two fields of the same row, it
requires 2 random accesses to do it. Since I know the access pattern for
these two fields is always reading together, theoretically it can be
reduced to 1 random access when fetching them. Initially I read doc about
struct layout (
https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but it
seems still storing and accessing the data in a columnar manner so it
doesn't help. I could probably use some proprietary encoding to encode
these two fields into a single field, but it is not elegant and somewhat
less portable. Is there any canonical approach in arrow for modeling such
usage? Thanks.

Regards,
Yue

Re: modeling column group

Posted by Yue Ni <ni...@gmail.com>.
Thanks so much Weston. Both [1][2] are informative, and I will check them
out. Thanks.

On Mon, Jan 2, 2023 at 5:05 AM Weston Pace <we...@gmail.com> wrote:

> There was a discussion a while back about representing complex numbers
> that seems similar[1].  If both fields were the same type you could
> use a fixed size list array.  However, since you want two different
> types you'd want some kind of "packed struct" which does not exist (to
> my knowledge) today.  Also, given that one of the fields is a string
> it would be a bit of a challenge.
>
> There is a layout kind of like this in the hash-table/group-by
> implementation.  We use a row-encoding scheme in the hash-table.  All
> fixed size types are encoded first and then the variable types come at
> the end.  I can't remember off the top of my head if the lengths of
> the variable sized fields are encoded as fixed size types or in a
> separate array.  However, this is internal, not thoroughly documented,
> and probably just useful for inspiration at the moment.
>
> [1] https://lists.apache.org/thread/m8jnrfzozq1dx66twzc80vbyr6r365yf
> [2]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/row/row_internal.h
>
> On Sun, Jan 1, 2023 at 6:02 AM Yue Ni <ni...@gmail.com> wrote:
> >
> > Hi there,
> >
> > Happy new year.
> >
> > I store some data in arrow IPC files. And I have two fields that are
> always
> > accessed at the same time, namely, when accessing these two fields, they
> > are accessed in a row oriented manner and are always fetched together,
> but
> > other fields are accessed in columnar manner. One of the fields is a
> string
> > field, and the other is an int32 field. I would like to know if there is
> > any canonical approach for modeling this kind of usage in arrow.
> >
> > The IPC files are memory mapped, and are randomly accessed. Because of
> the
> > columnar storage,  when accessing the two fields of the same row, it
> > requires 2 random accesses to do it. Since I know the access pattern for
> > these two fields is always reading together, theoretically it can be
> > reduced to 1 random access when fetching them. Initially I read doc about
> > struct layout (
> > https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but
> it
> > seems still storing and accessing the data in a columnar manner so it
> > doesn't help. I could probably use some proprietary encoding to encode
> > these two fields into a single field, but it is not elegant and somewhat
> > less portable. Is there any canonical approach in arrow for modeling such
> > usage? Thanks.
> >
> > Regards,
> > Yue
>

Re: modeling column group

Posted by Weston Pace <we...@gmail.com>.
There was a discussion a while back about representing complex numbers
that seems similar[1].  If both fields were the same type you could
use a fixed size list array.  However, since you want two different
types you'd want some kind of "packed struct" which does not exist (to
my knowledge) today.  Also, given that one of the fields is a string
it would be a bit of a challenge.

There is a layout kind of like this in the hash-table/group-by
implementation.  We use a row-encoding scheme in the hash-table.  All
fixed size types are encoded first and then the variable types come at
the end.  I can't remember off the top of my head if the lengths of
the variable sized fields are encoded as fixed size types or in a
separate array.  However, this is internal, not thoroughly documented,
and probably just useful for inspiration at the moment.

[1] https://lists.apache.org/thread/m8jnrfzozq1dx66twzc80vbyr6r365yf
[2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/row/row_internal.h

On Sun, Jan 1, 2023 at 6:02 AM Yue Ni <ni...@gmail.com> wrote:
>
> Hi there,
>
> Happy new year.
>
> I store some data in arrow IPC files. And I have two fields that are always
> accessed at the same time, namely, when accessing these two fields, they
> are accessed in a row oriented manner and are always fetched together, but
> other fields are accessed in columnar manner. One of the fields is a string
> field, and the other is an int32 field. I would like to know if there is
> any canonical approach for modeling this kind of usage in arrow.
>
> The IPC files are memory mapped, and are randomly accessed. Because of the
> columnar storage,  when accessing the two fields of the same row, it
> requires 2 random accesses to do it. Since I know the access pattern for
> these two fields is always reading together, theoretically it can be
> reduced to 1 random access when fetching them. Initially I read doc about
> struct layout (
> https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but it
> seems still storing and accessing the data in a columnar manner so it
> doesn't help. I could probably use some proprietary encoding to encode
> these two fields into a single field, but it is not elegant and somewhat
> less portable. Is there any canonical approach in arrow for modeling such
> usage? Thanks.
>
> Regards,
> Yue