You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2018/06/24 06:24:38 UTC

[DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

hi folks,

Some time ago I opened ARROW-1790 based on some discussions I'd had
with users on mailing list or in person about how to deal with data
similar to a C array of struct types. Indeed, while we have Structs in
the Arrow columnar format, our structs are "fully shredded" columnar
structs.

Many systems such as Apache Impala (TupleRow, used in row batches),
Apache Kudu (used in client RPCs), Apache Spark (off-heap "unsafe row"
aka Tungsten), NumPy (structured dtypes), and others have in-memory
data structures supporting record oriented data. As far as I know,
there is not an open standard for this type of data.

The purpose of developing this within Apache Arrow would serve a
couple purposes:

* To have an open standard for in-memory records under ASF community
governance. Achieving consensus in this setting would have a lot of
long-term value and accelerate adoption

* To provide a means to embed sequences of records in the Arrow columnar format

In light of efforts to create LLVM codegen infrastructure for Arrow
(Gandiva), it would stand to reason that we could develop LLVM IR for
manipulating columns of records in a coherent algebraic expression
framework. For example: efficient LLVM code generation for "shredding"
or "pivoting" records into fully-shredded columnar format.

If this sounds interesting to the community, I could help to kickstart
a design process which would likely take a significant amount of time.
The requirements could be complex (i.e. we might want to support
variable-size record fields while also providing random access
guarantees). We could use the ASF's Confluence wiki to house the
documents and facilitate discussion.

Thanks,
Wes

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Wes McKinney <we...@gmail.com>.

I added a line item for this on the project wiki. When I or someone
has some time, we can spend some time writing a requirements / design
sketch document to help with further discussions.

On Wed, Jun 27, 2018 at 7:09 AM, Wes McKinney <we...@gmail.com> wrote:
>> I'm not sure this makes sense as an external stable api. I definitely think it is useful as an internal representation for use within a particular algorithm. I also think that can be informed by the particular algorithm that you're working on.
>
> I agree that this is definitely needed in certain algorithms, e.g.
> certain types of hashing. And the memory layout that's best for a
> given algorithm might change. Since we have a number of support
> structures already in place for columnar data, like dictionary
> encoding, it would be easier to implement these things for
> row-oriented data.
>
> I think the question is really about open standards. Our original
> focus when we started the project was to develop an open standard for
> columnar data. It seems valuable to have one for row-oriented data.
> Given how many systems have developed their own internal formats, it
> seems like an inevitability. This begs the questions: if it does not
> happen here, then where? and if not now, then when?
>
> That being said, it's hard to say how feasible the project would be
> until we gather more requirements and non-requirements.
>
> On Wed, Jun 27, 2018 at 3:20 AM, Siddharth Teotia <si...@dremio.com> wrote:
>> I am wondering if this can be considered as an opportunity to implement
>> support in Arrow for building high performance in-memory row stores for low
>> latency and high throughput key based queries. In other words, we can
>> design the in-memory record format keeping efficient RDMA reads as one of
>> the goals too. Consider two data structures in memory -- a  hash table and
>> a row-store comprising of records in Arrow row format. Hashtable points to
>> row store and information can be read from both data structures without
>> interrupting the CPU on server. This client-server code-path support can
>> also be incorporated into Arrow Flight
>>
>> On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>
>>> I'm not sure this makes sense as an external stable api. I definitely think
>>> it is useful as an internal representation for use within a particular
>>> algorithm. I also think that can be informed by the particular algorithm
>>> that you're working on.
>>>
>>> We definitely had this requirement in Dremio and came up with an internal
>>> representation that we are happy with for the use in hash tables. I'll try
>>> to dig up the design docs we had around this but the actual
>>> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>>>
>>> Our main model is two blocks: a fixed width block and a variable width
>>> block (with the fixed width block also carrying address & length of the
>>> variable data). Fixed width is randomly accessible and variable width is
>>> randomly accessible through fixed width.
>>>
>>> [1]
>>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
>>> [2]
>>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>>>
>>> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>
>>> > hi Antoine,
>>> >
>>> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <an...@python.org>
>>> > wrote:
>>> > >
>>> > > Hi Wes,
>>> > >
>>> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
>>> > >>
>>> > >> If this sounds interesting to the community, I could help to kickstart
>>> > >> a design process which would likely take a significant amount of time.
>>> > >> The requirements could be complex (i.e. we might want to support
>>> > >> variable-size record fields while also providing random access
>>> > >> guarantees).
>>> > >
>>> > > What do you call "variable-sized" here? A scheme where the length of a
>>> > > record's field is determined by the value of another field in the same
>>> > > record?
>>> >
>>> > As an example, here is a fixed size record
>>> >
>>> > record foo {
>>> >   a: int32;
>>> >   b: float64;
>>> >   c: uint8;
>>> > }
>>> >
>>> > With padding suppose this is 16 bytes per record; so if we have a
>>> > column of these, then random accessing any value in any record is
>>> > simple.
>>> >
>>> > Here's a variable-length record:
>>> >
>>> > record bar {
>>> >   a: string;
>>> >   b: list<int32>;
>>> > }
>>> >
>>> > What I've seen done to represent this in memory is to have a fixed
>>> > size record followed by a sidecar containing the variable-length data,
>>> > so the fixed size portion might look something like
>>> >
>>> > a_offset: int32;
>>> > a_length: int32;
>>> > b_offset: int32;
>>> > b_length: int32;
>>> >
>>> > So from this, you can do random access into the record. If you wanted
>>> > to do random access on a _column_ of such records, it is similar to
>>> > our current variable-length Binary type. So it might be that the
>>> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
>>> > records and variable Binary for variable-size records.
>>> >
>>> > - Wes
>>> >
>>> > >
>>> > >
>>> > >
>>> > > Regards
>>> > >
>>> > > Antoine.
>>> >
>>>

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Wes McKinney <we...@gmail.com>.

> I'm not sure this makes sense as an external stable api. I definitely think it is useful as an internal representation for use within a particular algorithm. I also think that can be informed by the particular algorithm that you're working on.

I agree that this is definitely needed in certain algorithms, e.g.
certain types of hashing. And the memory layout that's best for a
given algorithm might change. Since we have a number of support
structures already in place for columnar data, like dictionary
encoding, it would be easier to implement these things for
row-oriented data.

I think the question is really about open standards. Our original
focus when we started the project was to develop an open standard for
columnar data. It seems valuable to have one for row-oriented data.
Given how many systems have developed their own internal formats, it
seems like an inevitability. This begs the questions: if it does not
happen here, then where? and if not now, then when?

That being said, it's hard to say how feasible the project would be
until we gather more requirements and non-requirements.

On Wed, Jun 27, 2018 at 3:20 AM, Siddharth Teotia <si...@dremio.com> wrote:
> I am wondering if this can be considered as an opportunity to implement
> support in Arrow for building high performance in-memory row stores for low
> latency and high throughput key based queries. In other words, we can
> design the in-memory record format keeping efficient RDMA reads as one of
> the goals too. Consider two data structures in memory -- a  hash table and
> a row-store comprising of records in Arrow row format. Hashtable points to
> row store and information can be read from both data structures without
> interrupting the CPU on server. This client-server code-path support can
> also be incorporated into Arrow Flight
>
> On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
>> I'm not sure this makes sense as an external stable api. I definitely think
>> it is useful as an internal representation for use within a particular
>> algorithm. I also think that can be informed by the particular algorithm
>> that you're working on.
>>
>> We definitely had this requirement in Dremio and came up with an internal
>> representation that we are happy with for the use in hash tables. I'll try
>> to dig up the design docs we had around this but the actual
>> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>>
>> Our main model is two blocks: a fixed width block and a variable width
>> block (with the fixed width block also carrying address & length of the
>> variable data). Fixed width is randomly accessible and variable width is
>> randomly accessible through fixed width.
>>
>> [1]
>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
>> [2]
>> https://github.com/dremio/dremio-oss/blob/master/sabot/
>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>>
>> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>
>> > hi Antoine,
>> >
>> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <an...@python.org>
>> > wrote:
>> > >
>> > > Hi Wes,
>> > >
>> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
>> > >>
>> > >> If this sounds interesting to the community, I could help to kickstart
>> > >> a design process which would likely take a significant amount of time.
>> > >> The requirements could be complex (i.e. we might want to support
>> > >> variable-size record fields while also providing random access
>> > >> guarantees).
>> > >
>> > > What do you call "variable-sized" here? A scheme where the length of a
>> > > record's field is determined by the value of another field in the same
>> > > record?
>> >
>> > As an example, here is a fixed size record
>> >
>> > record foo {
>> >   a: int32;
>> >   b: float64;
>> >   c: uint8;
>> > }
>> >
>> > With padding suppose this is 16 bytes per record; so if we have a
>> > column of these, then random accessing any value in any record is
>> > simple.
>> >
>> > Here's a variable-length record:
>> >
>> > record bar {
>> >   a: string;
>> >   b: list<int32>;
>> > }
>> >
>> > What I've seen done to represent this in memory is to have a fixed
>> > size record followed by a sidecar containing the variable-length data,
>> > so the fixed size portion might look something like
>> >
>> > a_offset: int32;
>> > a_length: int32;
>> > b_offset: int32;
>> > b_length: int32;
>> >
>> > So from this, you can do random access into the record. If you wanted
>> > to do random access on a _column_ of such records, it is similar to
>> > our current variable-length Binary type. So it might be that the
>> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
>> > records and variable Binary for variable-size records.
>> >
>> > - Wes
>> >
>> > >
>> > >
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> >
>>

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Siddharth Teotia <si...@dremio.com>.

I am wondering if this can be considered as an opportunity to implement
support in Arrow for building high performance in-memory row stores for low
latency and high throughput key based queries. In other words, we can
design the in-memory record format keeping efficient RDMA reads as one of
the goals too. Consider two data structures in memory -- a  hash table and
a row-store comprising of records in Arrow row format. Hashtable points to
row store and information can be read from both data structures without
interrupting the CPU on server. This client-server code-path support can
also be incorporated into Arrow Flight

On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm not sure this makes sense as an external stable api. I definitely think
> it is useful as an internal representation for use within a particular
> algorithm. I also think that can be informed by the particular algorithm
> that you're working on.
>
> We definitely had this requirement in Dremio and came up with an internal
> representation that we are happy with for the use in hash tables. I'll try
> to dig up the design docs we had around this but the actual
> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>
> Our main model is two blocks: a fixed width block and a variable width
> block (with the fixed width block also carrying address & length of the
> variable data). Fixed width is randomly accessible and variable width is
> randomly accessible through fixed width.
>
> [1]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
> [2]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>
> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <we...@gmail.com>
> wrote:
>
> > hi Antoine,
> >
> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <an...@python.org>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> > >>
> > >> If this sounds interesting to the community, I could help to kickstart
> > >> a design process which would likely take a significant amount of time.
> > >> The requirements could be complex (i.e. we might want to support
> > >> variable-size record fields while also providing random access
> > >> guarantees).
> > >
> > > What do you call "variable-sized" here? A scheme where the length of a
> > > record's field is determined by the value of another field in the same
> > > record?
> >
> > As an example, here is a fixed size record
> >
> > record foo {
> >   a: int32;
> >   b: float64;
> >   c: uint8;
> > }
> >
> > With padding suppose this is 16 bytes per record; so if we have a
> > column of these, then random accessing any value in any record is
> > simple.
> >
> > Here's a variable-length record:
> >
> > record bar {
> >   a: string;
> >   b: list<int32>;
> > }
> >
> > What I've seen done to represent this in memory is to have a fixed
> > size record followed by a sidecar containing the variable-length data,
> > so the fixed size portion might look something like
> >
> > a_offset: int32;
> > a_length: int32;
> > b_offset: int32;
> > b_length: int32;
> >
> > So from this, you can do random access into the record. If you wanted
> > to do random access on a _column_ of such records, it is similar to
> > our current variable-length Binary type. So it might be that the
> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
> > records and variable Binary for variable-size records.
> >
> > - Wes
> >
> > >
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Jacques Nadeau <ja...@apache.org>.

I'm not sure this makes sense as an external stable api. I definitely think
it is useful as an internal representation for use within a particular
algorithm. I also think that can be informed by the particular algorithm
that you're working on.

We definitely had this requirement in Dremio and came up with an internal
representation that we are happy with for the use in hash tables. I'll try
to dig up the design docs we had around this but the actual
pivoting/unpivoting code that we developed can be seen here: [1], [2].

Our main model is two blocks: a fixed width block and a variable width
block (with the fixed width block also carrying address & length of the
variable data). Fixed width is randomly accessible and variable width is
randomly accessible through fixed width.

[1]
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
[2]
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java

On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Antoine,
>
> On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <an...@python.org>
> wrote:
> >
> > Hi Wes,
> >
> > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> >>
> >> If this sounds interesting to the community, I could help to kickstart
> >> a design process which would likely take a significant amount of time.
> >> The requirements could be complex (i.e. we might want to support
> >> variable-size record fields while also providing random access
> >> guarantees).
> >
> > What do you call "variable-sized" here? A scheme where the length of a
> > record's field is determined by the value of another field in the same
> > record?
>
> As an example, here is a fixed size record
>
> record foo {
>   a: int32;
>   b: float64;
>   c: uint8;
> }
>
> With padding suppose this is 16 bytes per record; so if we have a
> column of these, then random accessing any value in any record is
> simple.
>
> Here's a variable-length record:
>
> record bar {
>   a: string;
>   b: list<int32>;
> }
>
> What I've seen done to represent this in memory is to have a fixed
> size record followed by a sidecar containing the variable-length data,
> so the fixed size portion might look something like
>
> a_offset: int32;
> a_length: int32;
> b_offset: int32;
> b_length: int32;
>
> So from this, you can do random access into the record. If you wanted
> to do random access on a _column_ of such records, it is similar to
> our current variable-length Binary type. So it might be that the
> underlying Arrow memory layout would be FixedSizeBinary for fixed-size
> records and variable Binary for variable-size records.
>
> - Wes
>
> >
> >
> >
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Wes McKinney <we...@gmail.com>.

hi Antoine,

On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <an...@python.org> wrote:
>
> Hi Wes,
>
> Le 24/06/2018 à 08:24, Wes McKinney a écrit :
>>
>> If this sounds interesting to the community, I could help to kickstart
>> a design process which would likely take a significant amount of time.
>> The requirements could be complex (i.e. we might want to support
>> variable-size record fields while also providing random access
>> guarantees).
>
> What do you call "variable-sized" here? A scheme where the length of a
> record's field is determined by the value of another field in the same
> record?

As an example, here is a fixed size record

record foo {
  a: int32;
  b: float64;
  c: uint8;
}

With padding suppose this is 16 bytes per record; so if we have a
column of these, then random accessing any value in any record is
simple.

Here's a variable-length record:

record bar {
  a: string;
  b: list<int32>;
}

What I've seen done to represent this in memory is to have a fixed
size record followed by a sidecar containing the variable-length data,
so the fixed size portion might look something like

a_offset: int32;
a_length: int32;
b_offset: int32;
b_length: int32;

So from this, you can do random access into the record. If you wanted
to do random access on a _column_ of such records, it is similar to
our current variable-length Binary type. So it might be that the
underlying Arrow memory layout would be FixedSizeBinary for fixed-size
records and variable Binary for variable-size records.

- Wes

>
>
>
> Regards
>
> Antoine.

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

Posted by Antoine Pitrou <an...@python.org>.

Hi Wes,

Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> 
> If this sounds interesting to the community, I could help to kickstart
> a design process which would likely take a significant amount of time.
> The requirements could be complex (i.e. we might want to support
> variable-size record fields while also providing random access
> guarantees).

What do you call "variable-sized" here? A scheme where the length of a
record's field is determined by the value of another field in the same
record?



Regards

Antoine.