You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@cloudera.com> on 2016/03/19 07:06:05 UTC

Shared memory "IPC" of Arrow row batches in C++

I’ve been collaborating with Steven Phillips (who’s been working on
the Java Arrow impl recently) to show a proof of concept ping-ponging
Arrow data back and forth between the Java and C++ implementations. We
aren’t 100% there yet, but I got C++ to C++ round-trip to memory map
working today (for primitive types — e.g. integers):

https://github.com/apache/arrow/pull/28

We created a small metadata specification using Flatbuffers IDL —
feedback would be much desired here:

https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7

This includes:

- Logical schemas
- Data headers: compact descriptions row batches associated with a
particular schema

The idea is that two systems agree up front on “what is the schema” so
that only the data header (containing memory offsets and sizes and
some other important data-dependent metadata). After working through
this in some real code, I’m feeling fairly good that it meets the
needs of Arrow for the time being, but there may be some unknown
requirements that it would be good to learn about sooner than later.
After some design review and iteration we’ll want to document the
metadata specification as part of the format in more gory detail.

(Note: We are using Flatbuffers for convenience, performance, and
development simplicity — one feature that is especially nice is its
union support, but it can be done in other serialization tools, too)

It's be great to get some benchmark code written so that we are also
able to make technical decisions on the basis of measurable
performance implications. For example, while the read-path of the
above code does not copy any data, it would be useful to know how fast
reassembling the row batch data structure is and how this scales with
the number of columns.

best regards,
Wes

RE: Shared memory "IPC" of Arrow row batches in C++

Posted by "Zheng, Kai" <ka...@intel.com>.
Thanks Wes again for explaining all of this. It looks good.

Regards,
Kai

-----Original Message-----
From: Wes McKinney [mailto:wes@cloudera.com] 
Sent: Tuesday, March 22, 2016 11:11 PM
To: dev@arrow.apache.org
Subject: Re: Shared memory "IPC" of Arrow row batches in C++

hi Kai

On Mon, Mar 21, 2016 at 8:40 AM, Zheng, Kai <ka...@intel.com> wrote:
> Thanks Wes. This sounds a good starting on the IPC direction.
>
>>> It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications.
> Is there any bootstrap setup for the benchmark thing, here, Parquet or elsewhere we can borrow? Does it mean we'll compare two or more approaches or just measure the performance of code path like the read-path you said? For the cpp part, benchmark codes in c++ or python, preferred?
>

Micah is working on ARROW-28 (https://github.com/apache/arrow/pull/29)
which will give us an organized way to create benchmarks.

>>>For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.
> I guess it mean the data is in columnar, the read-path will reassemble it into row batches, without any data copy (by pointers), right?
>

Yes, it's only reassembling C++ objects with memory addresses, no data copying. But it will be nice to know how fast this reassembly process is -- I don't know yet whether it's low (< 50) microseconds or something more than that.

- Wes

> Bear me if something stupid. Thanks!
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Wes McKinney [mailto:wes@cloudera.com]
> Sent: Saturday, March 19, 2016 2:06 PM
> To: dev@arrow.apache.org
> Subject: Shared memory "IPC" of Arrow row batches in C++
>
> I’ve been collaborating with Steven Phillips (who’s been working on the Java Arrow impl recently) to show a proof of concept ping-ponging Arrow data back and forth between the Java and C++ implementations. We aren’t 100% there yet, but I got C++ to C++ round-trip to memory map working today (for primitive types — e.g. integers):
>
> https://github.com/apache/arrow/pull/28
>
> We created a small metadata specification using Flatbuffers IDL — feedback would be much desired here:
>
> https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3c
> c7aa9855030d7
>
> This includes:
>
> - Logical schemas
> - Data headers: compact descriptions row batches associated with a 
> particular schema
>
> The idea is that two systems agree up front on “what is the schema” so that only the data header (containing memory offsets and sizes and some other important data-dependent metadata). After working through this in some real code, I’m feeling fairly good that it meets the needs of Arrow for the time being, but there may be some unknown requirements that it would be good to learn about sooner than later.
> After some design review and iteration we’ll want to document the metadata specification as part of the format in more gory detail.
>
> (Note: We are using Flatbuffers for convenience, performance, and 
> development simplicity — one feature that is especially nice is its 
> union support, but it can be done in other serialization tools, too)
>
> It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications. For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.
>
> best regards,
> Wes

Re: Shared memory "IPC" of Arrow row batches in C++

Posted by Wes McKinney <we...@cloudera.com>.
hi Kai

On Mon, Mar 21, 2016 at 8:40 AM, Zheng, Kai <ka...@intel.com> wrote:
> Thanks Wes. This sounds a good starting on the IPC direction.
>
>>> It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications.
> Is there any bootstrap setup for the benchmark thing, here, Parquet or elsewhere we can borrow? Does it mean we'll compare two or more approaches or just measure the performance of code path like the read-path you said? For the cpp part, benchmark codes in c++ or python, preferred?
>

Micah is working on ARROW-28 (https://github.com/apache/arrow/pull/29)
which will give us an organized way to create benchmarks.

>>>For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.
> I guess it mean the data is in columnar, the read-path will reassemble it into row batches, without any data copy (by pointers), right?
>

Yes, it's only reassembling C++ objects with memory addresses, no data
copying. But it will be nice to know how fast this reassembly process
is -- I don't know yet whether it's low (< 50) microseconds or
something more than that.

- Wes

> Bear me if something stupid. Thanks!
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Wes McKinney [mailto:wes@cloudera.com]
> Sent: Saturday, March 19, 2016 2:06 PM
> To: dev@arrow.apache.org
> Subject: Shared memory "IPC" of Arrow row batches in C++
>
> I’ve been collaborating with Steven Phillips (who’s been working on the Java Arrow impl recently) to show a proof of concept ping-ponging Arrow data back and forth between the Java and C++ implementations. We aren’t 100% there yet, but I got C++ to C++ round-trip to memory map working today (for primitive types — e.g. integers):
>
> https://github.com/apache/arrow/pull/28
>
> We created a small metadata specification using Flatbuffers IDL — feedback would be much desired here:
>
> https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7
>
> This includes:
>
> - Logical schemas
> - Data headers: compact descriptions row batches associated with a particular schema
>
> The idea is that two systems agree up front on “what is the schema” so that only the data header (containing memory offsets and sizes and some other important data-dependent metadata). After working through this in some real code, I’m feeling fairly good that it meets the needs of Arrow for the time being, but there may be some unknown requirements that it would be good to learn about sooner than later.
> After some design review and iteration we’ll want to document the metadata specification as part of the format in more gory detail.
>
> (Note: We are using Flatbuffers for convenience, performance, and development simplicity — one feature that is especially nice is its union support, but it can be done in other serialization tools, too)
>
> It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications. For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.
>
> best regards,
> Wes

RE: Shared memory "IPC" of Arrow row batches in C++

Posted by "Zheng, Kai" <ka...@intel.com>.
Thanks Wes. This sounds a good starting on the IPC direction. 

>> It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications. 
Is there any bootstrap setup for the benchmark thing, here, Parquet or elsewhere we can borrow? Does it mean we'll compare two or more approaches or just measure the performance of code path like the read-path you said? For the cpp part, benchmark codes in c++ or python, preferred?

>>For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.
I guess it mean the data is in columnar, the read-path will reassemble it into row batches, without any data copy (by pointers), right?

Bear me if something stupid. Thanks!

Regards,
Kai

-----Original Message-----
From: Wes McKinney [mailto:wes@cloudera.com] 
Sent: Saturday, March 19, 2016 2:06 PM
To: dev@arrow.apache.org
Subject: Shared memory "IPC" of Arrow row batches in C++

I’ve been collaborating with Steven Phillips (who’s been working on the Java Arrow impl recently) to show a proof of concept ping-ponging Arrow data back and forth between the Java and C++ implementations. We aren’t 100% there yet, but I got C++ to C++ round-trip to memory map working today (for primitive types — e.g. integers):

https://github.com/apache/arrow/pull/28

We created a small metadata specification using Flatbuffers IDL — feedback would be much desired here:

https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7

This includes:

- Logical schemas
- Data headers: compact descriptions row batches associated with a particular schema

The idea is that two systems agree up front on “what is the schema” so that only the data header (containing memory offsets and sizes and some other important data-dependent metadata). After working through this in some real code, I’m feeling fairly good that it meets the needs of Arrow for the time being, but there may be some unknown requirements that it would be good to learn about sooner than later.
After some design review and iteration we’ll want to document the metadata specification as part of the format in more gory detail.

(Note: We are using Flatbuffers for convenience, performance, and development simplicity — one feature that is especially nice is its union support, but it can be done in other serialization tools, too)

It's be great to get some benchmark code written so that we are also able to make technical decisions on the basis of measurable performance implications. For example, while the read-path of the above code does not copy any data, it would be useful to know how fast reassembling the row batch data structure is and how this scales with the number of columns.

best regards,
Wes