You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Miki Tebeka <mi...@353solutions.com> on 2019/05/20 12:23:56 UTC

[C++] Storing/retreiving a Table in plasma

Hi,

I'm looking for an example on how to store/retrieve a an arrow::Table in
plasma. The examples I see in the documentation site are for basic types.

My end goal is to create data (Table) in C++, store it in plasma and read
if from Python.

From reading around, I need to allocate buffer in plasma, but how can I
find the size of the Table to allocate the table? And how can I serialize
it into the created Buffer?

Thanks,
Miki

Re: [C++] Storing/retreiving a Table in plasma

Posted by Miki Tebeka <mi...@353solutions.com>.
Thanks Wes!

On Mon, May 20, 2019 at 9:46 PM Wes McKinney <we...@gmail.com> wrote:

> hi Miki,
>
> In
>
> https://github.com/353solutions/carrow/blob/plasma/_misc/plasma.cc#L47
>
> GetRecordBatchSize does not represent the entire size of the stream
> including schema. If you are serializing Schema separate from
> RecordBatch then you need to use the lower level
> arrow::ipc::ReadRecordBatch/WriteRecordBatch functions. Have a look at
> the unit tests
>
> If you are going to use RecordBatchStreamWriter then you need to
> compute the size using MockOutputStream per my original e-mail
>
> - Wes
>
> On Mon, May 20, 2019 at 12:50 PM Miki Tebeka <mi...@353solutions.com>
> wrote:
> >>
> >> That link didn't work for me.
> >
> > Doh! I moved it to
> https://github.com/353solutions/carrow/blob/plasma/_misc/plasma.cc
> >
> >>
> >> Would it not be better to do this work in Apache Arrow rather than an
> external project? I would guess the
> >> community would be interested in this.
> >
> > I do plan to suggest this as a patch to arrow once the code is usable,
> currently it's just noise.
> >
> > The idea behind carrow is to use the underlying C++ both in Python & Go
> so that in the same process we can simply share pointers (and maybe later
> used shared memory allocator to do it between processes).  I don't see a
> clear path to do it with the current Go implementation since it's uses the
> Go runtime to allocate memory, and carrow has a complicated build process
> that currently won't with with simple "go get".
> >
> > To get initial usable Go<->Python IPC quickly, I'm trying to utilize
> plasma for now. However in the long run I'd like to just share pointers
> with no serializaton at all.
> >
> > I'd love to discuss how we can make this project usable and get the
> community help in solving some "easy of build" issues later on. Would love
> to have it in the main arrow eventually.
>

Re: [C++] Storing/retreiving a Table in plasma

Posted by Wes McKinney <we...@gmail.com>.
hi Miki,

In

https://github.com/353solutions/carrow/blob/plasma/_misc/plasma.cc#L47

GetRecordBatchSize does not represent the entire size of the stream
including schema. If you are serializing Schema separate from
RecordBatch then you need to use the lower level
arrow::ipc::ReadRecordBatch/WriteRecordBatch functions. Have a look at
the unit tests

If you are going to use RecordBatchStreamWriter then you need to
compute the size using MockOutputStream per my original e-mail

- Wes

On Mon, May 20, 2019 at 12:50 PM Miki Tebeka <mi...@353solutions.com> wrote:
>>
>> That link didn't work for me.
>
> Doh! I moved it to https://github.com/353solutions/carrow/blob/plasma/_misc/plasma.cc
>
>>
>> Would it not be better to do this work in Apache Arrow rather than an external project? I would guess the
>> community would be interested in this.
>
> I do plan to suggest this as a patch to arrow once the code is usable, currently it's just noise.
>
> The idea behind carrow is to use the underlying C++ both in Python & Go so that in the same process we can simply share pointers (and maybe later used shared memory allocator to do it between processes).  I don't see a clear path to do it with the current Go implementation since it's uses the Go runtime to allocate memory, and carrow has a complicated build process that currently won't with with simple "go get".
>
> To get initial usable Go<->Python IPC quickly, I'm trying to utilize plasma for now. However in the long run I'd like to just share pointers with no serializaton at all.
>
> I'd love to discuss how we can make this project usable and get the community help in solving some "easy of build" issues later on. Would love to have it in the main arrow eventually.

Re: [C++] Storing/retreiving a Table in plasma

Posted by Miki Tebeka <mi...@353solutions.com>.
>
> That link didn't work for me.

Doh! I moved it to
https://github.com/353solutions/carrow/blob/plasma/_misc/plasma.cc


> Would it not be better to do this work in Apache Arrow rather than an
> external project? I would guess the
> community would be interested in this.
>
I do plan to suggest this as a patch to arrow once the code is usable,
currently it's just noise.

The idea behind carrow is to use the underlying C++ both in Python & Go so
that in the same process we can simply share pointers (and maybe later used
shared memory allocator to do it between processes).  I don't see a clear
path to do it with the current Go implementation since it's uses the Go
runtime to allocate memory, and carrow has a complicated build process that
currently won't with with simple "go get".

To get initial usable Go<->Python IPC quickly, I'm trying to utilize plasma
for now. However in the long run I'd like to just share pointers with no
serializaton at all.

I'd love to discuss how we can make this project usable and get the
community help in solving some "easy of build" issues later on. Would love
to have it in the main arrow eventually.

Re: [C++] Storing/retreiving a Table in plasma

Posted by Wes McKinney <we...@gmail.com>.
hi Miki,

That link didn't work for me. Would it not be better to do this work
in Apache Arrow rather than an external project? I would guess the
community would be interested in this.

- Wes

On Mon, May 20, 2019 at 9:48 AM Miki Tebeka <mi...@353solutions.com> wrote:
>
> OK, almost working. I get "Write out of bounds" when running the code at https://github.com/353solutions/carrow/blob/plasma/plasma.cc
>
> Any ideas?
>
> Full output:
> batch size = 224
> buf size = 224
> error: write: Write out of bounds
>
> On Mon, May 20, 2019 at 5:21 PM Miki Tebeka <mi...@353solutions.com> wrote:
>>
>> Thanks Wes
>>
>> On Mon, May 20, 2019 at 4:24 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> See https://issues.apache.org/jira/browse/ARROW-5377
>>>
>>> On Mon, May 20, 2019 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
>>> >
>>> > hi Miki,
>>> >
>>> > Steps
>>> >
>>> > * Convert the Table to a sequence of RecordBatch objects. You can use
>>> > arrow::TableBatchReader to do this [1]
>>> > * Write a stream using MockOutputStream [2]
>>> > * Use the reported size of the total stream to allocate memory in Plasma
>>> > * Write a real stream using arrow::io::FixedSizeBufferWriter
>>> >
>>> > I'm interested at some point to reduce the amount of boilerplate
>>> > associated with this process, and also to avoid multiple metadata
>>> > serialization and record batch disassembly steps. I'll open a JIRA
>>> > issue
>>> >
>>> > We'd be delighted if you would contribute to the C++ documentation at
>>> > https://github.com/apache/arrow/tree/master/docs/source/cpp
>>> >
>>> > - Wes
>>> >
>>> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L340
>>> > [2]: https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/io/memory.h#L89
>>> >
>>> > On Mon, May 20, 2019 at 7:24 AM Miki Tebeka <mi...@353solutions.com> wrote:
>>> > >
>>> > > Hi,
>>> > >
>>> > > I'm looking for an example on how to store/retrieve a an arrow::Table in plasma. The examples I see in the documentation site are for basic types.
>>> > >
>>> > > My end goal is to create data (Table) in C++, store it in plasma and read if from Python.
>>> > >
>>> > > From reading around, I need to allocate buffer in plasma, but how can I find the size of the Table to allocate the table? And how can I serialize it into the created Buffer?
>>> > >
>>> > > Thanks,
>>> > > Miki

Re: [C++] Storing/retreiving a Table in plasma

Posted by Miki Tebeka <mi...@353solutions.com>.
OK, almost working. I get "Write out of bounds" when running the code at
https://github.com/353solutions/carrow/blob/plasma/plasma.cc

Any ideas?

Full output:
batch size = 224
buf size = 224
error: write: Write out of bounds

On Mon, May 20, 2019 at 5:21 PM Miki Tebeka <mi...@353solutions.com> wrote:

> Thanks Wes
>
> On Mon, May 20, 2019 at 4:24 PM Wes McKinney <we...@gmail.com> wrote:
>
>> See https://issues.apache.org/jira/browse/ARROW-5377
>>
>> On Mon, May 20, 2019 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > hi Miki,
>> >
>> > Steps
>> >
>> > * Convert the Table to a sequence of RecordBatch objects. You can use
>> > arrow::TableBatchReader to do this [1]
>> > * Write a stream using MockOutputStream [2]
>> > * Use the reported size of the total stream to allocate memory in Plasma
>> > * Write a real stream using arrow::io::FixedSizeBufferWriter
>> >
>> > I'm interested at some point to reduce the amount of boilerplate
>> > associated with this process, and also to avoid multiple metadata
>> > serialization and record batch disassembly steps. I'll open a JIRA
>> > issue
>> >
>> > We'd be delighted if you would contribute to the C++ documentation at
>> > https://github.com/apache/arrow/tree/master/docs/source/cpp
>> >
>> > - Wes
>> >
>> > [1]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L340
>> > [2]:
>> https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/io/memory.h#L89
>> >
>> > On Mon, May 20, 2019 at 7:24 AM Miki Tebeka <mi...@353solutions.com>
>> wrote:
>> > >
>> > > Hi,
>> > >
>> > > I'm looking for an example on how to store/retrieve a an arrow::Table
>> in plasma. The examples I see in the documentation site are for basic types.
>> > >
>> > > My end goal is to create data (Table) in C++, store it in plasma and
>> read if from Python.
>> > >
>> > > From reading around, I need to allocate buffer in plasma, but how can
>> I find the size of the Table to allocate the table? And how can I serialize
>> it into the created Buffer?
>> > >
>> > > Thanks,
>> > > Miki
>>
>

Re: [C++] Storing/retreiving a Table in plasma

Posted by Miki Tebeka <mi...@353solutions.com>.
Thanks Wes

On Mon, May 20, 2019 at 4:24 PM Wes McKinney <we...@gmail.com> wrote:

> See https://issues.apache.org/jira/browse/ARROW-5377
>
> On Mon, May 20, 2019 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > hi Miki,
> >
> > Steps
> >
> > * Convert the Table to a sequence of RecordBatch objects. You can use
> > arrow::TableBatchReader to do this [1]
> > * Write a stream using MockOutputStream [2]
> > * Use the reported size of the total stream to allocate memory in Plasma
> > * Write a real stream using arrow::io::FixedSizeBufferWriter
> >
> > I'm interested at some point to reduce the amount of boilerplate
> > associated with this process, and also to avoid multiple metadata
> > serialization and record batch disassembly steps. I'll open a JIRA
> > issue
> >
> > We'd be delighted if you would contribute to the C++ documentation at
> > https://github.com/apache/arrow/tree/master/docs/source/cpp
> >
> > - Wes
> >
> > [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L340
> > [2]:
> https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/io/memory.h#L89
> >
> > On Mon, May 20, 2019 at 7:24 AM Miki Tebeka <mi...@353solutions.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I'm looking for an example on how to store/retrieve a an arrow::Table
> in plasma. The examples I see in the documentation site are for basic types.
> > >
> > > My end goal is to create data (Table) in C++, store it in plasma and
> read if from Python.
> > >
> > > From reading around, I need to allocate buffer in plasma, but how can
> I find the size of the Table to allocate the table? And how can I serialize
> it into the created Buffer?
> > >
> > > Thanks,
> > > Miki
>

Re: [C++] Storing/retreiving a Table in plasma

Posted by Wes McKinney <we...@gmail.com>.
See https://issues.apache.org/jira/browse/ARROW-5377

On Mon, May 20, 2019 at 8:15 AM Wes McKinney <we...@gmail.com> wrote:
>
> hi Miki,
>
> Steps
>
> * Convert the Table to a sequence of RecordBatch objects. You can use
> arrow::TableBatchReader to do this [1]
> * Write a stream using MockOutputStream [2]
> * Use the reported size of the total stream to allocate memory in Plasma
> * Write a real stream using arrow::io::FixedSizeBufferWriter
>
> I'm interested at some point to reduce the amount of boilerplate
> associated with this process, and also to avoid multiple metadata
> serialization and record batch disassembly steps. I'll open a JIRA
> issue
>
> We'd be delighted if you would contribute to the C++ documentation at
> https://github.com/apache/arrow/tree/master/docs/source/cpp
>
> - Wes
>
> [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L340
> [2]: https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/io/memory.h#L89
>
> On Mon, May 20, 2019 at 7:24 AM Miki Tebeka <mi...@353solutions.com> wrote:
> >
> > Hi,
> >
> > I'm looking for an example on how to store/retrieve a an arrow::Table in plasma. The examples I see in the documentation site are for basic types.
> >
> > My end goal is to create data (Table) in C++, store it in plasma and read if from Python.
> >
> > From reading around, I need to allocate buffer in plasma, but how can I find the size of the Table to allocate the table? And how can I serialize it into the created Buffer?
> >
> > Thanks,
> > Miki

Re: [C++] Storing/retreiving a Table in plasma

Posted by Wes McKinney <we...@gmail.com>.
hi Miki,

Steps

* Convert the Table to a sequence of RecordBatch objects. You can use
arrow::TableBatchReader to do this [1]
* Write a stream using MockOutputStream [2]
* Use the reported size of the total stream to allocate memory in Plasma
* Write a real stream using arrow::io::FixedSizeBufferWriter

I'm interested at some point to reduce the amount of boilerplate
associated with this process, and also to avoid multiple metadata
serialization and record batch disassembly steps. I'll open a JIRA
issue

We'd be delighted if you would contribute to the C++ documentation at
https://github.com/apache/arrow/tree/master/docs/source/cpp

- Wes

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L340
[2]: https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/io/memory.h#L89

On Mon, May 20, 2019 at 7:24 AM Miki Tebeka <mi...@353solutions.com> wrote:
>
> Hi,
>
> I'm looking for an example on how to store/retrieve a an arrow::Table in plasma. The examples I see in the documentation site are for basic types.
>
> My end goal is to create data (Table) in C++, store it in plasma and read if from Python.
>
> From reading around, I need to allocate buffer in plasma, but how can I find the size of the Table to allocate the table? And how can I serialize it into the created Buffer?
>
> Thanks,
> Miki