You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Dawson D'Almeida <da...@snowflake.com> on 2021/02/12 20:58:25 UTC

[c++] Help with serializing and IPC with dictionary arrays

I am trying to create a record batch containing any number of dictionary
and/or normal arrow arrays, serialize the record batch into bytes (a normal
std::string), and send it via grpc to another server process. On that end
we receive the arrow bytes and deserialize using the bytes and the schema.

Is there a standard way to serialize/deserialize these dictionary arrays?
It seems like all of the info is packaged correctly into the record batch.

I've looked through a lot of the c++ apache arrow source and test code but
I can't find how to approach our use case.

The current failure is:
Field with memory address 140283497044320 not found
from the returns status from arrow::ipc::ReadRecordBatch
<https://github.com/apache/arrow/blob/319b46c0011918939557dc943a0d5d7179860b81/cpp/src/arrow/ipc/reader.h#L406>


Thanks,
-- 
Dawson d'Almeida
Software Engineer

MOBILE  +1 360 499 1852
EMAIL  dawson.dalmeida@snowflake.com <na...@snowflake.com>


Snowflake Inc.
227 Bellevue Way NE
Bellevue, WA, 98004

Re: [c++] Help with serializing and IPC with dictionary arrays

Posted by Wes McKinney <we...@gmail.com>.
I believe you have to extend the ipc::MessageReader interface, have you
looked at the details in

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/client.cc#L425

? (there is analogous code handling the Put side in server.cc) The idea is
that you feed the stream of IPC messages and the dictionary
accounting/record batch reconstruction is handled internally.

On Thu, Feb 18, 2021 at 12:14 PM Dawson D'Almeida <
dawson.dalmeida@snowflake.com> wrote:

> Hi Wes,
>
> We have our own implementation of something like Flight for flexibility of
> use.
>
> The main thing that I am trying to figure out is how to get the dictionary
> record batches properly deserialized on the server side. On the client
> side, I can deserialize them properly using the dictionarymemo directly
> from the record batch we create, but on the other side I do not have access
> to the same dictionarymemo. How is this passed in Flight? I have been
> trying to find this in the source code but haven't yet.
>
> Thanks,
> Dawson
>
> On Fri, Feb 12, 2021 at 3:34 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Dawson — you need to follow the IPC stream protocol, e.g. what
>> RecordBatchStreamWriter or RecordBatchStreamReader are doing
>> internally. Is there a reason you cannot use these interfaces
>> (particularly their internal bits, which are also used to implement
>> Flight where messages are split across different elements of a gRPC
>> stream)?
>>
>> I'm not sure that I would advise you to deal with dictionary
>> disassembly and reconstruction on your own unless it's your only
>> option. That said if you look in the unit test suite you should be
>> able to find examples of where DictionaryBatch IPC messages are
>> reconstructed manually, and then used to reconstitute a RecordBatch
>> IPC message using the arrow::ipc::ReadRecordBatch API. We can try to
>> help you look in the right place, let us know.
>>
>> Thanks,
>> Wes
>>
>> On Fri, Feb 12, 2021 at 2:58 PM Dawson D'Almeida
>> <da...@snowflake.com> wrote:
>> >
>> > I am trying to create a record batch containing any number of
>> dictionary and/or normal arrow arrays, serialize the record batch into
>> bytes (a normal std::string), and send it via grpc to another server
>> process. On that end we receive the arrow bytes and deserialize using the
>> bytes and the schema.
>> >
>> > Is there a standard way to serialize/deserialize these dictionary
>> arrays? It seems like all of the info is packaged correctly into the record
>> batch.
>> >
>> > I've looked through a lot of the c++ apache arrow source and test code
>> but I can't find how to approach our use case.
>> >
>> > The current failure is:
>> > Field with memory address 140283497044320 not found
>> > from the returns status from arrow::ipc::ReadRecordBatch
>> >
>> > Thanks,
>> > --
>> > Dawson d'Almeida
>> > Software Engineer
>> >
>> > MOBILE  +1 360 499 1852
>> > EMAIL  dawson.dalmeida@snowflake.com
>> >
>> >
>> > Snowflake Inc.
>> > 227 Bellevue Way NE
>> > Bellevue, WA, 98004
>>
>
>
> --
> Dawson d'Almeida
> Software Engineer
>
> MOBILE  +1 360 499 1852
> EMAIL  dawson.dalmeida@snowflake.com <na...@snowflake.com>
>
>
> Snowflake Inc.
> 227 Bellevue Way NE
> Bellevue, WA, 98004
>

Re: [c++] Help with serializing and IPC with dictionary arrays

Posted by Dawson D'Almeida <da...@snowflake.com>.
Hi Wes,

We have our own implementation of something like Flight for flexibility of
use.

The main thing that I am trying to figure out is how to get the dictionary
record batches properly deserialized on the server side. On the client
side, I can deserialize them properly using the dictionarymemo directly
from the record batch we create, but on the other side I do not have access
to the same dictionarymemo. How is this passed in Flight? I have been
trying to find this in the source code but haven't yet.

Thanks,
Dawson

On Fri, Feb 12, 2021 at 3:34 PM Wes McKinney <we...@gmail.com> wrote:

> hi Dawson — you need to follow the IPC stream protocol, e.g. what
> RecordBatchStreamWriter or RecordBatchStreamReader are doing
> internally. Is there a reason you cannot use these interfaces
> (particularly their internal bits, which are also used to implement
> Flight where messages are split across different elements of a gRPC
> stream)?
>
> I'm not sure that I would advise you to deal with dictionary
> disassembly and reconstruction on your own unless it's your only
> option. That said if you look in the unit test suite you should be
> able to find examples of where DictionaryBatch IPC messages are
> reconstructed manually, and then used to reconstitute a RecordBatch
> IPC message using the arrow::ipc::ReadRecordBatch API. We can try to
> help you look in the right place, let us know.
>
> Thanks,
> Wes
>
> On Fri, Feb 12, 2021 at 2:58 PM Dawson D'Almeida
> <da...@snowflake.com> wrote:
> >
> > I am trying to create a record batch containing any number of dictionary
> and/or normal arrow arrays, serialize the record batch into bytes (a normal
> std::string), and send it via grpc to another server process. On that end
> we receive the arrow bytes and deserialize using the bytes and the schema.
> >
> > Is there a standard way to serialize/deserialize these dictionary
> arrays? It seems like all of the info is packaged correctly into the record
> batch.
> >
> > I've looked through a lot of the c++ apache arrow source and test code
> but I can't find how to approach our use case.
> >
> > The current failure is:
> > Field with memory address 140283497044320 not found
> > from the returns status from arrow::ipc::ReadRecordBatch
> >
> > Thanks,
> > --
> > Dawson d'Almeida
> > Software Engineer
> >
> > MOBILE  +1 360 499 1852
> > EMAIL  dawson.dalmeida@snowflake.com
> >
> >
> > Snowflake Inc.
> > 227 Bellevue Way NE
> > Bellevue, WA, 98004
>


-- 
Dawson d'Almeida
Software Engineer

MOBILE  +1 360 499 1852
EMAIL  dawson.dalmeida@snowflake.com <na...@snowflake.com>


Snowflake Inc.
227 Bellevue Way NE
Bellevue, WA, 98004

Re: [c++] Help with serializing and IPC with dictionary arrays

Posted by Wes McKinney <we...@gmail.com>.
hi Dawson — you need to follow the IPC stream protocol, e.g. what
RecordBatchStreamWriter or RecordBatchStreamReader are doing
internally. Is there a reason you cannot use these interfaces
(particularly their internal bits, which are also used to implement
Flight where messages are split across different elements of a gRPC
stream)?

I'm not sure that I would advise you to deal with dictionary
disassembly and reconstruction on your own unless it's your only
option. That said if you look in the unit test suite you should be
able to find examples of where DictionaryBatch IPC messages are
reconstructed manually, and then used to reconstitute a RecordBatch
IPC message using the arrow::ipc::ReadRecordBatch API. We can try to
help you look in the right place, let us know.

Thanks,
Wes

On Fri, Feb 12, 2021 at 2:58 PM Dawson D'Almeida
<da...@snowflake.com> wrote:
>
> I am trying to create a record batch containing any number of dictionary and/or normal arrow arrays, serialize the record batch into bytes (a normal std::string), and send it via grpc to another server process. On that end we receive the arrow bytes and deserialize using the bytes and the schema.
>
> Is there a standard way to serialize/deserialize these dictionary arrays? It seems like all of the info is packaged correctly into the record batch.
>
> I've looked through a lot of the c++ apache arrow source and test code but I can't find how to approach our use case.
>
> The current failure is:
> Field with memory address 140283497044320 not found
> from the returns status from arrow::ipc::ReadRecordBatch
>
> Thanks,
> --
> Dawson d'Almeida
> Software Engineer
>
> MOBILE  +1 360 499 1852
> EMAIL  dawson.dalmeida@snowflake.com
>
>
> Snowflake Inc.
> 227 Bellevue Way NE
> Bellevue, WA, 98004