You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alex Baden <al...@omnisci.com> on 2020/04/16 18:44:09 UTC

Dictionary Memo serialization for CUDA IPC

Hi all,

OmniSci (formerly MapD) has been a long time user of Arrow for IPC
serialization and mem sharing of query results, primarily through our
python connector. We recently upgraded from Arrow 0.13 to Arrow 0.16.
This required us to change our Arrow conversion routines to handle the
new DictionaryMemo for serializing dictionaries. For CPU, this was
fairly easy as I was able to just write the record batch stream using
`arrow::ipc::WriteRecordBatchStream` (and read it using
`RecordBatchStreamReader` on the client). For GPU/CUDA, however, I did
not see a way to serialize the dictionary alongside the CUDA data and
wrap that in a single "object" (the semantics of which probably need
to be broken down, which I will do in a second). So, I came up with
our own: https://github.com/omnisci/omniscidb/blob/4ab6622bd0ee15e478bff4263f083ab761fc965c/QueryEngine/ArrowResultSetConverter.cpp#L219

Essentially, I assemble a RecordBatch with the dictionaries I want to
serialize and call WriteRecordBatchStream to serialize into a CPU IPC
stream, which I copy to CPU shared memory. I then serialize the GPU
record batch using SerializeRecordBatch into a CUDABuffer. The
CudaBuffer is exported for IPC sharing, and I send both memory handles
(CPU and GPU) over to the client. The client then has to read the
RecordBatch containing the dictionaries and place the dictionaries
into a DictionaryMemo, which is used to read the record batches from
GPU. The process of building the DictionaryMemo on the client is here:
https://github.com/omnisci/omniscidb/blob/master/Tests/ArrowIpcIntegrationTest.cpp#L380

This seems to work ok, at least for C++, but I am interested in making
it more compact and possibly contributing some or all to mainline
Arrow. Therefore, I have two questions:
1) Does this look like a reasonable way to go about handling a
serialized RecordBatch in CUDA (that is, separate the dictionaries and
return two objects, or a single object holding two handles)?
2) Is this something that the Arrow community would be interested in
seeing contributed in whatever form we agree upon for (1)?

Thanks,
Alex

Re: Dictionary Memo serialization for CUDA IPC

Posted by Alex Baden <al...@omnisci.com>.
Hi Wes,

Thanks for the reply. I scanned through JIRA but it didn't look like
this was filed or anyone was working on it, so I filed
https://issues.apache.org/jira/browse/ARROW-8927. I have a branch and
things seem to look pretty good, was able to duplicate
TestCudaArrowIpc_BasicWriteRead but provide a record batch with
dictionaries.


Alex

On Thu, Apr 16, 2020 at 12:51 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Alex,
>
> I haven't looked at the details of your code, but having APIs that
> "collapse" the process of writing a single record batch along with its
> dictionaries as a sequence of end-to-end IPC messages (and then having
> a function to reverse that process to reconstruct the record batch)
> and making that work for writing to GPU memory (using the new device
> API) seems reasonable to me. There's a bit of refactoring that would
> need to take place to be able to reuse certain code paths relating to
> dictionary batch handling. Note also that we're due to implement delta
> dictionaries and dictionary replacements so we might want to take all
> of these needs into account to reduce the amount of code churn that
> takes place.
>
> - Wes
>
> On Thu, Apr 16, 2020 at 1:44 PM Alex Baden <al...@omnisci.com> wrote:
> >
> > Hi all,
> >
> > OmniSci (formerly MapD) has been a long time user of Arrow for IPC
> > serialization and mem sharing of query results, primarily through our
> > python connector. We recently upgraded from Arrow 0.13 to Arrow 0.16.
> > This required us to change our Arrow conversion routines to handle the
> > new DictionaryMemo for serializing dictionaries. For CPU, this was
> > fairly easy as I was able to just write the record batch stream using
> > `arrow::ipc::WriteRecordBatchStream` (and read it using
> > `RecordBatchStreamReader` on the client). For GPU/CUDA, however, I did
> > not see a way to serialize the dictionary alongside the CUDA data and
> > wrap that in a single "object" (the semantics of which probably need
> > to be broken down, which I will do in a second). So, I came up with
> > our own: https://github.com/omnisci/omniscidb/blob/4ab6622bd0ee15e478bff4263f083ab761fc965c/QueryEngine/ArrowResultSetConverter.cpp#L219
> >
> > Essentially, I assemble a RecordBatch with the dictionaries I want to
> > serialize and call WriteRecordBatchStream to serialize into a CPU IPC
> > stream, which I copy to CPU shared memory. I then serialize the GPU
> > record batch using SerializeRecordBatch into a CUDABuffer. The
> > CudaBuffer is exported for IPC sharing, and I send both memory handles
> > (CPU and GPU) over to the client. The client then has to read the
> > RecordBatch containing the dictionaries and place the dictionaries
> > into a DictionaryMemo, which is used to read the record batches from
> > GPU. The process of building the DictionaryMemo on the client is here:
> > https://github.com/omnisci/omniscidb/blob/master/Tests/ArrowIpcIntegrationTest.cpp#L380
> >
> > This seems to work ok, at least for C++, but I am interested in making
> > it more compact and possibly contributing some or all to mainline
> > Arrow. Therefore, I have two questions:
> > 1) Does this look like a reasonable way to go about handling a
> > serialized RecordBatch in CUDA (that is, separate the dictionaries and
> > return two objects, or a single object holding two handles)?
> > 2) Is this something that the Arrow community would be interested in
> > seeing contributed in whatever form we agree upon for (1)?
> >
> > Thanks,
> > Alex

Re: Dictionary Memo serialization for CUDA IPC

Posted by Wes McKinney <we...@gmail.com>.
hi Alex,

I haven't looked at the details of your code, but having APIs that
"collapse" the process of writing a single record batch along with its
dictionaries as a sequence of end-to-end IPC messages (and then having
a function to reverse that process to reconstruct the record batch)
and making that work for writing to GPU memory (using the new device
API) seems reasonable to me. There's a bit of refactoring that would
need to take place to be able to reuse certain code paths relating to
dictionary batch handling. Note also that we're due to implement delta
dictionaries and dictionary replacements so we might want to take all
of these needs into account to reduce the amount of code churn that
takes place.

- Wes

On Thu, Apr 16, 2020 at 1:44 PM Alex Baden <al...@omnisci.com> wrote:
>
> Hi all,
>
> OmniSci (formerly MapD) has been a long time user of Arrow for IPC
> serialization and mem sharing of query results, primarily through our
> python connector. We recently upgraded from Arrow 0.13 to Arrow 0.16.
> This required us to change our Arrow conversion routines to handle the
> new DictionaryMemo for serializing dictionaries. For CPU, this was
> fairly easy as I was able to just write the record batch stream using
> `arrow::ipc::WriteRecordBatchStream` (and read it using
> `RecordBatchStreamReader` on the client). For GPU/CUDA, however, I did
> not see a way to serialize the dictionary alongside the CUDA data and
> wrap that in a single "object" (the semantics of which probably need
> to be broken down, which I will do in a second). So, I came up with
> our own: https://github.com/omnisci/omniscidb/blob/4ab6622bd0ee15e478bff4263f083ab761fc965c/QueryEngine/ArrowResultSetConverter.cpp#L219
>
> Essentially, I assemble a RecordBatch with the dictionaries I want to
> serialize and call WriteRecordBatchStream to serialize into a CPU IPC
> stream, which I copy to CPU shared memory. I then serialize the GPU
> record batch using SerializeRecordBatch into a CUDABuffer. The
> CudaBuffer is exported for IPC sharing, and I send both memory handles
> (CPU and GPU) over to the client. The client then has to read the
> RecordBatch containing the dictionaries and place the dictionaries
> into a DictionaryMemo, which is used to read the record batches from
> GPU. The process of building the DictionaryMemo on the client is here:
> https://github.com/omnisci/omniscidb/blob/master/Tests/ArrowIpcIntegrationTest.cpp#L380
>
> This seems to work ok, at least for C++, but I am interested in making
> it more compact and possibly contributing some or all to mainline
> Arrow. Therefore, I have two questions:
> 1) Does this look like a reasonable way to go about handling a
> serialized RecordBatch in CUDA (that is, separate the dictionaries and
> return two objects, or a single object holding two handles)?
> 2) Is this something that the Arrow community would be interested in
> seeing contributed in whatever form we agree upon for (1)?
>
> Thanks,
> Alex