You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Paul Dix <pa...@influxdata.com> on 2020/04/07 15:36:33 UTC

[Rust] Dictionary encoding and Flight

Hello,
I'm trying to build a Rust based Flight server and I'd like to use
Dictionary encoding for a number of string columns in my data. I've seen
that StringDictionary was recently added to Rust here:
https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab.

However, that doesn't seem to reach down into Flight. When I attempt to
send a schema through flight that has a Dictionary<UInt8, Utf8> it throws
an error when attempting to convert from the Rust type to the Flatbuffer
field type. I figured I'd take a swing at adding that to convert.rs here:
https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319

However, when I look at the definitions in Schema.fbs and the related
generated Rust file, Dictionary isn't a type there. Should I be sending
this down as some other composed type? And if so, how does this look at the
client side of things? In my test I'm connecting to the Flight server via
PyArrow and working with it in Pandas so I'm hoping that it will be able to
consume Dictionary fields.

Separately, the Rust field type doesn't have a spot for the dictionary ID,
which I assume I'll need to send down so it can be consumed on the client.
Would appreciate any thoughts on that. A little push in the right direction
and I'll be happy to submit a PR to help push the Rust Flight
implementation farther along.

Thanks,
Paul

Re: [Rust] Dictionary encoding and Flight

Posted by Wes McKinney <we...@gmail.com>.
Well, luckily we have some newly spruced up documentation about how
integration testing works (thanks Neal!)

https://github.com/apache/arrow/blob/master/docs/source/format/Integration.rst

The main task is writing a parser for the JSON format used for
integration testing. The JSON is used to communicate the "truth" about
the data between two implementations. There are implementations of
this in C++, Go, JS, and Java, so any of those implementations could
be used as your guide.

One small reminder for others reading: the current JSON integration
testing format doesn't yet capture dictionary deltas/replacements. See
ARROW-5338

On Thu, Apr 9, 2020 at 5:55 PM Paul Dix <pa...@influxdata.com> wrote:
>
> I'd be happy to pitch in on getting the integration tests developed. It
> would certainly beat my current method of building and running my test
> project and switching over to a Jupyter notebook to manually check it.
>
> Is there any prior work in the Rust project that I could basically copy
> from? Or perhaps the C++ implementation?
>
> On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Paul,
> >
> > Dictionary-encoded is not a nested type, so there shouldn't be any
> > children -- the IPC layout of a dictionary encoded field is that same
> > as the type of the indices (probably want to change the terminology in
> > the Rust library from "keys" to "indices" which is what's used in the
> > specification). And the dictionaries are sent separately in record
> > batches
> >
> > > The Rust implementation of DictionaryArray keeps the values as a child
> > to the keys array.
> >
> > What we do in C++ is that the dictionary is a separate member of the
> > ArrayData data structure. I don't know enough about the Rust library
> > to know whether that's the right design choice or not.
> >
> > Implementing Rust support for integration tests would make it much
> > easier to validate that things are implemented correctly. Does anyone
> > know how far away the Rust library might be from having them? It's not
> > the most fun thing to implement, but it's very important.
> >
> > - Wes
> >
> > On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <pa...@influxdata.com> wrote:
> > >
> > > I managed to get something up and running. I ended up creating a
> > > dictionary_batch.rs and adding that to convert.rs to translate
> > dictionary
> > > fields in a schema over to the correct fb thing. I also added a method to
> > > writer.rs to convert that to bytes so it can be sent via ipc. However,
> > when
> > > writing out the subsequent record batches that contain a dictionary field
> > > that references the dictionary sent in the dictionary batch, it runs
> > into a
> > > problem. The Rust implementation of DictionaryArray keeps the values as a
> > > child to the keys array.
> > >
> > > You can see the chain of calls when calling finish on the
> > > StringDictionaryBuilder
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > > which calls out to finish the dictionary
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > > which then calls from to create the DictionaryArray
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > > And that last part validates that the child is there.
> > >
> > > And here in the writer is where it blindly writes out any children:
> > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> > >
> > > I found that when I sent this out, the Python reader on the other side
> > > would choke on it when I tried to read_pandas. However, if I skip
> > > serializing the children for DictionaryArray only, then everything works
> > > exactly as expected.
> > >
> > > So I'm not sure what the right thing here is because I don't know if
> > > DictionaryArrays should not include the children (i.e. values) when being
> > > written out or if there's some other thing that should be happening.
> > Should
> > > the raw arrow data have the values in that dictionary? Or does it get
> > > reconstituted manually on the other side from the DictionaryBatch?
> > >
> > > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > As another item for consideration -- in C++ at least, the dictionary
> > > > id is dealt with as an internal detail of the IPC message production
> > > > process. When serializing the Schema, id's are assigned to each
> > > > dictionary-encoded field in the DictionaryMemo object, see
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > > >
> > > > When record batches are reconstructed, the dictionary corresponding to
> > > > an id at the time of reconstruction is set in the Array's internal
> > > > data -- that's the "dictionary" member of the ArrayData object
> > > > (
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > > >
> > > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >
> > > > > hey Paul,
> > > > >
> > > > > Take a look at how dictionaries work in the IPC protocol
> > > > >
> > > > >
> > > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > > > >
> > > > > Dictionaries are sent as separate messages. When a field is tagged as
> > > > > dictionary encoded in the schema, the IPC reader must keep track of
> > > > > the dictionaries it's seen come across the protocol and then set them
> > > > > in the reconstructed record batches when a record batch comes
> > through.
> > > > >
> > > > > Note that the protocol now supports dictionary deltas (dictionaries
> > > > > can be appended to by subsequent messages for the same dictionary id)
> > > > > and replacements (new dictionary for an id).
> > > > >
> > > > > I don't know what the status of handling dictionaries in the Rust
> > IPC,
> > > > > but it would be a good idea to take time to take into account the
> > > > > above details.
> > > > >
> > > > > Finally, note that Rust is not participating in either the regular
> > IPC
> > > > > nor Flight integration tests. This is an important milestone to being
> > > > > able to depend on the Rust library in production.
> > > > >
> > > > > Thanks
> > > > > Wes
> > > > >
> > > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com>
> > wrote:
> > > > > >
> > > > > > Hello,
> > > > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > > > Dictionary encoding for a number of string columns in my data. I've
> > > > seen
> > > > > > that StringDictionary was recently added to Rust here:
> > > > > >
> > > >
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > > > .
> > > > > >
> > > > > > However, that doesn't seem to reach down into Flight. When I
> > attempt to
> > > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > > > throws
> > > > > > an error when attempting to convert from the Rust type to the
> > > > Flatbuffer
> > > > > > field type. I figured I'd take a swing at adding that to
> > convert.rs
> > > > here:
> > > > > >
> > > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > > > >
> > > > > > However, when I look at the definitions in Schema.fbs and the
> > related
> > > > > > generated Rust file, Dictionary isn't a type there. Should I be
> > sending
> > > > > > this down as some other composed type? And if so, how does this
> > look
> > > > at the
> > > > > > client side of things? In my test I'm connecting to the Flight
> > server
> > > > via
> > > > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > > > able to
> > > > > > consume Dictionary fields.
> > > > > >
> > > > > > Separately, the Rust field type doesn't have a spot for the
> > dictionary
> > > > ID,
> > > > > > which I assume I'll need to send down so it can be consumed on the
> > > > client.
> > > > > > Would appreciate any thoughts on that. A little push in the right
> > > > direction
> > > > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > > > implementation farther along.
> > > > > >
> > > > > > Thanks,
> > > > > > Paul
> > > >
> > >
> > >
> > > --
> > >
> > > Paul Dix
> > >
> > > Founder & CTO • InfluxData
> > >
> > > m 646.283.6377     t @pauldix     w influxdata.com
> > > <https://www.influxdata.com/>
> >
>
>
> --
>
> Paul Dix
>
> Founder & CTO • InfluxData
>
> m 646.283.6377     t @pauldix     w influxdata.com
> <https://www.influxdata.com/>

Re: [Rust] Dictionary encoding and Flight

Posted by Paul Dix <pa...@influxdata.com>.
I'd be happy to pitch in on getting the integration tests developed. It
would certainly beat my current method of building and running my test
project and switching over to a Jupyter notebook to manually check it.

Is there any prior work in the Rust project that I could basically copy
from? Or perhaps the C++ implementation?

On Thu, Apr 9, 2020 at 6:31 PM Wes McKinney <we...@gmail.com> wrote:

> hi Paul,
>
> Dictionary-encoded is not a nested type, so there shouldn't be any
> children -- the IPC layout of a dictionary encoded field is that same
> as the type of the indices (probably want to change the terminology in
> the Rust library from "keys" to "indices" which is what's used in the
> specification). And the dictionaries are sent separately in record
> batches
>
> > The Rust implementation of DictionaryArray keeps the values as a child
> to the keys array.
>
> What we do in C++ is that the dictionary is a separate member of the
> ArrayData data structure. I don't know enough about the Rust library
> to know whether that's the right design choice or not.
>
> Implementing Rust support for integration tests would make it much
> easier to validate that things are implemented correctly. Does anyone
> know how far away the Rust library might be from having them? It's not
> the most fun thing to implement, but it's very important.
>
> - Wes
>
> On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <pa...@influxdata.com> wrote:
> >
> > I managed to get something up and running. I ended up creating a
> > dictionary_batch.rs and adding that to convert.rs to translate
> dictionary
> > fields in a schema over to the correct fb thing. I also added a method to
> > writer.rs to convert that to bytes so it can be sent via ipc. However,
> when
> > writing out the subsequent record batches that contain a dictionary field
> > that references the dictionary sent in the dictionary batch, it runs
> into a
> > problem. The Rust implementation of DictionaryArray keeps the values as a
> > child to the keys array.
> >
> > You can see the chain of calls when calling finish on the
> > StringDictionaryBuilder
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> > which calls out to finish the dictionary
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> > which then calls from to create the DictionaryArray
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> > And that last part validates that the child is there.
> >
> > And here in the writer is where it blindly writes out any children:
> >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
> >
> > I found that when I sent this out, the Python reader on the other side
> > would choke on it when I tried to read_pandas. However, if I skip
> > serializing the children for DictionaryArray only, then everything works
> > exactly as expected.
> >
> > So I'm not sure what the right thing here is because I don't know if
> > DictionaryArrays should not include the children (i.e. values) when being
> > written out or if there's some other thing that should be happening.
> Should
> > the raw arrow data have the values in that dictionary? Or does it get
> > reconstituted manually on the other side from the DictionaryBatch?
> >
> > On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > As another item for consideration -- in C++ at least, the dictionary
> > > id is dealt with as an internal detail of the IPC message production
> > > process. When serializing the Schema, id's are assigned to each
> > > dictionary-encoded field in the DictionaryMemo object, see
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> > >
> > > When record batches are reconstructed, the dictionary corresponding to
> > > an id at the time of reconstruction is set in the Array's internal
> > > data -- that's the "dictionary" member of the ArrayData object
> > > (
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> > >
> > > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > hey Paul,
> > > >
> > > > Take a look at how dictionaries work in the IPC protocol
> > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > > >
> > > > Dictionaries are sent as separate messages. When a field is tagged as
> > > > dictionary encoded in the schema, the IPC reader must keep track of
> > > > the dictionaries it's seen come across the protocol and then set them
> > > > in the reconstructed record batches when a record batch comes
> through.
> > > >
> > > > Note that the protocol now supports dictionary deltas (dictionaries
> > > > can be appended to by subsequent messages for the same dictionary id)
> > > > and replacements (new dictionary for an id).
> > > >
> > > > I don't know what the status of handling dictionaries in the Rust
> IPC,
> > > > but it would be a good idea to take time to take into account the
> > > > above details.
> > > >
> > > > Finally, note that Rust is not participating in either the regular
> IPC
> > > > nor Flight integration tests. This is an important milestone to being
> > > > able to depend on the Rust library in production.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com>
> wrote:
> > > > >
> > > > > Hello,
> > > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > > Dictionary encoding for a number of string columns in my data. I've
> > > seen
> > > > > that StringDictionary was recently added to Rust here:
> > > > >
> > >
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > > .
> > > > >
> > > > > However, that doesn't seem to reach down into Flight. When I
> attempt to
> > > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > > throws
> > > > > an error when attempting to convert from the Rust type to the
> > > Flatbuffer
> > > > > field type. I figured I'd take a swing at adding that to
> convert.rs
> > > here:
> > > > >
> > >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > > >
> > > > > However, when I look at the definitions in Schema.fbs and the
> related
> > > > > generated Rust file, Dictionary isn't a type there. Should I be
> sending
> > > > > this down as some other composed type? And if so, how does this
> look
> > > at the
> > > > > client side of things? In my test I'm connecting to the Flight
> server
> > > via
> > > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > > able to
> > > > > consume Dictionary fields.
> > > > >
> > > > > Separately, the Rust field type doesn't have a spot for the
> dictionary
> > > ID,
> > > > > which I assume I'll need to send down so it can be consumed on the
> > > client.
> > > > > Would appreciate any thoughts on that. A little push in the right
> > > direction
> > > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > > implementation farther along.
> > > > >
> > > > > Thanks,
> > > > > Paul
> > >
> >
> >
> > --
> >
> > Paul Dix
> >
> > Founder & CTO • InfluxData
> >
> > m 646.283.6377     t @pauldix     w influxdata.com
> > <https://www.influxdata.com/>
>


-- 

Paul Dix

Founder & CTO • InfluxData

m 646.283.6377     t @pauldix     w influxdata.com
<https://www.influxdata.com/>

Re: [Rust] Dictionary encoding and Flight

Posted by Wes McKinney <we...@gmail.com>.
hi Paul,

Dictionary-encoded is not a nested type, so there shouldn't be any
children -- the IPC layout of a dictionary encoded field is that same
as the type of the indices (probably want to change the terminology in
the Rust library from "keys" to "indices" which is what's used in the
specification). And the dictionaries are sent separately in record
batches

> The Rust implementation of DictionaryArray keeps the values as a child to the keys array.

What we do in C++ is that the dictionary is a separate member of the
ArrayData data structure. I don't know enough about the Rust library
to know whether that's the right design choice or not.

Implementing Rust support for integration tests would make it much
easier to validate that things are implemented correctly. Does anyone
know how far away the Rust library might be from having them? It's not
the most fun thing to implement, but it's very important.

- Wes

On Thu, Apr 9, 2020 at 5:08 PM Paul Dix <pa...@influxdata.com> wrote:
>
> I managed to get something up and running. I ended up creating a
> dictionary_batch.rs and adding that to convert.rs to translate dictionary
> fields in a schema over to the correct fb thing. I also added a method to
> writer.rs to convert that to bytes so it can be sent via ipc. However, when
> writing out the subsequent record batches that contain a dictionary field
> that references the dictionary sent in the dictionary batch, it runs into a
> problem. The Rust implementation of DictionaryArray keeps the values as a
> child to the keys array.
>
> You can see the chain of calls when calling finish on the
> StringDictionaryBuilder
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
> which calls out to finish the dictionary
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
> which then calls from to create the DictionaryArray
> https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
> And that last part validates that the child is there.
>
> And here in the writer is where it blindly writes out any children:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378
>
> I found that when I sent this out, the Python reader on the other side
> would choke on it when I tried to read_pandas. However, if I skip
> serializing the children for DictionaryArray only, then everything works
> exactly as expected.
>
> So I'm not sure what the right thing here is because I don't know if
> DictionaryArrays should not include the children (i.e. values) when being
> written out or if there's some other thing that should be happening. Should
> the raw arrow data have the values in that dictionary? Or does it get
> reconstituted manually on the other side from the DictionaryBatch?
>
> On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <we...@gmail.com> wrote:
>
> > As another item for consideration -- in C++ at least, the dictionary
> > id is dealt with as an internal detail of the IPC message production
> > process. When serializing the Schema, id's are assigned to each
> > dictionary-encoded field in the DictionaryMemo object, see
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
> >
> > When record batches are reconstructed, the dictionary corresponding to
> > an id at the time of reconstruction is set in the Array's internal
> > data -- that's the "dictionary" member of the ArrayData object
> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
> >
> > On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > hey Paul,
> > >
> > > Take a look at how dictionaries work in the IPC protocol
> > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> > >
> > > Dictionaries are sent as separate messages. When a field is tagged as
> > > dictionary encoded in the schema, the IPC reader must keep track of
> > > the dictionaries it's seen come across the protocol and then set them
> > > in the reconstructed record batches when a record batch comes through.
> > >
> > > Note that the protocol now supports dictionary deltas (dictionaries
> > > can be appended to by subsequent messages for the same dictionary id)
> > > and replacements (new dictionary for an id).
> > >
> > > I don't know what the status of handling dictionaries in the Rust IPC,
> > > but it would be a good idea to take time to take into account the
> > > above details.
> > >
> > > Finally, note that Rust is not participating in either the regular IPC
> > > nor Flight integration tests. This is an important milestone to being
> > > able to depend on the Rust library in production.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com> wrote:
> > > >
> > > > Hello,
> > > > I'm trying to build a Rust based Flight server and I'd like to use
> > > > Dictionary encoding for a number of string columns in my data. I've
> > seen
> > > > that StringDictionary was recently added to Rust here:
> > > >
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> > .
> > > >
> > > > However, that doesn't seem to reach down into Flight. When I attempt to
> > > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> > throws
> > > > an error when attempting to convert from the Rust type to the
> > Flatbuffer
> > > > field type. I figured I'd take a swing at adding that to convert.rs
> > here:
> > > >
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > > >
> > > > However, when I look at the definitions in Schema.fbs and the related
> > > > generated Rust file, Dictionary isn't a type there. Should I be sending
> > > > this down as some other composed type? And if so, how does this look
> > at the
> > > > client side of things? In my test I'm connecting to the Flight server
> > via
> > > > PyArrow and working with it in Pandas so I'm hoping that it will be
> > able to
> > > > consume Dictionary fields.
> > > >
> > > > Separately, the Rust field type doesn't have a spot for the dictionary
> > ID,
> > > > which I assume I'll need to send down so it can be consumed on the
> > client.
> > > > Would appreciate any thoughts on that. A little push in the right
> > direction
> > > > and I'll be happy to submit a PR to help push the Rust Flight
> > > > implementation farther along.
> > > >
> > > > Thanks,
> > > > Paul
> >
>
>
> --
>
> Paul Dix
>
> Founder & CTO • InfluxData
>
> m 646.283.6377     t @pauldix     w influxdata.com
> <https://www.influxdata.com/>

Re: [Rust] Dictionary encoding and Flight

Posted by Paul Dix <pa...@influxdata.com>.
I managed to get something up and running. I ended up creating a
dictionary_batch.rs and adding that to convert.rs to translate dictionary
fields in a schema over to the correct fb thing. I also added a method to
writer.rs to convert that to bytes so it can be sent via ipc. However, when
writing out the subsequent record batches that contain a dictionary field
that references the dictionary sent in the dictionary batch, it runs into a
problem. The Rust implementation of DictionaryArray keeps the values as a
child to the keys array.

You can see the chain of calls when calling finish on the
StringDictionaryBuilder
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L1312-L1316
which calls out to finish the dictionary
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L464-L482
which then calls from to create the DictionaryArray
https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L1900-L1927
And that last part validates that the child is there.

And here in the writer is where it blindly writes out any children:
https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/writer.rs#L366-L378

I found that when I sent this out, the Python reader on the other side
would choke on it when I tried to read_pandas. However, if I skip
serializing the children for DictionaryArray only, then everything works
exactly as expected.

So I'm not sure what the right thing here is because I don't know if
DictionaryArrays should not include the children (i.e. values) when being
written out or if there's some other thing that should be happening. Should
the raw arrow data have the values in that dictionary? Or does it get
reconstituted manually on the other side from the DictionaryBatch?

On Wed, Apr 8, 2020 at 12:05 AM Wes McKinney <we...@gmail.com> wrote:

> As another item for consideration -- in C++ at least, the dictionary
> id is dealt with as an internal detail of the IPC message production
> process. When serializing the Schema, id's are assigned to each
> dictionary-encoded field in the DictionaryMemo object, see
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h
>
> When record batches are reconstructed, the dictionary corresponding to
> an id at the time of reconstruction is set in the Array's internal
> data -- that's the "dictionary" member of the ArrayData object
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).
>
> On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > hey Paul,
> >
> > Take a look at how dictionaries work in the IPC protocol
> >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
> >
> > Dictionaries are sent as separate messages. When a field is tagged as
> > dictionary encoded in the schema, the IPC reader must keep track of
> > the dictionaries it's seen come across the protocol and then set them
> > in the reconstructed record batches when a record batch comes through.
> >
> > Note that the protocol now supports dictionary deltas (dictionaries
> > can be appended to by subsequent messages for the same dictionary id)
> > and replacements (new dictionary for an id).
> >
> > I don't know what the status of handling dictionaries in the Rust IPC,
> > but it would be a good idea to take time to take into account the
> > above details.
> >
> > Finally, note that Rust is not participating in either the regular IPC
> > nor Flight integration tests. This is an important milestone to being
> > able to depend on the Rust library in production.
> >
> > Thanks
> > Wes
> >
> > On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com> wrote:
> > >
> > > Hello,
> > > I'm trying to build a Rust based Flight server and I'd like to use
> > > Dictionary encoding for a number of string columns in my data. I've
> seen
> > > that StringDictionary was recently added to Rust here:
> > >
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab
> .
> > >
> > > However, that doesn't seem to reach down into Flight. When I attempt to
> > > send a schema through flight that has a Dictionary<UInt8, Utf8> it
> throws
> > > an error when attempting to convert from the Rust type to the
> Flatbuffer
> > > field type. I figured I'd take a swing at adding that to convert.rs
> here:
> > >
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> > >
> > > However, when I look at the definitions in Schema.fbs and the related
> > > generated Rust file, Dictionary isn't a type there. Should I be sending
> > > this down as some other composed type? And if so, how does this look
> at the
> > > client side of things? In my test I'm connecting to the Flight server
> via
> > > PyArrow and working with it in Pandas so I'm hoping that it will be
> able to
> > > consume Dictionary fields.
> > >
> > > Separately, the Rust field type doesn't have a spot for the dictionary
> ID,
> > > which I assume I'll need to send down so it can be consumed on the
> client.
> > > Would appreciate any thoughts on that. A little push in the right
> direction
> > > and I'll be happy to submit a PR to help push the Rust Flight
> > > implementation farther along.
> > >
> > > Thanks,
> > > Paul
>


-- 

Paul Dix

Founder & CTO • InfluxData

m 646.283.6377     t @pauldix     w influxdata.com
<https://www.influxdata.com/>

Re: [Rust] Dictionary encoding and Flight

Posted by Wes McKinney <we...@gmail.com>.
As another item for consideration -- in C++ at least, the dictionary
id is dealt with as an internal detail of the IPC message production
process. When serializing the Schema, id's are assigned to each
dictionary-encoded field in the DictionaryMemo object, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h

When record batches are reconstructed, the dictionary corresponding to
an id at the time of reconstruction is set in the Array's internal
data -- that's the "dictionary" member of the ArrayData object
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).

On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <we...@gmail.com> wrote:
>
> hey Paul,
>
> Take a look at how dictionaries work in the IPC protocol
>
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
>
> Dictionaries are sent as separate messages. When a field is tagged as
> dictionary encoded in the schema, the IPC reader must keep track of
> the dictionaries it's seen come across the protocol and then set them
> in the reconstructed record batches when a record batch comes through.
>
> Note that the protocol now supports dictionary deltas (dictionaries
> can be appended to by subsequent messages for the same dictionary id)
> and replacements (new dictionary for an id).
>
> I don't know what the status of handling dictionaries in the Rust IPC,
> but it would be a good idea to take time to take into account the
> above details.
>
> Finally, note that Rust is not participating in either the regular IPC
> nor Flight integration tests. This is an important milestone to being
> able to depend on the Rust library in production.
>
> Thanks
> Wes
>
> On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com> wrote:
> >
> > Hello,
> > I'm trying to build a Rust based Flight server and I'd like to use
> > Dictionary encoding for a number of string columns in my data. I've seen
> > that StringDictionary was recently added to Rust here:
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab.
> >
> > However, that doesn't seem to reach down into Flight. When I attempt to
> > send a schema through flight that has a Dictionary<UInt8, Utf8> it throws
> > an error when attempting to convert from the Rust type to the Flatbuffer
> > field type. I figured I'd take a swing at adding that to convert.rs here:
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> >
> > However, when I look at the definitions in Schema.fbs and the related
> > generated Rust file, Dictionary isn't a type there. Should I be sending
> > this down as some other composed type? And if so, how does this look at the
> > client side of things? In my test I'm connecting to the Flight server via
> > PyArrow and working with it in Pandas so I'm hoping that it will be able to
> > consume Dictionary fields.
> >
> > Separately, the Rust field type doesn't have a spot for the dictionary ID,
> > which I assume I'll need to send down so it can be consumed on the client.
> > Would appreciate any thoughts on that. A little push in the right direction
> > and I'll be happy to submit a PR to help push the Rust Flight
> > implementation farther along.
> >
> > Thanks,
> > Paul

Re: [Rust] Dictionary encoding and Flight

Posted by Wes McKinney <we...@gmail.com>.
hey Paul,

Take a look at how dictionaries work in the IPC protocol

https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc

Dictionaries are sent as separate messages. When a field is tagged as
dictionary encoded in the schema, the IPC reader must keep track of
the dictionaries it's seen come across the protocol and then set them
in the reconstructed record batches when a record batch comes through.

Note that the protocol now supports dictionary deltas (dictionaries
can be appended to by subsequent messages for the same dictionary id)
and replacements (new dictionary for an id).

I don't know what the status of handling dictionaries in the Rust IPC,
but it would be a good idea to take time to take into account the
above details.

Finally, note that Rust is not participating in either the regular IPC
nor Flight integration tests. This is an important milestone to being
able to depend on the Rust library in production.

Thanks
Wes

On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <pa...@influxdata.com> wrote:
>
> Hello,
> I'm trying to build a Rust based Flight server and I'd like to use
> Dictionary encoding for a number of string columns in my data. I've seen
> that StringDictionary was recently added to Rust here:
> https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab.
>
> However, that doesn't seem to reach down into Flight. When I attempt to
> send a schema through flight that has a Dictionary<UInt8, Utf8> it throws
> an error when attempting to convert from the Rust type to the Flatbuffer
> field type. I figured I'd take a swing at adding that to convert.rs here:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
>
> However, when I look at the definitions in Schema.fbs and the related
> generated Rust file, Dictionary isn't a type there. Should I be sending
> this down as some other composed type? And if so, how does this look at the
> client side of things? In my test I'm connecting to the Flight server via
> PyArrow and working with it in Pandas so I'm hoping that it will be able to
> consume Dictionary fields.
>
> Separately, the Rust field type doesn't have a spot for the dictionary ID,
> which I assume I'll need to send down so it can be consumed on the client.
> Would appreciate any thoughts on that. A little push in the right direction
> and I'll be happy to submit a PR to help push the Rust Flight
> implementation farther along.
>
> Thanks,
> Paul