You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2020/05/21 17:25:08 UTC

[IPC] Stream representation of nested dictionaries

Hello,

I'm working on getting nested dictionaries to work in the C++ IPC
implementation, together with integration tests.

My current implementation introduces a restriction.  Let's say we have
the following schema field:

- type: List
- dictionaryEncoding:
  - id: 123
  - indexType: Int32
- children[0]:
  - type: String
  - dictionaryEncoding:
    - id: 456
    - indexType: Int32

then my C++ patch requires that the dictionary batch for the inner
dictionary (id 456) appears before the dictionary batch for the outer
dictionary (id 123).  It seems like a reasonable restriction, but I'd
like to check if that's ok.  Also, should we add it to the spec?

Regards

Antoine.

Re: [IPC] Stream representation of nested dictionaries

Posted by Antoine Pitrou <an...@python.org>.
This is simply for completeness in integration tests.

Regards

Antoine.


Le 26/05/2020 à 05:21, Micah Kornfield a écrit :
> I don't have a strong opinion since, I'm not sure what use-case requires
> the nested dictionary types, but it seems like we should keep the
> specification but it seems more natural to writers to write the outer
> dictionary first.  Keeping this as a limitation of the C++ implementation
> seems reasonable for the time being until other languages have had the
> opportunity to support this.
> 
> Antoine are you actively working on something that will require nested
> dictionaries or is this simply for completeness?
> 
> Thanks,
> Micah
> 
> On Thu, May 21, 2020 at 6:08 PM Wes McKinney <we...@gmail.com> wrote:
> 
>> It is true that refactoring the IPC reader code to defer dictionary
>> reassembly given out-of-order dictionaries would be some work. The
>> worst case scenario in the short term is that this part of the C++
>> implementation is not implemented, and demonstrating that it works
>> when they appear in depth-first/bottom-up order may be good enough for
>> the 1.0 release.
>>
>> On Thu, May 21, 2020 at 1:12 PM Antoine Pitrou <an...@python.org> wrote:
>>>
>>>
>>> Le 21/05/2020 à 19:46, Micah Kornfield a écrit :
>>>> Hi Antoine,
>>>> Can you expand on why that restriction  is necessary/makes things
>> easier?
>>>> It seems a little strange since each dictionary batch has the ID
>> attached,
>>>> I wouldn't think it would be hard for the reader to track their
>> arrival in
>>>> any order.
>>>
>>> The problem is, you can't really instantiate the outer dictionary (as a
>>> C++ Array) without having the inner dictionary already.  We could work
>>> around that by keeping the DictionaryBatch around and decoding it
>>> lazily, but I don't know the cost of that refactor (nor the runtime
>>> consequences, e.g. react later to an error in the stream).
>>>
>>> Regards
>>>
>>> Antoine.
>>
> 

Re: [IPC] Stream representation of nested dictionaries

Posted by Micah Kornfield <em...@gmail.com>.
I don't have a strong opinion since, I'm not sure what use-case requires
the nested dictionary types, but it seems like we should keep the
specification but it seems more natural to writers to write the outer
dictionary first.  Keeping this as a limitation of the C++ implementation
seems reasonable for the time being until other languages have had the
opportunity to support this.

Antoine are you actively working on something that will require nested
dictionaries or is this simply for completeness?

Thanks,
Micah

On Thu, May 21, 2020 at 6:08 PM Wes McKinney <we...@gmail.com> wrote:

> It is true that refactoring the IPC reader code to defer dictionary
> reassembly given out-of-order dictionaries would be some work. The
> worst case scenario in the short term is that this part of the C++
> implementation is not implemented, and demonstrating that it works
> when they appear in depth-first/bottom-up order may be good enough for
> the 1.0 release.
>
> On Thu, May 21, 2020 at 1:12 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 21/05/2020 à 19:46, Micah Kornfield a écrit :
> > > Hi Antoine,
> > > Can you expand on why that restriction  is necessary/makes things
> easier?
> > > It seems a little strange since each dictionary batch has the ID
> attached,
> > > I wouldn't think it would be hard for the reader to track their
> arrival in
> > > any order.
> >
> > The problem is, you can't really instantiate the outer dictionary (as a
> > C++ Array) without having the inner dictionary already.  We could work
> > around that by keeping the DictionaryBatch around and decoding it
> > lazily, but I don't know the cost of that refactor (nor the runtime
> > consequences, e.g. react later to an error in the stream).
> >
> > Regards
> >
> > Antoine.
>

Re: [IPC] Stream representation of nested dictionaries

Posted by Wes McKinney <we...@gmail.com>.
It is true that refactoring the IPC reader code to defer dictionary
reassembly given out-of-order dictionaries would be some work. The
worst case scenario in the short term is that this part of the C++
implementation is not implemented, and demonstrating that it works
when they appear in depth-first/bottom-up order may be good enough for
the 1.0 release.

On Thu, May 21, 2020 at 1:12 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 21/05/2020 à 19:46, Micah Kornfield a écrit :
> > Hi Antoine,
> > Can you expand on why that restriction  is necessary/makes things easier?
> > It seems a little strange since each dictionary batch has the ID attached,
> > I wouldn't think it would be hard for the reader to track their arrival in
> > any order.
>
> The problem is, you can't really instantiate the outer dictionary (as a
> C++ Array) without having the inner dictionary already.  We could work
> around that by keeping the DictionaryBatch around and decoding it
> lazily, but I don't know the cost of that refactor (nor the runtime
> consequences, e.g. react later to an error in the stream).
>
> Regards
>
> Antoine.

Re: [IPC] Stream representation of nested dictionaries

Posted by Antoine Pitrou <an...@python.org>.
Le 21/05/2020 à 19:46, Micah Kornfield a écrit :
> Hi Antoine,
> Can you expand on why that restriction  is necessary/makes things easier?
> It seems a little strange since each dictionary batch has the ID attached,
> I wouldn't think it would be hard for the reader to track their arrival in
> any order.

The problem is, you can't really instantiate the outer dictionary (as a
C++ Array) without having the inner dictionary already.  We could work
around that by keeping the DictionaryBatch around and decoding it
lazily, but I don't know the cost of that refactor (nor the runtime
consequences, e.g. react later to an error in the stream).

Regards

Antoine.

Re: [IPC] Stream representation of nested dictionaries

Posted by Micah Kornfield <em...@gmail.com>.
Hi Antoine,
Can you expand on why that restriction  is necessary/makes things easier?
It seems a little strange since each dictionary batch has the ID attached,
I wouldn't think it would be hard for the reader to track their arrival in
any order.

Thanks,
Micah





On Thu, May 21, 2020 at 10:25 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> I'm working on getting nested dictionaries to work in the C++ IPC
> implementation, together with integration tests.
>
> My current implementation introduces a restriction.  Let's say we have
> the following schema field:
>
> - type: List
> - dictionaryEncoding:
>   - id: 123
>   - indexType: Int32
> - children[0]:
>   - type: String
>   - dictionaryEncoding:
>     - id: 456
>     - indexType: Int32
>
> then my C++ patch requires that the dictionary batch for the inner
> dictionary (id 456) appears before the dictionary batch for the outer
> dictionary (id 123).  It seems like a reasonable restriction, but I'd
> like to check if that's ok.  Also, should we add it to the spec?
>
> Regards
>
> Antoine.
>