You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/10/03 04:23:26 UTC

Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

Hi Wes,
It seems fine to be flexible here.  However:


> This could have implications for hashing or
> comparisons, for example, so I think that having the flexibility to do
> either is a good idea.

This statement of use-cases makes me a little nervous.  It seems like it
could lead to bugs if a consumer is reading from two producers that use
different alternatives?

Thanks,
Micah

On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney <we...@gmail.com> wrote:

> I just updated my pull request from May adding language to clarify
> what protocol writers are expected to set when producing the Arrow
> binary protocol
>
> https://github.com/apache/arrow/pull/4370
>
> Implementations may allocate small buffers, or use memory which does
> not meet the 8-byte minimal padding requirements of the Arrow
> protocol. It becomes a question, then, whether to set the in-memory
> buffer size or the padded size when producing the protocol.
>
> This PR states that either is acceptable. As an example, a 1-byte
> validity buffer could have Buffer metadata stating that the size
> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
> written to conform to the protocol. The metadata, therefore, reflects
> the "intent" of the protocol writer for the protocol reader. If the
> writer says the length is 1, then the protocol reader understands that
> the writer does not expect the reader to concern itself with the 7
> bytes of padding. This could have implications for hashing or
> comparisons, for example, so I think that having the flexibility to do
> either is a good idea.
>
> For an application that wants to guarantee that AVX512 instructions
> can be used on all buffers on the receiver side, it would be
> appropriate to include 512-bit padding in the accounting.
>
> Let me know if others think differently so we can have this properly
> documented for the 1.0.0 Format release.
>
> Thanks,
> Wes
>

Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

Posted by Wes McKinney <we...@gmail.com>.

On Thu, Oct 3, 2019 at 7:33 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 03/10/2019 à 14:22, Wes McKinney a écrit :
> > On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Yeah, I think the spec should be strict.  And for convenience, I'd say
> >> it should probably be the padded length (though I don't have a strong
> >> opinion).
> >
> > The reason I'm against this is that it makes it impossible for a
> > producer to preserve the exact state of its buffers for a consumer.
> >
> > For example, if you have a 1-byte validity bitmap, and you do not have
> > the flexibility to indicate in the metadata that the length is either
> > 1 (unpadded) or 8 (padded), then the producer only will ever see 8
> > bytes.
>
> I see.  Then we should mandate the non-padded length, IMHO.

I think all that needs to be said is that an unpadded size is not
invalid. If a consumer is passed a buffer that is larger than it needs
to be, there is no harm done. I can tweak the language so that there
is less uncertainty perhaps

> Regards
>
> Antoine.

Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

Posted by Antoine Pitrou <an...@python.org>.

Le 03/10/2019 à 14:22, Wes McKinney a écrit :
> On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Yeah, I think the spec should be strict.  And for convenience, I'd say
>> it should probably be the padded length (though I don't have a strong
>> opinion).
> 
> The reason I'm against this is that it makes it impossible for a
> producer to preserve the exact state of its buffers for a consumer.
> 
> For example, if you have a 1-byte validity bitmap, and you do not have
> the flexibility to indicate in the metadata that the length is either
> 1 (unpadded) or 8 (padded), then the producer only will ever see 8
> bytes.

I see.  Then we should mandate the non-padded length, IMHO.

Regards

Antoine.

Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

Posted by Wes McKinney <we...@gmail.com>.

On Thu, Oct 3, 2019 at 4:26 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Yeah, I think the spec should be strict.  And for convenience, I'd say
> it should probably be the padded length (though I don't have a strong
> opinion).

The reason I'm against this is that it makes it impossible for a
producer to preserve the exact state of its buffers for a consumer.

For example, if you have a 1-byte validity bitmap, and you do not have
the flexibility to indicate in the metadata that the length is either
1 (unpadded) or 8 (padded), then the producer only will ever see 8
bytes.

Note that padding is only performed in context of the encapsulated IPC
format. If the metadata is used to communicate in-memory pointers then
it is not appropriate to pad lengths if they are not already padded.

> Regards
>
> Antoine.
>
>
> Le 03/10/2019 à 06:23, Micah Kornfield a écrit :
> > Hi Wes,
> > It seems fine to be flexible here.  However:
> >
> >
> >> This could have implications for hashing or
> >> comparisons, for example, so I think that having the flexibility to do
> >> either is a good idea.
> >
> > This statement of use-cases makes me a little nervous.  It seems like it
> > could lead to bugs if a consumer is reading from two producers that use
> > different alternatives?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> I just updated my pull request from May adding language to clarify
> >> what protocol writers are expected to set when producing the Arrow
> >> binary protocol
> >>
> >> https://github.com/apache/arrow/pull/4370
> >>
> >> Implementations may allocate small buffers, or use memory which does
> >> not meet the 8-byte minimal padding requirements of the Arrow
> >> protocol. It becomes a question, then, whether to set the in-memory
> >> buffer size or the padded size when producing the protocol.
> >>
> >> This PR states that either is acceptable. As an example, a 1-byte
> >> validity buffer could have Buffer metadata stating that the size
> >> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
> >> written to conform to the protocol. The metadata, therefore, reflects
> >> the "intent" of the protocol writer for the protocol reader. If the
> >> writer says the length is 1, then the protocol reader understands that
> >> the writer does not expect the reader to concern itself with the 7
> >> bytes of padding. This could have implications for hashing or
> >> comparisons, for example, so I think that having the flexibility to do
> >> either is a good idea.
> >>
> >> For an application that wants to guarantee that AVX512 instructions
> >> can be used on all buffers on the receiver side, it would be
> >> appropriate to include 512-bit padding in the accounting.
> >>
> >> Let me know if others think differently so we can have this properly
> >> documented for the 1.0.0 Format release.
> >>
> >> Thanks,
> >> Wes
> >>
> >

Re: Clarifying interpretation of Buffer "length" field in Arrow protocol

Posted by Antoine Pitrou <an...@python.org>.

Yeah, I think the spec should be strict.  And for convenience, I'd say
it should probably be the padded length (though I don't have a strong
opinion).

Regards

Antoine.


Le 03/10/2019 à 06:23, Micah Kornfield a écrit :
> Hi Wes,
> It seems fine to be flexible here.  However:
> 
> 
>> This could have implications for hashing or
>> comparisons, for example, so I think that having the flexibility to do
>> either is a good idea.
> 
> This statement of use-cases makes me a little nervous.  It seems like it
> could lead to bugs if a consumer is reading from two producers that use
> different alternatives?
> 
> Thanks,
> Micah
> 
> On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney <we...@gmail.com> wrote:
> 
>> I just updated my pull request from May adding language to clarify
>> what protocol writers are expected to set when producing the Arrow
>> binary protocol
>>
>> https://github.com/apache/arrow/pull/4370
>>
>> Implementations may allocate small buffers, or use memory which does
>> not meet the 8-byte minimal padding requirements of the Arrow
>> protocol. It becomes a question, then, whether to set the in-memory
>> buffer size or the padded size when producing the protocol.
>>
>> This PR states that either is acceptable. As an example, a 1-byte
>> validity buffer could have Buffer metadata stating that the size
>> either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
>> written to conform to the protocol. The metadata, therefore, reflects
>> the "intent" of the protocol writer for the protocol reader. If the
>> writer says the length is 1, then the protocol reader understands that
>> the writer does not expect the reader to concern itself with the 7
>> bytes of padding. This could have implications for hashing or
>> comparisons, for example, so I think that having the flexibility to do
>> either is a good idea.
>>
>> For an application that wants to guarantee that AVX512 instructions
>> can be used on all buffers on the receiver side, it would be
>> appropriate to include 512-bit padding in the accounting.
>>
>> Let me know if others think differently so we can have this properly
>> documented for the 1.0.0 Format release.
>>
>> Thanks,
>> Wes
>>
>