You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/07/22 17:35:35 UTC

[Discuss] Do a 0.15.0 release before 1.0.0?

Hello,

Recently we've discussed breaking the IPC format to fix a long-standing
alignment issue.  See this discussion:
https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E

Should we first do a 0.15.0 in order to get those format fixes right?
Once that is fine and settled we can move to the 1.0.0 release?

Regards

Antoine.

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Bryan Cutler <cu...@gmail.com>.

+1 on a 0.15.0 release. At the minimum, if we could detect the stream and
provide a clear error message for Python and Java I think that would help
the transition. If we are also able to implement readers/writers that can
fallback to 4-byte prefix, then that would be nice to have.

On Wed, Jul 24, 2019 at 1:27 PM Jacques Nadeau <ja...@apache.org> wrote:

> I'm ok with the change and 0.15 release to better manage it.
>
>
> > I've always understood the metadata to be a few dozen/hundred KB, a
> > small percentage of the total message size. I could be underestimating
> > the ratios though -- is it common to have tables w/ 1000+ columns? I've
> > seen a few reports like that in cuDF, but I'm curious to hear
> > Jacques'/Dremio's experience too.
> >
>
> Metadata size has been an issue at different points for us. We do
> definitely see datasets with 1000+ columns. It is also compounded by the
> fact that as we add more columns, we typically decrease row count so that
> the individual batches are still easily pipelined--which further increases
> the relative ratio between data and metadata.
>

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Jacques Nadeau <ja...@apache.org>.

I'm ok with the change and 0.15 release to better manage it.


> I've always understood the metadata to be a few dozen/hundred KB, a
> small percentage of the total message size. I could be underestimating
> the ratios though -- is it common to have tables w/ 1000+ columns? I've
> seen a few reports like that in cuDF, but I'm curious to hear
> Jacques'/Dremio's experience too.
>

Metadata size has been an issue at different points for us. We do
definitely see datasets with 1000+ columns. It is also compounded by the
fact that as we add more columns, we typically decrease row count so that
the individual batches are still easily pipelined--which further increases
the relative ratio between data and metadata.

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Paul Taylor <pt...@gmail.com>.

> I'm not sure I understand this suggestion:
> 1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
> (and provide meaningless bytes at the beginning).
> 2.  The current proposal on the other thread is to have the pattern be
> <0xffffffff><buffer length><buffer data>

Sorry I didn't mean to say an int64_t length, just that now we'd be 
reserving 8 bytes in the "metadata length" position where today we 
reserve 4.

I'm not sure about every language, but at least in Python/JS an external 
forwards-compatible solution would involve slicing the message buffer up 
front like this:

def adjust_message_buffer(message_bytes):
   buf = pa.py_buffer(message_bytes)
   if first_four_bytes_are_max_int32(message_bytes):
     return buf.slice(4)
   return buf



On 7/23/19 7:31 PM, Micah Kornfield wrote:
>> Could we detect the 4-byte length, incur a penalty copying the memory to
>> an aligned buffer, then continue consuming the stream?
> I think that is the plan (or at least would be my plan) if we go ahead with
> the change
>
>
>
>> (It's probably
>> fine if we only write the 8-byte length, since consumers on older
>> versions of Arrow could slice from the 4th byte before passing a buffer
>> to the reader).
> I'm not sure I understand this suggestion:
> 1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
> (and provide meaningless bytes at the beginning).
> 2.  The current proposal on the other thread is to have the pattern be
> <0xffffffff><buffer length><buffer data>
>
> Thanks,
> Micah
>
> On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor <pt...@gmail.com>
> wrote:
>
>> +1 for a 0.15.0 before 1.0 if we go ahead with this.
>>
>> I'm curious to hear other's thoughts about compatibility. I think we
>> should avoid breaking backwards compatibility if possible. It's common
>> for apps/libs to be pinned on specific Arrow versions, and I worry it'd
>> cause a lot of work for downstream devs to audit their tool suite for
>> full Arrow binary compatibility (and/or require their customers to do
>> the same).
>>
>> Could we detect the 4-byte length, incur a penalty copying the memory to
>> an aligned buffer, then continue consuming the stream? (It's probably
>> fine if we only write the 8-byte length, since consumers on older
>> versions of Arrow could slice from the 4th byte before passing a buffer
>> to the reader).
>>
>> I've always understood the metadata to be a few dozen/hundred KB, a
>> small percentage of the total message size. I could be underestimating
>> the ratios though -- is it common to have tables w/ 1000+ columns? I've
>> seen a few reports like that in cuDF, but I'm curious to hear
>> Jacques'/Dremio's experience too.
>>
>> If copying is feasible, it doesn't seem so bad a trade-off to maintain
>> backwards-compatibility. As libraries and consumers upgrade their Arrow
>> dependencies, the 4-byte length will be less and less common, and
>> they'll be less likely to pay the cost.
>>
>>
>>
>> On 7/23/19 2:22 AM, Uwe L. Korn wrote:
>>> It is also a good way to test the change in public. We don't want to
>> adjust something like this anymore in a 1.0.0 release. Already doing this
>> in 0.15.0 and then maybe doing adjustments due to issues that appear "in
>> the wild" is psychologically the easier way. There is a lot of thinking of
>> users bound with the magic 1.0, thus I would plan to minimize what is
>> changed between 1.0 and pre-1.0. This also should save us maintainers some
>> time as I would expect different behaviour in bug reports between 1.0 and
>> pre-1.0 issues.
>>> Uwe
>>>
>>> On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
>>>> I think the main reason to do a release before 1.0.0 is if we want to
>> make
>>>> the change that would give a good error message for forward
>> incompatibility
>>>> (I think this could be done as 0.14.2 since it would just be clarifying
>> an
>>>> error message).  Otherwise, I think including it in 1.0.0 would be fine
>>>> (its still not clear to me if there is consensus to fix the issue).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>>> I'd be satisfied with fixing the Flatbuffer alignment issue either in
>>>>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
>>>>> 0.15.0 with this change sooner rather than later might be prudent.
>>>>>
>>>>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org>
>>>>> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Recently we've discussed breaking the IPC format to fix a
>> long-standing
>>>>>> alignment issue.  See this discussion:
>>>>>>
>> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>>>>>> Should we first do a 0.15.0 in order to get those format fixes right?
>>>>>> Once that is fine and settled we can move to the 1.0.0 release?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>
>>

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Micah Kornfield <em...@gmail.com>.

>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream?

I think that is the plan (or at least would be my plan) if we go ahead with
the change



> (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).

I'm not sure I understand this suggestion:
1.  Wouldn't this cause old readers to miss the last 4 bytes of the buffer
(and provide meaningless bytes at the beginning).
2.  The current proposal on the other thread is to have the pattern be
<0xffffffff><buffer length><buffer data>

Thanks,
Micah

On Tue, Jul 23, 2019 at 11:43 AM Paul Taylor <pt...@gmail.com>
wrote:

> +1 for a 0.15.0 before 1.0 if we go ahead with this.
>
> I'm curious to hear other's thoughts about compatibility. I think we
> should avoid breaking backwards compatibility if possible. It's common
> for apps/libs to be pinned on specific Arrow versions, and I worry it'd
> cause a lot of work for downstream devs to audit their tool suite for
> full Arrow binary compatibility (and/or require their customers to do
> the same).
>
> Could we detect the 4-byte length, incur a penalty copying the memory to
> an aligned buffer, then continue consuming the stream? (It's probably
> fine if we only write the 8-byte length, since consumers on older
> versions of Arrow could slice from the 4th byte before passing a buffer
> to the reader).
>
> I've always understood the metadata to be a few dozen/hundred KB, a
> small percentage of the total message size. I could be underestimating
> the ratios though -- is it common to have tables w/ 1000+ columns? I've
> seen a few reports like that in cuDF, but I'm curious to hear
> Jacques'/Dremio's experience too.
>
> If copying is feasible, it doesn't seem so bad a trade-off to maintain
> backwards-compatibility. As libraries and consumers upgrade their Arrow
> dependencies, the 4-byte length will be less and less common, and
> they'll be less likely to pay the cost.
>
>
>
> On 7/23/19 2:22 AM, Uwe L. Korn wrote:
> > It is also a good way to test the change in public. We don't want to
> adjust something like this anymore in a 1.0.0 release. Already doing this
> in 0.15.0 and then maybe doing adjustments due to issues that appear "in
> the wild" is psychologically the easier way. There is a lot of thinking of
> users bound with the magic 1.0, thus I would plan to minimize what is
> changed between 1.0 and pre-1.0. This also should save us maintainers some
> time as I would expect different behaviour in bug reports between 1.0 and
> pre-1.0 issues.
> >
> > Uwe
> >
> > On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> >> I think the main reason to do a release before 1.0.0 is if we want to
> make
> >> the change that would give a good error message for forward
> incompatibility
> >> (I think this could be done as 0.14.2 since it would just be clarifying
> an
> >> error message).  Otherwise, I think including it in 1.0.0 would be fine
> >> (its still not clear to me if there is consensus to fix the issue).
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >> On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
> >>
> >>> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> >>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> >>> 0.15.0 with this change sooner rather than later might be prudent.
> >>>
> >>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org>
> >>> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> Recently we've discussed breaking the IPC format to fix a
> long-standing
> >>>> alignment issue.  See this discussion:
> >>>>
> >>>
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> >>>> Should we first do a 0.15.0 in order to get those format fixes right?
> >>>> Once that is fine and settled we can move to the 1.0.0 release?
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
>
>
>

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Paul Taylor <pt...@gmail.com>.

+1 for a 0.15.0 before 1.0 if we go ahead with this.

I'm curious to hear other's thoughts about compatibility. I think we 
should avoid breaking backwards compatibility if possible. It's common 
for apps/libs to be pinned on specific Arrow versions, and I worry it'd 
cause a lot of work for downstream devs to audit their tool suite for 
full Arrow binary compatibility (and/or require their customers to do 
the same).

Could we detect the 4-byte length, incur a penalty copying the memory to 
an aligned buffer, then continue consuming the stream? (It's probably 
fine if we only write the 8-byte length, since consumers on older  
versions of Arrow could slice from the 4th byte before passing a buffer 
to the reader).

I've always understood the metadata to be a few dozen/hundred KB, a 
small percentage of the total message size. I could be underestimating 
the ratios though -- is it common to have tables w/ 1000+ columns? I've 
seen a few reports like that in cuDF, but I'm curious to hear 
Jacques'/Dremio's experience too.

If copying is feasible, it doesn't seem so bad a trade-off to maintain 
backwards-compatibility. As libraries and consumers upgrade their Arrow 
dependencies, the 4-byte length will be less and less common, and 
they'll be less likely to pay the cost.

On 7/23/19 2:22 AM, Uwe L. Korn wrote:
> It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues.
>
> Uwe
>
> On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
>> I think the main reason to do a release before 1.0.0 is if we want to make
>> the change that would give a good error message for forward incompatibility
>> (I think this could be done as 0.14.2 since it would just be clarifying an
>> error message).  Otherwise, I think including it in 1.0.0 would be fine
>> (its still not clear to me if there is consensus to fix the issue).
>>
>> Thanks,
>> Micah
>>
>>
>> On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
>>
>>> I'd be satisfied with fixing the Flatbuffer alignment issue either in
>>> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
>>> 0.15.0 with this change sooner rather than later might be prudent.
>>>
>>> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org>
>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Recently we've discussed breaking the IPC format to fix a long-standing
>>>> alignment issue.  See this discussion:
>>>>
>>> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>>>> Should we first do a 0.15.0 in order to get those format fixes right?
>>>> Once that is fine and settled we can move to the 1.0.0 release?
>>>>
>>>> Regards
>>>>
>>>> Antoine.

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

It is also a good way to test the change in public. We don't want to adjust something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 and then maybe doing adjustments due to issues that appear "in the wild" is psychologically the easier way. There is a lot of thinking of users bound with the magic 1.0, thus I would plan to minimize what is changed between 1.0 and pre-1.0. This also should save us maintainers some time as I would expect different behaviour in bug reports between 1.0 and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> I think the main reason to do a release before 1.0.0 is if we want to make
> the change that would give a good error message for forward incompatibility
> (I think this could be done as 0.14.2 since it would just be clarifying an
> error message).  Otherwise, I think including it in 1.0.0 would be fine
> (its still not clear to me if there is consensus to fix the issue).
> 
> Thanks,
> Micah
> 
> 
> On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:
> 
> > I'd be satisfied with fixing the Flatbuffer alignment issue either in
> > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> > 0.15.0 with this change sooner rather than later might be prudent.
> >
> > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org>
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > Recently we've discussed breaking the IPC format to fix a long-standing
> > > alignment issue.  See this discussion:
> > >
> > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> > >
> > > Should we first do a 0.15.0 in order to get those format fixes right?
> > > Once that is fine and settled we can move to the 1.0.0 release?
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Micah Kornfield <em...@gmail.com>.

I think the main reason to do a release before 1.0.0 is if we want to make
the change that would give a good error message for forward incompatibility
(I think this could be done as 0.14.2 since it would just be clarifying an
error message).  Otherwise, I think including it in 1.0.0 would be fine
(its still not clear to me if there is consensus to fix the issue).

Thanks,
Micah

On Monday, July 22, 2019, Wes McKinney <we...@gmail.com> wrote:

> I'd be satisfied with fixing the Flatbuffer alignment issue either in
> a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> 0.15.0 with this change sooner rather than later might be prudent.
>
> On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >
> > Hello,
> >
> > Recently we've discussed breaking the IPC format to fix a long-standing
> > alignment issue.  See this discussion:
> >
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> >
> > Should we first do a 0.15.0 in order to get those format fixes right?
> > Once that is fine and settled we can move to the 1.0.0 release?
> >
> > Regards
> >
> > Antoine.
>

Re: [Discuss] Do a 0.15.0 release before 1.0.0?

Posted by Wes McKinney <we...@gmail.com>.

I'd be satisfied with fixing the Flatbuffer alignment issue either in
a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
0.15.0 with this change sooner rather than later might be prudent.

On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> Recently we've discussed breaking the IPC format to fix a long-standing
> alignment issue.  See this discussion:
> https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
>
> Should we first do a 0.15.0 in order to get those format fixes right?
> Once that is fine and settled we can move to the 1.0.0 release?
>
> Regards
>
> Antoine.