You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/05/01 06:11:17 UTC

Re: parquet 2 incompatibility between 0.16 and 0.17?

Sorry I didn't get to this, will try again tomorrow.

On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:

> I'd be fine with a patch release addressing this so long as it's
> binary-only (to save us all time).
>
> On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> This sounds like something we might want to do and issue a patch release.
>> It seems bad to default to a non-production version?
>>
>> I can try to take a look tonight at a patch of no gets to it before.
>>
>> Thanks,
>> Micah
>>
>> On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:
>>
>> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <
>> pierre.belzile@gmail.com>
>> > wrote:
>> > >
>> > > Wes,
>> > >
>> > > You used the words "forward compatible". Does this mean that 0.17 is
>> able
>> > > to decode 0.16 datapagev2?
>> >
>> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
>> > the type casting and metadata behavior I indicated in my email. The
>> > changes in
>> >
>> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> >
>> > enabled the use of DataPageV2 and I/we didn't think about the forward
>> > compatibility issue (version=2.0 files written in 0.17.0 being
>> > unreadable in 0.16.0). We might actually want to revert this (just the
>> > toggle between DataPageV1/V2, not the whole patch).
>> >
>> >
>> >
>> > > Crossing my fingers...
>> > >
>> > > Pierre
>> > >
>> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
>> > écrit :
>> > >
>> > > > Ah, so we have a slight mess on our hands because the patch for
>> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
>> > > > compatible with older version because the implementation was fixed
>> > > > (see the JIRA for more details)
>> > > >
>> > > >
>> > > >
>> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> > > >
>> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
>> > > > being used for two different purposes:
>> > > >
>> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
>> > > > and nanosecond timestamps
>> > > > * DataPageV1 vs. DataPageV2 data pages
>> > > >
>> > > > I think we should separate these concepts and instead have a
>> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
>> > > > annotations and the behavior around conversions when writing
>> unsigned
>> > > > integers, nanosecond timestamps, and other types to Parquet V1
>> (which
>> > > > is the only "production" Parquet format).
>> > > >
>> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
>> > pierre.belzile@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > Hi,
>> > > > >
>> > > > > We've been using the parquet 2 format (mostly because of
>> nanosecond
>> > > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
>> > 0.16,
>> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is
>> this
>> > > > > expected? Would a 0.17 decode a 0.16?
>> > > > >
>> > > > > If that's not expected, I can put the debugger on it and see what
>> is
>> > > > > happening. I suspect it's with string fields (regular, not large
>> > string).
>> > > > >
>> > > > > Cheers, Pierre
>> > > >
>> >
>>
>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Micah Kornfield <em...@gmail.com>.
I put up an initial PR to split the flags

On Thursday, April 30, 2020, Micah Kornfield <em...@gmail.com> wrote:

> Sorry I didn't get to this, will try again tomorrow.
>
> On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:
>
>> I'd be fine with a patch release addressing this so long as it's
>> binary-only (to save us all time).
>>
>> On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> This sounds like something we might want to do and issue a patch release.
>>> It seems bad to default to a non-production version?
>>>
>>> I can try to take a look tonight at a patch of no gets to it before.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <
>>> pierre.belzile@gmail.com>
>>> > wrote:
>>> > >
>>> > > Wes,
>>> > >
>>> > > You used the words "forward compatible". Does this mean that 0.17 is
>>> able
>>> > > to decode 0.16 datapagev2?
>>> >
>>> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
>>> > the type casting and metadata behavior I indicated in my email. The
>>> > changes in
>>> >
>>> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>>> > a9da588516
>>> >
>>> > enabled the use of DataPageV2 and I/we didn't think about the forward
>>> > compatibility issue (version=2.0 files written in 0.17.0 being
>>> > unreadable in 0.16.0). We might actually want to revert this (just the
>>> > toggle between DataPageV1/V2, not the whole patch).
>>> >
>>> >
>>> >
>>> > > Crossing my fingers...
>>> > >
>>> > > Pierre
>>> > >
>>> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
>>> > écrit :
>>> > >
>>> > > > Ah, so we have a slight mess on our hands because the patch for
>>> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
>>> > > > compatible with older version because the implementation was fixed
>>> > > > (see the JIRA for more details)
>>> > > >
>>> > > >
>>> > > > https://github.com/apache/arrow/commit/
>>> 809d40ab9518bd254705f35af01162
>>> > a9da588516
>>> > > >
>>> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
>>> > > > being used for two different purposes:
>>> > > >
>>> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned
>>> types
>>> > > > and nanosecond timestamps
>>> > > > * DataPageV1 vs. DataPageV2 data pages
>>> > > >
>>> > > > I think we should separate these concepts and instead have a
>>> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
>>> > > > annotations and the behavior around conversions when writing
>>> unsigned
>>> > > > integers, nanosecond timestamps, and other types to Parquet V1
>>> (which
>>> > > > is the only "production" Parquet format).
>>> > > >
>>> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
>>> > pierre.belzile@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > We've been using the parquet 2 format (mostly because of
>>> nanosecond
>>> > > > > resolution). I'm getting crashes in the C++ parquet decoder,
>>> arrow
>>> > 0.16,
>>> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is
>>> this
>>> > > > > expected? Would a 0.17 decode a 0.16?
>>> > > > >
>>> > > > > If that's not expected, I can put the debugger on it and see
>>> what is
>>> > > > > happening. I suspect it's with string fields (regular, not large
>>> > string).
>>> > > > >
>>> > > > > Cheers, Pierre
>>> > > >
>>> >
>>>
>>