You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Pierre Belzile <pi...@gmail.com> on 2020/04/29 22:56:10 UTC

parquet 2 incompatibility between 0.16 and 0.17?

Hi,

We've been using the parquet 2 format (mostly because of nanosecond
resolution). I'm getting crashes in the C++ parquet decoder, arrow 0.16,
when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
expected? Would a 0.17 decode a 0.16?

If that's not expected, I can put the debugger on it and see what is
happening. I suspect it's with string fields (regular, not large string).

Cheers, Pierre

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Micah Kornfield <em...@gmail.com>.
I put up an initial PR to split the flags

On Thursday, April 30, 2020, Micah Kornfield <em...@gmail.com> wrote:

> Sorry I didn't get to this, will try again tomorrow.
>
> On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:
>
>> I'd be fine with a patch release addressing this so long as it's
>> binary-only (to save us all time).
>>
>> On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> This sounds like something we might want to do and issue a patch release.
>>> It seems bad to default to a non-production version?
>>>
>>> I can try to take a look tonight at a patch of no gets to it before.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <
>>> pierre.belzile@gmail.com>
>>> > wrote:
>>> > >
>>> > > Wes,
>>> > >
>>> > > You used the words "forward compatible". Does this mean that 0.17 is
>>> able
>>> > > to decode 0.16 datapagev2?
>>> >
>>> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
>>> > the type casting and metadata behavior I indicated in my email. The
>>> > changes in
>>> >
>>> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>>> > a9da588516
>>> >
>>> > enabled the use of DataPageV2 and I/we didn't think about the forward
>>> > compatibility issue (version=2.0 files written in 0.17.0 being
>>> > unreadable in 0.16.0). We might actually want to revert this (just the
>>> > toggle between DataPageV1/V2, not the whole patch).
>>> >
>>> >
>>> >
>>> > > Crossing my fingers...
>>> > >
>>> > > Pierre
>>> > >
>>> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
>>> > écrit :
>>> > >
>>> > > > Ah, so we have a slight mess on our hands because the patch for
>>> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
>>> > > > compatible with older version because the implementation was fixed
>>> > > > (see the JIRA for more details)
>>> > > >
>>> > > >
>>> > > > https://github.com/apache/arrow/commit/
>>> 809d40ab9518bd254705f35af01162
>>> > a9da588516
>>> > > >
>>> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
>>> > > > being used for two different purposes:
>>> > > >
>>> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned
>>> types
>>> > > > and nanosecond timestamps
>>> > > > * DataPageV1 vs. DataPageV2 data pages
>>> > > >
>>> > > > I think we should separate these concepts and instead have a
>>> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
>>> > > > annotations and the behavior around conversions when writing
>>> unsigned
>>> > > > integers, nanosecond timestamps, and other types to Parquet V1
>>> (which
>>> > > > is the only "production" Parquet format).
>>> > > >
>>> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
>>> > pierre.belzile@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > We've been using the parquet 2 format (mostly because of
>>> nanosecond
>>> > > > > resolution). I'm getting crashes in the C++ parquet decoder,
>>> arrow
>>> > 0.16,
>>> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is
>>> this
>>> > > > > expected? Would a 0.17 decode a 0.16?
>>> > > > >
>>> > > > > If that's not expected, I can put the debugger on it and see
>>> what is
>>> > > > > happening. I suspect it's with string fields (regular, not large
>>> > string).
>>> > > > >
>>> > > > > Cheers, Pierre
>>> > > >
>>> >
>>>
>>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Micah Kornfield <em...@gmail.com>.
Sorry I didn't get to this, will try again tomorrow.

On Thu, Apr 30, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:

> I'd be fine with a patch release addressing this so long as it's
> binary-only (to save us all time).
>
> On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> This sounds like something we might want to do and issue a patch release.
>> It seems bad to default to a non-production version?
>>
>> I can try to take a look tonight at a patch of no gets to it before.
>>
>> Thanks,
>> Micah
>>
>> On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:
>>
>> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <
>> pierre.belzile@gmail.com>
>> > wrote:
>> > >
>> > > Wes,
>> > >
>> > > You used the words "forward compatible". Does this mean that 0.17 is
>> able
>> > > to decode 0.16 datapagev2?
>> >
>> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
>> > the type casting and metadata behavior I indicated in my email. The
>> > changes in
>> >
>> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> >
>> > enabled the use of DataPageV2 and I/we didn't think about the forward
>> > compatibility issue (version=2.0 files written in 0.17.0 being
>> > unreadable in 0.16.0). We might actually want to revert this (just the
>> > toggle between DataPageV1/V2, not the whole patch).
>> >
>> >
>> >
>> > > Crossing my fingers...
>> > >
>> > > Pierre
>> > >
>> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
>> > écrit :
>> > >
>> > > > Ah, so we have a slight mess on our hands because the patch for
>> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
>> > > > compatible with older version because the implementation was fixed
>> > > > (see the JIRA for more details)
>> > > >
>> > > >
>> > > >
>> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
>> > a9da588516
>> > > >
>> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
>> > > > being used for two different purposes:
>> > > >
>> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
>> > > > and nanosecond timestamps
>> > > > * DataPageV1 vs. DataPageV2 data pages
>> > > >
>> > > > I think we should separate these concepts and instead have a
>> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
>> > > > annotations and the behavior around conversions when writing
>> unsigned
>> > > > integers, nanosecond timestamps, and other types to Parquet V1
>> (which
>> > > > is the only "production" Parquet format).
>> > > >
>> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
>> > pierre.belzile@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > Hi,
>> > > > >
>> > > > > We've been using the parquet 2 format (mostly because of
>> nanosecond
>> > > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
>> > 0.16,
>> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is
>> this
>> > > > > expected? Would a 0.17 decode a 0.16?
>> > > > >
>> > > > > If that's not expected, I can put the debugger on it and see what
>> is
>> > > > > happening. I suspect it's with string fields (regular, not large
>> > string).
>> > > > >
>> > > > > Cheers, Pierre
>> > > >
>> >
>>
>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Wes McKinney <we...@gmail.com>.
I'd be fine with a patch release addressing this so long as it's
binary-only (to save us all time).

On Thu, Apr 30, 2020, 12:30 PM Micah Kornfield <em...@gmail.com>
wrote:

> This sounds like something we might want to do and issue a patch release.
> It seems bad to default to a non-production version?
>
> I can try to take a look tonight at a patch of no gets to it before.
>
> Thanks,
> Micah
>
> On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <pierre.belzile@gmail.com
> >
> > wrote:
> > >
> > > Wes,
> > >
> > > You used the words "forward compatible". Does this mean that 0.17 is
> able
> > > to decode 0.16 datapagev2?
> >
> > 0.16 doesn't write DataPageV2 at all, the version flag only determines
> > the type casting and metadata behavior I indicated in my email. The
> > changes in
> >
> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> > a9da588516
> >
> > enabled the use of DataPageV2 and I/we didn't think about the forward
> > compatibility issue (version=2.0 files written in 0.17.0 being
> > unreadable in 0.16.0). We might actually want to revert this (just the
> > toggle between DataPageV1/V2, not the whole patch).
> >
> >
> >
> > > Crossing my fingers...
> > >
> > > Pierre
> > >
> > > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
> > écrit :
> > >
> > > > Ah, so we have a slight mess on our hands because the patch for
> > > > PARQUET-458 enabled the use of DataPageV2, which is not forward
> > > > compatible with older version because the implementation was fixed
> > > > (see the JIRA for more details)
> > > >
> > > >
> > > >
> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> > a9da588516
> > > >
> > > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
> > > > being used for two different purposes:
> > > >
> > > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
> > > > and nanosecond timestamps
> > > > * DataPageV1 vs. DataPageV2 data pages
> > > >
> > > > I think we should separate these concepts and instead have a
> > > > "compatibility mode" option regarding the ConvertedType/LogicalType
> > > > annotations and the behavior around conversions when writing unsigned
> > > > integers, nanosecond timestamps, and other types to Parquet V1 (which
> > > > is the only "production" Parquet format).
> > > >
> > > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
> > pierre.belzile@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > We've been using the parquet 2 format (mostly because of nanosecond
> > > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
> > 0.16,
> > > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > > > > expected? Would a 0.17 decode a 0.16?
> > > > >
> > > > > If that's not expected, I can put the debugger on it and see what
> is
> > > > > happening. I suspect it's with string fields (regular, not large
> > string).
> > > > >
> > > > > Cheers, Pierre
> > > >
> >
>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Micah Kornfield <em...@gmail.com>.
This sounds like something we might want to do and issue a patch release.
It seems bad to default to a non-production version?

I can try to take a look tonight at a patch of no gets to it before.

Thanks,
Micah

On Wednesday, April 29, 2020, Wes McKinney <we...@gmail.com> wrote:

> On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <pi...@gmail.com>
> wrote:
> >
> > Wes,
> >
> > You used the words "forward compatible". Does this mean that 0.17 is able
> > to decode 0.16 datapagev2?
>
> 0.16 doesn't write DataPageV2 at all, the version flag only determines
> the type casting and metadata behavior I indicated in my email. The
> changes in
>
> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> a9da588516
>
> enabled the use of DataPageV2 and I/we didn't think about the forward
> compatibility issue (version=2.0 files written in 0.17.0 being
> unreadable in 0.16.0). We might actually want to revert this (just the
> toggle between DataPageV1/V2, not the whole patch).
>
>
>
> > Crossing my fingers...
> >
> > Pierre
> >
> > Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a
> écrit :
> >
> > > Ah, so we have a slight mess on our hands because the patch for
> > > PARQUET-458 enabled the use of DataPageV2, which is not forward
> > > compatible with older version because the implementation was fixed
> > > (see the JIRA for more details)
> > >
> > >
> > > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162
> a9da588516
> > >
> > > Unfortunately, in Python the version='1.0' / version='2.0' flag is
> > > being used for two different purposes:
> > >
> > > * Expanded ConvertedType / LogicalType metadata, like unsigned types
> > > and nanosecond timestamps
> > > * DataPageV1 vs. DataPageV2 data pages
> > >
> > > I think we should separate these concepts and instead have a
> > > "compatibility mode" option regarding the ConvertedType/LogicalType
> > > annotations and the behavior around conversions when writing unsigned
> > > integers, nanosecond timestamps, and other types to Parquet V1 (which
> > > is the only "production" Parquet format).
> > >
> > > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <
> pierre.belzile@gmail.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > We've been using the parquet 2 format (mostly because of nanosecond
> > > > resolution). I'm getting crashes in the C++ parquet decoder, arrow
> 0.16,
> > > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > > > expected? Would a 0.17 decode a 0.16?
> > > >
> > > > If that's not expected, I can put the debugger on it and see what is
> > > > happening. I suspect it's with string fields (regular, not large
> string).
> > > >
> > > > Cheers, Pierre
> > >
>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Apr 29, 2020 at 6:15 PM Pierre Belzile <pi...@gmail.com> wrote:
>
> Wes,
>
> You used the words "forward compatible". Does this mean that 0.17 is able
> to decode 0.16 datapagev2?

0.16 doesn't write DataPageV2 at all, the version flag only determines
the type casting and metadata behavior I indicated in my email. The
changes in

https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162a9da588516

enabled the use of DataPageV2 and I/we didn't think about the forward
compatibility issue (version=2.0 files written in 0.17.0 being
unreadable in 0.16.0). We might actually want to revert this (just the
toggle between DataPageV1/V2, not the whole patch).



> Crossing my fingers...
>
> Pierre
>
> Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a écrit :
>
> > Ah, so we have a slight mess on our hands because the patch for
> > PARQUET-458 enabled the use of DataPageV2, which is not forward
> > compatible with older version because the implementation was fixed
> > (see the JIRA for more details)
> >
> >
> > https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162a9da588516
> >
> > Unfortunately, in Python the version='1.0' / version='2.0' flag is
> > being used for two different purposes:
> >
> > * Expanded ConvertedType / LogicalType metadata, like unsigned types
> > and nanosecond timestamps
> > * DataPageV1 vs. DataPageV2 data pages
> >
> > I think we should separate these concepts and instead have a
> > "compatibility mode" option regarding the ConvertedType/LogicalType
> > annotations and the behavior around conversions when writing unsigned
> > integers, nanosecond timestamps, and other types to Parquet V1 (which
> > is the only "production" Parquet format).
> >
> > On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <pi...@gmail.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > We've been using the parquet 2 format (mostly because of nanosecond
> > > resolution). I'm getting crashes in the C++ parquet decoder, arrow 0.16,
> > > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > > expected? Would a 0.17 decode a 0.16?
> > >
> > > If that's not expected, I can put the debugger on it and see what is
> > > happening. I suspect it's with string fields (regular, not large string).
> > >
> > > Cheers, Pierre
> >

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Pierre Belzile <pi...@gmail.com>.
Wes,

You used the words "forward compatible". Does this mean that 0.17 is able
to decode 0.16 datapagev2?

Crossing my fingers...

Pierre

Le mer. 29 avr. 2020 à 19:05, Wes McKinney <we...@gmail.com> a écrit :

> Ah, so we have a slight mess on our hands because the patch for
> PARQUET-458 enabled the use of DataPageV2, which is not forward
> compatible with older version because the implementation was fixed
> (see the JIRA for more details)
>
>
> https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162a9da588516
>
> Unfortunately, in Python the version='1.0' / version='2.0' flag is
> being used for two different purposes:
>
> * Expanded ConvertedType / LogicalType metadata, like unsigned types
> and nanosecond timestamps
> * DataPageV1 vs. DataPageV2 data pages
>
> I think we should separate these concepts and instead have a
> "compatibility mode" option regarding the ConvertedType/LogicalType
> annotations and the behavior around conversions when writing unsigned
> integers, nanosecond timestamps, and other types to Parquet V1 (which
> is the only "production" Parquet format).
>
> On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <pi...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > We've been using the parquet 2 format (mostly because of nanosecond
> > resolution). I'm getting crashes in the C++ parquet decoder, arrow 0.16,
> > when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> > expected? Would a 0.17 decode a 0.16?
> >
> > If that's not expected, I can put the debugger on it and see what is
> > happening. I suspect it's with string fields (regular, not large string).
> >
> > Cheers, Pierre
>

Re: parquet 2 incompatibility between 0.16 and 0.17?

Posted by Wes McKinney <we...@gmail.com>.
Ah, so we have a slight mess on our hands because the patch for
PARQUET-458 enabled the use of DataPageV2, which is not forward
compatible with older version because the implementation was fixed
(see the JIRA for more details)

https://github.com/apache/arrow/commit/809d40ab9518bd254705f35af01162a9da588516

Unfortunately, in Python the version='1.0' / version='2.0' flag is
being used for two different purposes:

* Expanded ConvertedType / LogicalType metadata, like unsigned types
and nanosecond timestamps
* DataPageV1 vs. DataPageV2 data pages

I think we should separate these concepts and instead have a
"compatibility mode" option regarding the ConvertedType/LogicalType
annotations and the behavior around conversions when writing unsigned
integers, nanosecond timestamps, and other types to Parquet V1 (which
is the only "production" Parquet format).

On Wed, Apr 29, 2020 at 5:56 PM Pierre Belzile <pi...@gmail.com> wrote:
>
> Hi,
>
> We've been using the parquet 2 format (mostly because of nanosecond
> resolution). I'm getting crashes in the C++ parquet decoder, arrow 0.16,
> when decoding a parquet 2 file created with pyarrow 0.17.0. Is this
> expected? Would a 0.17 decode a 0.16?
>
> If that's not expected, I can put the debugger on it and see what is
> happening. I suspect it's with string fields (regular, not large string).
>
> Cheers, Pierre