You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by William Butler <wa...@google.com.INVALID> on 2022/05/10 21:21:23 UTC

Forward & Backwards Compatibility

We recently upgraded our version of Arrow and came across an interesting
issue. The version that we had been using was pre-logical types and the new
version is post-logical types. It turns out some customers have files with
invalid logical types. The old version, being oblivious to logical types,
reads them fine whereas the new version, which can see logical types,
rejects them. Parquet-mr seems a lot more forgiving of these situations. It
prints a warning and uses the converted type. What is the correct behavior
here? Should we make Parquet C++ more like the Java version?

Similarly, while I haven't tried this, it looks like Parquet C++ will not
accept unrecognized logical types from Thrift. It would seem to me that in
the case of a new, unimplemented, logical type, reverting back to the
converted or unannotated physical type might be a good idea in some cases.

Re: Forward & Backwards Compatibility

Posted by "Miller, Tim" <th...@amazon.com.INVALID>.
You might also consider looking for fallback options. For instance, in https://github.com/apache/parquet-mr/pull/957, I figured out a good spot to catch the exception and then fall-back to a converted schema.

On 5/29/22, 1:53 PM, "Micah Kornfield" <em...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    I'd be in favor of maybe adding a flag that allows this type of schema
    which is by default false.  Another option is if we can identify the writer
    of these files, we can make an exception specifically for those versions?

    On Wed, May 18, 2022 at 4:02 PM William Butler <wa...@google.com.invalid>
    wrote:

    > >
    > > Well, why is UNKNOWN used here? This seems like a bug in the producer:
    >
    > It is a bug in the producer, see the JIRA, but now there are a lot of these
    > files out there in the wild. Adding an explicit workaround might be
    > reasonable given the pervasiveness.
    >
    > On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou <an...@python.org> wrote:
    >
    > > On Thu, 12 May 2022 09:46:57 -0700
    > > William Butler <wa...@google.com.INVALID>
    > > wrote:
    > > >
    > > > From the JIRA, the converted type looks something like
    > > >
    > > >   required group FeatureAmounts (MAP) {
    > > >     repeated group map (MAP_KEY_VALUE) {
    > > >       required binary key (STRING);
    > > >       required binary key (STRING);
    > > >     }
    > > >   }
    > > >
    > > >
    > > > but the logical type looks like
    > > >
    > > >   required group FeatureAmounts (MAP) {
    > > >     repeated group map (UNKNOWN) {
    > > >       required binary key (STRING);
    > > >       required binary key (STRING);
    > > >     }
    > > >   }
    > > >
    > > > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being
    > used
    > > in
    > > > the groups and rejects the schema with an exception.
    > >
    > > Well, why is UNKNOWN used here? This seems like a bug in the producer:
    > > if MAP_KEY_VALUE does not have an equivalent logical type, then no
    > > logical type annotation should be produced, instead of the "UNKNOWN"
    > > logical type annotation which means that all values are null and the
    > > "real" type of the data is therefore lost.
    > >
    > > (I understand that this is probably due to confusion from the misnaming
    > > of the "UNKNOWN" logical type, which would have been more appropriately
    > > named "ALWAYS_NULL" or similar)
    > >
    > > > The second example involves an INT64 column with a TIMESTAMP_MILLIS
    > > > converted type but a String logical type. Parquet-mr in this example
    > > > fallbacks to the timestamp converted type whereas Parquet C++ throws an
    > > > exception.
    > >
    > > Well, I don't know why a String logical type should be accepted for
    > > integer columns with a timestamp converted type.  The fact that
    > > parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
    > >
    > > Regards
    > >
    > > Antoine.
    > >
    > >
    > >
    >


Re: Forward & Backwards Compatibility

Posted by Micah Kornfield <em...@gmail.com>.
I'd be in favor of maybe adding a flag that allows this type of schema
which is by default false.  Another option is if we can identify the writer
of these files, we can make an exception specifically for those versions?

On Wed, May 18, 2022 at 4:02 PM William Butler <wa...@google.com.invalid>
wrote:

> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
>
> It is a bug in the producer, see the JIRA, but now there are a lot of these
> files out there in the wild. Adding an explicit workaround might be
> reasonable given the pervasiveness.
>
> On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou <an...@python.org> wrote:
>
> > On Thu, 12 May 2022 09:46:57 -0700
> > William Butler <wa...@google.com.INVALID>
> > wrote:
> > >
> > > From the JIRA, the converted type looks something like
> > >
> > >   required group FeatureAmounts (MAP) {
> > >     repeated group map (MAP_KEY_VALUE) {
> > >       required binary key (STRING);
> > >       required binary key (STRING);
> > >     }
> > >   }
> > >
> > >
> > > but the logical type looks like
> > >
> > >   required group FeatureAmounts (MAP) {
> > >     repeated group map (UNKNOWN) {
> > >       required binary key (STRING);
> > >       required binary key (STRING);
> > >     }
> > >   }
> > >
> > > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being
> used
> > in
> > > the groups and rejects the schema with an exception.
> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
> > if MAP_KEY_VALUE does not have an equivalent logical type, then no
> > logical type annotation should be produced, instead of the "UNKNOWN"
> > logical type annotation which means that all values are null and the
> > "real" type of the data is therefore lost.
> >
> > (I understand that this is probably due to confusion from the misnaming
> > of the "UNKNOWN" logical type, which would have been more appropriately
> > named "ALWAYS_NULL" or similar)
> >
> > > The second example involves an INT64 column with a TIMESTAMP_MILLIS
> > > converted type but a String logical type. Parquet-mr in this example
> > > fallbacks to the timestamp converted type whereas Parquet C++ throws an
> > > exception.
> >
> > Well, I don't know why a String logical type should be accepted for
> > integer columns with a timestamp converted type.  The fact that
> > parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: Forward & Backwards Compatibility

Posted by William Butler <wa...@google.com.INVALID>.
>
> Well, why is UNKNOWN used here? This seems like a bug in the producer:

It is a bug in the producer, see the JIRA, but now there are a lot of these
files out there in the wild. Adding an explicit workaround might be
reasonable given the pervasiveness.

On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou <an...@python.org> wrote:

> On Thu, 12 May 2022 09:46:57 -0700
> William Butler <wa...@google.com.INVALID>
> wrote:
> >
> > From the JIRA, the converted type looks something like
> >
> >   required group FeatureAmounts (MAP) {
> >     repeated group map (MAP_KEY_VALUE) {
> >       required binary key (STRING);
> >       required binary key (STRING);
> >     }
> >   }
> >
> >
> > but the logical type looks like
> >
> >   required group FeatureAmounts (MAP) {
> >     repeated group map (UNKNOWN) {
> >       required binary key (STRING);
> >       required binary key (STRING);
> >     }
> >   }
> >
> > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being used
> in
> > the groups and rejects the schema with an exception.
>
> Well, why is UNKNOWN used here? This seems like a bug in the producer:
> if MAP_KEY_VALUE does not have an equivalent logical type, then no
> logical type annotation should be produced, instead of the "UNKNOWN"
> logical type annotation which means that all values are null and the
> "real" type of the data is therefore lost.
>
> (I understand that this is probably due to confusion from the misnaming
> of the "UNKNOWN" logical type, which would have been more appropriately
> named "ALWAYS_NULL" or similar)
>
> > The second example involves an INT64 column with a TIMESTAMP_MILLIS
> > converted type but a String logical type. Parquet-mr in this example
> > fallbacks to the timestamp converted type whereas Parquet C++ throws an
> > exception.
>
> Well, I don't know why a String logical type should be accepted for
> integer columns with a timestamp converted type.  The fact that
> parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
>
> Regards
>
> Antoine.
>
>
>

Re: Forward & Backwards Compatibility

Posted by Antoine Pitrou <an...@python.org>.
On Thu, 12 May 2022 09:46:57 -0700
William Butler <wa...@google.com.INVALID>
wrote:
> 
> From the JIRA, the converted type looks something like
> 
>   required group FeatureAmounts (MAP) {
>     repeated group map (MAP_KEY_VALUE) {
>       required binary key (STRING);
>       required binary key (STRING);
>     }
>   }
> 
> 
> but the logical type looks like
> 
>   required group FeatureAmounts (MAP) {
>     repeated group map (UNKNOWN) {
>       required binary key (STRING);
>       required binary key (STRING);
>     }
>   }
> 
> Parquet C++ does not like that the UNKNOWN/NullLogicalType is being used in
> the groups and rejects the schema with an exception.

Well, why is UNKNOWN used here? This seems like a bug in the producer:
if MAP_KEY_VALUE does not have an equivalent logical type, then no
logical type annotation should be produced, instead of the "UNKNOWN"
logical type annotation which means that all values are null and the
"real" type of the data is therefore lost.

(I understand that this is probably due to confusion from the misnaming
of the "UNKNOWN" logical type, which would have been more appropriately
named "ALWAYS_NULL" or similar)

> The second example involves an INT64 column with a TIMESTAMP_MILLIS
> converted type but a String logical type. Parquet-mr in this example
> fallbacks to the timestamp converted type whereas Parquet C++ throws an
> exception.

Well, I don't know why a String logical type should be accepted for
integer columns with a timestamp converted type.  The fact that
parquet-mr accepts it sounds like a bug in parquet-mr, IMO.

Regards

Antoine.



Re: Forward & Backwards Compatibility

Posted by William Butler <wa...@google.com.INVALID>.
On Thu, May 12, 2022 at 6:39 AM Antoine Pitrou <an...@python.org> wrote:

Can you give an example of what an invalid logical type look like?

We are seeing two examples.
-- https://issues.apache.org/jira/browse/PARQUET-1879

From the JIRA, the converted type looks something like

  required group FeatureAmounts (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (STRING);
      required binary key (STRING);
    }
  }


but the logical type looks like

  required group FeatureAmounts (MAP) {
    repeated group map (UNKNOWN) {
      required binary key (STRING);
      required binary key (STRING);
    }
  }


Parquet C++ does not like that the UNKNOWN/NullLogicalType is being used in
the groups and rejects the schema with an exception.

The second example involves an INT64 column with a TIMESTAMP_MILLIS
converted type but a String logical type. Parquet-mr in this example
fallbacks to the timestamp converted type whereas Parquet C++ throws an
exception.

 Can you point to the relevant piece of code?


https://github.com/apache/arrow/blob/90aac16761b7dbf5fe931bc8837cad5116939270/cpp/src/parquet/types.cc#L428

Re: Forward & Backwards Compatibility

Posted by Antoine Pitrou <an...@python.org>.
On Tue, 10 May 2022 14:21:23 -0700
William Butler <wa...@google.com.INVALID>
wrote:
> We recently upgraded our version of Arrow and came across an interesting
> issue. The version that we had been using was pre-logical types and the new
> version is post-logical types. It turns out some customers have files with
> invalid logical types. The old version, being oblivious to logical types,
> reads them fine whereas the new version, which can see logical types,
> rejects them. Parquet-mr seems a lot more forgiving of these situations. It
> prints a warning and uses the converted type. What is the correct behavior
> here? Should we make Parquet C++ more like the Java version?

Can you give an example of what an invalid logical type look like?

Generally, I don't think errors should pass silently, especially when
several implementations claim to support the same standard. The days
of lax HTML in the late 90s/early 2000s, and the resulting
compatibility hell, come to mind.

> Similarly, while I haven't tried this, it looks like Parquet C++ will not
> accept unrecognized logical types from Thrift.

Can you point to the relevant piece of code?

Regards

Antoine.