You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2020/12/15 16:24:05 UTC

Should we default to write parquet format version 2.0? (not data page version 2.0)

Hi all,

(somewhat related to the discussion on the parquet mailing list about
compatibility of different features in the format and to which format
version they belong, which triggered
https://github.com/apache/parquet-format/pull/164. But mainly related in
the sense that it is rather unclear to me which features are enabled when,
also for the column types).

Currently, when writing parquet files with Arrow (parquet-cpp), we default
to parquet format "1.0".
This means we don't use ConvertedTypes or LogicalTypes introduced in format
"2.0"+.
For example, this means we can (by default) only write int32 and int64
integer types, and not any of the other signed and unsigned integer types.

But, most of the additional ConvertedTypes that were not present in
parquet-format 1.0 (eg the different signed/unsigned integer types,
timestamp, ..) were introduced in parquet-format 2.2 (
https://github.com/apache/parquet-format/pull/3,
https://issues.apache.org/jira/browse/PARQUET-12) almost 7 years ago.
The LogicalTypes are certainly more recent, but so with the current options
of "1.0" or "2.0" when writing, you can either choose for all new features
of the different 2.x release (both ConvertedTypes and LogicalTypes), or
none of those (e.g. we don't have the option to say `version="2.2"`).

When we are saying "version 2.0 is not yet recommended for production use"
(because many other readers are not yet compatible with it), isn't this
mostly about the *data page* version 2 (which, AFAIU, is separate from the
format version 2.0).
If so, could we start defaulting to version 2.0 (but still with date page
version 1.0), or do other parquet readers actually not yet support the
ConvertedTypes introduced 7 years ago?

Best,
Joris

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Joris Van den Bossche <jo...@gmail.com>.

Coming back to this topic, there is one additional aspect that might
warrant some consideration: forward compatibility for nanosecond timestamps.

If we would switch to a default of `version="2.0"`, that would mean we
start using the LogicalType with nanosecond time unit. Since this has no
equivalent legacy ConvertedType, we would no longer add such annotation (by
default, we now convert nanoseconds to microseconds, which has a
TIMESTAMP_MICROS legacy converted type).
As a consequence, any reader that does not yet support
nanosecond-LogicalType will read the data as int64 instead of timestamps.

- The nanosecond time unit was introduced in format version 2.6.0 (Sep 27,
2018).
- Given that a large part of our user bases writes Parquet from pandas, and
pandas only has nanosecond resolution, they will all start to write
nanoseconds using the new logical type without legacy converted type
equivalent.

I don't really know how widely supported this specific logical type already
is.

I am wondering if our `version="1.0 / "2.0"` is too crude, and we would
need finer control like a `version="2.4"` (use logical types including
integer type annotations other than int32/int64), or `version="2.6"`
(additionally use nanosecond time unit). But that might also be overkill.

While checking this, I made an overview of which types were introduced in
which parquet format version, in case someone wants to see the details ->
https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702

Another question: do we write different encodings based on version="1.0" vs
"2.0" ?

Joris

On Tue, 15 Dec 2020 at 17:59, Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> On Tue, 15 Dec 2020 at 17:24, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
>>
>> Currently, when writing parquet files with Arrow (parquet-cpp), we
>> default to parquet format "1.0".
>> This means we don't use ConvertedTypes or LogicalTypes introduced in
>> format "2.0"+.
>> For example, this means we can (by default) only write int32 and int64
>> integer types, and not any of the other signed and unsigned integer types.
>>
>> I have to correct myself here. Apparently, we DO write version 2.0
> Converted/LogicalTypes by default with `version="1.0"` ...
> But just not for integer types, so I got a wrong impression here by only
> looking at ints when writing the previous email.
>
> A small example using python at the bottom. From that, it seems that we
> actually do write Converted/LogicalType annotations, even with
> `version="1.0"` for types like Date and Decimal, which were only added in
> format version 2.x.
> I assume the reason we do that for those types, and not for the integer
> types, is that Date (as int32) or Decimal (variable size binary) can be
> read "correctly" as the physical type as well (only with information loss).
> While eg UINT_32 converted type is written as INT32 physical type, so not
> interpreting the type annotation would lead to wrong numbers.
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> import decimal
>
> table = pa.table({
>     "int64": [1, 2, 3],
>     "int32": pa.array([1, 2, 3], type="int32"),
>     "uint32": pa.array([1, 2, 3], type="uint32"),
>     "date": pa.array([1, 2, 3], type=pa.date32()),
>     "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
>     "string": pa.array(["a", "b", "c"]),
>     "list": pa.array([[1, 2], [3, 4], [5, 6]]),
>     "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"),
> decimal.Decimal("3.0")]),
> })
>
> pq.write_table(table, "test_v1.parquet", version="1.0")
> pq.write_table(table, "test_v2.parquet", version="2.0")
>
> In [55]: pq.read_metadata("test_v1.parquet").schema
> Out[55]:
> <pyarrow._parquet.ParquetSchema object at 0x7f839b7602c0>
> required group field_id=0 schema {
>   optional int64 field_id=1 int64;
>   optional int32 field_id=2 int32;
>   optional int64 field_id=3 uint32;
>   optional int32 field_id=4 date (Date);
>   optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
> timeUnit=milliseconds, is_from_converted_type=false,
> force_set_converted_type=false));
>   optional binary field_id=6 string (String);
>   optional fixed_len_byte_array(1) field_id=7 decimal
> (Decimal(precision=2, scale=1));
> }
>
> In [56]: pq.read_metadata("test_v2.parquet").schema
> Out[56]:
> <pyarrow._parquet.ParquetSchema object at 0x7f839bc43f80>
> required group field_id=0 schema {
>   optional int64 field_id=1 int64;
>   optional int32 field_id=2 int32;
>   optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));
>   optional int32 field_id=4 date (Date);
>   optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
> timeUnit=milliseconds, is_from_converted_type=false,
> force_set_converted_type=false));
>   optional binary field_id=6 string (String);
>   optional fixed_len_byte_array(1) field_id=7 decimal
> (Decimal(precision=2, scale=1));
>
>

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Tue, 15 Dec 2020 at 17:24, Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

>
> Currently, when writing parquet files with Arrow (parquet-cpp), we default
> to parquet format "1.0".
> This means we don't use ConvertedTypes or LogicalTypes introduced in
> format "2.0"+.
> For example, this means we can (by default) only write int32 and int64
> integer types, and not any of the other signed and unsigned integer types.
>
> I have to correct myself here. Apparently, we DO write version 2.0
Converted/LogicalTypes by default with `version="1.0"` ...
But just not for integer types, so I got a wrong impression here by only
looking at ints when writing the previous email.

A small example using python at the bottom. From that, it seems that we
actually do write Converted/LogicalType annotations, even with
`version="1.0"` for types like Date and Decimal, which were only added in
format version 2.x.
I assume the reason we do that for those types, and not for the integer
types, is that Date (as int32) or Decimal (variable size binary) can be
read "correctly" as the physical type as well (only with information loss).
While eg UINT_32 converted type is written as INT32 physical type, so not
interpreting the type annotation would lead to wrong numbers.

import pyarrow as pa
import pyarrow.parquet as pq
import decimal

table = pa.table({
    "int64": [1, 2, 3],
    "int32": pa.array([1, 2, 3], type="int32"),
    "uint32": pa.array([1, 2, 3], type="uint32"),
    "date": pa.array([1, 2, 3], type=pa.date32()),
    "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
    "string": pa.array(["a", "b", "c"]),
    "list": pa.array([[1, 2], [3, 4], [5, 6]]),
    "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"),
decimal.Decimal("3.0")]),
})

pq.write_table(table, "test_v1.parquet", version="1.0")
pq.write_table(table, "test_v2.parquet", version="2.0")

In [55]: pq.read_metadata("test_v1.parquet").schema
Out[55]:
<pyarrow._parquet.ParquetSchema object at 0x7f839b7602c0>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int64 field_id=3 uint32;
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
timeUnit=milliseconds, is_from_converted_type=false,
force_set_converted_type=false));
  optional binary field_id=6 string (String);
  optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2,
scale=1));
}

In [56]: pq.read_metadata("test_v2.parquet").schema
Out[56]:
<pyarrow._parquet.ParquetSchema object at 0x7f839bc43f80>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
timeUnit=milliseconds, is_from_converted_type=false,
force_set_converted_type=false));
  optional binary field_id=6 string (String);
  optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2,
scale=1));

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Wes McKinney <we...@gmail.com>.

I'm in favor of the confusingly-named version='2.0' default. I note
that such decisions are hampered by our lack of integration /
compatibility testing with other Parquet consumers to know whether
they will understand all of the data that we write.

On Tue, Dec 15, 2020 at 10:50 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 15/12/2020 à 17:46, Joris Van den Bossche a écrit :
> >
> > No, I actually mean 2.2. (I don't think a 1.2 version exists, at least
> > according to the git tags)
> > You can compare the thrift file for version 2.1 (
> > https://github.com/apache/parquet-format/blob/parquet-format-2.1.0/src/thrift/parquet.thrift)
> > vs 2.2 (
> > https://github.com/apache/parquet-format/blob/apache-parquet-format-2.2.0/src/thrift/parquet.thrift),
> > in which a set of ConvertedTypes where added.
>
> I don't understand.  In which version did logical types appear then?
>
> (frankly, Parquet features are a mess to navigate)
>
> Regards
>
> Antoine.

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Tue, 15 Dec 2020 at 17:50, Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/12/2020 à 17:46, Joris Van den Bossche a écrit :
> >
> > No, I actually mean 2.2. (I don't think a 1.2 version exists, at least
> > according to the git tags)
> > You can compare the thrift file for version 2.1 (
> >
> https://github.com/apache/parquet-format/blob/parquet-format-2.1.0/src/thrift/parquet.thrift
> )
> > vs 2.2 (
> >
> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.2.0/src/thrift/parquet.thrift
> ),
> > in which a set of ConvertedTypes where added.
>
> I don't understand.  In which version did logical types appear then?
>

LogicalTypes were added in format version 2.4 (
https://github.com/apache/parquet-format/pull/51), around 3 years ago.


>
> (frankly, Parquet features are a mess to navigate)
>
> Regards
>
> Antoine.
>

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Antoine Pitrou <an...@python.org>.

Le 15/12/2020 à 17:46, Joris Van den Bossche a écrit :
> 
> No, I actually mean 2.2. (I don't think a 1.2 version exists, at least
> according to the git tags)
> You can compare the thrift file for version 2.1 (
> https://github.com/apache/parquet-format/blob/parquet-format-2.1.0/src/thrift/parquet.thrift)
> vs 2.2 (
> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.2.0/src/thrift/parquet.thrift),
> in which a set of ConvertedTypes where added.

I don't understand.  In which version did logical types appear then?

(frankly, Parquet features are a mess to navigate)

Regards

Antoine.

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Tue, 15 Dec 2020 at 17:39, Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/12/2020 à 17:24, Joris Van den Bossche a écrit :
> >
> > But, most of the additional ConvertedTypes that were not present in
> > parquet-format 1.0 (eg the different signed/unsigned integer types,
> > timestamp, ..) were introduced in parquet-format 2.2 (
> > https://github.com/apache/parquet-format/pull/3,
> > https://issues.apache.org/jira/browse/PARQUET-12) almost 7 years ago.
>
> Surely you mean "1.2"?
>

No, I actually mean 2.2. (I don't think a 1.2 version exists, at least
according to the git tags)
You can compare the thrift file for version 2.1 (
https://github.com/apache/parquet-format/blob/parquet-format-2.1.0/src/thrift/parquet.thrift)
vs 2.2 (
https://github.com/apache/parquet-format/blob/apache-parquet-format-2.2.0/src/thrift/parquet.thrift),
in which a set of ConvertedTypes where added.

>
> > If so, could we start defaulting to version 2.0 (but still with date page
> > version 1.0), or do other parquet readers actually not yet support the
> > ConvertedTypes introduced 7 years ago?
>
> +1 for defaulting to 2.0.
>
> Regards
>
> Antoine.
>

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Posted by Antoine Pitrou <an...@python.org>.

Le 15/12/2020 à 17:24, Joris Van den Bossche a écrit :
> 
> But, most of the additional ConvertedTypes that were not present in
> parquet-format 1.0 (eg the different signed/unsigned integer types,
> timestamp, ..) were introduced in parquet-format 2.2 (
> https://github.com/apache/parquet-format/pull/3,
> https://issues.apache.org/jira/browse/PARQUET-12) almost 7 years ago.

Surely you mean "1.2"?

> If so, could we start defaulting to version 2.0 (but still with date page
> version 1.0), or do other parquet readers actually not yet support the
> ConvertedTypes introduced 7 years ago?

+1 for defaulting to 2.0.

Regards

Antoine.