You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2021/01/27 14:24:22 UTC

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Coming back to this topic, there is one additional aspect that might
warrant some consideration: forward compatibility for nanosecond timestamps.

If we would switch to a default of `version="2.0"`, that would mean we
start using the LogicalType with nanosecond time unit. Since this has no
equivalent legacy ConvertedType, we would no longer add such annotation (by
default, we now convert nanoseconds to microseconds, which has a
TIMESTAMP_MICROS legacy converted type).
As a consequence, any reader that does not yet support
nanosecond-LogicalType will read the data as int64 instead of timestamps.

- The nanosecond time unit was introduced in format version 2.6.0 (Sep 27,
2018).
- Given that a large part of our user bases writes Parquet from pandas, and
pandas only has nanosecond resolution, they will all start to write
nanoseconds using the new logical type without legacy converted type
equivalent.

I don't really know how widely supported this specific logical type already
is.

I am wondering if our `version="1.0 / "2.0"` is too crude, and we would
need finer control like a `version="2.4"` (use logical types including
integer type annotations other than int32/int64), or `version="2.6"`
(additionally use nanosecond time unit). But that might also be overkill.

While checking this, I made an overview of which types were introduced in
which parquet format version, in case someone wants to see the details ->
https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702

Another question: do we write different encodings based on version="1.0" vs
"2.0" ?

Joris

On Tue, 15 Dec 2020 at 17:59, Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> On Tue, 15 Dec 2020 at 17:24, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
>>
>> Currently, when writing parquet files with Arrow (parquet-cpp), we
>> default to parquet format "1.0".
>> This means we don't use ConvertedTypes or LogicalTypes introduced in
>> format "2.0"+.
>> For example, this means we can (by default) only write int32 and int64
>> integer types, and not any of the other signed and unsigned integer types.
>>
>> I have to correct myself here. Apparently, we DO write version 2.0
> Converted/LogicalTypes by default with `version="1.0"` ...
> But just not for integer types, so I got a wrong impression here by only
> looking at ints when writing the previous email.
>
> A small example using python at the bottom. From that, it seems that we
> actually do write Converted/LogicalType annotations, even with
> `version="1.0"` for types like Date and Decimal, which were only added in
> format version 2.x.
> I assume the reason we do that for those types, and not for the integer
> types, is that Date (as int32) or Decimal (variable size binary) can be
> read "correctly" as the physical type as well (only with information loss).
> While eg UINT_32 converted type is written as INT32 physical type, so not
> interpreting the type annotation would lead to wrong numbers.
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> import decimal
>
> table = pa.table({
>     "int64": [1, 2, 3],
>     "int32": pa.array([1, 2, 3], type="int32"),
>     "uint32": pa.array([1, 2, 3], type="uint32"),
>     "date": pa.array([1, 2, 3], type=pa.date32()),
>     "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
>     "string": pa.array(["a", "b", "c"]),
>     "list": pa.array([[1, 2], [3, 4], [5, 6]]),
>     "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"),
> decimal.Decimal("3.0")]),
> })
>
> pq.write_table(table, "test_v1.parquet", version="1.0")
> pq.write_table(table, "test_v2.parquet", version="2.0")
>
> In [55]: pq.read_metadata("test_v1.parquet").schema
> Out[55]:
> <pyarrow._parquet.ParquetSchema object at 0x7f839b7602c0>
> required group field_id=0 schema {
>   optional int64 field_id=1 int64;
>   optional int32 field_id=2 int32;
>   optional int64 field_id=3 uint32;
>   optional int32 field_id=4 date (Date);
>   optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
> timeUnit=milliseconds, is_from_converted_type=false,
> force_set_converted_type=false));
>   optional binary field_id=6 string (String);
>   optional fixed_len_byte_array(1) field_id=7 decimal
> (Decimal(precision=2, scale=1));
> }
>
> In [56]: pq.read_metadata("test_v2.parquet").schema
> Out[56]:
> <pyarrow._parquet.ParquetSchema object at 0x7f839bc43f80>
> required group field_id=0 schema {
>   optional int64 field_id=1 int64;
>   optional int32 field_id=2 int32;
>   optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));
>   optional int32 field_id=4 date (Date);
>   optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
> timeUnit=milliseconds, is_from_converted_type=false,
> force_set_converted_type=false));
>   optional binary field_id=6 string (String);
>   optional fixed_len_byte_array(1) field_id=7 decimal
> (Decimal(precision=2, scale=1));
>
>