You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/02/03 10:09:00 UTC

[jira] [Created] (PARQUET-1972) [C++] Switch to format version 2 as default for writing Parquet

Joris Van den Bossche created PARQUET-1972:
----------------------------------------------

             Summary: [C++] Switch to format version 2 as default for writing Parquet
                 Key: PARQUET-1972
                 URL: https://issues.apache.org/jira/browse/PARQUET-1972
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Joris Van den Bossche


Related to the thread on the arrow dev mailing list: https://lists.apache.org/thread.html/rf1a377c66990ae5ac0693119d416c93a7e19228d3eaaea8bd90acb17%40%3Cdev.arrow.apache.org%3E

Currently, when writing parquet files with Arrow (parquet-cpp), we default to parquet format "1.0". In practice, this means that we don't use certain LogicalTypes (eg we don't write integers other than int32/int64, and we don't write the nanosecond timestamps).

I think it would be nice to enable nanosecond timestamps by default, but I also have no idea how widely this is already supported by other readers.

To be clear, this is *not* about enabling _data page_ version 2 by default, in Arrow that is governed by a separate option.

While checking this, I made an overview of which types were introduced in
which parquet format version, in case someone wants to see the details ->
https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702



--
This message was sent by Atlassian Jira
(v8.3.4#803005)