You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/27 13:27:00 UTC

[jira] [Created] (ARROW-11399) [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE

Joris Van den Bossche created ARROW-11399:
---------------------------------------------

             Summary: [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE
                 Key: ARROW-11399
                 URL: https://issues.apache.org/jira/browse/ARROW-11399
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche


I ran into this, and find it rather confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for the second column, and not the first. 

However, I am quite sure it sets it for both. Because that was the result of the discussion about this at the time of pyarrow 0.14 (ARROW-5878, https://github.com/apache/arrow/pull/4825), and can also be shown by reading the parquet schema with an older version of pyarrow that doesn't support logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we don't need the ConvertedType information anymore for proper reading of the data, but seemingly indicating that the ConvertedType is not present in the parquet schema is quite confusing (certainly if checking files for forward/backward compatibility behaviour).

cc [~tpboudreau]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)