You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Stephen Simmons (Jira)" <ji...@apache.org> on 2020/10/31 02:24:00 UTC
[jira] [Created] (ARROW-10444) [Python] Timestamp metadata min/max
stored as INT96 cannot be read in
Stephen Simmons created ARROW-10444:
---------------------------------------
Summary: [Python] Timestamp metadata min/max stored as INT96 cannot be read in
Key: ARROW-10444
URL: https://issues.apache.org/jira/browse/ARROW-10444
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Reporter: Stephen Simmons
I am working with Parquet files produced by AWS Redshift's UNLOAD command. The schema has several timestamp columns stored as INT96. I have noticed their min/max values are omitted from the PyArrow's metadata
e.g. For this column in my table schema: {{dv_startdateutc: timestamp[ns]}}, the statistics section of the column metadata is None, i.e. not filled in with the min/max values present in the other non-timestamp columns:
{code:python}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7ff5000d1a10>
file_offset: 1342723
file_path:
physical_type: INT96
num_values: 150144
path_in_schema: dv_startdateutc
is_stats_set: False
statistics: None
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 1342659
data_page_offset: 1342687
total_compressed_size: 64
total_uncompressed_size: 60
{code}
This means PyArrow cannot use metadata to filter dataset reads by date/time.
I suspect this bug arises in {{_cast_statistic_raw_min()}} and `_cast_statistic_raw_max()` in {{/python/pyarrow/_parquet.pyx}} at L180. The code extracts below show there are casts for {{ParqetType_INT32}} and {{ParqetType_INT64}}, but not for {{ParqetType_INT96}}.
Can a case be added for {{ParqetType_INT96}} in both of these?
Those raw {{ParqetType_INT96}} will be converted to the appropriate timestamp type in {{_box_logical_type_value(raw, statistics.descr())}}.
Thanks
Stephen
{code:python}
cdef _cast_statistic_raw_min(CStatistics* statistics):
cdef ParquetType physical_type = statistics.physical_type()
cdef uint32_t type_length = statistics.descr().type_length()
if physical_type == ParquetType_BOOLEAN:
return (<CBoolStatistics*> statistics).min()
elif physical_type == ParquetType_INT32:
return (<CInt32Statistics*> statistics).min()
elif physical_type == ParquetType_INT64:
return (<CInt64Statistics*> statistics).min()
# ADD ParquetType_INT96 here!!!
elif physical_type == ParquetType_FLOAT:
return (<CFloatStatistics*> statistics).min()
elif physical_type == ParquetType_DOUBLE:
return (<CDoubleStatistics*> statistics).min()
elif physical_type == ParquetType_BYTE_ARRAY:
return _box_byte_array((<CByteArrayStatistics*> statistics).min())
elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
return _box_flba((<CFLBAStatistics*> statistics).min(), type_length)
cdef _cast_statistic_raw_max(CStatistics* statistics):
cdef ParquetType physical_type = statistics.physical_type()
cdef uint32_t type_length = statistics.descr().type_length()
if physical_type == ParquetType_BOOLEAN:
return (<CBoolStatistics*> statistics).max()
elif physical_type == ParquetType_INT32:
return (<CInt32Statistics*> statistics).max()
elif physical_type == ParquetType_INT64:
return (<CInt64Statistics*> statistics).max()
# ADD ParquetType_INT96 here!!!
elif physical_type == ParquetType_FLOAT:
return (<CFloatStatistics*> statistics).max()
elif physical_type == ParquetType_DOUBLE:
return (<CDoubleStatistics*> statistics).max()
elif physical_type == ParquetType_BYTE_ARRAY:
return _box_byte_array((<CByteArrayStatistics*> statistics).max())
elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)