You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Stephen Simmons (Jira)" <ji...@apache.org> on 2020/10/31 13:23:00 UTC
[jira] [Comment Edited] (ARROW-10444) [Python] Timestamp metadata
min/max stored as INT96 cannot be read in
[ https://issues.apache.org/jira/browse/ARROW-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224073#comment-17224073 ]
Stephen Simmons edited comment on ARROW-10444 at 10/31/20, 1:22 PM:
--------------------------------------------------------------------
Deprecated is fine for writing, but when you're reading, you have the parquet files you're given!
Looks like I'll also need to define a CInt96Statistics or maybe CTimestampStatistics class.
For reference, the conversion from Int96 Impala timestamp to int64[ns] in {{/parquet-cpp/parquet/arrow/reader.h}} has:
{code:java}
constexpr int64_t kJulianToUnixEpochDays = 2440588LL;
constexpr int64_t kMillisecondsInADay = 86400000LL;
constexpr int64_t kNanosecondsInADay = kMillisecondsInADay * 1000LL * 1000LL;
static inline int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays;
int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value)));
return days_since_epoch * kNanosecondsInADay + nanoseconds;
} {code}
was (Author: stevesimmons):
Ah, I'll also need to define a CInt96Statistics or maybe CTimestampStatistics class.
For reference, the conversion from Int96 Impala timestamp to int64[ns] in {{/parquet-cpp/parquet/arrow/reader.h}} has:
{code:java}
constexpr int64_t kJulianToUnixEpochDays = 2440588LL;
constexpr int64_t kMillisecondsInADay = 86400000LL;
constexpr int64_t kNanosecondsInADay = kMillisecondsInADay * 1000LL * 1000LL;
static inline int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays;
int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value)));
return days_since_epoch * kNanosecondsInADay + nanoseconds;
} {code}
> [Python] Timestamp metadata min/max stored as INT96 cannot be read in
> ---------------------------------------------------------------------
>
> Key: ARROW-10444
> URL: https://issues.apache.org/jira/browse/ARROW-10444
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Stephen Simmons
> Priority: Major
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> I am working with Parquet files produced by AWS Redshift's UNLOAD command. The schema has several timestamp columns stored as INT96. I have noticed their min/max values are omitted from the PyArrow's metadata
> e.g. For this column in my table schema: {{dv_startdateutc: timestamp[ns]}}, the statistics section of the column metadata is None, i.e. not filled in with the min/max values present in the other non-timestamp columns:
> {code:python}
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7ff5000d1a10>
> file_offset: 1342723
> file_path:
> physical_type: INT96
> num_values: 150144
> path_in_schema: dv_startdateutc
> is_stats_set: False
> statistics: None
> compression: SNAPPY
> encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
> has_dictionary_page: True
> dictionary_page_offset: 1342659
> data_page_offset: 1342687
> total_compressed_size: 64
> total_uncompressed_size: 60
> {code}
> This means PyArrow cannot use metadata to filter dataset reads by date/time.
>
> I suspect this bug arises in {{_cast_statistic_raw_min()}} and {{_cast_statistic_raw_max()}} in {{/python/pyarrow/_parquet.pyx}} at L180. The code extracts below show there are casts for {{ParquetType_INT32}} and {{ParquetType_INT64}}, but not for {{ParquetType_INT96}}.
> Can a case be added for {{ParquetType_INT96}} in both of these?
> Note that those raw {{ParquetType_INT96}} will be converted to the appropriate timestamp type in {{_box_logical_type_value(raw, statistics.descr())}}
>
> Thanks
> Stephen
> {code:python}
> cdef _cast_statistic_raw_min(CStatistics* statistics):
> cdef ParquetType physical_type = statistics.physical_type()
> cdef uint32_t type_length = statistics.descr().type_length()
> if physical_type == ParquetType_BOOLEAN:
> return (<CBoolStatistics*> statistics).min()
> elif physical_type == ParquetType_INT32:
> return (<CInt32Statistics*> statistics).min()
> elif physical_type == ParquetType_INT64:
> return (<CInt64Statistics*> statistics).min()
> # ADD ParquetType_INT96 here!!!
> elif physical_type == ParquetType_FLOAT:
> return (<CFloatStatistics*> statistics).min()
> elif physical_type == ParquetType_DOUBLE:
> return (<CDoubleStatistics*> statistics).min()
> elif physical_type == ParquetType_BYTE_ARRAY:
> return _box_byte_array((<CByteArrayStatistics*> statistics).min())
> elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
> return _box_flba((<CFLBAStatistics*> statistics).min(), type_length)
> cdef _cast_statistic_raw_max(CStatistics* statistics):
> cdef ParquetType physical_type = statistics.physical_type()
> cdef uint32_t type_length = statistics.descr().type_length()
> if physical_type == ParquetType_BOOLEAN:
> return (<CBoolStatistics*> statistics).max()
> elif physical_type == ParquetType_INT32:
> return (<CInt32Statistics*> statistics).max()
> elif physical_type == ParquetType_INT64:
> return (<CInt64Statistics*> statistics).max()
> # ADD ParquetType_INT96 here!!!
> elif physical_type == ParquetType_FLOAT:
> return (<CFloatStatistics*> statistics).max()
> elif physical_type == ParquetType_DOUBLE:
> return (<CDoubleStatistics*> statistics).max()
> elif physical_type == ParquetType_BYTE_ARRAY:
> return _box_byte_array((<CByteArrayStatistics*> statistics).max())
> elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
> return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)