You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Stephen Simmons (Jira)" <ji...@apache.org> on 2020/10/31 13:23:00 UTC

[jira] [Comment Edited] (ARROW-10444) [Python] Timestamp metadata min/max stored as INT96 cannot be read in

    [ https://issues.apache.org/jira/browse/ARROW-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224073#comment-17224073 ] 

Stephen Simmons edited comment on ARROW-10444 at 10/31/20, 1:22 PM:
--------------------------------------------------------------------

Deprecated is fine for writing, but when you're reading, you have the parquet files you're given!

Looks like I'll also need to define a CInt96Statistics or maybe CTimestampStatistics class.

For reference, the conversion from Int96 Impala timestamp to int64[ns] in {{/parquet-cpp/parquet/arrow/reader.h}} has:
{code:java}
constexpr int64_t kJulianToUnixEpochDays = 2440588LL;
constexpr int64_t kMillisecondsInADay = 86400000LL;
constexpr int64_t kNanosecondsInADay = kMillisecondsInADay * 1000LL * 1000LL;

static inline int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
  int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays;
  int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value)));
  return days_since_epoch * kNanosecondsInADay + nanoseconds;
} {code}


was (Author: stevesimmons):
Ah, I'll also need to define a CInt96Statistics or maybe CTimestampStatistics class.

For reference, the conversion from Int96 Impala timestamp to int64[ns] in {{/parquet-cpp/parquet/arrow/reader.h}} has:

 
{code:java}
constexpr int64_t kJulianToUnixEpochDays = 2440588LL;
constexpr int64_t kMillisecondsInADay = 86400000LL;
constexpr int64_t kNanosecondsInADay = kMillisecondsInADay * 1000LL * 1000LL;

static inline int64_t impala_timestamp_to_nanoseconds(const Int96& impala_timestamp) {
  int64_t days_since_epoch = impala_timestamp.value[2] - kJulianToUnixEpochDays;
  int64_t nanoseconds = *(reinterpret_cast<const int64_t*>(&(impala_timestamp.value)));
  return days_since_epoch * kNanosecondsInADay + nanoseconds;
} {code}

> [Python] Timestamp metadata min/max stored as INT96 cannot be read in
> ---------------------------------------------------------------------
>
>                 Key: ARROW-10444
>                 URL: https://issues.apache.org/jira/browse/ARROW-10444
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Stephen Simmons
>            Priority: Major
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I am working with Parquet files produced by AWS Redshift's UNLOAD command. The schema has several timestamp columns stored as INT96. I have noticed their min/max values are omitted from the PyArrow's metadata 
> e.g. For this column in my table schema: {{dv_startdateutc: timestamp[ns]}}, the statistics section of the column metadata is None, i.e. not filled in with the min/max values present in the other non-timestamp columns:
> {code:python}
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7ff5000d1a10>
>  file_offset: 1342723
>  file_path: 
>  physical_type: INT96
>  num_values: 150144
>  path_in_schema: dv_startdateutc
>  is_stats_set: False
>  statistics: None
>  compression: SNAPPY
>  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>  has_dictionary_page: True
>  dictionary_page_offset: 1342659
>  data_page_offset: 1342687
>  total_compressed_size: 64
>  total_uncompressed_size: 60
> {code}
> This means PyArrow cannot use metadata to filter dataset reads by date/time.
>   
>  I suspect this bug arises in {{_cast_statistic_raw_min()}} and {{_cast_statistic_raw_max()}} in {{/python/pyarrow/_parquet.pyx}} at L180. The code extracts below show there are casts for {{ParquetType_INT32}} and {{ParquetType_INT64}}, but not for {{ParquetType_INT96}}.
> Can a case be added for {{ParquetType_INT96}} in both of these?
> Note that those raw {{ParquetType_INT96}} will be converted to the appropriate timestamp type in {{_box_logical_type_value(raw, statistics.descr())}}
>  
> Thanks
>  Stephen
> {code:python}
>  cdef _cast_statistic_raw_min(CStatistics* statistics):
>      cdef ParquetType physical_type = statistics.physical_type()
>      cdef uint32_t type_length = statistics.descr().type_length()
>      if physical_type == ParquetType_BOOLEAN:
>          return (<CBoolStatistics*> statistics).min()
>      elif physical_type == ParquetType_INT32:
>          return (<CInt32Statistics*> statistics).min()
>      elif physical_type == ParquetType_INT64:
>          return (<CInt64Statistics*> statistics).min()
>      # ADD ParquetType_INT96 here!!!     
>      elif physical_type == ParquetType_FLOAT:
>          return (<CFloatStatistics*> statistics).min()
>      elif physical_type == ParquetType_DOUBLE:
>          return (<CDoubleStatistics*> statistics).min()
>      elif physical_type == ParquetType_BYTE_ARRAY:
>          return _box_byte_array((<CByteArrayStatistics*> statistics).min())
>      elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
>          return _box_flba((<CFLBAStatistics*> statistics).min(), type_length)
> cdef _cast_statistic_raw_max(CStatistics* statistics):
>      cdef ParquetType physical_type = statistics.physical_type()
>      cdef uint32_t type_length = statistics.descr().type_length()
>      if physical_type == ParquetType_BOOLEAN:
>          return (<CBoolStatistics*> statistics).max()
>      elif physical_type == ParquetType_INT32:
>          return (<CInt32Statistics*> statistics).max()
>      elif physical_type == ParquetType_INT64:
>          return (<CInt64Statistics*> statistics).max()
>      # ADD ParquetType_INT96 here!!!
>      elif physical_type == ParquetType_FLOAT:
>          return (<CFloatStatistics*> statistics).max()
>      elif physical_type == ParquetType_DOUBLE:
>          return (<CDoubleStatistics*> statistics).max()
>      elif physical_type == ParquetType_BYTE_ARRAY:
>          return _box_byte_array((<CByteArrayStatistics*> statistics).max())
>      elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
>          return _box_flba((<CFLBAStatistics*> statistics).max(), type_length)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)