You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/07 07:56:12 UTC

[GitHub] [arrow] isichei opened a new pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

isichei opened a new pull request #10461:
URL: https://github.com/apache/arrow/pull/10461


   Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp. 
   
   See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results.
   
   Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] isichei commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
isichei commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646852666



##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -742,20 +752,20 @@ Status TransferColumnData(RecordReader* reader, std::shared_ptr<DataType> value_
     case ::arrow::Type::TIMESTAMP: {
       const ::arrow::TimestampType& timestamp_type =
           checked_cast<::arrow::TimestampType&>(*value_type);
-      switch (timestamp_type.unit()) {
-        case ::arrow::TimeUnit::MILLI:
-        case ::arrow::TimeUnit::MICRO: {
-          result = TransferZeroCopy(reader, value_type);
-        } break;
-        case ::arrow::TimeUnit::NANO: {
-          if (descr->physical_type() == ::parquet::Type::INT96) {
-            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result));
-          } else {
+      if (descr->physical_type() == ::parquet::Type::INT96) {
+            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result, timestamp_type.unit()));
+        }
+      else {
+        switch (timestamp_type.unit()) {
+          case ::arrow::TimeUnit::SECOND:
+          case ::arrow::TimeUnit::MILLI:
+          case ::arrow::TimeUnit::MICRO:
+          case ::arrow::TimeUnit::NANO: {
             result = TransferZeroCopy(reader, value_type);
-          }
-        } break;
-        default:
-          return Status::NotImplemented("TimeUnit not supported");
+          } break;
+          default:

Review comment:
       I don't think so. I was just copying how others had written switch expressions in the existing codebase. Will remove 👍  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#issuecomment-856103500


   > @wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?
   
   It seems like a small enough change so I'm okay with it.  In general, though since Int96 is deprecated we should be looking very carefully at adding new functionality for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10461: ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#issuecomment-861557116


   It will probably be useful to expose this in Python (or R) as well? (the original JIRA report is also using a pyarrow example)  
   
   (this can be done in a follow-up to be clear, just to be sure we then create a JIRA for a follow-up task)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646928428



##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       As you prefer really. As long as the chosen solution avoids repeating the same decoding code, it should be ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] isichei commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
isichei commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646870346



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);
+
+  // Create single input table of NS to be written to parquet with INT96
+  auto input_schema = schema({field("f", t_ns)});
+  auto input = Table::Make(input_schema, {a_ns});
+
+  // Create an expected schema for each resulting table (one for each "down sampled" ts)
+  auto ex_schema_s = schema({field("f", t_s)});
+  auto ex_schema_ms = schema({field("f", t_ms)});
+  auto ex_schema_us = schema({field("f", t_us)});
+  
+  // Create tables
+  auto ex_result_s = Table::Make(ex_schema_s, {a_s});
+  auto ex_result_ms = Table::Make(ex_schema_ms, {a_ms});
+  auto ex_result_us = Table::Make(ex_schema_us, {a_us});
+
+  std::shared_ptr<Table> result_s;
+  std::shared_ptr<Table> result_ms;
+  std::shared_ptr<Table> result_us;
+
+  ArrowReaderProperties arrow_reader_prop_s, arrow_reader_prop_ms, arrow_reader_prop_us;
+  arrow_reader_prop_s.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::SECOND);
+  arrow_reader_prop_ms.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MILLI);
+  arrow_reader_prop_us.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MICRO);
+
+// SECOND
+  ASSERT_NO_FATAL_FAILURE(DoRoundtrip(
+    input, input->num_rows(), &result_s, default_writer_properties(),
+    ArrowWriterProperties::Builder().enable_deprecated_int96_timestamps()->build(),
+    arrow_reader_prop_s));
+
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
+                                                     *result_s->schema(),
+                                                     /*check_metadata=*/false));
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));

Review comment:
       Will give it a go! I'm afraid it has been a long time since I wrote any C++ code so the languange is basically new to me at this point - hence the basic repitition in places.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646709078



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);

Review comment:
       I know you're essentially copying this from the test above, but nowadays we have `ArrowFromJSON` which allows to express test data much more easily and tersely (you can grep through the source tree to find examples).
   
   You may also change the test above to use it, at the same time.

##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -742,20 +752,20 @@ Status TransferColumnData(RecordReader* reader, std::shared_ptr<DataType> value_
     case ::arrow::Type::TIMESTAMP: {
       const ::arrow::TimestampType& timestamp_type =
           checked_cast<::arrow::TimestampType&>(*value_type);
-      switch (timestamp_type.unit()) {
-        case ::arrow::TimeUnit::MILLI:
-        case ::arrow::TimeUnit::MICRO: {
-          result = TransferZeroCopy(reader, value_type);
-        } break;
-        case ::arrow::TimeUnit::NANO: {
-          if (descr->physical_type() == ::parquet::Type::INT96) {
-            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result));
-          } else {
+      if (descr->physical_type() == ::parquet::Type::INT96) {
+            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result, timestamp_type.unit()));
+        }
+      else {
+        switch (timestamp_type.unit()) {
+          case ::arrow::TimeUnit::SECOND:
+          case ::arrow::TimeUnit::MILLI:
+          case ::arrow::TimeUnit::MICRO:
+          case ::arrow::TimeUnit::NANO: {
             result = TransferZeroCopy(reader, value_type);
-          }
-        } break;
-        default:
-          return Status::NotImplemented("TimeUnit not supported");
+          } break;
+          default:

Review comment:
       This `default` case doesn't seem useful, unless the compiler requires it?

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
   }
 }
 
+// ARROW-12096 -- Overloading functions with new input (setting default as NANO)
 Result<std::shared_ptr<ArrowType>> GetArrowType(const schema::PrimitiveNode& primitive) {
   return GetArrowType(primitive.physical_type(), *primitive.logical_type(),
-                      primitive.type_length());
+                      primitive.type_length(), ::arrow::TimeUnit::NANO);
 }
 
 Result<std::shared_ptr<ArrowType>> GetArrowType(const ColumnDescriptor& descriptor) {
   return GetArrowType(descriptor.physical_type(), *descriptor.logical_type(),
-                      descriptor.type_length());
+                      descriptor.type_length(), ::arrow::TimeUnit::NANO);
+}
+
+// ARROW-12096 -- Exposing INT96 arrow type definition fromm parquet reader

Review comment:
       Same here.

##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -353,7 +353,8 @@ Status TransferBool(RecordReader* reader, MemoryPool* pool, Datum* out) {
 }
 
 Status TransferInt96(RecordReader* reader, MemoryPool* pool,
-                     const std::shared_ptr<DataType>& type, Datum* out) {
+                     const std::shared_ptr<DataType>& type, Datum* out,
+                     const ::arrow::TimeUnit::type& int96_arrow_time_unit) {

Review comment:
       You do not need to pass `TimeUnit::type` as a reference, since it's a cheap trivial type. Just pass it by value.

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);
+
+  // Create single input table of NS to be written to parquet with INT96
+  auto input_schema = schema({field("f", t_ns)});
+  auto input = Table::Make(input_schema, {a_ns});
+
+  // Create an expected schema for each resulting table (one for each "down sampled" ts)
+  auto ex_schema_s = schema({field("f", t_s)});
+  auto ex_schema_ms = schema({field("f", t_ms)});
+  auto ex_schema_us = schema({field("f", t_us)});
+  
+  // Create tables
+  auto ex_result_s = Table::Make(ex_schema_s, {a_s});
+  auto ex_result_ms = Table::Make(ex_schema_ms, {a_ms});
+  auto ex_result_us = Table::Make(ex_schema_us, {a_us});
+
+  std::shared_ptr<Table> result_s;
+  std::shared_ptr<Table> result_ms;
+  std::shared_ptr<Table> result_us;
+
+  ArrowReaderProperties arrow_reader_prop_s, arrow_reader_prop_ms, arrow_reader_prop_us;
+  arrow_reader_prop_s.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::SECOND);
+  arrow_reader_prop_ms.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MILLI);
+  arrow_reader_prop_us.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MICRO);
+
+// SECOND
+  ASSERT_NO_FATAL_FAILURE(DoRoundtrip(
+    input, input->num_rows(), &result_s, default_writer_properties(),
+    ArrowWriterProperties::Builder().enable_deprecated_int96_timestamps()->build(),
+    arrow_reader_prop_s));
+
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
+                                                     *result_s->schema(),
+                                                     /*check_metadata=*/false));
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));

Review comment:
       You can probably create a smaller helper function, method, or even a lambda, to avoid repeating those three lines below.

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -181,7 +181,8 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {
 
 Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
                                                 const LogicalType& logical_type,
-                                                int type_length) {
+                                                int type_length,
+                                                const ::arrow::TimeUnit::type& int96_arrow_time_unit) {

Review comment:
       Same comment here, with respect to passing by value vs. reference.

##########
File path: cpp/src/parquet/arrow/schema_internal.h
##########
@@ -39,8 +39,20 @@ Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type
                                                         const LogicalType& logical_type,
                                                         int type_length);
 
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
+                                                        const LogicalType& logical_type,
+                                                        int type_length,
+                                                        const ::arrow::TimeUnit::type& int96_arrow_time_unit);

Review comment:
       I don't think this is the right place to pass int96-specific information. Perhaps this should be done at a higher level (for example in `schema.cc`?).

##########
File path: cpp/src/parquet/arrow/schema_internal.h
##########
@@ -39,8 +39,20 @@ Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type
                                                         const LogicalType& logical_type,
                                                         int type_length);
 
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
+                                                        const LogicalType& logical_type,
+                                                        int type_length,
+                                                        const ::arrow::TimeUnit::type& int96_arrow_time_unit);
+
 Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
     const schema::PrimitiveNode& primitive);
+
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(

Review comment:
       Same here.

##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       For example:
   ```c++
   struct DecodedInt96 {
     uint64_t days_since_epoch;
     uint64_t nanoseconds;
   };
   
   static inline int64_t DecodeInt96Timestamp(const parquet::Int96& i96) {
     // We do the computations in the unsigned domain to avoid unsigned behaviour
     // on overflow.
     DecodedInt96 result;
     result.days_since_epoch =
         i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
     result.nanoseconds = 0;
     memcpy(&result.nanoseconds, &i96.value, sizeof(uint64_t));
     return result;
   }
   
   static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
     const auto decoded = DecodeInt96Timestamp(i96);
     return static_cast<int64_t>(decoded.days_since_epoch * kNanosecondsPerDay + decoded.nanoseconds);
   }
   ```
   

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
   }
 }
 
+// ARROW-12096 -- Overloading functions with new input (setting default as NANO)

Review comment:
       This comment doesn't seem terribly informative. Can you remove it?

##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       There is some amount of repetition in those four functions that would be nice to avoid, IMHO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#issuecomment-856054729


   @wesm @emkornfield What do you think about the functionality that is added here? Is it a reasonable burden for us to take on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10461: ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10461:
URL: https://github.com/apache/arrow/pull/10461


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] isichei commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
isichei commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646865384



##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       Yeah I agree.
   
   I was concerned about changing the original function `Int96GetNanoSeconds` incase I introduced some unexpected change. Perhaps a halfway house is to replace the 3 (`us`, `ms` and `s`) INT96 functions to something like:
   
   ```cpp
   static inline int64_t Int96GetDownsampledTimestamp(const parquet::Int96& i96, const ::arrow::TimeUnit::type unit) {
     // We do the computations in the unsigned domain to avoid unsigned behaviour
     // on overflow.
     uint64_t days_since_epoch =
         i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
     uint64_t nanoseconds = 0;
   
     memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
     uint64_t seconds;
   
     switch (unit){
       case ::arrow::TimeUnit::SECOND:
         seconds = nanoseconds/static_cast<uint64_t>(1000000000);
       case ::arrow::TimeUnit::MILLI:
         seconds = nanoseconds/static_cast<uint64_t>(1000000);
       case ::arrow::TimeUnit::MICRO:
         seconds = nanoseconds/static_cast<uint64_t>(1000);
     }
     return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + seconds);
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] isichei commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
isichei commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646865384



##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       Yeah I agree.
   
   I was concerned about changing the original function `Int96GetNanoSeconds` incase I introduced some unexpected change. Perhaps a halfway house is to replace the 3 (`us`, `ms` and `s`) INT96 functions to something like:
   
   ```cpp
   static inline int64_t Int96GetDownsampledTimestamp(const parquet::Int96& i96, const ::arrow::TimeUnit::type unit) {
     // We do the computations in the unsigned domain to avoid unsigned behaviour
     // on overflow.
     uint64_t days_since_epoch =
         i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
     uint64_t nanoseconds = 0;
   
     memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
     uint64_t seconds;
   
     switch (unit){
       case ::arrow::TimeUnit::SECOND:
         seconds = nanoseconds/static_cast<uint64_t>(1000000000);
       case ::arrow::TimeUnit::MILLI:
         seconds = nanoseconds/static_cast<uint64_t>(1000000);
       case ::arrow::TimeUnit::MICRO:
         seconds = nanoseconds/static_cast<uint64_t>(1000);
     }
     return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + seconds);
   }
   ```
   
   Then there is an if/else in the `TransferInt96` function where default `NANO` unit calls the unchanged `Int96GetNanoSeconds` otherwise it calls the downcast version of the function?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] isichei commented on a change in pull request #10461: ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
isichei commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r650521753



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);

Review comment:
       Have rewritten tests to use a helper function. Hopefully cleaner.

##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);
+
+  // Create single input table of NS to be written to parquet with INT96
+  auto input_schema = schema({field("f", t_ns)});
+  auto input = Table::Make(input_schema, {a_ns});
+
+  // Create an expected schema for each resulting table (one for each "down sampled" ts)
+  auto ex_schema_s = schema({field("f", t_s)});
+  auto ex_schema_ms = schema({field("f", t_ms)});
+  auto ex_schema_us = schema({field("f", t_us)});
+  
+  // Create tables
+  auto ex_result_s = Table::Make(ex_schema_s, {a_s});
+  auto ex_result_ms = Table::Make(ex_schema_ms, {a_ms});
+  auto ex_result_us = Table::Make(ex_schema_us, {a_us});
+
+  std::shared_ptr<Table> result_s;
+  std::shared_ptr<Table> result_ms;
+  std::shared_ptr<Table> result_us;
+
+  ArrowReaderProperties arrow_reader_prop_s, arrow_reader_prop_ms, arrow_reader_prop_us;
+  arrow_reader_prop_s.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::SECOND);
+  arrow_reader_prop_ms.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MILLI);
+  arrow_reader_prop_us.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MICRO);
+
+// SECOND
+  ASSERT_NO_FATAL_FAILURE(DoRoundtrip(
+    input, input->num_rows(), &result_s, default_writer_properties(),
+    ArrowWriterProperties::Builder().enable_deprecated_int96_timestamps()->build(),
+    arrow_reader_prop_s));
+
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
+                                                     *result_s->schema(),
+                                                     /*check_metadata=*/false));
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));

Review comment:
       See previous comment thread.

##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -353,7 +353,8 @@ Status TransferBool(RecordReader* reader, MemoryPool* pool, Datum* out) {
 }
 
 Status TransferInt96(RecordReader* reader, MemoryPool* pool,
-                     const std::shared_ptr<DataType>& type, Datum* out) {
+                     const std::shared_ptr<DataType>& type, Datum* out,
+                     const ::arrow::TimeUnit::type& int96_arrow_time_unit) {

Review comment:
       Addressed.

##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -353,7 +353,8 @@ Status TransferBool(RecordReader* reader, MemoryPool* pool, Datum* out) {
 }
 
 Status TransferInt96(RecordReader* reader, MemoryPool* pool,
-                     const std::shared_ptr<DataType>& type, Datum* out) {
+                     const std::shared_ptr<DataType>& type, Datum* out,
+                     const ::arrow::TimeUnit::type& int96_arrow_time_unit) {

Review comment:
       Done.

##########
File path: cpp/src/parquet/arrow/reader_internal.cc
##########
@@ -742,20 +752,20 @@ Status TransferColumnData(RecordReader* reader, std::shared_ptr<DataType> value_
     case ::arrow::Type::TIMESTAMP: {
       const ::arrow::TimestampType& timestamp_type =
           checked_cast<::arrow::TimestampType&>(*value_type);
-      switch (timestamp_type.unit()) {
-        case ::arrow::TimeUnit::MILLI:
-        case ::arrow::TimeUnit::MICRO: {
-          result = TransferZeroCopy(reader, value_type);
-        } break;
-        case ::arrow::TimeUnit::NANO: {
-          if (descr->physical_type() == ::parquet::Type::INT96) {
-            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result));
-          } else {
+      if (descr->physical_type() == ::parquet::Type::INT96) {
+            RETURN_NOT_OK(TransferInt96(reader, pool, value_type, &result, timestamp_type.unit()));
+        }
+      else {
+        switch (timestamp_type.unit()) {
+          case ::arrow::TimeUnit::SECOND:
+          case ::arrow::TimeUnit::MILLI:
+          case ::arrow::TimeUnit::MICRO:
+          case ::arrow::TimeUnit::NANO: {
             result = TransferZeroCopy(reader, value_type);
-          }
-        } break;
-        default:
-          return Status::NotImplemented("TimeUnit not supported");
+          } break;
+          default:

Review comment:
       Done.

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -181,7 +181,8 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {
 
 Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
                                                 const LogicalType& logical_type,
-                                                int type_length) {
+                                                int type_length,
+                                                const ::arrow::TimeUnit::type& int96_arrow_time_unit) {

Review comment:
       Done.

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
   }
 }
 
+// ARROW-12096 -- Overloading functions with new input (setting default as NANO)

Review comment:
       Done.

##########
File path: cpp/src/parquet/arrow/schema_internal.cc
##########
@@ -211,14 +212,22 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(Type::type physical_type,
   }
 }
 
+// ARROW-12096 -- Overloading functions with new input (setting default as NANO)
 Result<std::shared_ptr<ArrowType>> GetArrowType(const schema::PrimitiveNode& primitive) {
   return GetArrowType(primitive.physical_type(), *primitive.logical_type(),
-                      primitive.type_length());
+                      primitive.type_length(), ::arrow::TimeUnit::NANO);
 }
 
 Result<std::shared_ptr<ArrowType>> GetArrowType(const ColumnDescriptor& descriptor) {
   return GetArrowType(descriptor.physical_type(), *descriptor.logical_type(),
-                      descriptor.type_length());
+                      descriptor.type_length(), ::arrow::TimeUnit::NANO);
+}
+
+// ARROW-12096 -- Exposing INT96 arrow type definition fromm parquet reader

Review comment:
       Done.

##########
File path: cpp/src/parquet/arrow/schema_internal.h
##########
@@ -39,8 +39,20 @@ Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type
                                                         const LogicalType& logical_type,
                                                         int type_length);
 
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
+                                                        const LogicalType& logical_type,
+                                                        int type_length,
+                                                        const ::arrow::TimeUnit::type& int96_arrow_time_unit);

Review comment:
       Went back to to review this and not sure how to address.
   
   Only thing I can imagine would be to add to `GetTypeForNode` (from `arrow/schema.cc`) and overwrite the standard `storage_type` if the parquet physical_type is INT96 and reader properties are not set to NANO? Let me know if I have misunderstood.

##########
File path: cpp/src/parquet/arrow/schema_internal.h
##########
@@ -39,8 +39,20 @@ Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type
                                                         const LogicalType& logical_type,
                                                         int type_length);
 
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
+                                                        const LogicalType& logical_type,
+                                                        int type_length,
+                                                        const ::arrow::TimeUnit::type& int96_arrow_time_unit);
+
 Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
     const schema::PrimitiveNode& primitive);
+
+// ARROW-12096 Exposing int96 arrow timestamp unit definition
+Result<std::shared_ptr<::arrow::DataType>> GetArrowType(

Review comment:
       Done.

##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t seconds = nanoseconds/(static_cast<uint64_t>(1000000000));
+
+  return static_cast<int64_t>(days_since_epoch * kSecondsPerDay + seconds);

Review comment:
       Went with your example in the end. As it made far more sense IMO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646777158



##########
File path: cpp/src/parquet/types.h
##########
@@ -602,6 +602,49 @@ static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
   return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
 }
 
+// ARROW-12096
+static inline int64_t Int96GetMicroSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t microseconds = nanoseconds/static_cast<uint64_t>(1000);
+
+  return static_cast<int64_t>(days_since_epoch * kMicrosecondsPerDay + microseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetMilliSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour
+  // on overflow.
+  uint64_t days_since_epoch =
+      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
+  uint64_t nanoseconds = 0;
+  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
+
+  uint64_t milliseconds = nanoseconds/static_cast<uint64_t>(1000000);
+
+  return static_cast<int64_t>(days_since_epoch * kMillisecondsPerDay + milliseconds);
+}
+
+// ARROW-12096
+static inline int64_t Int96GetSeconds(const parquet::Int96& i96) {
+  // We do the computations in the unsigned domain to avoid unsigned behaviour

Review comment:
       ```suggestion
     // We do the computations in the unsigned domain to avoid undefined behaviour
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#issuecomment-855692947


   https://issues.apache.org/jira/browse/ARROW-12096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10461: ARROW-12096: [C++]: Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#discussion_r646927762



##########
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##########
@@ -1671,6 +1671,91 @@ TEST(TestArrowReadWrite, UseDeprecatedInt96) {
   ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result, *result));
 }
 
+// Test for added functionality in ARROW-12096
+TEST(TestArrowReadWrite, DownsampleDeprecatedInt96) {
+  using ::arrow::ArrayFromVector;
+  using ::arrow::field;
+  using ::arrow::schema;
+
+  std::vector<bool> is_valid = {true, true, true, true};
+
+  auto t_s = ::arrow::timestamp(TimeUnit::SECOND);
+  auto t_ms = ::arrow::timestamp(TimeUnit::MILLI);
+  auto t_us = ::arrow::timestamp(TimeUnit::MICRO);
+  auto t_ns = ::arrow::timestamp(TimeUnit::NANO);
+
+  // Values demonstrate loss of resolution when "down sampling" INT96 to units that are not NS
+  std::vector<int64_t> s_values = {1489269, 1489269, 1489269, 1489269};
+  std::vector<int64_t> ms_values = {1489269000, 1489269000,
+                                    1489269000, 1489269001};
+  std::vector<int64_t> us_values = {1489269000000, 1489269000000,
+                                    1489269000001, 1489269001000};
+  std::vector<int64_t> ns_values = {1489269000000000LL, 1489269000000001LL,
+                                    1489269000001000LL, 1489269001000000LL};
+
+  std::shared_ptr<Array> a_s, a_ms, a_us, a_ns;
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_s, is_valid, s_values, &a_s);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ms, is_valid, ms_values, &a_ms);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
+  ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);
+
+  // Create single input table of NS to be written to parquet with INT96
+  auto input_schema = schema({field("f", t_ns)});
+  auto input = Table::Make(input_schema, {a_ns});
+
+  // Create an expected schema for each resulting table (one for each "down sampled" ts)
+  auto ex_schema_s = schema({field("f", t_s)});
+  auto ex_schema_ms = schema({field("f", t_ms)});
+  auto ex_schema_us = schema({field("f", t_us)});
+  
+  // Create tables
+  auto ex_result_s = Table::Make(ex_schema_s, {a_s});
+  auto ex_result_ms = Table::Make(ex_schema_ms, {a_ms});
+  auto ex_result_us = Table::Make(ex_schema_us, {a_us});
+
+  std::shared_ptr<Table> result_s;
+  std::shared_ptr<Table> result_ms;
+  std::shared_ptr<Table> result_us;
+
+  ArrowReaderProperties arrow_reader_prop_s, arrow_reader_prop_ms, arrow_reader_prop_us;
+  arrow_reader_prop_s.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::SECOND);
+  arrow_reader_prop_ms.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MILLI);
+  arrow_reader_prop_us.set_coerce_int96_timestamp_unit(::arrow::TimeUnit::MICRO);
+
+// SECOND
+  ASSERT_NO_FATAL_FAILURE(DoRoundtrip(
+    input, input->num_rows(), &result_s, default_writer_properties(),
+    ArrowWriterProperties::Builder().enable_deprecated_int96_timestamps()->build(),
+    arrow_reader_prop_s));
+
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertSchemaEqual(*ex_result_s->schema(),
+                                                     *result_s->schema(),
+                                                     /*check_metadata=*/false));
+  ASSERT_NO_FATAL_FAILURE(::arrow::AssertTablesEqual(*ex_result_s, *result_s));

Review comment:
       C++11 is quite a bit better than what was available before, if your experience was with C++98 :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10461: ARROW-12096: [C++] Allows users to define arrow timestamp unit for Parquet INT96 timestamp

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10461:
URL: https://github.com/apache/arrow/pull/10461#issuecomment-861645603


   Thank you @isichei !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org