You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 19:31:40 UTC

[GitHub] [arrow] rok opened a new pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

rok opened a new pull request #10610:
URL: https://github.com/apache/arrow/pull/10610


   This is to resolve [ARROW-13033](https://issues.apache.org/jira/browse/ARROW-13033).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901970388


   I was following the new terminology at 
   
   https://github.com/apache/arrow/blob/895f85fd32782a4447db74e4413e54a4f7128809/format/Schema.fbs#L221-L247


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-878416643






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703450693



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       I'll see if we can make it more like expected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872784049


   What's the difference between both timezones? In my mind this is the same time zone? If you want to convert your localized timezone-aware timestamp to another timezone after localizing, you can use another kernel to do that? (a kernel we don't have yet, but one we should also add, although this is a trivial, metadata-only change)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901932831


   As for naming the function, perhaps "localize_timestamp"? Most important is to have a good explanatory docstring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901965504


   "localize_naive_timestamp" perhaps?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-903757870


   Next name suggestion: `to_timestamp_with_time_zone`. 
   It doesn't use any specific terminology like "local" or "zoned", but only refers to the actual parameter of the timestamp type: whether it has the time zone set or not. 
   
   > I don't read that as new terminology, rather pointers to terminology used in some other pieces of software.
   (also I find that explanation actually a bit intimidating and confusing)
   
   The edits in https://github.com/apache/arrow/pull/10629 where meant to _clarify_ the text after the discussion on the mailing list (and not make it even more confusing :)), so if that's not the case, I think your feedback would be welcome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703554466



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Well, that depends, do we want to copy Pandas here?
   Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche 
   
   Also, note that `shift_forward` already corresponds to the original `date` library's behaviour. So we don't need an additional option value, and we can also keep the names "earliest" and "latest" :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703584418



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > Bottom line, I think we can simply implement the Pandas semantics while keeping the name "earliest" and "latest".
   
   Sounds good, will do.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r702987433



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -571,6 +592,73 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from local timestamp without a timezone to timestamps with a
+// timezone, interpreting the local timestamp as being in the specified timezone
+
+template <typename Duration>
+struct AssumeTimezone {
+  explicit AssumeTimezone(const AssumeTimezoneOptions* options, const time_zone* tz)
+      : options(*options), tz_(tz) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const time_zone* tz) const {
+    return static_cast<T>(zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}))
+                              .get_sys_time()
+                              .time_since_epoch()
+                              .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose,
+                   const time_zone* tz) const {
+    return static_cast<T>(
+        zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg, tz_);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_RAISE: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE: {
+          *st = Status::Invalid("Ambiguous: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    }
+    *st = Status::UnknownError("Unknown error.");

Review comment:
       I think it was just to satisfy the compiler.

##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -278,6 +279,31 @@ struct ARROW_EXPORT DayOfWeekOptions : public FunctionOptions {
   uint32_t week_start;
 };
 
+/// Used to control timestamp timezone conversion and handling ambiguous/nonexistent
+/// times.
+struct ARROW_EXPORT AssumeTimezoneOptions : public FunctionOptions {
+ public:
+  /// How to interpret ambiguous local times that can be interpreted as
+  /// multiple instants due to DST shifts.
+  enum Ambiguous { AMBIGUOUS_RAISE, AMBIGUOUS_EARLIEST, AMBIGUOUS_LATEST };
+
+  /// How to handle local times that do not exists due to DST shifts.
+  enum Nonexistent { NONEXISTENT_RAISE, NONEXISTENT_EARLIEST, NONEXISTENT_LATEST };
+
+  explicit AssumeTimezoneOptions(std::string timezone,
+                                 Ambiguous ambiguous = AMBIGUOUS_RAISE,
+                                 Nonexistent nonexistent = NONEXISTENT_RAISE);
+  AssumeTimezoneOptions();
+  constexpr static char const kTypeName[] = "AssumeTimezoneOptions";
+
+  /// Timezone and timezone name to convert timestamp from
+  std::string timezone;
+  const arrow_vendored::date::time_zone* tz;

Review comment:
       Oh. Probably a bad rebase. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872782386


   > @rok can you explain the rationale of adding a `source_timezone` and `destination_timezone` to the API? What's the reason for doing it like this, instead of having a single target timezone as argument?
   
   In case we start with "naive" a timestamp we don't know the source and target timezone so both need to be specified. Or am I missing something here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703317416



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       I guess that the `date` library chooses the existing (UTC) instant that logically encloses the non-existent instant "2015-03-29 02:30:00 Europe/Warsaw" (because that UTC instant corresponds to both "2015-03-29 02:00:00 Europe/Warsaw" and "2015-03-29 03:00:00 Europe/Warsaw"). But having two option values that produce the same results doesn't seem useful at all.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703576179



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche
   
   I'm sure there are is someone out there who care about nanoseconds :)).
   
   My thought here is - the input timestamp is half past full hour locally. The local hour doesn't exist, so the half past is probably a measurement mistake and what's was measured is half past an hour later or earlier. The third option (infer) is to ignore the half past and go to the moment the DST shift occurred.
   
   ```
   earliest -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 00:30:00 (UTC)
   infer -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 01:00:00 (UTC)
   latest -> 2015-03-29 01:30:00 (Warsaw) -> 2015-03-29 01:30:00 (UTC)
   ```
   
   Visual aid for the discussion (not exactly the same zones):
   ![](https://files.gitter.im/HowardHinnant/date/fM7T/EuropeMoscow.jpeg)
   
   We can also just park this into [ARROW-13347](https://issues.apache.org/jira/browse/ARROW-13347).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703317416



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       I guess that the `date` library chooses the existing (UTC) instant that encloses the invalid instant "2015-03-29 02:30:00 Europe/Paris`. But having two option values that produce the same results doesn't seem useful at all.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-913875878


   CI fail seems unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-880645806


   > > > > and how to return NaN when the timestamp is an int64.
   > > > 
   > > > I think it should rather be a null instead of NaN (I suppose this is for the IGNORE case?), and nulls can be encoded in int64.
   > > 
   > > Yes I was thinking about the IGNORE case. This should now return nulls but zeros are returned. Do we need to switch away from `ScalarUnaryNotNullStateful` for this?
   > 
   > If it makes the implementation more complicated (as it would no longer simply propagate nulls), I think it is fine to also leave this option for a future enhancement
   
   I suppose it would. Shall we leave this PR as is then?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-903838972


   @jorisvandenbossche ok, this is ready for another review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-905851379


   What would be a good way to reach consensus here? A doodle poll to ML to see what people find most intuitive?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r694110745



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -1029,5 +1055,19 @@ ARROW_EXPORT Result<Datum> Subsecond(const Datum& values, ExecContext* ctx = NUL
 ARROW_EXPORT Result<Datum> Strftime(const Datum& values, StrftimeOptions options,
                                     ExecContext* ctx = NULLPTR);
 
+/// \brief Converts timestamps from one timezone to another for each element of `values`

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901966140


   The problem is we don't have any clear terminology here. "zoned" doesn't mean anything to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669163511



##########
File path: r/configure.win
##########
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
       No, I didn't see your earlier comment so I didn't understand why you were adding this.  I get it now, sorry.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872636186


   Thanks for the review @jorisvandenbossche :).
   I've gone through your feedback and done some of the boilerplate except for the correct output timestamp type. I'll try to do it over the weekend.
   
   > (also, given all the confusion also on the JIRA, we should probably try to come up with a more explicit/descriptive name ...)
   
   I think there is no consensus on this yet. `tz_convert`, `tz_localize` were mentioned. I'd add `tz_normalize`. Any other?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r670697102



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -321,6 +341,77 @@ struct Nanosecond {
   }
 };
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC
+
+template <typename Duration>
+struct TzLocalize {
+  explicit TzLocalize(const TemporalLocalizationOptions& options)
+      : options(std::move(options)) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}))
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case TemporalLocalizationOptions::Nonexistent::RAISE_NONEXISTENT: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case TemporalLocalizationOptions::Nonexistent::SHIFT_BACKWARD: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest);
+        }
+        case TemporalLocalizationOptions::Nonexistent::SHIFT_FORWARD: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest);
+        }
+        case TemporalLocalizationOptions::Nonexistent::IGNORE_NONEXISTENT: {
+          return static_cast<T>(NULL);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case TemporalLocalizationOptions::Ambiguous::RAISE_AMBIGUOUS: {
+          *st = Status::Invalid("Ambiguous: ", e.what());
+          return arg;
+        }
+        case TemporalLocalizationOptions::Ambiguous::FIRST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest);
+        }
+        case TemporalLocalizationOptions::Ambiguous::LAST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest);

Review comment:
       Yeah, that makes sense. I've made names consistent with `date.h`. See the latest commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-873001971


   > Since the Arrow spec is going to refer to "LocalDateTime", "Instant" and "ZonedDateTime", how about reusing terms from the spec? Something along the lines of `local_datetime_to_zoned_datetime()`?
   
   It's nice that it's idiomatic if a bit verbose :)
   
   Let's add it to the list. If we don't get to consensus here we can always call a vote on the ML.
   * force_tz
   * tz_localize
   * tz_normalize
   * local_datetime_to_zoned_datetime


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703554466



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Well, that depends, do we want to copy Pandas here?
   Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche 
   
   Also, note that `shift_forward` already corresponds to the original `date` library's behaviour. So we don't need an additional option value :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10610:
URL: https://github.com/apache/arrow/pull/10610


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703497017



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Now the behavior is more like shift forward and shift backward.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872826233


   > Having another kernel to only change metadata seems redundant.
   
   I don't think that's redundant, as a user will regularly want to change the timezone of a timestamp column that is already localized (already has a timezone).
   
   I am of course coming from the point of view of pandas, where those two operations are defined as two distinct methods: tz_localize to convert from "timestamp without timezone" to "timestamp with timezone" (i.e. from naive clock time to tz-aware time) and tz_convert to convert between "timestamp with timezone" with different timezones (a metadata-only change). 
   From a user point of view (and also implementation-wise), those two are distinct concepts, IMO, and so I would keep them as separate kernels. (also eg R's lubridate has distinct `force_tz` and `with_tz`, respectively)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-911780013


   @pitrou I've changed this to `assume_timezone` (`AssumeTimezone`, `AssumeTimezoneOptions`).
   Discussion on Zulip seems to be in favor of `assume_timestamp`.
   
   Would still be good to hear @jorisvandenbossche @adamhooper @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703554466



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Well, that depends, do we want to copy Pandas here?
   Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche 
   
   Also, note that `shift_forward` already corresponds to the original `date` library's behaviour. So we don't need an additional option values :-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-874903409


   > @rok I was checking out this PR, and in the process wrote python bindings for the option class to be able to test it in Python, so thought to directly push that here as well. (and at the same time also fixed (I think) the initialization of the timezone, which was causing the C++ tests to fail here as well)
   
   Thanks @jorisvandenbossche! The new option initialization was indeed problematic.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901843106


   We still have the naming issue (or bikeshedding) to resolve. As a reference, some names used by other systems that I am aware of:
   * Pandas: [tz_localize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html)
   * R's lubridate: [force_tz](https://lubridate.tidyverse.org/reference/force_tz.html)
   * Java: [atZone](https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-)
   * Java Joda: [toDateTime](https://www.joda.org/joda-time/apidocs/org/joda/time/LocalDateTime.html#toDateTime-org.joda.time.DateTimeZone-)
   * Postgresql: [AT TIME ZONE](https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)
   
   The proposals that have been floated around above:
   
   * `tz_localize`
   * `force_tz`
   * `local_datetime_to_zoned_datetime`
   
   I left out `tz_normalize` (like Weston mentioned, "normalize" already has too many meanings, it also doesn't give any indication to me whether it's adding or removing a timezone ..). 
   I personally like Adam's suggestion of being more explicit (although resulting in a longer name) and re-using the terminology that was now added to the format spec to clarify the meaning. I would maybe only use "timestamp" instead of "datetime", since the actual types in arrow that this kernel accepts are called that way. 
   With that, I would do a proposal of calling it `timestamp_local_to_zoned` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-869967788


   https://issues.apache.org/jira/browse/ARROW-13033


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-880920360


   Thanks for the review @jorisvandenbossche! :)
   
   > > Shall we leave this PR as is then?
   > 
   > I would then remove this option (if we keep it in the PR, it should do the correct thing IMO, while not it is still returning 0 (i.e. 1970-01-01) instead I think?)
   
   Removed the `IGNORE` option and made [ARROW-13347](https://issues.apache.org/jira/browse/ARROW-13347) to discuss/remember.
   
   > Can you also handle the case of an unknown / wrong timezone?
   
   Added but not at `TemporalLocalizationOptions` initialization but in [the TzLocalize exec](https://github.com/apache/arrow/pull/10610/files#diff-53850537a2f3f7977473b30241fdd3ddb35aa97269284c06471d9b0c599e2c59R86) since state can be returned from there.
   
   I think this is ready for another review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r694108545



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -840,6 +967,13 @@ const FunctionDoc strftime_doc{
     {"timestamps"},
     "StrftimeOptions"};
 
+const FunctionDoc tz_localize_doc{
+    "TzLocalize converts timestamps to UTC keeping the original timezone information",
+    ("TzLocalize converts timestamps from arbitrary timezone to UTC.\n"

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r660346805



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -177,5 +177,17 @@ TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) {
     ASSERT_RAISES(NotImplemented, Subsecond(timestamps));
   }
 }
+
+TEST(ScalarTemporalTest, TestLocalizeTimezoneNaiveTimestamp) {
+  std::string timezone = "Asia/Kolkata";
+  const char* times_naive = R"(["1970-01-01T00:00:59", null])";
+  const char* times_utc = R"(["1970-01-01T00:00:59", null])";
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit_naive = timestamp(u);
+    auto unit_utc = timestamp(u, timezone);
+    CheckScalarUnary("localize", unit_naive, times_naive, unit_utc, times_utc);

Review comment:
       Can you add a test with some other timezones as well?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -906,48 +906,51 @@ Temporal component extraction
 These functions extract datetime components (year, month, day, etc) from timestamp type.
 Note: this is currently not supported for timestamps with timezone information.
 
-+--------------------+------------+-------------------+---------------+--------+
-| Function name      | Arity      | Input types       | Output type   | Notes  |
-+====================+============+===================+===============+========+
-| year               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| month              | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day                | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_year        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_year           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_week           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |
-+--------------------+------------+-------------------+---------------+--------+
-| quarter            | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| hour               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| minute             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| second             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| millisecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| microsecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| nanosecond         | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| subsecond          | Unary      | Temporal          | Double        |        |
-+--------------------+------------+-------------------+---------------+--------+
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| Function name      | Arity      | Input types       | Output type   | Notes  | Options class                                |
++====================+============+===================+===============+========+==============================================+
+| year               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| month              | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day                | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_year        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_year           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_week           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| quarter            | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| hour               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| minute             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| second             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| millisecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| microsecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| nanosecond         | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| subsecond          | Unary      | Temporal          | Double        |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| localize           | Unary      | Temporal          | Temporal      | \(4)   | :struct:`TemporalLocalizationOptions`        |

Review comment:
       Localize is not a "component extraction", so I would put this in a separate table. Maybe under a new subsection below this one as "Timezone handling" ?

##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -735,5 +749,18 @@ Result<Datum> Nanosecond(const Datum& values, ExecContext* ctx = NULLPTR);
 /// \note API not yet finalized
 ARROW_EXPORT Result<Datum> Subsecond(const Datum& values, ExecContext* ctx = NULLPTR);
 
+/// \brief Localizes timezone naive timestamps to UTC for each element of `values`

Review comment:
       This is not only to localize to UTC, but rather to any timezone (that the underlying values are converted from that timezone to UTC is more an implementation detail to the user)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r692034160



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -1029,5 +1055,19 @@ ARROW_EXPORT Result<Datum> Subsecond(const Datum& values, ExecContext* ctx = NUL
 ARROW_EXPORT Result<Datum> Strftime(const Datum& values, StrftimeOptions options,
                                     ExecContext* ctx = NULLPTR);
 
+/// \brief Converts timestamps from one timezone to another for each element of `values`

Review comment:
       Not from one timezone to another, but from local timestamp without timezone to timestamp with timezone, interpreting the local timestamp as being in the specified timezone.

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -570,6 +586,71 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC
+
+template <typename Duration, typename Localizer>
+struct TzLocalize {
+  explicit TzLocalize(const TemporalLocalizationOptions* options, Localizer&& localizer)
+      : options(*options) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}))
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case TemporalLocalizationOptions::Nonexistent::NONEXISTENT_RAISE: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case TemporalLocalizationOptions::Nonexistent::NONEXISTENT_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest);
+        }
+        case TemporalLocalizationOptions::Nonexistent::NONEXISTENT_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case TemporalLocalizationOptions::Ambiguous::AMBIGUOUS_RAISE: {
+          *st = Status::Invalid("Ambiguous: ", e.what());

Review comment:
       ```suggestion
             *st = Status::Invalid("Ambiguous timestamp: ", e.what());
   ```

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -570,6 +586,71 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC

Review comment:
       This comment is also a bit confusing (it's technically correct, since this is the conversion that is happening under the hood, but it's not how the kernel is documented)

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -840,6 +967,13 @@ const FunctionDoc strftime_doc{
     {"timestamps"},
     "StrftimeOptions"};
 
+const FunctionDoc tz_localize_doc{
+    "TzLocalize converts timestamps to UTC keeping the original timezone information",
+    ("TzLocalize converts timestamps from arbitrary timezone to UTC.\n"

Review comment:
       Can you update this description?

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -570,6 +586,71 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC
+
+template <typename Duration, typename Localizer>
+struct TzLocalize {
+  explicit TzLocalize(const TemporalLocalizationOptions* options, Localizer&& localizer)
+      : options(*options) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}))
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case TemporalLocalizationOptions::Nonexistent::NONEXISTENT_RAISE: {
+          *st = Status::Invalid("Nonexistent: ", e.what());

Review comment:
       ```suggestion
             *st = Status::Invalid("Nonexistent timestamp: ", e.what());
   ```

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -161,6 +162,22 @@ struct TemporalComponentExtract
   }
 };
 
+template <template <typename...> class Op, typename Duration, typename OutType>
+struct TemporalLocalization
+    : public TemporalComponentExtractBase<Op, Duration, OutType> {
+  using Base = TemporalComponentExtractBase<Op, Duration, OutType>;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    const TemporalLocalizationOptions& options = TemporalLocalizationState::Get(ctx);
+    const auto& timezone = GetInputTimezone(batch.values[0]);
+    if (!timezone.empty() && options.timezone != timezone) {
+      return Status::Invalid("Input and output timezones do not match: ", timezone,

Review comment:
       Does this need to work with timezoned input? It could also always raise an error (so without checking if the timezones are the same)
   
   Also, if we keep it, I would expect it to do nothing? While currently it does actually do some conversion:
   
   ```
   In [21]: ts = pd.to_datetime(["2018-03-10 09:00"]).tz_localize("US/Eastern")
   
   In [22]: tsa = pa.array(ts)
   
   In [23]: tsa.to_pandas()
   Out[23]: 
   0   2018-03-10 09:00:00-05:00
   dtype: datetime64[ns, US/Eastern]
   
   In [24]: pc.tz_localize(tsa, timezone="US/Eastern").to_pandas()
   Out[24]: 
   0   2018-03-10 14:00:00-05:00
   dtype: datetime64[ns, US/Eastern]
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r702989021



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -278,6 +279,31 @@ struct ARROW_EXPORT DayOfWeekOptions : public FunctionOptions {
   uint32_t week_start;
 };
 
+/// Used to control timestamp timezone conversion and handling ambiguous/nonexistent
+/// times.
+struct ARROW_EXPORT AssumeTimezoneOptions : public FunctionOptions {
+ public:
+  /// How to interpret ambiguous local times that can be interpreted as
+  /// multiple instants due to DST shifts.
+  enum Ambiguous { AMBIGUOUS_RAISE, AMBIGUOUS_EARLIEST, AMBIGUOUS_LATEST };
+
+  /// How to handle local times that do not exists due to DST shifts.
+  enum Nonexistent { NONEXISTENT_RAISE, NONEXISTENT_EARLIEST, NONEXISTENT_LATEST };
+
+  explicit AssumeTimezoneOptions(std::string timezone,
+                                 Ambiguous ambiguous = AMBIGUOUS_RAISE,
+                                 Nonexistent nonexistent = NONEXISTENT_RAISE);
+  AssumeTimezoneOptions();
+  constexpr static char const kTypeName[] = "AssumeTimezoneOptions";
+
+  /// Timezone and timezone name to convert timestamp from
+  std::string timezone;
+  const arrow_vendored::date::time_zone* tz;

Review comment:
       Removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
adamhooper commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-911990435


   @rok Thanks for asking! I prefer `assume_timezone`, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872932527


   > > Having another kernel to only change metadata seems redundant.
   > 
   > I don't think that's redundant, as a user will regularly want to change the timezone of a timestamp column that is already localized (already has a timezone).
   > 
   > I am of course coming from the point of view of pandas, where those two operations are defined as two distinct methods: tz_localize to convert from "timestamp without timezone" to "timestamp with timezone" (i.e. from naive clock time to tz-aware time) and tz_convert to convert between "timestamp with timezone" with different timezones (a metadata-only change).
   > From a user point of view (and also implementation-wise), those two are distinct concepts, IMO, and so I would keep them as separate kernels. (also eg R's lubridate has distinct `force_tz` and `with_tz`, respectively)
   
   I suppose these are two different operations indeed. I've changed to make this only "localization" now.
   
   I like `force_tz` as it indicates the operation is a bit "dangerous". Candidates are then:
   * force_tz
   * tz_localize
   * tz_normalize


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901974066


   I don't read that as new terminology, rather pointers to terminology used in some other pieces of software.
   (also I find that explanation actually a bit intimidating and confusing)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703584105



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > This corresponds neither to the Pandas behaviour nor to the `date` library behaviour. Is there any system that offers such an option? Saying "the half past is probably a measurement mistake and what was measured is half past an hour later or earlier" is quite a wild guess...
   
   Well if the user wants to do it it's not guessing :)
   Pandas is close with the `nonexistent=pd.Timedelta('1H')` but yeah, it is probably to niche to have here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872813488


   > If you want to convert your localized timezone-aware timestamp to another timezone after localizing, you can use another kernel to do that? (a kernel we don't have yet, but one we should also add, although this is a trivial, metadata-only change)
   
   I was thinking only of the metadata change not "materialization" of a target timezone. Having another kernel to only change metadata seems redundant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703502578



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Uh, sorry, that's what not I meant :-) I think the behaviour was ok, it just that there shouldn't be two different option values that produce exactly the same result.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-906803011


   > Well, I don't think anything is intuitive here, because the semantics of timestamps in Arrow are slightly weird and surprising :-)
   
   You mean timestamps and timezones in general are weird (because the underlying problem is counter intuitive)? If not we should really make this as simple as the underlying complexity is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703578096



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > My thought here is - the input timestamp is half past full hour locally. The local hour doesn't exist, so the half past is probably a measurement mistake and what was measured is half past an hour later or earlier. 
   
   This corresponds neither to the Pandas behaviour nor to the `date` library behaviour. Is there any system that offers such an option? Saying "the half past is probably a measurement mistake and what was measured is half past an hour later or earlier" is quite a wild guess...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703548514



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Well maybe we can keep the original behavior as `infer` and have the two new ones as `shift_forward` and `shift_forward`. It needs a new name for sure.
   I'm borrowing from [pandas tz_localize](https://pandas.pydata.org/docs/reference/api/pandas.Series.tz_localize.html).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-916173722


   Thanks @jorisvandenbossche @pitrou @adamhooper @westonpace !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669166367



##########
File path: r/configure.win
##########
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
       No worries, thanks for responding :)
   The TZ functionality of the vendored library (date.h) needs to download a timezone DB on windows to work. There's [a jira to set that up](https://issues.apache.org/jira/browse/ARROW-13168) so this will eventually need to compile on windows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901964122


   Although it's named like this in pandas, I am a bit hesitant to use "localize". I seem to recall that eg @bkietz interpreted that as the reverse operation (from zoned to naive/local), and in general it seemed to cause confusion in the discussion around this. 
   
   And while typing that, I realize that it would indeed be nice to have a kernel for the reverse as well (zoned to local), so the question is how to name that if we call this "localize_timestamp" (while for "timestamp(_local)_to_zoned" there is a obvious "timestamp(_zoned)_to_local" equivalent naming option)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703581274



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Bottom line, I think we can simply implement the Pandas semantics while keeping the name "earliest" and "latest".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r660346805



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -177,5 +177,17 @@ TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) {
     ASSERT_RAISES(NotImplemented, Subsecond(timestamps));
   }
 }
+
+TEST(ScalarTemporalTest, TestLocalizeTimezoneNaiveTimestamp) {
+  std::string timezone = "Asia/Kolkata";
+  const char* times_naive = R"(["1970-01-01T00:00:59", null])";
+  const char* times_utc = R"(["1970-01-01T00:00:59", null])";
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit_naive = timestamp(u);
+    auto unit_utc = timestamp(u, timezone);
+    CheckScalarUnary("localize", unit_naive, times_naive, unit_utc, times_utc);

Review comment:
       Can you add a test with some other timezones as well?

##########
File path: docs/source/cpp/compute.rst
##########
@@ -906,48 +906,51 @@ Temporal component extraction
 These functions extract datetime components (year, month, day, etc) from timestamp type.
 Note: this is currently not supported for timestamps with timezone information.
 
-+--------------------+------------+-------------------+---------------+--------+
-| Function name      | Arity      | Input types       | Output type   | Notes  |
-+====================+============+===================+===============+========+
-| year               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| month              | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day                | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_year        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_year           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_week           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |
-+--------------------+------------+-------------------+---------------+--------+
-| quarter            | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| hour               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| minute             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| second             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| millisecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| microsecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| nanosecond         | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| subsecond          | Unary      | Temporal          | Double        |        |
-+--------------------+------------+-------------------+---------------+--------+
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| Function name      | Arity      | Input types       | Output type   | Notes  | Options class                                |
++====================+============+===================+===============+========+==============================================+
+| year               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| month              | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day                | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_year        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_year           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_week           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| quarter            | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| hour               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| minute             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| second             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| millisecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| microsecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| nanosecond         | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| subsecond          | Unary      | Temporal          | Double        |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| localize           | Unary      | Temporal          | Temporal      | \(4)   | :struct:`TemporalLocalizationOptions`        |

Review comment:
       Localize is not a "component extraction", so I would put this in a separate table. Maybe under a new subsection below this one as "Timezone handling" ?

##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -735,5 +749,18 @@ Result<Datum> Nanosecond(const Datum& values, ExecContext* ctx = NULLPTR);
 /// \note API not yet finalized
 ARROW_EXPORT Result<Datum> Subsecond(const Datum& values, ExecContext* ctx = NULLPTR);
 
+/// \brief Localizes timezone naive timestamps to UTC for each element of `values`

Review comment:
       This is not only to localize to UTC, but rather to any timezone (that the underlying values are converted from that timezone to UTC is more an implementation detail to the user)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872832117


   (if we keep both source and destination timezone in the "localize" kernel, I think at least the destination timezone, if not specified, should default to be the same as the source timezone, and not "UTC", so that a user doesn't have to specify the timezone they want twice, like `localize(timestamp, source_timezone="Europe/Brussels", destination_timezone="Europe/Brussels")` to end up with a `timestamp(unit, tz="Europe/Brussels)` type column)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703584105



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > This corresponds neither to the Pandas behaviour nor to the `date` library behaviour. Is there any system that offers such an option? Saying "the half past is probably a measurement mistake and what was measured is half past an hour later or earlier" is quite a wild guess...
   
   Well if the user wants to do it it's not guessing :)
   Pandas is close with the `nonexistent=pd.Timedelta('1H')` but yeah, it is probably too niche to have here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-874871566


   @rok I was checking out this PR, and in the process wrote python bindings for the option class to be able to test it in Python, so thought to directly push that here as well. (and at the same time also fixed (I think) the initialization of the timezone, which was causing the C++ tests to fail here as well)
   
   And from testing, you will still need to handle the errors raised by `date.h`:
   
   ```
   In [21]: arr = pa.array([pd.Timestamp("2021-03-28 02:30:00")], type=pa.timestamp("ns"))
   
   In [22]: pc.localize(arr, options=pc.TemporalLocalizationOptions("Europe/Brussels"))
   terminate called after throwing an instance of 'arrow_vendored::date::nonexistent_local_time'
     what():  2021-03-28 02:30:00.000000000 is in a gap between
   2021-03-28 02:00:00 CET and
   2021-03-28 03:00:00 CEST which are both equivalent to
   2021-03-28 01:00:00 UTC
   Aborted (core dumped)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-905428946


   I'm going to suggest `assume_timezone`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r692129176



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -269,6 +270,31 @@ struct ARROW_EXPORT DayOfWeekOptions : public FunctionOptions {
   uint32_t week_start;
 };
 
+/// Used to control timestamp timezone conversion and handling ambiguous/nonexistent
+/// times.
+struct ARROW_EXPORT TemporalLocalizationOptions : public FunctionOptions {
+ public:
+  /// How to interpret ambiguous local times that can be interpreted as
+  /// multiple instants due to DST shifts.
+  enum Ambiguous { AMBIGUOUS_RAISE, AMBIGUOUS_EARLIEST, AMBIGUOUS_LATEST };
+
+  /// How to handle local times that do not exists due to DST shifts.
+  enum Nonexistent { NONEXISTENT_RAISE, NONEXISTENT_EARLIEST, NONEXISTENT_LATEST };

Review comment:
       @bkietz @jorisvandenbossche  Any thoughts about naming? Should these be called `AmbiguousHandling` and `NonExistentHandling`? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703602319



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Pushed a pandas-like change. It's a bit annoying that output depends on the unit but ok ..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-880655054


   > Shall we leave this PR as is then?
   
   I would then remove this option (if we keep it in the PR, it should do the correct thing IMO, while not it is still returning 0 (i.e. 1970-01-01) instead I think?) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r662661947



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -177,5 +177,17 @@ TEST(ScalarTemporalTest, TestZonedTemporalComponentExtraction) {
     ASSERT_RAISES(NotImplemented, Subsecond(timestamps));
   }
 }
+
+TEST(ScalarTemporalTest, TestLocalizeTimezoneNaiveTimestamp) {
+  std::string timezone = "Asia/Kolkata";
+  const char* times_naive = R"(["1970-01-01T00:00:59", null])";
+  const char* times_utc = R"(["1970-01-01T00:00:59", null])";
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit_naive = timestamp(u);
+    auto unit_utc = timestamp(u, timezone);
+    CheckScalarUnary("localize", unit_naive, times_naive, unit_utc, times_utc);

Review comment:
       Added a couple.

##########
File path: docs/source/cpp/compute.rst
##########
@@ -906,48 +906,51 @@ Temporal component extraction
 These functions extract datetime components (year, month, day, etc) from timestamp type.
 Note: this is currently not supported for timestamps with timezone information.
 
-+--------------------+------------+-------------------+---------------+--------+
-| Function name      | Arity      | Input types       | Output type   | Notes  |
-+====================+============+===================+===============+========+
-| year               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| month              | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day                | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |
-+--------------------+------------+-------------------+---------------+--------+
-| day_of_year        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_year           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_week           | Unary      | Temporal          | Int64         | \(2)   |
-+--------------------+------------+-------------------+---------------+--------+
-| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |
-+--------------------+------------+-------------------+---------------+--------+
-| quarter            | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| hour               | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| minute             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| second             | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| millisecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| microsecond        | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| nanosecond         | Unary      | Temporal          | Int64         |        |
-+--------------------+------------+-------------------+---------------+--------+
-| subsecond          | Unary      | Temporal          | Double        |        |
-+--------------------+------------+-------------------+---------------+--------+
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| Function name      | Arity      | Input types       | Output type   | Notes  | Options class                                |
++====================+============+===================+===============+========+==============================================+
+| year               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| month              | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day                | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_week        | Unary      | Temporal          | Int64         | \(1)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| day_of_year        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_year           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_week           | Unary      | Temporal          | Int64         | \(2)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| iso_calendar       | Unary      | Temporal          | Struct        | \(3)   |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| quarter            | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| hour               | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| minute             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| second             | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| millisecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| microsecond        | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| nanosecond         | Unary      | Temporal          | Int64         |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| subsecond          | Unary      | Temporal          | Double        |        |                                              |
++--------------------+------------+-------------------+---------------+--------+----------------------------------------------+
+| localize           | Unary      | Temporal          | Temporal      | \(4)   | :struct:`TemporalLocalizationOptions`        |

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-880661119


   Can you also handle the case of an unknown / wrong timezone? (I thought I already commented about that before, but don't directly see it ;)) Currently when using such a timezone, you get a segfault:
   
   ```
   In [7]: pc.tz_localize(arr, timezone="Europe/Brussels", nonexistent="shift_forward").type
   Out[7]: TimestampType(timestamp[ns, tz=Europe/Brussels])
   
   In [8]: pc.tz_localize(arr, timezone="Europe/Brusselsss", nonexistent="shift_forward")
   terminate called after throwing an instance of 'std::runtime_error'
     what():  Europe/Brusselsss not found in timezone database
   Aborted (core dumped)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
adamhooper commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872986365


   Since the Arrow spec is going to refer to "LocalDateTime", "Instant" and "ZonedDateTime", how about reusing terms from the spec? Something along the lines of `local_datetime_to_zoned_datetime()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872766851


   @rok can you explain the rationale of adding a ``source_timezone`` and ``destination_timezone`` to the API? What's the reason for doing it like this, instead of having a single target timezone as argument?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-906352319


   Well, I don't think anything is intuitive here, because the semantics of timestamps in Arrow are slightly weird and surprising :-)
   
   You may still ask for opinions on the ML, but I would personally go with `assume_timezone`. Users will have to read the documentation in any case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669127595



##########
File path: r/configure.win
##########
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
       Should this be somewhere else?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703310562



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Am I misreading, or is `times_latest` equal to `times_earliest`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703576179



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche
   
   I'm sure there are is someone out there who cares about nanoseconds :)).
   
   My thought here is - the input timestamp is half past full hour locally. The local hour doesn't exist, so the half past is probably a measurement mistake and what was measured is half past an hour later or earlier. The third option (infer) is to ignore the half past and go to the moment the DST shift occurred.
   
   ```
   earliest -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 00:30:00 (UTC)
   infer -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 01:00:00 (UTC)
   latest -> 2015-03-29 01:30:00 (Warsaw) -> 2015-03-29 01:30:00 (UTC)
   ```
   
   Visual aid for the discussion (not exactly the same zones):
   ![](https://files.gitter.im/HowardHinnant/date/fM7T/EuropeMoscow.jpeg)
   
   We can also just park this into [ARROW-13347](https://issues.apache.org/jira/browse/ARROW-13347).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-916060718


   @rok I'm on it already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r702982685



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -278,6 +279,31 @@ struct ARROW_EXPORT DayOfWeekOptions : public FunctionOptions {
   uint32_t week_start;
 };
 
+/// Used to control timestamp timezone conversion and handling ambiguous/nonexistent
+/// times.
+struct ARROW_EXPORT AssumeTimezoneOptions : public FunctionOptions {
+ public:
+  /// How to interpret ambiguous local times that can be interpreted as
+  /// multiple instants due to DST shifts.
+  enum Ambiguous { AMBIGUOUS_RAISE, AMBIGUOUS_EARLIEST, AMBIGUOUS_LATEST };
+
+  /// How to handle local times that do not exists due to DST shifts.
+  enum Nonexistent { NONEXISTENT_RAISE, NONEXISTENT_EARLIEST, NONEXISTENT_LATEST };
+
+  explicit AssumeTimezoneOptions(std::string timezone,
+                                 Ambiguous ambiguous = AMBIGUOUS_RAISE,
+                                 Nonexistent nonexistent = NONEXISTENT_RAISE);
+  AssumeTimezoneOptions();
+  constexpr static char const kTypeName[] = "AssumeTimezoneOptions";
+
+  /// Timezone and timezone name to convert timestamp from
+  std::string timezone;
+  const arrow_vendored::date::time_zone* tz;

Review comment:
       I don't think this should be here. Is it deliberate or just a leftover?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-873184390


   I like all except `tz_normalize`.  "normalize" just has too many meanings for me already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-903924787


   Name proposals so far:
   * `tz_localize`
   * `force_tz`
   * `local_datetime_to_zoned_datetime`
   * `timestamp_to_zoned`
   * `timestamp_to_local`
   * `timestamp_local_to_zoned`
   * `timestamp_local_at_zone`
   * `localize_timestamp`
   * `localize_naive_timestamp`
   * `to_timestamp_with_time_zone`
   
   I kinda like '..._at_zone' format because it makes me think the timestamp will be interpreted as in a timezone rather then converted to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-877088851


   > > and how to return NaN when the timestamp is an int64.
   > 
   > I think it should rather be a null instead of NaN (I suppose this is for the IGNORE case?), and nulls can be encoded in int64.
   
   Yes I was thinking about the IGNORE case. This should now return nulls but zeros are returned. Do we need to switch away from `ScalarUnaryNotNullStateful` for this?
   
   > I also think that the output type of the kernel should be timestamp _with_ timezone.
   
   Done. I was not sure how to get the timezone into the out type. I've made a runtime resolver now.
   
   I'll move this out of draft to invite more discussion :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r670426594



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -321,6 +341,77 @@ struct Nanosecond {
   }
 };
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC
+
+template <typename Duration>
+struct TzLocalize {
+  explicit TzLocalize(const TemporalLocalizationOptions& options)
+      : options(std::move(options)) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}))
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose) const {
+    return static_cast<T>(
+        arrow_vendored::date::zoned_time<Duration>(
+            options.tz, arrow_vendored::date::local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case TemporalLocalizationOptions::Nonexistent::RAISE_NONEXISTENT: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case TemporalLocalizationOptions::Nonexistent::SHIFT_BACKWARD: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest);
+        }
+        case TemporalLocalizationOptions::Nonexistent::SHIFT_FORWARD: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest);
+        }
+        case TemporalLocalizationOptions::Nonexistent::IGNORE_NONEXISTENT: {
+          return static_cast<T>(NULL);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case TemporalLocalizationOptions::Ambiguous::RAISE_AMBIGUOUS: {
+          *st = Status::Invalid("Ambiguous: ", e.what());
+          return arg;
+        }
+        case TemporalLocalizationOptions::Ambiguous::FIRST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest);
+        }
+        case TemporalLocalizationOptions::Ambiguous::LAST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest);

Review comment:
       For `nonexistent`, we use "shift_backward" and "shift_forward" for earliest/latest, and here we use "first" and "last" for the `ambiguous` keyword. Shall we make this consistent?  (and could then also use the terminology of date.h)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458



##########
File path: r/configure.win
##########
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
       Ah, sorry, thought this was on the other PR where I was questioning this.  I will let someone with more knowledge of the build process chime in on whether this is correct or not.  I thought the TZ stuff didn't support Windows anyways?  Or is that only strftime?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-869967788


   https://issues.apache.org/jira/browse/ARROW-13033


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703392527



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Yeah, indeed https://howardhinnant.github.io/date/tz.html#nonexistent_local_time




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703576179



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       > Look at the results in the Pandas example: "2015-03-29 01:59:59.999999999+01:00" and "2015-03-29 03:00:00+02:00" are separated by a single _nanosecond_. Is this behaviour useful? @jorisvandenbossche
   
   I'm sure there are is someone out there who cares about nanoseconds :)).
   
   My thought here is - the input timestamp is half past full hour locally. The local hour doesn't exist, so the half past is probably a measurement mistake and what's was measured is half past an hour later or earlier. The third option (infer) is to ignore the half past and go to the moment the DST shift occurred.
   
   ```
   earliest -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 00:30:00 (UTC)
   infer -> 2015-03-29 02:30:00 (Warsaw) -> 2015-03-29 01:00:00 (UTC)
   latest -> 2015-03-29 01:30:00 (Warsaw) -> 2015-03-29 01:30:00 (UTC)
   ```
   
   Visual aid for the discussion (not exactly the same zones):
   ![](https://files.gitter.im/HowardHinnant/date/fM7T/EuropeMoscow.jpeg)
   
   We can also just park this into [ARROW-13347](https://issues.apache.org/jira/browse/ARROW-13347).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r694108721



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -570,6 +586,71 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from arbitrary timezone to UTC

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r694110590



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -161,6 +162,22 @@ struct TemporalComponentExtract
   }
 };
 
+template <template <typename...> class Op, typename Duration, typename OutType>
+struct TemporalLocalization
+    : public TemporalComponentExtractBase<Op, Duration, OutType> {
+  using Base = TemporalComponentExtractBase<Op, Duration, OutType>;
+
+  static Status Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+    const TemporalLocalizationOptions& options = TemporalLocalizationState::Get(ctx);
+    const auto& timezone = GetInputTimezone(batch.values[0]);
+    if (!timezone.empty() && options.timezone != timezone) {
+      return Status::Invalid("Input and output timezones do not match: ", timezone,

Review comment:
       Yeah, that's a good point! I made this so that input timestamp array has to be timezone-less.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-877018432


   > What I'm not sure about now is if try/catch is the way to go here
   
   I think the try/catch is fine, it's the way to convert an exception (which we don't use ourselves as return values) into a Status. 
   
   > and how to return NaN when the timestamp is an int64.
   
   I think it should rather be a null instead of NaN (I suppose this is for the IGNORE case?), and nulls can be encoded in int64. 
   
   I also think that the output type of the kernel should be timestamp _with_ timezone.
   
   > Another thing: `infer` requires understanding if the array is monotonous or not (if I understand correctly) and I'd push it out of scope of this PR for now.
   
   Yes, that's certainly something to be left/discussed for later, as it makes the kernel no longer a simple scalar kernel (where we can operate on each scalar independently element-wise), since the next scalar needs to have knowledge about the previous ones.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703631280



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -383,7 +384,96 @@ TEST_F(ScalarTemporalTest, DayOfWeek) {
                                                        /*week_start=*/8)));
 }
 
+// TODO: We should test on windows once ARROW-13168 is resolved.
 #ifndef _WIN32
+TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
+  std::string timezone_utc = "UTC";
+  std::string timezone_kolkata = "Asia/Kolkata";
+  std::string timezone_us_central = "US/Central";
+  const char* times_utc = R"(["1970-01-01T00:00:00", null])";
+  const char* times_kolkata = R"(["1970-01-01T05:30:00", null])";
+  const char* times_us_central = R"(["1969-12-31T18:00:00", null])";
+  auto options_utc = AssumeTimezoneOptions(timezone_utc);
+  auto options_kolkata = AssumeTimezoneOptions(timezone_kolkata);
+  auto options_us_central = AssumeTimezoneOptions(timezone_us_central);
+  auto options_invalid = AssumeTimezoneOptions("Europe/Brusselsss");
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_utc = timestamp(u, timezone_utc);
+    auto unit_kolkata = timestamp(u, timezone_kolkata);
+    auto unit_us_central = timestamp(u, timezone_us_central);
+
+    CheckScalarUnary("assume_timezone", unit, times_utc, unit_utc, times_utc,
+                     &options_utc);
+    CheckScalarUnary("assume_timezone", unit, times_kolkata, unit_kolkata, times_utc,
+                     &options_kolkata);
+    CheckScalarUnary("assume_timezone", unit, times_us_central, unit_us_central,
+                     times_utc, &options_us_central);
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit_kolkata, times_utc), options_utc));
+    ASSERT_RAISES(Invalid,
+                  AssumeTimezone(ArrayFromJSON(unit, times_utc), options_invalid));
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneAmbiguous) {
+  std::string timezone = "CET";
+  const char* times = R"(["2018-10-28 01:20:00",
+                          "2018-10-28 02:36:00",
+                          "2018-10-28 03:46:00"])";
+  const char* times_earliest = R"(["2018-10-27 23:20:00",
+                                   "2018-10-28 00:36:00",
+                                   "2018-10-28 02:46:00"])";
+  const char* times_latest = R"(["2018-10-27 23:20:00",
+                                 "2018-10-28 01:36:00",
+                                 "2018-10-28 02:46:00"])";
+
+  auto options_earliest = AssumeTimezoneOptions(
+      timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST);
+  auto options_latest =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST);
+  auto options_raise =
+      AssumeTimezoneOptions(timezone, AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE);
+
+  for (auto u : internal::AllTimeUnits()) {
+    auto unit = timestamp(u);
+    auto unit_local = timestamp(u, timezone);
+    ASSERT_RAISES(Invalid, AssumeTimezone(ArrayFromJSON(unit, times), options_raise));
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_earliest,
+                     &options_earliest);
+    CheckScalarUnary("assume_timezone", unit, times, unit_local, times_latest,
+                     &options_latest);
+  }
+}
+
+TEST_F(ScalarTemporalTest, TestAssumeTimezoneNonexistent) {
+  std::string timezone = "Europe/Warsaw";
+  const char* times = R"(["2015-03-29 02:30:00", "2015-03-29 03:30:00"])";
+  const char* times_earliest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";
+  const char* times_latest = R"(["2015-03-29 01:00:00", "2015-03-29 01:30:00"])";

Review comment:
       Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok edited a comment on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok edited a comment on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872932527


   > > Having another kernel to only change metadata seems redundant.
   > 
   > I don't think that's redundant, as a user will regularly want to change the timezone of a timestamp column that is already localized (already has a timezone).
   > 
   > I am of course coming from the point of view of pandas, where those two operations are defined as two distinct methods: tz_localize to convert from "timestamp without timezone" to "timestamp with timezone" (i.e. from naive clock time to tz-aware time) and tz_convert to convert between "timestamp with timezone" with different timezones (a metadata-only change).
   > From a user point of view (and also implementation-wise), those two are distinct concepts, IMO, and so I would keep them as separate kernels. (also eg R's lubridate has distinct `force_tz` and `with_tz`, respectively)
   
   I suppose these are two different operations indeed. I've changed this to only do "localization" now.
   
   I like `force_tz` as it indicates the operation is a bit "dangerous". Candidates are then:
   * force_tz
   * tz_localize
   * tz_normalize


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r669164458



##########
File path: r/configure.win
##########
@@ -44,7 +44,7 @@ else
   RWINLIB="../windows/$(ls windows/ | grep ^arrow- | tail -n 1)"
 fi
 OPENSSL_LIBS="-lcrypto -lcrypt32"
-MIMALLOC_LIBS="-lbcrypt -lpsapi"
+MIMALLOC_LIBS="-lbcrypt -lpsapi -lole32"

Review comment:
       Ah, sorry, thought this was on the other PR where I was questioning this.  I will let someone with more knowledge of the build process chime in on whether this is correct or not.  I thought the TZ stuff didn't support Windows anyways?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-876688529


   @jorisvandenbossche I've added nonexistent and ambiguous time handling. I'll try to template calls instead of having a single general one.
   
   What I'm not sure about now is if try/catch is the way to go here and how to return `NaN` when the timestamp is an `int64`.
   
   Another thing: `infer` requires understanding if the array is monotonous or not (if I understand correctly) and I'd push it out of scope of this PR for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-872839038


   I opened https://issues.apache.org/jira/browse/ARROW-13247 for the discussion whether we want a separate "change timezone" kernel


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r702986242



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -571,6 +592,73 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from local timestamp without a timezone to timestamps with a
+// timezone, interpreting the local timestamp as being in the specified timezone
+
+template <typename Duration>
+struct AssumeTimezone {
+  explicit AssumeTimezone(const AssumeTimezoneOptions* options, const time_zone* tz)
+      : options(*options), tz_(tz) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const time_zone* tz) const {
+    return static_cast<T>(zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}))
+                              .get_sys_time()
+                              .time_since_epoch()
+                              .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose,
+                   const time_zone* tz) const {
+    return static_cast<T>(
+        zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg, tz_);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_RAISE: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE: {
+          *st = Status::Invalid("Ambiguous: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    }
+    *st = Status::UnknownError("Unknown error.");

Review comment:
       I don't really understand what this is for. Presumably this should be unreachable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on a change in pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on a change in pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#discussion_r703017328



##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal.cc
##########
@@ -571,6 +592,73 @@ struct Strftime {
 };
 #endif
 
+// ----------------------------------------------------------------------
+// Convert timestamps from local timestamp without a timezone to timestamps with a
+// timezone, interpreting the local timestamp as being in the specified timezone
+
+template <typename Duration>
+struct AssumeTimezone {
+  explicit AssumeTimezone(const AssumeTimezoneOptions* options, const time_zone* tz)
+      : options(*options), tz_(tz) {}
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const time_zone* tz) const {
+    return static_cast<T>(zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}))
+                              .get_sys_time()
+                              .time_since_epoch()
+                              .count());
+  }
+
+  template <typename T, typename Arg0>
+  T get_local_time(Arg0 arg, const arrow_vendored::date::choose choose,
+                   const time_zone* tz) const {
+    return static_cast<T>(
+        zoned_time<Duration>(tz, local_time<Duration>(Duration{arg}), choose)
+            .get_sys_time()
+            .time_since_epoch()
+            .count());
+  }
+
+  template <typename T, typename Arg0>
+  T Call(KernelContext*, Arg0 arg, Status* st) const {
+    try {
+      return get_local_time<T, Arg0>(arg, tz_);
+    } catch (const arrow_vendored::date::nonexistent_local_time& e) {
+      switch (options.nonexistent) {
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_RAISE: {
+          *st = Status::Invalid("Nonexistent: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Nonexistent::NONEXISTENT_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    } catch (const arrow_vendored::date::ambiguous_local_time& e) {
+      switch (options.ambiguous) {
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_RAISE: {
+          *st = Status::Invalid("Ambiguous: ", e.what());
+          return arg;
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_EARLIEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::earliest,
+                                         tz_);
+        }
+        case AssumeTimezoneOptions::Ambiguous::AMBIGUOUS_LATEST: {
+          return get_local_time<T, Arg0>(arg, arrow_vendored::date::choose::latest, tz_);
+        }
+      }
+    }
+    *st = Status::UnknownError("Unknown error.");

Review comment:
       Removed the status but kept the `return 0` to keep he compiler happy. Please check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-916055216


   Let me rebase that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-916020676


   Ok, I've pushed some minor changes and will merge once CI is green. Thanks a lot for this @rok !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901843106


   We still have the naming issue (or bikeshedding) to resolve. As a reference, some names used by other systems that I am aware of:
   * Pandas: [tz_localize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.tz_localize.html)
   * R's lubridate: [force_tz](https://lubridate.tidyverse.org/reference/force_tz.html)
   * Java: [atZone](https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html#atZone-java.time.ZoneId-)
   * Java Joda: [toDateTime](https://www.joda.org/joda-time/apidocs/org/joda/time/LocalDateTime.html#toDateTime-org.joda.time.DateTimeZone-)
   * Postgresql: [AT TIME ZONE](https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT)
   
   The proposals that have been floated around above:
   
   * `tz_localize`
   * `force_tz`
   * `local_datetime_to_zoned_datetime`
   
   I left out `tz_normalize` (like Weston mentioned, "normalize" already has too many meanings, it also doesn't give any indication to me whether it's adding or removing a timezone ..). 
   I personally like Adam's suggestion of being more explicit (although resulting in a longer name) and re-using the terminology that was now added to the format spec to clarify the meaning. I would maybe only use "timestamp" instead of "datetime", since the actual types in arrow that this kernel accepts are called that way. 
   With that, I would do a proposal of calling it `timestamp_local_to_zoned` ? (or `timestamp_local_at_zone`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-880634922


   > > > and how to return NaN when the timestamp is an int64.
   > > 
   > > 
   > > I think it should rather be a null instead of NaN (I suppose this is for the IGNORE case?), and nulls can be encoded in int64.
   > 
   > Yes I was thinking about the IGNORE case. This should now return nulls but zeros are returned. Do we need to switch away from `ScalarUnaryNotNullStateful` for this?
   
   If it makes the implementation more complicated (as it would no longer simply propagate nulls), I think it is fine to also leave this option for a future enhancement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-901324621


   @pitrou - I've rebased this and it's ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rok commented on pull request #10610: ARROW-13033: [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time)

Posted by GitBox <gi...@apache.org>.
rok commented on pull request #10610:
URL: https://github.com/apache/arrow/pull/10610#issuecomment-903765579


   @jorisvandenbossche I think most of the names we're discussing are great. I would slightly prefer an 'old' name as it confuses slightly less users :).
   I'm just going through the feedback now and will push a commit today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org