You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/23 13:33:55 UTC

[GitHub] [arrow] pitrou commented on a change in pull request #12536: ARROW-13168: [C++][R] Enable runtime timezone database for Windows

pitrou commented on a change in pull request #12536:
URL: https://github.com/apache/arrow/pull/12536#discussion_r833247348



##########
File path: cpp/src/arrow/config.h
##########
@@ -77,4 +86,13 @@ const BuildInfo& GetBuildInfo();
 ARROW_EXPORT
 RuntimeInfo GetRuntimeInfo();
 
+struct ArrowGlobalOptions {
+  /// Path to text timezone database. This is only configurable on Windows,
+  /// which does not have a compatible OS timezone database.
+  util::optional<std::string> tz_db_path;

Review comment:
       Can we use a consistent naming here and above?
   

##########
File path: cpp/src/arrow/config.cc
##########
@@ -73,7 +76,30 @@ RuntimeInfo GetRuntimeInfo() {
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsSupported(flags); });
   info.detected_simd_level =
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsDetected(flags); });
+  info.using_os_timezone_db = USE_OS_TZDB;
+#if !USE_OS_TZDB
+  info.timezone_db_path = timezone_db_path;

Review comment:
       You shouldn't leave it uninitialized otherwise.

##########
File path: cpp/src/arrow/public_api_test.cc
##########
@@ -103,4 +104,40 @@ TEST(Misc, BuildInfo) {
   ASSERT_THAT(info.full_so_version, ::testing::HasSubstr(info.so_version));
 }
 
+TEST(Misc, SetTimzoneConfig) {
+#ifndef _WIN32
+  GTEST_SKIP() << "Can only set the Timezone database on Windows";
+#else
+#ifndef ARROW_FILESYSTEM
+  GTEST_SKIP() << "Need filesystem support to test timezone config.";
+#else
+  auto fs = std::make_shared<arrow::fs::LocalFileSystem>();
+  auto home_raw = std::getenv("USERPROFILE");
+  std::string home = home_raw == nullptr ? "~" : std::string(home_raw);
+  ASSERT_OK_AND_ASSIGN(std::string tzdata_dir,
+                       fs->NormalizePath(home + "\\Downloads\\tzdata"));

Review comment:
       This makes the test depend on specifics of where the timezone db was downloaded.
   
   Do we instead want to expose a `ARROW_TIMEZONE_DB_PATH` environment variable that would automatically configure the right tz db path? Or is that overkill?

##########
File path: cpp/src/arrow/public_api_test.cc
##########
@@ -103,4 +104,40 @@ TEST(Misc, BuildInfo) {
   ASSERT_THAT(info.full_so_version, ::testing::HasSubstr(info.so_version));
 }
 
+TEST(Misc, SetTimzoneConfig) {
+#ifndef _WIN32
+  GTEST_SKIP() << "Can only set the Timezone database on Windows";
+#else
+#ifndef ARROW_FILESYSTEM

Review comment:
       ```suggestion
   #elif !defined(ARROW_FILESYSTEM)
   ```

##########
File path: cpp/src/arrow/public_api_test.cc
##########
@@ -103,4 +104,40 @@ TEST(Misc, BuildInfo) {
   ASSERT_THAT(info.full_so_version, ::testing::HasSubstr(info.so_version));
 }
 
+TEST(Misc, SetTimzoneConfig) {

Review comment:
       Typo here.

##########
File path: docs/source/cpp/build_system.rst
##########
@@ -163,3 +163,24 @@ can control the source of each dependency and whether it is statically or
 dynamically linked. See :doc:`/developers/cpp/building` for instructions. Or
 alternatively, use Arrow from a package manager such as Conda or vcpkg which
 will manage consistent versions of Arrow and its dependencies.
+
+
+Runtime Dependencies
+====================
+
+While Arrow uses the OS-provided timezone database on Linux and Mac OS, it
+requires a user-provided database on Windows. You must download and extract the
+text version of the IANA timezone database and add the Windows timezone mapping
+XML. To download, you can use the following batch script:
+
+.. literalinclude:: ../../../ci/appveyor-cpp-setup.bat
+   :language: cmd
+   :start-after: @rem (Doc section: Download timezone database)
+   :end-before: @rem (Doc section: Download timezone database)

Review comment:
       I'd rather if you copied the relevant snippet here, we shouldn't ideally rely on the contents of CI scripts (which may contain specific quirks that irrelevant to normal user setups) for the public docs.

##########
File path: cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
##########
@@ -1875,6 +1870,9 @@ TEST_F(ScalarTemporalTest, StrftimeCLocale) {
 }
 
 TEST_F(ScalarTemporalTest, StrftimeOtherLocale) {
+#ifdef _WIN32
+  GTEST_SKIP() << "There is a known bug in strftime for locales on Windows (ARROW-15922)";

Review comment:
       Is this on Windows or specifically MinGW? i.e., would the non-MinGW Windows CI pass if you remove this skip? I'm wondering if we can narrow the condition.

##########
File path: docs/source/cpp/build_system.rst
##########
@@ -163,3 +163,24 @@ can control the source of each dependency and whether it is statically or
 dynamically linked. See :doc:`/developers/cpp/building` for instructions. Or
 alternatively, use Arrow from a package manager such as Conda or vcpkg which
 will manage consistent versions of Arrow and its dependencies.
+
+
+Runtime Dependencies
+====================
+
+While Arrow uses the OS-provided timezone database on Linux and Mac OS, it

Review comment:
       ```suggestion
   While Arrow uses the OS-provided timezone database on Linux and macOS, it
   ```

##########
File path: docs/source/developers/cpp/windows.rst
##########
@@ -362,6 +362,21 @@ suppress dllimport/dllexport marking of symbols. Projects that statically link
 against Arrow on Windows additionally need this definition. The Unix builds do
 not use the macro.
 
+Downloading the Timezone Database
+=================================
+
+To run some of the compute unit tests on Windows, the IANA timezone database
+and the Windows timezone mapping need to be downloaded first. To download
+them to the default search directory (in the user Downloads directory), use 

Review comment:
       Can you just cross-reference to the relevant section in the building docs above?

##########
File path: docs/source/cpp/build_system.rst
##########
@@ -163,3 +163,24 @@ can control the source of each dependency and whether it is statically or
 dynamically linked. See :doc:`/developers/cpp/building` for instructions. Or
 alternatively, use Arrow from a package manager such as Conda or vcpkg which
 will manage consistent versions of Arrow and its dependencies.
+
+
+Runtime Dependencies
+====================
+
+While Arrow uses the OS-provided timezone database on Linux and Mac OS, it
+requires a user-provided database on Windows. You must download and extract the
+text version of the IANA timezone database and add the Windows timezone mapping
+XML. To download, you can use the following batch script:
+
+.. literalinclude:: ../../../ci/appveyor-cpp-setup.bat
+   :language: cmd
+   :start-after: @rem (Doc section: Download timezone database)
+   :end-before: @rem (Doc section: Download timezone database)
+
+By default, the timezone database will be detected at ``%USERPROFILE%\Downloads\tzdata``,

Review comment:
       Is this a reliable location? Is there a risk that the downloads folder gets cleared from time to time?

##########
File path: cpp/src/arrow/config.cc
##########
@@ -73,7 +76,30 @@ RuntimeInfo GetRuntimeInfo() {
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsSupported(flags); });
   info.detected_simd_level =
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsDetected(flags); });
+  info.using_os_timezone_db = USE_OS_TZDB;
+#if !USE_OS_TZDB
+  info.timezone_db_path = timezone_db_path;
+#endif
   return info;
 }
 
+Status Initialize(const ArrowGlobalOptions& options) noexcept {
+  if (options.tz_db_path.has_value()) {
+#if !USE_OS_TZDB
+    try {
+      arrow_vendored::date::set_install(options.tz_db_path.value());
+      arrow_vendored::date::reload_tzdb();
+    } catch (const std::runtime_error& e) {
+      return Status::ExecutionError(e.what());
+    }
+    timezone_db_path = options.tz_db_path.value();
+#else
+    return Status::Invalid(
+        "Arrow was set to use OS timezone database at compile time, so it cannot be set "
+        "at runtime.");

Review comment:
       The error message is weird, it reads like you can set to use OS timezone database at runtime.

##########
File path: cpp/src/arrow/public_api_test.cc
##########
@@ -103,4 +104,40 @@ TEST(Misc, BuildInfo) {
   ASSERT_THAT(info.full_so_version, ::testing::HasSubstr(info.so_version));
 }
 
+TEST(Misc, SetTimzoneConfig) {
+#ifndef _WIN32
+  GTEST_SKIP() << "Can only set the Timezone database on Windows";
+#else
+#ifndef ARROW_FILESYSTEM
+  GTEST_SKIP() << "Need filesystem support to test timezone config.";
+#else
+  auto fs = std::make_shared<arrow::fs::LocalFileSystem>();
+  auto home_raw = std::getenv("USERPROFILE");
+  std::string home = home_raw == nullptr ? "~" : std::string(home_raw);
+  ASSERT_OK_AND_ASSIGN(std::string tzdata_dir,
+                       fs->NormalizePath(home + "\\Downloads\\tzdata"));
+  ASSERT_OK_AND_ASSIGN(auto tzdata_path,
+                       arrow::internal::PlatformFilename::FromString(tzdata_dir));
+  if (!arrow::internal::FileExists(tzdata_path).ValueOr(false)) {
+    GTEST_SKIP() << "Couldn't find timezone database in expected dir: " << tzdata_dir;
+  }
+  // Create a tmp directory
+  ASSERT_OK_AND_ASSIGN(auto tempdir, arrow::internal::TemporaryDir::Make("tzdata"));
+
+  // Validate that setting tzdb to that dir fails
+  arrow::ArrowGlobalOptions options = {util::make_optional(tempdir->path().ToString())};
+  ASSERT_NOT_OK(arrow::Initialize(options));
+
+  // Copy tzdb data from ~/Downloads
+  auto selector = arrow::fs::FileSelector();
+  selector.base_dir = tzdata_dir;
+  selector.recursive = true;
+  ASSERT_OK(arrow::fs::CopyFiles(fs, selector, fs, tempdir->path().ToString()));
+
+  // Validate that tzdb is working
+  ASSERT_OK(arrow::Initialize(options));

Review comment:
       Hmm, the problem is that later tests may then pick up your temporary directory for the timezone db, which will then not exist anymore?

##########
File path: cpp/src/arrow/config.cc
##########
@@ -73,7 +76,30 @@ RuntimeInfo GetRuntimeInfo() {
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsSupported(flags); });
   info.detected_simd_level =
       MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsDetected(flags); });
+  info.using_os_timezone_db = USE_OS_TZDB;
+#if !USE_OS_TZDB
+  info.timezone_db_path = timezone_db_path;
+#endif
   return info;
 }
 
+Status Initialize(const ArrowGlobalOptions& options) noexcept {
+  if (options.tz_db_path.has_value()) {
+#if !USE_OS_TZDB
+    try {
+      arrow_vendored::date::set_install(options.tz_db_path.value());
+      arrow_vendored::date::reload_tzdb();
+    } catch (const std::runtime_error& e) {
+      return Status::ExecutionError(e.what());

Review comment:
       That status code is really Gandiva specific.
   ```suggestion
         return Status::IOError(e.what());
   ```

##########
File path: cpp/src/arrow/config.h
##########
@@ -77,4 +86,13 @@ const BuildInfo& GetBuildInfo();
 ARROW_EXPORT
 RuntimeInfo GetRuntimeInfo();
 
+struct ArrowGlobalOptions {

Review comment:
       Note sure it's worth putting "Arrow" here since this is in the `arrow` namespace.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org