You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/06 12:05:22 UTC

[GitHub] [arrow] assignUser commented on a diff in pull request #13464: ARROW-16752: [R] Rework Linux binary installation

assignUser commented on code in PR #13464:
URL: https://github.com/apache/arrow/pull/13464#discussion_r914748797


##########
dev/tasks/r/github.packages.yml:
##########
@@ -212,47 +212,62 @@ jobs:
 
   test-linux-binary:
     needs: [source, linux-cpp]
-    name: Test binary {{ '${{ matrix.image }}' }}
+    name: Test binary {{ '${{ matrix.config.image }}' }}
     runs-on: ubuntu-latest
-    container: {{ '${{ matrix.image }}' }}
+    container: {{ '${{ matrix.config.image }}' }}
     strategy:
       fail-fast: false
       matrix:
-        image:
-          - "rhub/ubuntu-gcc-release" # ubuntu-20.04 (focal)
-          - "rstudio/r-base:4.1-bionic"
-          - "rstudio/r-base:4.2-centos7"
-          - "rocker/r-ver:3.6.3" # for debian:buster (10)
-          - "rocker/r-ver" # ubuntu-20.04
-          - "rhub/fedora-clang-devel" # tests distro-map.csv, mapped to ubuntu-18.04
-          - "rocker/r-ubuntu:22.04" # tests openssl3 compatibility
+        config:
+          # If libarrow_binary is unset, we're testing that we're automatically
+          # choosing a binary on this OS. If libarrow_binary is TRUE, we're on
+          # an OS that is not in the allowlist, so we have to opt-in to use the
+          # binary. Other env vars used in r_docker_configure.sh can be added
+          # here (like devtoolset) and wired up in the later steps.
+          - {image: "rhub/debian-clang-devel", libarrow_binary: "TRUE"}
+          # fedora-clang-devel cannot use binaries bc of libc++ (uncomment to see the error)

Review Comment:
   This was working previously, is this a regression or was it a false positive previously?



##########
r/tools/nixlibs.R:
##########
@@ -60,53 +60,173 @@ download_ok <- !env_is("TEST_OFFLINE_BUILD", "true") && try_download("https://gi
 thirdparty_dependency_dir <- Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR", "tools/thirdparty_dependencies")
 
 
-download_binary <- function(os = identify_os()) {
+download_binary <- function(lib) {
   libfile <- tempfile()
-  if (!is.null(os)) {
-    # See if we can map this os-version to one we have binaries for
-    os <- find_available_binary(os)
-    binary_url <- paste0(arrow_repo, "bin/", os, "/arrow-", VERSION, ".zip")
-    if (try_download(binary_url, libfile)) {
-      cat(sprintf("*** Successfully retrieved C++ binaries for %s\n", os))
-      if (!identical(os, "centos-7")) {
-        # centos-7 uses gcc 4.8 so the binary doesn't have ARROW_S3=ON
-        # or ARROW_GCS=ON but the others do
-        # TODO: actually check for system requirements?
-        cat("**** Binary package requires libcurl and openssl\n")
-        cat("**** If installation fails, retry after installing those system requirements\n")
-      }
-    } else {
-      cat(sprintf("*** No libarrow binary found for version %s on %s\n", VERSION, os))
-      libfile <- NULL
+  binary_url <- paste0(arrow_repo, "bin/", lib, "/arrow-", VERSION, ".zip")
+  if (try_download(binary_url, libfile)) {
+    if (!quietly) {
+      cat(sprintf("*** Successfully retrieved C++ binaries (%s)\n", lib))
     }
   } else {
+    if (!quietly) {
+      cat(sprintf("*** No libarrow binary found for version %s (%s)\n", VERSION, lib))
+    }
     libfile <- NULL
   }
   libfile
 }
 
 # Function to figure out which flavor of binary we should download, if at all.
-# By default (unset or "FALSE"), it will not download a precompiled library,
-# but you can override this by setting the env var LIBARROW_BINARY to:
-# * `TRUE` (not case-sensitive), to try to discover your current OS, or
-# * some other string, presumably a related "distro-version" that has binaries
-#   built that work for your OS
-identify_os <- function(os = Sys.getenv("LIBARROW_BINARY")) {
-  if (tolower(os) %in% c("", "false")) {
-    # Env var says not to download a binary
-    return(NULL)
-  } else if (!identical(tolower(os), "true")) {
-    # Env var provided an os-version to use--maybe you're on Ubuntu 18.10 but
-    # we only build for 18.04 and that's fine--so use what the user set
-    return(os)
+# LIBARROW_BINARY controls the behavior. If unset, it will determine a course
+# of action based on the current system. Other values you can set it to:
+# * "FALSE" (not case-sensitive), to skip this option altogether
+# * "TRUE" (not case-sensitive), to try to discover your current OS, or
+# * Some other string: a "distro-version" that corresponds to a binary that is
+#   available, to override what this function may discover by default.
+#   Possible values are:
+#    * "centos-7" (gcc 4.8, no AWS/GCS support)
+#    * "ubuntu-18.04" (gcc 8, openssl 1)
+#    * "ubuntu-22.04" (openssl 3)
+#   These string values, along with `NULL`, are the potential return values of #   this function.

Review Comment:
   ```suggestion
   #   These string values, along with `NULL`, are the potential return values of
   #   this function.
   ```



##########
r/tools/nixlibs.R:
##########
@@ -60,53 +60,173 @@ download_ok <- !env_is("TEST_OFFLINE_BUILD", "true") && try_download("https://gi
 thirdparty_dependency_dir <- Sys.getenv("ARROW_THIRDPARTY_DEPENDENCY_DIR", "tools/thirdparty_dependencies")
 
 
-download_binary <- function(os = identify_os()) {
+download_binary <- function(lib) {
   libfile <- tempfile()
-  if (!is.null(os)) {
-    # See if we can map this os-version to one we have binaries for
-    os <- find_available_binary(os)
-    binary_url <- paste0(arrow_repo, "bin/", os, "/arrow-", VERSION, ".zip")
-    if (try_download(binary_url, libfile)) {
-      cat(sprintf("*** Successfully retrieved C++ binaries for %s\n", os))
-      if (!identical(os, "centos-7")) {
-        # centos-7 uses gcc 4.8 so the binary doesn't have ARROW_S3=ON
-        # or ARROW_GCS=ON but the others do
-        # TODO: actually check for system requirements?
-        cat("**** Binary package requires libcurl and openssl\n")
-        cat("**** If installation fails, retry after installing those system requirements\n")
-      }
-    } else {
-      cat(sprintf("*** No libarrow binary found for version %s on %s\n", VERSION, os))
-      libfile <- NULL
+  binary_url <- paste0(arrow_repo, "bin/", lib, "/arrow-", VERSION, ".zip")
+  if (try_download(binary_url, libfile)) {
+    if (!quietly) {
+      cat(sprintf("*** Successfully retrieved C++ binaries (%s)\n", lib))
     }
   } else {
+    if (!quietly) {
+      cat(sprintf("*** No libarrow binary found for version %s (%s)\n", VERSION, lib))
+    }
     libfile <- NULL
   }
   libfile
 }
 
 # Function to figure out which flavor of binary we should download, if at all.
-# By default (unset or "FALSE"), it will not download a precompiled library,
-# but you can override this by setting the env var LIBARROW_BINARY to:
-# * `TRUE` (not case-sensitive), to try to discover your current OS, or
-# * some other string, presumably a related "distro-version" that has binaries
-#   built that work for your OS
-identify_os <- function(os = Sys.getenv("LIBARROW_BINARY")) {
-  if (tolower(os) %in% c("", "false")) {
-    # Env var says not to download a binary
-    return(NULL)
-  } else if (!identical(tolower(os), "true")) {
-    # Env var provided an os-version to use--maybe you're on Ubuntu 18.10 but
-    # we only build for 18.04 and that's fine--so use what the user set
-    return(os)
+# LIBARROW_BINARY controls the behavior. If unset, it will determine a course
+# of action based on the current system. Other values you can set it to:
+# * "FALSE" (not case-sensitive), to skip this option altogether
+# * "TRUE" (not case-sensitive), to try to discover your current OS, or
+# * Some other string: a "distro-version" that corresponds to a binary that is
+#   available, to override what this function may discover by default.
+#   Possible values are:
+#    * "centos-7" (gcc 4.8, no AWS/GCS support)
+#    * "ubuntu-18.04" (gcc 8, openssl 1)
+#    * "ubuntu-22.04" (openssl 3)
+#   These string values, along with `NULL`, are the potential return values of #   this function.
+identify_binary <- function(lib = Sys.getenv("LIBARROW_BINARY"), info = distro()) {
+  lib <- tolower(lib)
+  if (identical(lib, "")) {
+    # Not specified. Check the allowlist.
+    lib <- ifelse(check_allowlist(info$id), "true", "false")
+  }
+
+  if (identical(lib, "false")) {
+    # Do not download a binary
+    NULL
+  } else if (!identical(lib, "true")) {
+    # Env var provided an os-version to use, to override our logic.
+    # We don't validate that this exists. If it doesn't, the download will fail
+    # and the build will fall back to building from source
+    lib
+  } else {
+    # See if we can find a suitable binary
+    select_binary()
   }
+}
+
+check_allowlist <- function(os, allowed = "https://raw.githubusercontent.com/apache/arrow/master/r/tools/nixlibs-allowlist.txt") {
+  allowlist <- tryCatch(
+    # Try a remote allowlist so that we can add/remove without a release
+    suppressWarnings(readLines(allowed)),
+    # Fallback to default: allowed only on Ubuntu and CentOS/RHEL
+    error = function(e) c("ubuntu", "centos", "redhat", "rhel")
+  )
+  # allowlist should contain valid regular expressions (plain strings ok too)
+  any(grepl(paste(allowlist, collapse = "|"), os))
+}
+
+select_binary <- function(os = tolower(Sys.info()[["sysname"]]),
+                          arch = tolower(Sys.info()[["machine"]]),
+                          compiler_version = compiler_version_string(),
+                          test_program = test_for_curl_and_openssl) {
+  if (identical(os, "linux") && identical(arch, "x86_64")) {
+    # We only host x86 linux binaries today
+    is_gcc4 <- any(grepl("^g\\+\\+.*[^\\d.]4(\\.\\d){2}", compiler_version))
+    if (is_gcc4) {
+      cat("*** Some features are not available with gcc 4\n")
+      return("centos-7")
+    } else {
+      tryCatch(
+        # Somehow the test program system2 call errors on the sanitizer builds
+        # so globally handle the possibility that this could fail
+        {
+          errs <- compile_test_program(test_program)
+          determine_binary_from_stderr(errs)
+        },
+        error = function(e) {
+          cat("*** Unable to find libcurl and openssl\n")
+          NULL
+        }
+      )
+    }
+  } else {
+    # No binary available for arch
+    cat(sprintf("*** Building on %s %s\n", os, arch))
+    NULL
+  }
+}
+
+# This tests that curl and openssl are present (bc we can include their headers)
+# and it checks for other versions/features and raises errors that we grep for
+test_for_curl_and_openssl <- "
+#include <ciso646>
+#ifdef _LIBCPP_VERSION
+#error Using libc++
+#endif
+
+#if !( __GLIBC__ >= 2 && __GLIBC_MINOR__ >= 27)
+#error glibc version too old
+#endif
+
+#include <curl/curl.h>
+#include <openssl/opensslv.h>
+#if OPENSSL_VERSION_NUMBER < 0x10002000L
+#error OpenSSL version too old
+#endif
+#if OPENSSL_VERSION_NUMBER >= 0x30000000L
+#error Using OpenSSL version 3
+#endif
+"
+
+compile_test_program <- function(code) {
+  # Note: if we wanted to check for openssl on macOS, we'd have to set the brew
+  # path as a -I directory. But since we (currently) only run this code to
+  # determine whether we can download a Linux binary, it's not relevant.
+  runner <- "`R CMD config CXX11` `R CMD config CPPFLAGS` `R CMD config CXX11FLAGS` `R CMD config CXX11STD` -E -xc++"
+  suppressWarnings(system2("echo", sprintf('"%s" | %s -', code, runner), stdout = FALSE, stderr = TRUE))
+}
 
-  linux <- distro()
-  if (is.null(linux)) {
-    cat("*** Unable to identify current OS/version\n")
+# TODO(later): build "ubuntu-18.04" on centos7 with newer devtoolset (but glibc is 2.17) for broader compatibility (like manylinux2014)?
+determine_binary_from_stderr <- function(errs) {
+  if (is.null(attr(errs, "status"))) {
+    # There was no error in compiling: so we found libcurl and openssl > 1.0.2,
+    # openssl is < 3.0, glibc is >= 2.27, and we're not using a strict libc++
+    cat("*** Found libcurl and openssl >= 1.0.2\n")
+    return("ubuntu-18.04")
+  } else if (any(grepl("Using libc++", errs, fixed = TRUE))) {
+    # Our binaries are all built with GNU stdlib so they fail with libc++
+    cat("*** Found libc++\n")
     return(NULL)
+  } else if (any(grepl("glibc version too old", errs))) {
+    # ubuntu-18.04 has glibc 2.27, so even if you install newer compilers
+    # (e.g. devtoolset on centos) and have curl/openssl, you run into problems
+    # TODO(ARROW-16976): build binaries with older glibc
+    cat("*** Checking glibc version\n")
+    # If we're here, we're on an older OS but with a newer compiler than gcc 4.8
+    # (we already checked), so it is possible to build with more features on.
+    # We just can't use our binaries because they were built with newer glibc.
+    return(NULL)
+  } else if (header_not_found("curl/curl", errs)) {
+    # TODO: should these next 3 NULL cases return centos-7? A source build

Review Comment:
   I think that would make sense + a message explaining this to the user?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org