You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by pa...@apache.org on 2022/12/15 13:44:50 UTC

[arrow] branch master updated: ARROW-17332: [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow (#14930)

This is an automated email from the ASF dual-hosted git repository.

paleolimbot pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 5ce8d79d7a ARROW-17332: [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow (#14930)
5ce8d79d7a is described below

commit 5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d
Author: Dewey Dunnington <de...@voltrondata.com>
AuthorDate: Thu Dec 15 09:44:41 2022 -0400

    ARROW-17332: [R] error parsing folder path with accent ('c:/Público') in read_csv_arrow (#14930)
    
    This PR fixes a bug that prevented some filenames with non-ASCII characters from being openable. The probable culprit is `normalizePath()`, which does some handling of special characters but does not mark the encoding of its output in a way that cpp11's conversion to `std::string` understands.
    
    Because most of our test environments have UTF-8 as the session encoding, this usually works by accident, and it may work by accident in a latin-1 locale too (judging mostly by the fact that our issue thread is not overflowing with complaints of unopenable files, which may or may not be a good metric).
    
    I've added a test for converting the out the output of `normalizePath()` and making sure it's marked as UTF-8. I'll try to replicate this using Docker, too and see if there's any additional test we could add.
    
    The change here brings arrow in line with what the vroom package does for this operation: https://github.com/tidyverse/vroom/blob/main/R/path.R#L322-L324
    
    Authored-by: Dewey Dunnington <de...@voltrondata.com>
    Signed-off-by: Dewey Dunnington <de...@voltrondata.com>
---
 r/R/filesystem.R                   | 2 +-
 r/src/arrow_types.h                | 9 +++++++--
 r/tests/testthat/test-filesystem.R | 1 -
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/r/R/filesystem.R b/r/R/filesystem.R
index 4ad6aa83e3..a218c88769 100644
--- a/r/R/filesystem.R
+++ b/r/R/filesystem.R
@@ -618,7 +618,7 @@ copy_files <- function(from, to, chunk_size = 1024L * 1024L) {
 
 clean_path_abs <- function(path) {
   # Make sure we have a valid, absolute, forward-slashed path for passing to Arrow
-  normalizePath(path, winslash = "/", mustWork = FALSE)
+  enc2utf8(normalizePath(path, winslash = "/", mustWork = FALSE))
 }
 
 clean_path_rel <- function(path) {
diff --git a/r/src/arrow_types.h b/r/src/arrow_types.h
index aff50ffbee..49283bd224 100644
--- a/r/src/arrow_types.h
+++ b/r/src/arrow_types.h
@@ -109,9 +109,14 @@ static inline void StopIfNotOk(const Status& status) {
     if (unwind_detail) {
       throw cpp11::unwind_exception(unwind_detail->token);
     } else {
-      // ARROW-13039: be careful not to interpret our error message as a %-format string
+      // We need to translate this to "native" encoding for the error to be
+      // displayed properly using cpp11::stop()
       std::string s = status.ToString();
-      cpp11::stop("%s", s.c_str());
+      cpp11::strings s_utf8 = cpp11::as_sexp(s);
+      const char* s_native = cpp11::safe[Rf_translateChar](s_utf8[0]);
+
+      // ARROW-13039: be careful not to interpret our error message as a %-format string
+      cpp11::stop("%s", s_native);
     }
   }
 }
diff --git a/r/tests/testthat/test-filesystem.R b/r/tests/testthat/test-filesystem.R
index 7957743a2a..34095acc25 100644
--- a/r/tests/testthat/test-filesystem.R
+++ b/r/tests/testthat/test-filesystem.R
@@ -15,7 +15,6 @@
 # specific language governing permissions and limitations
 # under the License.
 
-
 test_that("LocalFilesystem", {
   fs <- LocalFileSystem$create()
   expect_identical(fs$type_name, "local")