You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/08 23:15:48 UTC

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       If I read the code correctly, we're not distinguishing between arguments that are valid CsvReadOptions/CsvConvertOptions and just not supported in datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea` options. Is that distinction worth making? I.e. is erroring with `Unsupported option: na` clear enough that we're not going to get another bug report about this?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Why does this need to be asserted? Because the validating code assumes it?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       You don't have to use it here, but for reference we have an `oxford_paste` utility function for making human-readable, appropriately comma'd error message strings: https://github.com/apache/arrow/blob/master/r/R/util.R#L18

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Could you add a comment here explaining how this catches ambiguous args? I don't use `pmatch` enough for this to be immediately apparent. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org