You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/08 21:14:26 UTC

[GitHub] [arrow] ianmcook opened a new pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

ianmcook opened a new pull request #9143:
URL: https://github.com/apache/arrow/pull/9143


   Improves messaging for currently unsupported readr parse options and improves handling in related cases such as ambiguous partial argument names and invalid combinations of Arrow and readr options


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757043963


   > @nealrichardson let me know if you'd also like to me to add text in these two places in the `dataset_factory` and `FileFormat` docs explaining that some readr-style options are not supported in `...`
   
   You could, but IMO a better (though not mutually exclusive) solution would be to use the error message to tell you that more clearly https://github.com/apache/arrow/pull/9143#discussion_r554241301
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757001158


   https://issues.apache.org/jira/browse/ARROW-10463


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554441947



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of rlang options that `read_delim_arrow()` supports and compute a set difference with the names of the rlang options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554280947



##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Got it. So this works because we've already filtered out the case where there's no match in each list separately, so if there's no unambiguous match when together, it must mean that there were multiple matches.
   
   Can you add an inline comment to that effect? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook edited a comment on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook edited a comment on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757016760


   @nealrichardson let me know if you'd also like to me to explain in the `dataset_factory` and `FileFormat` docs that some readr-style options are not supported in `...`
   - https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L107-L110
   - https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L43-L45


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554441947



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of rlang options that `read_delim_arrow()`supports and compute a set difference with the names of the rlang options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554476336



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Yeah, I think we need to distinguish in the error message between "your fault" and "our fault/not implemented"--what you as a user do with that information is very different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-758143905


   @nealrichardson I believe 205d1e7 resolves the issues you raised here


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r555297933



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,68 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr parse options must be mutually exclusive, or else the code
+  # in `csv_file_format_parse_options()` will error or behave incorrectly. A
+  # failure of this test indicates that these two sets of option names are not
+  # mutually exclusive.
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With not yet supported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Works for me




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554280501



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Ok, that's what I figured. Could you please add a comment here explaining that?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554271178



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       The code in `csv_file_format_parse_options()` in `dataset-format.R` will error or behave incorrectly if any of the Arrow file parsing options (`delimiter`, `quoting`, etc.) happens to have exactly the same name as any of the supported readr-style options (`delim`, `quote`, etc.). Some future change might accidentally cause these two lists of option names to overlap, so I added this test which will fail if that happens.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson edited a comment on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson edited a comment on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757043963


   > @nealrichardson let me know if you'd also like to me to add text in these two places in the `dataset_factory` and `FileFormat` docs explaining that some readr-style options are not supported in `...`
   
   You could, but IMO a better (though not mutually exclusive) solution would be to use the error message to tell you that more clearly https://github.com/apache/arrow/pull/9143#discussion_r554241301. I don't assume people read help pages, at least not that closely.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #9143:
URL: https://github.com/apache/arrow/pull/9143


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757001158


   https://issues.apache.org/jira/browse/ARROW-10463


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r555288000



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,68 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr parse options must be mutually exclusive, or else the code
+  # in `csv_file_format_parse_options()` will error or behave incorrectly. A
+  # failure of this test indicates that these two sets of option names are not
+  # mutually exclusive.
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With not yet supported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {
+    expect_error(
+      open_dataset(tsv_dir, partitioning = "part", delim = "\t", na = "\\N"),
+      "supported"
+    )
+  }

Review comment:
       Provided that you agree, here's a suggestion version you can just accept
   
   ```suggestion
     # With not yet supported readr parse options (ARROW-8631)
     expect_error(
       open_dataset(tsv_dir, partitioning = "part", delim = "\t", na = "\\N"),
       "supported"
     )
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson edited a comment on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson edited a comment on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757043963


   > @nealrichardson let me know if you'd also like to me to add text in these two places in the `dataset_factory` and `FileFormat` docs explaining that some readr-style options are not supported in `...`
   
   You could, but IMO a better (though not mutually exclusive) solution would be to use the error message to tell you that more clearly https://github.com/apache/arrow/pull/9143#discussion_r554241301. I don't assume people read help pages, at least not that closely.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       If I read the code correctly, we're not distinguishing between arguments that are valid CsvReadOptions/CsvConvertOptions and just not supported in datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea` options. Is that distinction worth making? I.e. is erroring with `Unsupported option: na` clear enough that we're not going to get another bug report about this?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Why does this need to be asserted? Because the validating code assumes it?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       You don't have to use it here, but for reference we have an `oxford_paste` utility function for making human-readable, appropriately comma'd error message strings: https://github.com/apache/arrow/blob/master/r/R/util.R#L18

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Could you add a comment here explaining how this catches ambiguous args? I don't use `pmatch` enough for this to be immediately apparent. 

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Ok, that's what I figured. Could you please add a comment here explaining that?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       I don't think we do, though I've definitely written various `pluralize` util functions in the past, and we could add one here (if a dependency doesn't already provide one; we should check `asserthat` and `rlang`)

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Got it. So this works because we've already filtered out the case where there's no match in each list separately, so if there's no unambiguous match when together, it must mean that there were multiple matches.
   
   Can you add an inline comment to that effect? 

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Yeah, I think we need to distinguish in the error message between "your fault" and "our fault/not implemented"--what you as a user do with that information is very different.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757043963


   > @nealrichardson let me know if you'd also like to me to add text in these two places in the `dataset_factory` and `FileFormat` docs explaining that some readr-style options are not supported in `...`
   
   You could, but IMO a better (though not mutually exclusive) solution would be to use the error message to tell you that more clearly https://github.com/apache/arrow/pull/9143#discussion_r554241301
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r555112693



##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       For pluralization, I'll use `ngettext()` from base R




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757016760


   @nealrichardson let me know if you'd also like to me to add a note here in the `dataset_factory()` docs explaining that some readr-style options are not supported https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L107-L110


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554280774



##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       I don't think we do, though I've definitely written various `pluralize` util functions in the past, and we could add one here (if a dependency doesn't already provide one; we should check `asserthat` and `rlang`)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook edited a comment on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook edited a comment on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757016760






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554241301



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       If I read the code correctly, we're not distinguishing between arguments that are valid CsvReadOptions/CsvConvertOptions and just not supported in datasets yet (what ARROW-8631 is about) and plain garbage `asdfasdrfwea` options. Is that distinction worth making? I.e. is erroring with `Unsupported option: na` clear enough that we're not going to get another bug report about this?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       Why does this need to be asserted? Because the validating code assumes it?

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       You don't have to use it here, but for reference we have an `oxford_paste` utility function for making human-readable, appropriately comma'd error message strings: https://github.com/apache/arrow/blob/master/r/R/util.R#L18

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       Could you add a comment here explaining how this catches ambiguous args? I don't use `pmatch` enough for this to be immediately apparent. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757016760


   @nealrichardson let me know if you'd also like to me to add a note here in the `dataset_factory()` docs explaining that some readr-style options are not supported https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L107-L110


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r555280727



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,68 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr parse options must be mutually exclusive, or else the code
+  # in `csv_file_format_parse_options()` will error or behave incorrectly. A
+  # failure of this test indicates that these two sets of option names are not
+  # mutually exclusive.
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With not yet supported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       I'd remove this `if`. Let the test fail if/when `na` is supported because we'll need to update it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554271178



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(

Review comment:
       The code in `csv_file_format_parse_options()` in `dataset-format.R` will error or behave incorrectly if any of the Arrow file parsing options (`delimiter`, `quoting`, etc.) happens to have exactly the same name as any of the supported readr-style options (`delim`, `quote`, etc.). Some future change might accidentally cause these two lists of option names to overlap, so I added this test which will fail if that happens.

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       `arrow_opts` is a vector of the names of the allowed arguments to `CsvParseOptions$create()` (Arrow-style arguments). `readr_opts` is a vector of the names of the allowed arguments to `readr_to_csv_parse_options()` (readr-style arguments).
   ```r
   arrow_opts
   ## [1] "delimiter"          "quoting"            "quote_char"         "double_quote"       "escaping"          
   ## [6] "escape_char"        "newlines_in_values" "ignore_empty_lines"
   
   readr_opts
   ## [1] "delim"            "quote"            "escape_double"    "escape_backslash" "skip_empty_rows"
   ```
   The function we're inside here (`csv_file_format_parse_options()`) allows _either_ of these sets of arguments. These two sets of argument names are mutually exclusive, but R's partial matching of argument names throws a wrench in that. For example, if someone shortens either `delimiter` or `delim` to just `del`, that would work fine in a function that accepts _only_ Arrow-style arguments or _only_ readr-style options, but here it creates ambiguity—we can't tell if the user is intending to specify Arrow-style arguments or readr-style arguments.
   
   ```r
   open_dataset("/path/to/csv/", format = "csv", del = ";")
   ## 💀
   ```
   So `pmatch()` to the rescue. `pmatch()`, when called like it is here, uses the same algorithm for partial matching that R uses to identify named arguments in function calls. `pmatch(x, y)` returns a vector of the same length as `x`, and in each position, the value will be `NA` if and only if the character string in that position in `x` _cannot_ be unambiguously matched to exactly one character string in `y`. So if there are any `NA` values in the vector returned by `pmatch()`, that means at least one of the argument names is ambiguous.

##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       Oh nice. Do we have any go-to functions for handling singular/plural (so that the "s" in "options" could be conditionally excluded/included)?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of rlang options that `read_delim_arrow()`supports and compute a set difference with the names of the rlang options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of rlang options that `read_delim_arrow()` supports and compute a set difference with the names of the rlang options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554277501



##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),
+         call. = FALSE)
+  }
+  is_ambig_opt <- is.na(pmatch(opt_names, c(arrow_opts, readr_opts)))

Review comment:
       `arrow_opts` is a vector of the names of the allowed arguments to `CsvParseOptions$create()` (Arrow-style arguments). `readr_opts` is a vector of the names of the allowed arguments to `readr_to_csv_parse_options()` (readr-style arguments).
   ```r
   arrow_opts
   ## [1] "delimiter"          "quoting"            "quote_char"         "double_quote"       "escaping"          
   ## [6] "escape_char"        "newlines_in_values" "ignore_empty_lines"
   
   readr_opts
   ## [1] "delim"            "quote"            "escape_double"    "escape_backslash" "skip_empty_rows"
   ```
   The function we're inside here (`csv_file_format_parse_options()`) allows _either_ of these sets of arguments. These two sets of argument names are mutually exclusive, but R's partial matching of argument names throws a wrench in that. For example, if someone shortens either `delimiter` or `delim` to just `del`, that would work fine in a function that accepts _only_ Arrow-style arguments or _only_ readr-style options, but here it creates ambiguity—we can't tell if the user is intending to specify Arrow-style arguments or readr-style arguments.
   
   ```r
   open_dataset("/path/to/csv/", format = "csv", del = ";")
   ## 💀
   ```
   So `pmatch()` to the rescue. `pmatch()`, when called like it is here, uses the same algorithm for partial matching that R uses to identify named arguments in function calls. `pmatch(x, y)` returns a vector of the same length as `x`, and in each position, the value will be `NA` if and only if the character string in that position in `x` _cannot_ be unambiguously matched to exactly one character string in `y`. So if there are any `NA` values in the vector returned by `pmatch()`, that means at least one of the argument names is ambiguous.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554441947



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of rlang options that `read_delim_arrow()` supports and compute a set difference with the names of the readr options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -303,11 +303,54 @@ test_that("Other text delimited dataset", {
       filter(integer > 6) %>%
       summarize(mean = mean(integer))
   )
+})
+
+test_that("readr parse options", {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+
+  # Arrow and readr options are mutually exclusive
+  expect_equal(
+    intersect(arrow_opts, readr_opts),
+    character(0)
+  )
 
-  # Now with readr option spelling (and omitting format = "text")
-  ds3 <- open_dataset(tsv_dir, partitioning = "part", delim = "\t")
+  # With unsupported readr parse options
+  # (remove this after ARROW-8631)
+  if (!"na" %in% readr_opts) {

Review comment:
       Right. If you think it's worth it, I'll add some additional code here to get the full list of readr options that `read_delim_arrow()` supports and compute a set difference with the names of the readr options currently supported here. Then in the case where the user specifies one of those options, I'll give a more specific error message.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook edited a comment on pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook edited a comment on pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#issuecomment-757016760


   @nealrichardson let me know if you'd also like to me to add text in these two places in the `dataset_factory` and `FileFormat` docs explaining that some readr-style options are not supported in `...`
   - https://github.com/apache/arrow/blob/master/r/R/dataset-factory.R#L107-L110
   - https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L43-L45


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #9143: ARROW-10463: [R] Better messaging for currently unsupported CSV options in open_dataset

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #9143:
URL: https://github.com/apache/arrow/pull/9143#discussion_r554278223



##########
File path: r/R/dataset-format.R
##########
@@ -104,9 +104,31 @@ CsvFileFormat$create <- function(..., opts = csv_file_format_parse_options(...))
 }
 
 csv_file_format_parse_options <- function(...) {
+  opt_names <- names(list(...))
   # Support both the readr spelling of options and the arrow spelling
-  readr_opts <- c("delim", "quote", "escape_double", "escape_backslash", "skip_empty_rows")
-  if (any(readr_opts %in% names(list(...)))) {
+  arrow_opts <- names(formals(CsvParseOptions$create))
+  readr_opts <- names(formals(readr_to_csv_parse_options))
+  is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
+  is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
+  bad_opts <- opt_names[!is_arrow_opt & !is_readr_opt]
+  if (length(bad_opts)) {
+    stop("Unsupported options: ",
+         paste(bad_opts, collapse = ", "),

Review comment:
       Oh nice. Do we have any go-to functions for handling singular/plural (so that the "s" in "options" could be conditionally excluded/included)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org