You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/14 18:32:29 UTC

[GitHub] [arrow] thisisnic opened a new pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

thisisnic opened a new pull request #10326:
URL: https://github.com/apache/arrow/pull/10326


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r640730474



##########
File path: r/R/util.R
##########
@@ -110,3 +110,12 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {
+  msg <- conditionMessage(e)
+  if (grepl("Parquet magic bytes not found in footer", msg) && is.null(format)) {
+      e$message <- paste0(msg, "\nDid you mean to specify a 'format' other than the default (parquet)?")
+  }
+  stop(e)

Review comment:
       Makes sense; done now, just waiting for CI




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634591911



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       Oh, I think that's totally fine to have a message like this after that. The helpfulness might be a bit of a distraction, but I think most people who get this error message will either be in need of the helpfulness or be knowledgeable enough about their underlying data to not get too distracted by it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634387573



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)
+      } else {
+        stop(e)

Review comment:
       Does this `stop()` return the call trace now?

##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       I can see arguments on both sides of this: would it be helpful to include the original error message as well as our new nicer one? 
   
   On the one hand: it's more content and content that some (most?) won't have the context to know what to do with, but on the other it makes clear what the message is form the c++ if someone is familiar with that or does have some reason to suspect that the parquet files are parquet, but were mangled it would tell them that a bit more clearly. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634482956



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)
+      } else {
+        stop(e)

Review comment:
       I think so? When I click "Show Traceback" in RStudio, I get this;
   
   ```
   > ds <- open_dataset(csv_dir, partitioning = "part")
    Error: Looks like these are not parquet files, did you mean to specify a 'format'? 
   6. stop("Looks like these are not parquet files, did you mean to specify a 'format'?", 
       call. = FALSE) at dataset.R#103
   5. value[[3L]](cond) 
   4. tryCatchOne(expr, names, parentenv, handlers[[1L]]) 
   3. tryCatchList(expr, classes, parentenv, handlers) 
   2. tryCatch(factory$Finish(schema, isTRUE(unify_schemas)), error = function(e) {
       msg <- conditionMessage(e)
       if (grep("Parquet magic bytes not found in footer", msg)) {
           stop("Looks like these are not parquet files, did you mean to specify a 'format'?",  ... at dataset.R#97
   1. open_dataset(csv_dir, partitioning = "part") 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634594017



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       So as an exploratory question, "If this message is universally better than the C++ message should the C++ message be changed?  Why or why not?"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634597442



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       The error isn't wrong, it's just not informative at this higher layer. "Invalid parquet magic bytes" is meaningful if you yourself called read_parquet(file), but in this case, you weren't intending to read a parquet file because they aren't parquet files. 

##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       The error isn't wrong, it's just not informative at this particular higher layer. "Invalid parquet magic bytes" is meaningful if you yourself called read_parquet(file), but in this case, you weren't intending to read a parquet file because they aren't parquet files. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r638438195



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,11 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = handle_parquet_io_error

Review comment:
       One last thought: since we're in `open_dataset()` here, we know whether the user has specified a format or not. "Did you mean to specify a 'format'?" may be a good message if `format` is not specified, but maybe we should give a different message if you specified `format = "parquet"` (i.e., there is a non-parquet file in what you think is all parquet files).
   
   Also maybe the default message should indicate that parquet is the default since that may not be obvious, something like "Did you mean to specify a 'format' other than the default (parquet)?" perhaps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #10326:
URL: https://github.com/apache/arrow/pull/10326


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r640963069



##########
File path: r/R/util.R
##########
@@ -110,3 +110,15 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {

Review comment:
       You aren't using `format` in here anymore--shouldn't you? Assuming you delete the format handling I suggested above, in here you will know that `format` is default if `is.character(format) && length(format) > 1` (i.e. `c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text")`)

##########
File path: r/R/dataset.R
##########
@@ -92,9 +105,19 @@ open_dataset <- function(sources,
     })
     return(dataset___UnionDataset__create(sources, schema))
   }
-  factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  if (is.character(format)) {
+    format <- FileFormat$create(match.arg(format), ...)
+  } else {
+    assert_is(format, "FileFormat")
+  }

Review comment:
       Is this necessary? Won't DatasetFactory$create() handle this for you?
   
   ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634604005



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       @nealrichardson is 💯 on that, it's also a slightly different audience than that C++ error is targeted at




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#issuecomment-841447425


   https://issues.apache.org/jira/browse/ARROW-12791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634743196



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       I can understand where you're coming from @westonpace - I definitely don't think we should necessarily have error translation for every single C++ error in the package or make that into a pattern we use.
   
   Just to provide a little more context, the reason for opening the original ticket was that when there is a call to `open_dataset()`, if the format is anything other than "parquet", the user needs to pass that format through as an optional argument, but it's not immediately clear from looking at the docs for `open_dataset()` that this is the case.  I'm going to add some examples to the documentation that show how to open files of different formats with `open_dataset()`, but the info in the error is intended as an extra hint in case the behaviour is unintuitive.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r638786673



##########
File path: r/R/util.R
##########
@@ -110,3 +110,12 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {
+  msg <- conditionMessage(e)
+  if (grepl("Parquet magic bytes not found in footer", msg) && is.null(format)) {
+      e$message <- paste0(msg, "\nDid you mean to specify a 'format' other than the default (parquet)?")
+  }
+  stop(e)

Review comment:
       That sounds like a good idea, but is it complicated by the fact that `open_dataset` can also take `Dataset` objects for the parameter `sources` and not just files?  If we were to add `format` as a parameter, if we keep the default value of `parquet`, does that feel weird if we're working with `Dataset` objects and not files? (Or could we just do it and mention in the docs?) And if we change `format` to be `NULL` by default, then that's quite a big breaking change, I guess.
   
   I'm not sure what would be a good solution here or what the convention is.  What are your thoughts, @romainfrancois ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r638745701



##########
File path: r/R/util.R
##########
@@ -110,3 +110,12 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {
+  msg <- conditionMessage(e)
+  if (grepl("Parquet magic bytes not found in footer", msg) && is.null(format)) {
+      e$message <- paste0(msg, "\nDid you mean to specify a 'format' other than the default (parquet)?")
+  }
+  stop(e)

Review comment:
       I think it's also a good place for: 
   
   ```
   abort(c(
     msg, 
     i = "Did you mean to specify a 'format' other than the default (parquet)?"
   ))
   ```
   
   Since this refers to `format`, would it be useful to add it as the argument list of `open_dataset()`, then I suppose we wouldn't need 
   
   ```r
   args <- list2(...)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#issuecomment-843578991


   > [ace2bfc](https://github.com/apache/arrow/commit/ace2bfc160e049b4edc7ffb55081cfb7210d6e43) and [aa27b4c](https://github.com/apache/arrow/commit/aa27b4c71c61e7264974c5eb076d0241ed1ddfe0) can be a model for your solution here
   
   Thanks, @nealrichardson, have now updated the PR to follow this pattern!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r645674370



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1502,3 +1502,12 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12791
+test_that("Error if no format specified and files are not parquet", {
+  skip_if_not_available("parquet")
+  expect_error(
+    open_dataset(csv_dir, partitioning = "part"),
+    regexp = "Did you mean to specify a 'format'? other than the default (parquet)?"

Review comment:
       ```suggestion
       "Did you mean to specify a 'format' other than the default (parquet)?",
       fixed = TRUE
   ```

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -1502,3 +1502,12 @@ test_that("Collecting zero columns from a dataset doesn't return entire dataset"
     c(32, 0)
   )
 })
+
+# see https://issues.apache.org/jira/browse/ARROW-12791
+test_that("Error if no format specified and files are not parquet", {
+  skip_if_not_available("parquet")
+  expect_error(
+    open_dataset(csv_dir, partitioning = "part"),
+    regexp = "Did you mean to specify a 'format'? other than the default (parquet)?"
+  )
+})

Review comment:
       Technically, for completeness, we should test that the extra error message *isn't* shown if you explicitly provided `format = "parquet"`. Maybe this would work?
   
   ```suggestion
     expect_failure(
       expect_error(
         open_dataset(csv_dir, partitioning = "part", format = "parquet"),
         "Did you mean to specify a 'format'"
       )
     )
   })
   ```

##########
File path: r/R/util.R
##########
@@ -110,3 +110,15 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {
+  msg <- conditionMessage(e)
+  if (grepl("Parquet magic bytes not found in footer", msg) && length(format) > 1 && is_character(format)) {
+    abort(c(

Review comment:
       ```suggestion
       # If length(format) > 1, that means it is (almost certainly) the default/not specified value
       # so let the user know that they should specify the actual (not parquet) format
       abort(c(
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634492852



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)
+      } else {
+        stop(e)

Review comment:
       Ah, I was thinking about the line 105 stop. I'm not certain if passing `e` there will have the same behavior as it would now (for errors that aren't this parquet issue, like providing a non-existent directory). Currently (on the master branch) that looks like:
   
   ```
   > ds <- open_dataset("~/does/not/exist") 
   Error: IOError: Cannot list directory '/Users/jkeane/does/not/exist'. Detail: [errno 2] No such file or directory
   ```
   
   But with this change it might be something like the following without also adding `call. = FALSE` here too (though I'm not certain of that!)
   
   ```
   Error in open_dataset() : Error: IOError: Cannot list directory '/Users/jkeane/does/not/exist'. Detail: [errno 2] No such file or directory
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r635436898



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)
+      } else {
+        stop(e)

Review comment:
       Now I've updated the error to match the patterns mentioned by @nealrichardson below, so I think that resolves this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r643448875



##########
File path: r/R/util.R
##########
@@ -110,3 +110,15 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {

Review comment:
       Good point!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r639742821



##########
File path: r/R/util.R
##########
@@ -110,3 +110,12 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+handle_parquet_io_error <- function(e, format) {
+  msg <- conditionMessage(e)
+  if (grepl("Parquet magic bytes not found in footer", msg) && is.null(format)) {
+      e$message <- paste0(msg, "\nDid you mean to specify a 'format' other than the default (parquet)?")
+  }
+  stop(e)

Review comment:
       Agree that `abort(..., i = helpful_message)` is a good idea.
   
   It probably is better to add `format` as an arg to `open_dataset()`, not because we need it there--it is just passed through--but it would help make it more explicit in the docs that it exists, what the default is, and what the supported values are (via match.arg). I don't think it's a problem that some kinds of inputs to open_dataset() wouldn't use the format argument: you can document that it is ignored unless `sources` is a string/character vector. And you wouldn't be making a breaking change or changing the defaults, you'd just be duplicating the `match.arg()` handling from DatasetFactory$create(). (This is the only downside I see, the duplication of that handling, but it's probably worth it on balance.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634528056



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       I actually think "just" the message you have here is better than the status quo (and better than docs). But I think you could `paste` together the two messages and be good if you wanted to surface both?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#issuecomment-843344849


   https://github.com/apache/arrow/commit/ace2bfc160e049b4edc7ffb55081cfb7210d6e43 and https://github.com/apache/arrow/commit/aa27b4c71c61e7264974c5eb076d0241ed1ddfe0 can be a model for your solution here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634581473



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       Oh, as you mentioned in your initial comment: if the files are Parquet files but mangled.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634570064



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       I don't like the C++ error messages being exposed to the end user, so I'm with you on that.  I'm now a bit concerned that the error can be triggered by multiple things and it's not easy to work out which - and I don't want the error message to mislead the user.  I'll have a think about if there's a rephrasing that fixes this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634572440



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       Hmmm, what are the other things that trigger it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634525212



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       Ergh, you make a good point.  I'm tempted to abandon this PR and just solve this via documentation as I don't think there's a good solution in the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634748152



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       I've added ARROW-12827 to cleanup the C++ exceptions at some point but I agree that this PR could still be very viable on its own.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r638526031



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,11 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = handle_parquet_io_error

Review comment:
       I like this suggestion as it helps make the error message a lot more specific.  I've now updated the function so that it gives your updated error message (i.e. mentioning that parquet is the default) if the format is NULL (i.e. hasn't been specified by the user).  
   
   I've tested the code with a directory containing a mix of file formats and the C++ error looks like this: 
   `Error: IOError: Could not open parquet input source '/tmp/RtmpLi0d7E/filefa0e499bb459/file1.txt': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.`
   
   Given that the C++ error message above is pretty informative - it mentions both the name of the file which has caused the error and the likely source of the error - I haven't implemented anything that changes the error output in this case, as I think this is sufficient.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10326: ARROW-12791: [R] Better error handling for DatasetFactory$Finish() when no format specified

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10326:
URL: https://github.com/apache/arrow/pull/10326#discussion_r634619864



##########
File path: r/R/dataset.R
##########
@@ -93,8 +93,19 @@ open_dataset <- function(sources,
     return(dataset___UnionDataset__create(sources, schema))
   }
   factory <- DatasetFactory$create(sources, partitioning = partitioning, ...)
-  # Default is _not_ to inspect/unify schemas
-  factory$Finish(schema, isTRUE(unify_schemas))
+  
+  tryCatch(
+    # Default is _not_ to inspect/unify schemas
+    factory$Finish(schema, isTRUE(unify_schemas)),
+    error = function (e) {
+      msg <- conditionMessage(e)
+      if(grep("Parquet magic bytes not found in footer", msg)){
+        stop("Looks like these are not parquet files, did you mean to specify a 'format'?", call. = FALSE)

Review comment:
       It seems at least there is some room for improvement.  The transition from `read_parquet` to `open_dataset` is at least partially in C++ as well (you are calling `DatasetFactory::Finish` here).  Although the terms are slightly different `e.g. fragments vs sources`.  So I think we could update the C++ error to something like "Dataset creation failed.  The fragment <fragment-to-string> did not match the expected <format> format: <child_error>"
   
   Concretely, "Dataset creation failed.  The fragment '/2019/July/myfile.csv' did not match the expected 'parquet'' format: Parquet magic bytes not found in footer".
   
   I worry this kind of error translation is a slippery slope.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org