You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 15:52:53 UTC

[GitHub] [arrow] thisisnic opened a new pull request, #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

thisisnic opened a new pull request, #12839:
URL: https://github.com/apache/arrow/pull/12839

   As discussed on #12826 
   
   Not sure how (if) to write tests but tried running it locally using the CSV directory set up in `test-dataset-csv.R` with and without this change, and without it, we get, e.g.
   
   ```
   open_dataset(csv_dir)
   # Error in `handle_parquet_io_error()` at r/R/dataset.R:221:6:
   # ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Could not open Parquet input source '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
   # /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
   # /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
   # /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
   # ℹ Did you mean to specify a 'format' other than the default (parquet)?
   ```
   
   and then with it:
   
   ```
   open_dataset(csv_dir)
   # Error in `open_dataset()`:
   # ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Could not open Parquet input source '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
   # /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
   # /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
   # /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
   # ℹ Did you mean to specify a 'format' other than the default (parquet)?
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic closed pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
thisisnic closed pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing
URL: https://github.com/apache/arrow/pull/12839


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
thisisnic commented on code in PR #12839:
URL: https://github.com/apache/arrow/pull/12839#discussion_r847057232


##########
r/R/csv.R:
##########
@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,
 
   tryCatch(
     tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {

Review Comment:
   > Is it always n = 4?
   
   It's always `n = 4` here, though I deliberately chose to pass the `call` parameter into `handle_csv_read_error()` so the function could be used elsewhere in the code where we may want to pass in a different environment.  
   
   > Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)
   
   I could call `rlang::current_env()` above the `tryCatch` block - I went for calling `caller_env()` here as it felt "cleaner" to keep that code within this block here.
   
   I suppose that if the `tryCatch` block was changed to have more functions wrapped round it, then the number would be wrong; however, if we call `current_env()` outside of the block, we're unnecessarily calling it every time we call the function, even if there's no error.
   
   Not sure what's better - what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on a diff in pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
thisisnic commented on code in PR #12839:
URL: https://github.com/apache/arrow/pull/12839#discussion_r847057232


##########
r/R/csv.R:
##########
@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,
 
   tryCatch(
     tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {

Review Comment:
   > Is it always n = 4?
   It's always `n = 4` here, though I deliberately chose to pass the `call` parameter into `handle_csv_read_error()` so the function could be used elsewhere in the code where we may want to pass in a different environment.  
   
   > Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)
   I could call `rlang::current_env()` above the `tryCatch` block - I went for calling `caller_env()` here as it felt "cleaner" to keep that code within this block here.
   
   I suppose that if the `tryCatch` block was changed to have more functions wrapped round it, then the number would be wrong; however, if we call `current_env()` outside of the block, we're unnecessarily calling it every time we call the function, even if there's no error.
   
   Not sure what's better - what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12839:
URL: https://github.com/apache/arrow/pull/12839#issuecomment-1099424302

   Benchmark runs are scheduled for baseline = 681ede6fcb81c102786261da1b193e0388387e77 and contender = 5d5ccebeed155f6d6ee10cefee3cb295fc300c85. 5d5ccebeed155f6d6ee10cefee3cb295fc300c85 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8ee758fa1a4e4b5e8ac93080526695bc...70461192e0ff481e84eb9090015aa4e2/)
   [Finished :arrow_down:0.67% :arrow_up:0.08%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5e3615ddf3ed415a96fb74f926165f30...b10dabec0ddc48d796c172d80802def9/)
   [Failed :arrow_down:0.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6a05c9ed57544f6b96c8a8c9c0e5d2a8...248e43e2373c4fd68445f91a826eb4dc/)
   [Finished :arrow_down:0.98% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c94e1afb3d1442a18c30681dc870ecea...c58bce7f47cb4858b252766f0c252b6f/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/504| `5d5ccebe` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/490| `5d5ccebe` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/488| `5d5ccebe` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/500| `5d5ccebe` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/503| `681ede6f` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/489| `681ede6f` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/490| `681ede6f` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/499| `681ede6f` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12839:
URL: https://github.com/apache/arrow/pull/12839#issuecomment-1093031258

   https://issues.apache.org/jira/browse/ARROW-16154


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #12839:
URL: https://github.com/apache/arrow/pull/12839#discussion_r846625210


##########
r/R/util.R:
##########
@@ -209,8 +209,7 @@ handle_csv_read_error <- function(e, schema) {
         "row, you should supply the argument `skip = 1` to prevent the",
         "header being read in as data."
       )
-    ))
+    ), call = call)
   }
-
-  abort(msg)
+  abort(conditionMessage(e), call = call)

Review Comment:
   (Comments go for both functions)
   
   We already have msg above so we can still use it.
   
   We could also edit the `if` above to do `msg <- c(msg, i = ...)` and just have one `abort()`, if you wanted to polish further. 
   
   ```suggestion
     abort(msg, call = call)
   ```



##########
r/R/csv.R:
##########
@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,
 
   tryCatch(
     tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {

Review Comment:
   Is it always `n = 4`? Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just *this* env?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12839:
URL: https://github.com/apache/arrow/pull/12839#issuecomment-1093031273

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #12839: ARROW-16154: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #12839:
URL: https://github.com/apache/arrow/pull/12839#discussion_r847673797


##########
r/R/csv.R:
##########
@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,
 
   tryCatch(
     tab <- reader$Read(),
-    error = function(e) {
-      handle_csv_read_error(e, schema)
+    error = function(e, call = caller_env(n = 4)) {

Review Comment:
   Feels brittle but it's probably fine. I'd just leave in some comments explaining why `n = 4`, that you could have used `caller_env()` but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).
   
   We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.
   
   In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org