You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/08 17:40:54 UTC

[GitHub] [arrow] thisisnic opened a new pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

thisisnic opened a new pull request #9950:
URL: https://github.com/apache/arrow/pull/9950


   A couple of things I wanted to check are expected behaviour:
   
   1. If I specify in the schema that a numeric column should be a string column, I get the error `Error: Invalid: JSON parse error: Column(/third_col) changed from string to number in row 0`
    (e.g. if I run the following)
   ```
   tf <- tempfile()
   writeLines('
       { "hello": 3.5, "world": 2, "third_col": 99}
       { "hello": 3.25, "world": 5, "third_col": 98}
       { "hello": 3.125, "world": 8, "third_col": 97 }
       { "hello": 0.0, "world": 10, "third_col": 96}
   ', tf)
   read_json_arrow(tf, schema = schema(third_col = float64(), world = float64()))
   ```
   2. As can be seen in the tests output (will delete the `print` statements before this is merged), table columns are returned in the order specified in the schema and then the columns not mentioned in the schema.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #9950:
URL: https://github.com/apache/arrow/pull/9950


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-818893574


   > I've added a new test that triggers it but is captured with `expect_error()`.
   > 
   > What do you think, does this make sense? 
   
   If this is expected/documented behavior of the C++ library, I don't think we need a test for it at all in R. 
   
   > Or even it does, do I still need to raise a ticket around the formatting of the error with the "/" next to the column name?
   
   I think that's probably intended, though you could confirm. My guess is that's how the reader will handle named references of nested structs (suppose (or test for yourself) that "third_col" had a value like `{"subcol1": 4, "subcol2": "a"}`, and you mis-typed subcol2, it would probably say "/third_col/subcol2" like a file path. If that's *not* the case, then it seems like a bug. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610809769



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       My bad, I thought I saw that you copied it from the other function (because I noticed the `TODO(fsaintjacques)` duplicated in the diff) but maybe it was just next to what you were doing




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610738117



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       I gave this a try, but when I call `JsonParseOptions$create`, it results in the following error:
   ` Error: Invalid R object for std::shared_ptr<arrow::Schema>, must be an ArrowObject` 
   
   I am guessing it's because the code below passes in `NULL` as the schema argument, and even with `explicit_schema` defaulting to `nullptr`, that default value is never used?
   ```
   JsonParseOptions$create <- function(newlines_in_values = FALSE, schema = NULL) {
       json___ParseOptions__initialize(newlines_in_values, schema)
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610768156



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       Cool, thanks. I'm confused about why this seems to work in a couple of the `dataset___FileSystemDatasetFactory__Make*` functions in `dataset.cpp` but it doesn't work here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610780626



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       R's `NULL` is not the same as `nullptr`. It works in dataset___FileSystemDatasetFactory__Make2 because Make1 calls it with nullptr. (The nullptr check you just added in Make3 will never be nullptr (nor `NULL`) because it only is called from R if the input is a `Partitioning` object.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611809776



##########
File path: r/tests/testthat/test-json.R
##########
@@ -86,6 +86,89 @@ test_that("read_json_arrow() supports col_select=", {
   expect_equal(names(tab2), c("hello", "world"))
 })
 
+test_that("read_json_arrow(schema=) with empty schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema())
+  
+  expect_identical(
+    tab1, 
+    tibble::tibble(
+      hello = c(3.5, 3.25, 3.125, 0),
+      world = c(2L, 5L, 8L, 10L),
+      third_col = c(99L,98L,97L,96L)
+    )               
+  )
+})
+
+test_that("read_json_arrow(schema=) with partial schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema(third_col = float64(), world = float64()))
+  
+  print("input:")
+  print("schema:")
+  print(schema(third_col = float64(), world = float64()))
+  print('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ')
+  print("output:")
+  print(tab1)
+  

Review comment:
       Yep, @jonkeane suggested I pop that in there to make it clear to whoever was reviewing what was happening as there was what looked like unusual behaviour when I was writing tests.  @nealrichardson  - to double check, is the column reordering mentioned in the comment at the very top expected behaviour?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611809776



##########
File path: r/tests/testthat/test-json.R
##########
@@ -86,6 +86,89 @@ test_that("read_json_arrow() supports col_select=", {
   expect_equal(names(tab2), c("hello", "world"))
 })
 
+test_that("read_json_arrow(schema=) with empty schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema())
+  
+  expect_identical(
+    tab1, 
+    tibble::tibble(
+      hello = c(3.5, 3.25, 3.125, 0),
+      world = c(2L, 5L, 8L, 10L),
+      third_col = c(99L,98L,97L,96L)
+    )               
+  )
+})
+
+test_that("read_json_arrow(schema=) with partial schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema(third_col = float64(), world = float64()))
+  
+  print("input:")
+  print("schema:")
+  print(schema(third_col = float64(), world = float64()))
+  print('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ')
+  print("output:")
+  print(tab1)
+  

Review comment:
       Yep, will delete, @jonkeane suggested I pop that in there to make it print out in the CI to make it clear to whoever was reviewing what was happening as there was what looked like unusual behaviour when I was writing tests.  @nealrichardson  - to double check, is the column reordering mentioned in the comment at the very top expected behaviour?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611805539



##########
File path: r/R/json.R
##########
@@ -20,6 +20,8 @@
 #' Using [JsonTableReader]
 #'
 #' @inheritParams read_delim_arrow
+#' @param schema [Schema] that describes the table.  If provided, it will be
+#' used to satisfy both `col_names` and `col_types`.

Review comment:
       If this were the right docs to include here, you wouldn't need to add them--`@inheritParams read_delim_arrow` would pull it in for you. But since there are no `col_names` and `col_types` arguments to this function, we should specify something different (which will override what inheritParams brings in).
   
   ```suggestion
   #' @param schema [Schema] that describes the column names and data types in the table.
   ```

##########
File path: r/tests/testthat/test-json.R
##########
@@ -86,6 +86,89 @@ test_that("read_json_arrow() supports col_select=", {
   expect_equal(names(tab2), c("hello", "world"))
 })
 
+test_that("read_json_arrow(schema=) with empty schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema())
+  
+  expect_identical(
+    tab1, 
+    tibble::tibble(
+      hello = c(3.5, 3.25, 3.125, 0),
+      world = c(2L, 5L, 8L, 10L),
+      third_col = c(99L,98L,97L,96L)
+    )               
+  )
+})
+
+test_that("read_json_arrow(schema=) with partial schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema(third_col = float64(), world = float64()))
+  
+  print("input:")
+  print("schema:")
+  print(schema(third_col = float64(), world = float64()))
+  print('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ')
+  print("output:")
+  print(tab1)
+  

Review comment:
       This was just for debugging right?
   
   ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-819317935


   @nealrichardson I gave your suggestion a try and it printed as you said, so it's not a bug.  I have also removed the redundant test now.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson edited a comment on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson edited a comment on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-818893574


   > I've added a new test that triggers it but is captured with `expect_error()`.
   > 
   > What do you think, does this make sense? 
   
   If this is expected/documented behavior of the C++ library, I don't think we need a test for it at all in R. 
   
   > Or even it does, do I still need to raise a ticket around the formatting of the error with the "/" next to the column name?
   
   I think that's probably intended, though you could confirm. My guess is that's how the reader will handle named references of nested structs: suppose (or test for yourself) that "third_col" had a value like `{"subcol1": 4, "subcol2": "a"}`, and you mis-typed subcol2, it would probably say "/third_col/subcol2" like a file path. If that's *not* the case, then it seems like a bug. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-818032304


   > Am I remembering right that we also ran into an issue where attempting to read a float/integer as a string failed with an odd error message?
   
   I see that now in the PR description. I recommend writing a C++ JIRA for that, adding a test here that would trigger it, and then `skip("ARROW-12XXX: Error: Invalid: JSON parse error: Column(/third_col) changed from string to number in row 0")`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611408611



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       Useful stuff to me to learn, thanks for flagging this up and having this discussion!  Just to check where this is, off the back of the discussion above, it is fine to leave it as it is currently, or shall I make some more changes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611804918



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       It's good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r611857473



##########
File path: r/tests/testthat/test-json.R
##########
@@ -86,6 +86,89 @@ test_that("read_json_arrow() supports col_select=", {
   expect_equal(names(tab2), c("hello", "world"))
 })
 
+test_that("read_json_arrow(schema=) with empty schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema())
+  
+  expect_identical(
+    tab1, 
+    tibble::tibble(
+      hello = c(3.5, 3.25, 3.125, 0),
+      world = c(2L, 5L, 8L, 10L),
+      third_col = c(99L,98L,97L,96L)
+    )               
+  )
+})
+
+test_that("read_json_arrow(schema=) with partial schema", {
+  tf <- tempfile()
+  writeLines('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ', tf)
+  
+  tab1 <- read_json_arrow(tf, schema = schema(third_col = float64(), world = float64()))
+  
+  print("input:")
+  print("schema:")
+  print(schema(third_col = float64(), world = float64()))
+  print('
+    { "hello": 3.5, "world": 2, "third_col": 99}
+    { "hello": 3.25, "world": 5, "third_col": 98}
+    { "hello": 3.125, "world": 8, "third_col": 97 }
+    { "hello": 0.0, "world": 10, "third_col": 96}
+  ')
+  print("output:")
+  print(tab1)
+  

Review comment:
       It wasn't the first thing I would expect but (1) IIUC in JSON the `{}` is not ordered, so it's not surprising that providing a schema would mean that the order comes from the schema; and (2) from the perspective of this PR, it doesn't matter much I don't think because you're just adding R bindings to a feature in the C++ library--if we think the C++ library is suboptimal we can open a new JIRA for that, but seems out of scope here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-816854949


   Sorry @thisisnic, I just merged something that means you need to rebase and re-generate `arrowExports.cpp` (which should happen when you install the package and test anyway). Due to the nature of the change I made, you will have a conflict on that file. Don't worry about it, it's autogenerated. Don't bother to resolve that conflict, just close out of whatever diff resolution tool that `git` opens, re-generate `arrowExports.cpp`, and then do `git add r/src/arrowExports.cpp && git rebase --continue` to proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-816188615


   https://issues.apache.org/jira/browse/ARROW-11468


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610738117



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       I gave this a try and implemented it as a single function, but when I call `JsonParseOptions$create`, it results in the following error:
   ` Error: Invalid R object for std::shared_ptr<arrow::Schema>, must be an ArrowObject` 
   
   I am guessing it's because the code below passes in `NULL` as the schema argument, and even with `explicit_schema` defaulting to `nullptr`, that default value is never used?
   ```
   JsonParseOptions$create <- function(newlines_in_values = FALSE, schema = NULL) {
       json___ParseOptions__initialize(newlines_in_values, schema)
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-817591944


   > Sorry @thisisnic, I just merged something that means you need to rebase and re-generate `arrowExports.cpp` (which should happen when you install the package and test anyway). Due to the nature of the change I made, you will have a conflict on that file. Don't worry about it, it's autogenerated. Don't bother to resolve that conflict, just close out of whatever diff resolution tool that `git` opens, re-generate `arrowExports.cpp`, and then do `git add r/src/arrowExports.cpp && git rebase --continue` to proceed.
   
   Thanks for the heads up @nealrichardson 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610777524



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       It might be the way I've implemented it; I'll check there and see how they compare.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610652455



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       Possibly silly question because I'm a C++ noob:
   Would it work to keep this as just one function `json___ParseOptions__initialize` with `explicit_schema` defaulting to `nullptr` and then do
   ```cpp
   if (explicit_schema != nullptr) {
     res->explicit_schema = explicit_schema;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-817991737


   Am I remembering right that we also ran into an issue where attempting to read a float/integer as a string failed with an odd error message?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#discussion_r610806160



##########
File path: r/src/json.cpp
##########
@@ -31,14 +31,24 @@ std::shared_ptr<arrow::json::ReadOptions> json___ReadOptions__initialize(bool us
 }
 
 // [[arrow::export]]
-std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize(
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize1(
     bool newlines_in_values) {
   auto res =
       std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
   res->newlines_in_values = newlines_in_values;
   return res;
 }
 
+// [[arrow::export]]
+std::shared_ptr<arrow::json::ParseOptions> json___ParseOptions__initialize2(
+    bool newlines_in_values, const std::shared_ptr<arrow::Schema>& explicit_schema) {
+  auto res =
+    std::make_shared<arrow::json::ParseOptions>(arrow::json::ParseOptions::Defaults());
+  res->newlines_in_values = newlines_in_values;
+  res->explicit_schema = explicit_schema;
+  return res;
+}
+

Review comment:
       Gotcha. FTR, I did not add that `nullptr` check in `Make3` 🙂




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #9950: ARROW-11468: [R] Allow user to pass schema to read_json_arrow()

Posted by GitBox <gi...@apache.org>.
thisisnic commented on pull request #9950:
URL: https://github.com/apache/arrow/pull/9950#issuecomment-818621376


   > > Am I remembering right that we also ran into an issue where attempting to read a float/integer as a string failed with an odd error message?
   > 
   > I see that now in the PR description. I recommend writing a C++ JIRA for that, adding a test here that would trigger it, and then `skip("ARROW-12XXX: Error: Invalid: JSON parse error: Column(/third_col) changed from string to number in row 0")`
   
   I've had a bit more of a read of the JSON C++ docs and I'm now thinking this is expected behaviour.  The error is only triggered when the values are not quoted in the JSON, and the table on this page (https://arrow.apache.org/docs/cpp/json.html#id1) shows that fields of JSON value type "Number" only have allowed Arrow data types "All Integer types, Float32, Float64, Date32, Date64, Time32, Time64".  I've added a new test that triggers it but is captured with `expect_error()`.
   
   What do you think, does this make sense? Or even it does, do I still need to raise a ticket around the formatting of the error with the "/" next to the column name?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org