You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/14 22:30:25 UTC

[GitHub] [arrow] wjones127 opened a new pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

wjones127 opened a new pull request #12629:
URL: https://github.com/apache/arrow/pull/12629


   The C++ `Dataset->ReplaceSchema()` method returns a new dataset, but we were calling it as if it mutated the original object. I'm moving this to a new method and adding a test for this behavior.
   
   I'm also getting another failure in testing UnionDataset schema unification; something about metadata not matching...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#discussion_r830261151



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
   expect_error(c(ds1, 42), "character")
 })
 
+test_that("UnionDataset can merge schemas", {
+  sub_df1 <- Table$create(
+    x = Array$create(c(1, 2, 3)),
+    y = Array$create(c("a", "b", "c"))
+  )
+  sub_df2 <- Table$create(
+    x = Array$create(c(4, 5)),
+    z = Array$create(c("d", "e"))
+  )
+
+  path1 <- make_temp_dir()
+  path2 <- make_temp_dir()
+  write_dataset(sub_df1, path1, format = "parquet")
+  write_dataset(sub_df2, path2, format = "parquet")
+
+  ds1 <- open_dataset(path1, format = "parquet")
+  ds2 <- open_dataset(path2, format = "parquet")
+
+  ds <- c(ds1, ds2)
+  actual <- ds %>%
+    collect() %>%
+    arrange(x)
+  expect_equal(
+    actual,
+    union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+  )

Review comment:
       I can see argument for either side of this: might we want to assert the actual column names? I know that `union_all` preserves all the columns from both, but it might be nice to have that super obvious that's what's happening in the test

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
   expect_error(c(ds1, 42), "character")
 })
 
+test_that("UnionDataset can merge schemas", {
+  sub_df1 <- Table$create(
+    x = Array$create(c(1, 2, 3)),
+    y = Array$create(c("a", "b", "c"))
+  )
+  sub_df2 <- Table$create(
+    x = Array$create(c(4, 5)),
+    z = Array$create(c("d", "e"))
+  )
+
+  path1 <- make_temp_dir()
+  path2 <- make_temp_dir()
+  write_dataset(sub_df1, path1, format = "parquet")
+  write_dataset(sub_df2, path2, format = "parquet")
+
+  ds1 <- open_dataset(path1, format = "parquet")
+  ds2 <- open_dataset(path2, format = "parquet")
+
+  ds <- c(ds1, ds2)
+  actual <- ds %>%
+    collect() %>%
+    arrange(x)
+  expect_equal(
+    actual,
+    union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+  )
+
+  # without unifying schemas, takes the first schema

Review comment:
       ```suggestion
     # without unifying schemas, takes the first schema and discards any columns in the second which aren't in the first
   ```
   
   A bit more descriptive comment (it took a second to see what was going on here when reading it)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane closed pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
jonkeane closed pull request #12629:
URL: https://github.com/apache/arrow/pull/12629


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1072831914


   Benchmark runs are scheduled for baseline = a7f91ec5565519a3071b90c247564cd97e34e140 and contender = ae93d12d40d41d26abb8034cbdfb11d233d29402. ae93d12d40d41d26abb8034cbdfb11d233d29402 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e81285e54ac4d539b29b9c56bae4263...619bbdce0209412e882d030bab8f2a22/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/bf45ae2fdaf34107abd6ca6ecfbe9f8b...b9c6d4e726d04447b70b418f4cc3d86a/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/560ae6ae38da48aabab7f2f817248b33...9deb95f35b0e476ea0d4ad4e046233d0/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f54e8b9a923546caadba2660bf308448...908163fdb0ed45d38b6f885f83d18990/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#discussion_r830328015



##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -544,6 +557,43 @@ test_that("Creating UnionDataset", {
   expect_error(c(ds1, 42), "character")
 })
 
+test_that("UnionDataset can merge schemas", {
+  sub_df1 <- Table$create(
+    x = Array$create(c(1, 2, 3)),
+    y = Array$create(c("a", "b", "c"))
+  )
+  sub_df2 <- Table$create(
+    x = Array$create(c(4, 5)),
+    z = Array$create(c("d", "e"))
+  )
+
+  path1 <- make_temp_dir()
+  path2 <- make_temp_dir()
+  write_dataset(sub_df1, path1, format = "parquet")
+  write_dataset(sub_df2, path2, format = "parquet")
+
+  ds1 <- open_dataset(path1, format = "parquet")
+  ds2 <- open_dataset(path2, format = "parquet")
+
+  ds <- c(ds1, ds2)
+  actual <- ds %>%
+    collect() %>%
+    arrange(x)
+  expect_equal(
+    actual,
+    union_all(as_tibble(sub_df1), as_tibble(sub_df2))
+  )

Review comment:
       I'm all for making the tests more obvious! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1072599443


   cc @jonkeane 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1072831914


   Benchmark runs are scheduled for baseline = a7f91ec5565519a3071b90c247564cd97e34e140 and contender = ae93d12d40d41d26abb8034cbdfb11d233d29402. ae93d12d40d41d26abb8034cbdfb11d233d29402 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e81285e54ac4d539b29b9c56bae4263...619bbdce0209412e882d030bab8f2a22/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/bf45ae2fdaf34107abd6ca6ecfbe9f8b...b9c6d4e726d04447b70b418f4cc3d86a/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/560ae6ae38da48aabab7f2f817248b33...9deb95f35b0e476ea0d4ad4e046233d0/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f54e8b9a923546caadba2660bf308448...908163fdb0ed45d38b6f885f83d18990/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1072831914


   Benchmark runs are scheduled for baseline = a7f91ec5565519a3071b90c247564cd97e34e140 and contender = ae93d12d40d41d26abb8034cbdfb11d233d29402. ae93d12d40d41d26abb8034cbdfb11d233d29402 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e81285e54ac4d539b29b9c56bae4263...619bbdce0209412e882d030bab8f2a22/)
   [Finished :arrow_down:0.34% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/bf45ae2fdaf34107abd6ca6ecfbe9f8b...b9c6d4e726d04447b70b418f4cc3d86a/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/560ae6ae38da48aabab7f2f817248b33...9deb95f35b0e476ea0d4ad4e046233d0/)
   [Finished :arrow_down:0.72% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f54e8b9a923546caadba2660bf308448...908163fdb0ed45d38b6f885f83d18990/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1072831914


   Benchmark runs are scheduled for baseline = a7f91ec5565519a3071b90c247564cd97e34e140 and contender = ae93d12d40d41d26abb8034cbdfb11d233d29402. ae93d12d40d41d26abb8034cbdfb11d233d29402 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7e81285e54ac4d539b29b9c56bae4263...619bbdce0209412e882d030bab8f2a22/)
   [Finished :arrow_down:0.34% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/bf45ae2fdaf34107abd6ca6ecfbe9f8b...b9c6d4e726d04447b70b418f4cc3d86a/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/560ae6ae38da48aabab7f2f817248b33...9deb95f35b0e476ea0d4ad4e046233d0/)
   [Finished :arrow_down:0.72% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f54e8b9a923546caadba2660bf308448...908163fdb0ed45d38b6f885f83d18990/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12629: ARROW-15627: [R] Fix union dataset unify schema

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12629:
URL: https://github.com/apache/arrow/pull/12629#issuecomment-1067366114


   https://issues.apache.org/jira/browse/ARROW-15627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org