You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/21 19:22:05 UTC

[GitHub] [arrow] jonkeane opened a new pull request, #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

jonkeane opened a new pull request, #12950:
URL: https://github.com/apache/arrow/pull/12950

   The real fix was in https://github.com/apache/arrow/pull/12891 ([ARROW-12659](https://issues.apache.org/jira/browse/ARROW-12659)) but this adds integration tests from the ticket to confirm this works in R + we don't run into this in the future


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12950:
URL: https://github.com/apache/arrow/pull/12950#issuecomment-1105751337

   https://issues.apache.org/jira/browse/ARROW-15312


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson closed pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows
URL: https://github.com/apache/arrow/pull/12950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12950:
URL: https://github.com/apache/arrow/pull/12950#issuecomment-1108490042

   Benchmark runs are scheduled for baseline = a6296cb53d2a2a05d3dac49152a2db7aee8953ba and contender = 285667cc88e22433c72842f1a37f1f95cccff656. 285667cc88e22433c72842f1a37f1f95cccff656 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6078fffdad204ac3a9a4c692b9bd299c...d560ea2e7d954668bca6dcedba1a5eba/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a3363e99659b423fa87128e97fdf7c1f...ce11ba31a80b4f099d8a07a7eebea61a/)
   [Failed :arrow_down:4.14% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/31b1f0171e9e479286b9ee282b691eee...0a478d0e592d4caf99c000e84a2b8242/)
   [Finished :arrow_down:0.84% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/5fa54bd5bf124681a161ce9b6b28d190...9630b1f38dbb45c0a55b03806fcd5065/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/584| `285667cc` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/572| `285667cc` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/570| `285667cc` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/582| `285667cc` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/583| `a6296cb5` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/571| `a6296cb5` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/569| `a6296cb5` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/581| `a6296cb5` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #12950:
URL: https://github.com/apache/arrow/pull/12950#discussion_r856153177


##########
r/tests/testthat/test-dataset.R:
##########
@@ -966,3 +966,29 @@ test_that("dataset to C-interface to arrow_dplyr_query with proj/filter", {
   # must clean up the pointer or we leak
   delete_arrow_array_stream(stream_ptr)
 })
+
+
+test_that("Filter parquet dataset with is.na ARROW-15312", {
+  ds_path <- make_temp_dir()
+
+  df <- tibble(x = 1:3, y = c(0L, 0L, NA_integer_), z = c(0L, 1L, NA_integer_))
+  write_dataset(df, ds_path)
+
+  # OK: Collect then filter: returns row 3, as expected
+  expect_identical(
+    open_dataset(ds_path) %>% collect() %>% filter(is.na(y)),
+    df %>% collect() %>% filter(is.na(y))
+  )
+
+  # ERROR: Filter then collect (on y) returns a tibble with no row

Review Comment:
   ```suggestion
     # Before the fix: Filter then collect on y returned a 0-row tibble
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #12950: ARROW-15312: [R][C++] filtering a Parquet dataset with is.na() misses some rows

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #12950:
URL: https://github.com/apache/arrow/pull/12950#discussion_r856153177


##########
r/tests/testthat/test-dataset.R:
##########
@@ -966,3 +966,29 @@ test_that("dataset to C-interface to arrow_dplyr_query with proj/filter", {
   # must clean up the pointer or we leak
   delete_arrow_array_stream(stream_ptr)
 })
+
+
+test_that("Filter parquet dataset with is.na ARROW-15312", {
+  ds_path <- make_temp_dir()
+
+  df <- tibble(x = 1:3, y = c(0L, 0L, NA_integer_), z = c(0L, 1L, NA_integer_))
+  write_dataset(df, ds_path)
+
+  # OK: Collect then filter: returns row 3, as expected
+  expect_identical(
+    open_dataset(ds_path) %>% collect() %>% filter(is.na(y)),
+    df %>% collect() %>% filter(is.na(y))
+  )
+
+  # ERROR: Filter then collect (on y) returns a tibble with no row

Review Comment:
   ```suggestion
     # Before the fix: Filter then collect (on y) returned a tibble with no row
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org