You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 12:55:24 UTC
[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming
paleolimbot commented on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-984602422
Nothing new yet, just listing the various ways this can fail.
First, intermittent success!
<details>
``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
example_data <- tibble::tibble(
int = c(1:3, NA_integer_, 5:10),
dbl = c(1:8, NA, 10) + .1,
dbl2 = rep(5, 10),
lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
false = logical(10),
chr = letters[c(1:5, NA, 7:10)],
fct = factor(letters[c(1:4, NA, NA, 7:10)])
)
tf <- tempfile()
new_ds <- rbind(
cbind(example_data, part = 1),
cbind(example_data, part = 2),
cbind(example_data, part = 3),
cbind(example_data, part = 4)
) %>%
mutate(row_order = 1:n())
write_dataset(new_ds, tf, partitioning = "part")
ds <- open_dataset(tf)
waldo::compare(
ds %>%
to_duckdb() %>%
# factors don't roundtrip https://github.com/duckdb/duckdb/issues/1879
select(-fct) %>%
to_arrow() %>%
filter(int > 5 & part > 1) %>%
collect() %>%
arrange(row_order) %>%
tibble::as_tibble(),
ds %>%
select(-fct) %>%
filter(int > 5 & part > 1) %>%
collect() %>%
arrange(row_order) %>%
tibble::as_tibble()
)
#> ✓ No differences
```
<sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
Second, filter mismatch:
<details>
``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
example_data <- tibble::tibble(
int = c(1:3, NA_integer_, 5:10),
dbl = c(1:8, NA, 10) + .1,
dbl2 = rep(5, 10),
lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
false = logical(10),
chr = letters[c(1:5, NA, 7:10)],
fct = factor(letters[c(1:4, NA, NA, 7:10)])
)
tf <- tempfile()
new_ds <- rbind(
cbind(example_data, part = 1),
cbind(example_data, part = 2),
cbind(example_data, part = 3),
cbind(example_data, part = 4)
) %>%
mutate(row_order = 1:n())
write_dataset(new_ds, tf, partitioning = "part")
ds <- open_dataset(tf)
waldo::compare(
ds %>%
to_duckdb() %>%
# factors don't roundtrip https://github.com/duckdb/duckdb/issues/1879
select(-fct) %>%
to_arrow() %>%
filter(int > 5 & part > 1) %>%
collect() %>%
arrange(row_order) %>%
tibble::as_tibble(),
ds %>%
select(-fct) %>%
filter(int > 5 & part > 1) %>%
collect() %>%
arrange(row_order) %>%
tibble::as_tibble()
)
#> old vs new
#> int dbl lgl chr row_order part
#> - old[1, ] 7 7.1 TRUE g 0 3
#> + new[1, ] 6 6.1 TRUE <NA> 16 2
#> - old[2, ] 8 8.1 NA h 0 3
#> + new[2, ] 7 7.1 TRUE g 17 2
#> - old[3, ] 9 NA NA i 0 3
#> + new[3, ] 8 8.1 NA h 18 2
#> - old[4, ] 10 10.1 FALSE j 0 3
#> + new[4, ] 9 NA NA i 19 2
#> - old[5, ] 6 6.1 TRUE <NA> 4 3
#> + new[5, ] 10 10.1 FALSE j 20 2
#> - old[6, ] 6 6.1 TRUE <NA> 16 2
#> + new[6, ] 6 6.1 TRUE <NA> 26 3
#> - old[7, ] 7 7.1 TRUE g 17 2
#> + new[7, ] 7 7.1 TRUE g 27 3
#> - old[8, ] 8 8.1 NA h 18 2
#> + new[8, ] 8 8.1 NA h 28 3
#> - old[9, ] 9 NA NA i 19 2
#> + new[9, ] 9 NA NA i 29 3
#> - old[10, ] 10 10.1 FALSE j 20 2
#> + new[10, ] 10 10.1 FALSE j 30 3
#> old[11, ] 6 6.1 TRUE <NA> 36 4
#> old[12, ] 7 7.1 TRUE g 37 4
#> old[13, ] 8 8.1 NA h 38 4
#>
#> `old$int[1:8]`: 7 8 9 10 6 6 7 8
#> `new$int[1:8]`: 6 7 8 9 10 6 7 8
#>
#> `old$dbl[1:8]`: 7 8 NA 10 6 6 7 8
#> `new$dbl[1:8]`: 6 7 8 NA 10 6 7 8
#>
#> `old$lgl[1:8]`: TRUE <NA> <NA> FALSE TRUE TRUE TRUE <NA>
#> `new$lgl[1:8]`: TRUE TRUE <NA> <NA> FALSE TRUE TRUE <NA>
#>
#> `old$chr[1:8]`: "g" "h" "i" "j" NA NA "g" "h"
#> `new$chr[1:8]`: NA "g" "h" "i" "j" NA "g" "h"
#>
#> `old$row_order[1:13]`: 0 0 0 0 4 16 17 18 19 20 and 3 more...
#> `new$row_order[1:13]`: 16 17 18 19 20 26 27 28 29 30 ...
#>
#> `old$part[1:13]`: 3 3 3 3 3 2 2 2 2 2 and 3 more...
#> `new$part[1:13]`: 2 2 2 2 2 3 3 3 3 3 ...
```
<sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
Third, `Query Stream is closed`:
<details>
```
Error: IOError: Query Stream is closed
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/c/bridge.cc:1759 StatusFromCError(stream_.get_next(&stream_, &c_array))
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.h:222 ReadNext(&batch)
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/util/iterator.h:428 it_.Next()
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:417 iterator_.Next()
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:326 ReadNext(&batch)
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337 ReadAll(&batches)
```
</details>
Fourth, segfault (only has happened once).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org