You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 12:55:24 UTC
[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming

paleolimbot commented on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-984602422


   Nothing new yet, just listing the various ways this can fail.
   
   First, intermittent success!
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   example_data <- tibble::tibble(
     int = c(1:3, NA_integer_, 5:10),
     dbl = c(1:8, NA, 10) + .1,
     dbl2 = rep(5, 10),
     lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
     false = logical(10),
     chr = letters[c(1:5, NA, 7:10)],
     fct = factor(letters[c(1:4, NA, NA, 7:10)])
   )
   
   tf <- tempfile()
   new_ds <- rbind(
     cbind(example_data, part = 1),
     cbind(example_data, part = 2),
     cbind(example_data, part = 3),
     cbind(example_data, part = 4)
   ) %>%
     mutate(row_order = 1:n())
   
   write_dataset(new_ds, tf, partitioning = "part")
   
   ds <- open_dataset(tf)
   
   waldo::compare(
     ds %>%
       to_duckdb() %>%
       # factors don't roundtrip https://github.com/duckdb/duckdb/issues/1879
       select(-fct) %>%
       to_arrow() %>%
       filter(int > 5 & part > 1) %>%
       collect() %>%
       arrange(row_order) %>%
       tibble::as_tibble(),
     ds %>%
       select(-fct) %>%
       filter(int > 5 & part > 1) %>%
       collect() %>%
       arrange(row_order) %>%
       tibble::as_tibble()
   )
   #> ✓ No differences
   ```
   
   <sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>
   
   Second, filter mismatch:
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   example_data <- tibble::tibble(
     int = c(1:3, NA_integer_, 5:10),
     dbl = c(1:8, NA, 10) + .1,
     dbl2 = rep(5, 10),
     lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
     false = logical(10),
     chr = letters[c(1:5, NA, 7:10)],
     fct = factor(letters[c(1:4, NA, NA, 7:10)])
   )
   
   tf <- tempfile()
   new_ds <- rbind(
     cbind(example_data, part = 1),
     cbind(example_data, part = 2),
     cbind(example_data, part = 3),
     cbind(example_data, part = 4)
   ) %>%
     mutate(row_order = 1:n())
   
   write_dataset(new_ds, tf, partitioning = "part")
   
   ds <- open_dataset(tf)
   
   waldo::compare(
     ds %>%
       to_duckdb() %>%
       # factors don't roundtrip https://github.com/duckdb/duckdb/issues/1879
       select(-fct) %>%
       to_arrow() %>%
       filter(int > 5 & part > 1) %>%
       collect() %>%
       arrange(row_order) %>%
       tibble::as_tibble(),
     ds %>%
       select(-fct) %>%
       filter(int > 5 & part > 1) %>%
       collect() %>%
       arrange(row_order) %>%
       tibble::as_tibble()
   )
   #> old vs new
   #>             int  dbl   lgl  chr row_order part
   #> - old[1, ]    7  7.1  TRUE    g         0    3
   #> + new[1, ]    6  6.1  TRUE <NA>        16    2
   #> - old[2, ]    8  8.1    NA    h         0    3
   #> + new[2, ]    7  7.1  TRUE    g        17    2
   #> - old[3, ]    9   NA    NA    i         0    3
   #> + new[3, ]    8  8.1    NA    h        18    2
   #> - old[4, ]   10 10.1 FALSE    j         0    3
   #> + new[4, ]    9   NA    NA    i        19    2
   #> - old[5, ]    6  6.1  TRUE <NA>         4    3
   #> + new[5, ]   10 10.1 FALSE    j        20    2
   #> - old[6, ]    6  6.1  TRUE <NA>        16    2
   #> + new[6, ]    6  6.1  TRUE <NA>        26    3
   #> - old[7, ]    7  7.1  TRUE    g        17    2
   #> + new[7, ]    7  7.1  TRUE    g        27    3
   #> - old[8, ]    8  8.1    NA    h        18    2
   #> + new[8, ]    8  8.1    NA    h        28    3
   #> - old[9, ]    9   NA    NA    i        19    2
   #> + new[9, ]    9   NA    NA    i        29    3
   #> - old[10, ]  10 10.1 FALSE    j        20    2
   #> + new[10, ]  10 10.1 FALSE    j        30    3
   #>   old[11, ]   6  6.1  TRUE <NA>        36    4
   #>   old[12, ]   7  7.1  TRUE    g        37    4
   #>   old[13, ]   8  8.1    NA    h        38    4
   #> 
   #> `old$int[1:8]`: 7 8 9 10  6 6 7 8
   #> `new$int[1:8]`: 6 7 8  9 10 6 7 8
   #> 
   #> `old$dbl[1:8]`: 7 8 NA 10  6 6 7 8
   #> `new$dbl[1:8]`: 6 7  8 NA 10 6 7 8
   #> 
   #> `old$lgl[1:8]`: TRUE <NA> <NA> FALSE TRUE  TRUE TRUE <NA>
   #> `new$lgl[1:8]`: TRUE TRUE <NA> <NA>  FALSE TRUE TRUE <NA>
   #> 
   #> `old$chr[1:8]`: "g" "h" "i" "j" NA  NA "g" "h"
   #> `new$chr[1:8]`: NA  "g" "h" "i" "j" NA "g" "h"
   #> 
   #> `old$row_order[1:13]`:  0  0  0  0  4 16 17 18 19 20 and 3 more...
   #> `new$row_order[1:13]`: 16 17 18 19 20 26 27 28 29 30           ...
   #> 
   #> `old$part[1:13]`: 3 3 3 3 3 2 2 2 2 2 and 3 more...
   #> `new$part[1:13]`: 2 2 2 2 2 3 3 3 3 3           ...
   ```
   
   <sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>
   
   Third, `Query Stream is closed`:
   
   <details>
   
   ```
   Error: IOError: Query Stream is closed
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/c/bridge.cc:1759  StatusFromCError(stream_.get_next(&stream_, &c_array))
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.h:222  ReadNext(&batch)
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/util/iterator.h:428  it_.Next()
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:417  iterator_.Next()
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:326  ReadNext(&batch)
   /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadAll(&batches) 
   ```
   
   </details>
   
   Fourth, segfault (only has happened once).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org