You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/22 20:36:49 UTC

[GitHub] [arrow] boshek commented on a diff in pull request #13641: ARROW-12693: [R] add unique() methods for ArrowTabular, datasets

boshek commented on code in PR #13641:
URL: https://github.com/apache/arrow/pull/13641#discussion_r927973046


##########
r/R/dplyr.R:
##########
@@ -184,6 +184,29 @@ dim.arrow_dplyr_query <- function(x) {
   c(rows, cols)
 }
 
+#' @export
+unique.arrow_dplyr_query <- function(x, incomparables = FALSE, fromLast = FALSE, ...) {
+
+  if (incomparables == TRUE) {
+    arrow_not_supported("`unique()` with `incomparables = TRUE`")
+  }
+
+  if (fromLast == TRUE) {
+    arrow_not_supported("`unique()` with `fromLast = TRUE`")
+  }
+
+  x <- dplyr::distinct(x)
+  dplyr::collect(x)

Review Comment:
   It comes [this](https://issues.apache.org/jira/browse/ARROW-12693?focusedCommentId=17568169&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17568169) here. I think it is a bit of a grey area. My thinking was that base don't fall into the lazy dbplyr paradigm but then I remember `head`. dbplyr does not have `unique` so I don't think there is a precedent:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   arrow_iris <- arrow_table(iris)
   duckdb_iris <- to_duckdb(arrow_iris)
   
   ## head
   head(arrow_iris)
   #> Table
   #> 6 rows x 5 columns
   #> $Sepal.Length <double>
   #> $Sepal.Width <double>
   #> $Petal.Length <double>
   #> $Petal.Width <double>
   #> $Species <dictionary<values=string, indices=int8>>
   #> 
   #> See $metadata for additional Schema metadata
   head(duckdb_iris)
   #> # Source:   SQL [6 x 5]
   #> # Database: DuckDB 0.3.5-dev1410 [root@Darwin 21.6.0:R 4.2.1/:memory:]
   #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   #>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
   #> 1          5.1         3.5          1.4         0.2 setosa 
   #> 2          4.9         3            1.4         0.2 setosa 
   #> 3          4.7         3.2          1.3         0.2 setosa 
   #> 4          4.6         3.1          1.5         0.2 setosa 
   #> 5          5           3.6          1.4         0.2 setosa 
   #> 6          5.4         3.9          1.7         0.4 setosa
   
   ## distinct
   distinct(arrow_iris)
   #> Table (query)
   #> Sepal.Length: double
   #> Sepal.Width: double
   #> Petal.Length: double
   #> Petal.Width: double
   #> Species: dictionary<values=string, indices=int8>
   #> 
   #> See $.data for the source Arrow object
   distinct(duckdb_iris)
   #> # Source:   SQL [?? x 5]
   #> # Database: DuckDB 0.3.5-dev1410 [root@Darwin 21.6.0:R 4.2.1/:memory:]
   #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   #>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
   #>  1          5.1         3.5          1.4         0.2 setosa 
   #>  2          4.9         3            1.4         0.2 setosa 
   #>  3          4.7         3.2          1.3         0.2 setosa 
   #>  4          4.6         3.1          1.5         0.2 setosa 
   #>  5          5           3.6          1.4         0.2 setosa 
   #>  6          5.4         3.9          1.7         0.4 setosa 
   #>  7          4.6         3.4          1.4         0.3 setosa 
   #>  8          5           3.4          1.5         0.2 setosa 
   #>  9          4.4         2.9          1.4         0.2 setosa 
   #> 10          4.9         3.1          1.5         0.1 setosa 
   #> # … with more rows
   #> # ℹ Use `print(n = ...)` to see more rows
   
   ##
   unique(duckdb_iris)
   #> [[1]]
   #> src:  DuckDB 0.3.5-dev1410 [root@Darwin 21.6.0:R 4.2.1/:memory:]
   #> tbls:
   #> 
   #> [[2]]
   #> From: arrow_001
   #> <Table: arrow_001>
   ```
   
   I don't think users will have an expectation here so we are probably free to decide. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org