You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/06/07 13:49:00 UTC
[jira] [Updated] (ARROW-16776) [R] dpylr::glimpse method for arrow table/datasets on disk
[ https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-16776:
------------------------------------
Summary: [R] dpylr::glimpse method for arrow table/datasets on disk (was: dpylr::glimpse method for arrow table/datasets on disk)
> [R] dpylr::glimpse method for arrow table/datasets on disk
> ----------------------------------------------------------
>
> Key: ARROW-16776
> URL: https://issues.apache.org/jira/browse/ARROW-16776
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Thomas Mock
> Priority: Minor
> Fix For: 9.0.0
>
>
> When working with Arrow datasets/tables, I often find myself wanting to interactively print or "see" the results of a query or the first few rows of the data without having to fully collect into memory.
> I can perform exploratory data analysis on large out-of-memory datasets via Arrow + dplyr but in order to print the returned values I have to collect() into memory or send to_duckdb().
> * compute() - returns number of rows/columns, but no data
> * collect() - returns data fully into memory, can be combined with head()
> * to_duckdb() - keeps data out of memory, always returns top 10 rows and all columns, optionally increase/decrease number of printed rows
> While to_duckdb() gives me the ability to do true EDA, it seems counterintuitive to need to send the arrow table over to a duckdb database just to see the glimpse()/head() equivalent.
> My feature request is that there is a dplyr::glimpse() method that will lazily print the first few values of table/dataset. The expected output would be something like the below.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>%
> glimpse()
> Rows: ??
> Columns: 11
> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, …
> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, …
> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 36…
> $ hp <dbl> 110, 110, 93, 110, 175, 105, 2…
> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, …
> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.…
> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17…
> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, …
> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, …
> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, …
> ```
> Currently glimpse() will return a list output where the majority of the output is erroneous to the actual data/values.
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds %>%
> glimpse()
> #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
> #> Inherits from: <Dataset>
> #> Public:
> #> .:xp:.: externalptr
> #> .class_title: function ()
> #> clone: function (deep = FALSE)
> #> files: active binding
> #> filesystem: active binding
> #> format: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> metadata: active binding
> #> NewScan: function ()
> #> num_cols: active binding
> #> num_rows: active binding
> #> pointer: function ()
> #> print: function (...)
> #> schema: active binding
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: active binding
> car_ds %>%
> filter(cyl == 6) %>%
> glimpse()
> #> List of 7
> #> $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
> #> Inherits from: <Dataset>
> #> Public:
> #> .:xp:.: externalptr
> #> .class_title: function ()
> #> clone: function (deep = FALSE)
> #> files: active binding
> #> filesystem: active binding
> #> format: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> metadata: active binding
> #> NewScan: function ()
> #> num_cols: active binding
> #> num_rows: active binding
> #> pointer: function ()
> #> print: function (...)
> #> schema: active binding
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: active binding
> #> $ cyl :List of 11
> #> ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ hp :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ drat:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ wt :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ qsec:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ vs :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ am :Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ gear:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> ..$ carb:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> $ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression>
> #> Inherits from: <ArrowObject>
> #> Public:
> #> .:xp:.: externalptr
> #> cast: function (to_type, safe = TRUE, ...)
> #> clone: function (deep = FALSE)
> #> Equals: function (other, ...)
> #> field_name: active binding
> #> initialize: function (xp)
> #> invalidate: function ()
> #> pointer: function ()
> #> print: function (...)
> #> schema: Schema, ArrowObject, R6
> #> set_pointer: function (xp)
> #> ToString: function ()
> #> type: function (schema = self$schema)
> #> type_id: function (schema = self$schema)
> #> $ hp : chr(0)
> #> $ drat: NULL
> #> $ wt : list()
> #> $ qsec: logi(0)
> #> - attr(*, "class")= chr "arrow_dplyr_query"
> ```
> <sup>Created on 2022-06-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)