You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Sam Albers (Jira)" <ji...@apache.org> on 2020/03/13 22:21:00 UTC

[jira] [Created] (ARROW-8118) dim method for FileSystemDataset

Sam Albers created ARROW-8118:
---------------------------------

             Summary: dim method for FileSystemDataset
                 Key: ARROW-8118
                 URL: https://issues.apache.org/jira/browse/ARROW-8118
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Sam Albers


I been using this function enough that I wonder if a) would be useful in the package and b) whether this is something you think is worth working on. The basic problem is that if you have a hierarchical file structure that accommodates using open_dataset, it is definitely useful to know the amount of data you are dealing with. My idea is that 'FileSystemDataset' would have dim, nrow and ncol methods. Here is how I've been using it:
{code:java}
library(arrow)
x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
dim_arrow <- function(x) {
 rows <- sum(purrr::map_dbl(x$files, ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
 cols <- x$schema$num_fields
 
 c(rows, cols)
}
dim_arrow(x)
#> [1] 426929 7
{code}
 

Ideally this would work on arrow_dplyr_query objects as well but I haven't quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)