You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/23 01:19:24 UTC

[GitHub] [arrow] westonpace commented on issue #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

westonpace commented on issue #35715:
URL: https://github.com/apache/arrow/issues/35715#issuecomment-1558318146

   Looks like the problem might be in the R code getting ready to call the dataset factory:
   
   ```
   DatasetFactory$create <- function(x,
                                     filesystem = NULL,
                                     format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
                                     partitioning = NULL,
                                     hive_style = NA,
                                     factory_options = list(),
                                     ...) {
     if (is_list_of(x, "DatasetFactory")) {
       return(dataset___UnionDatasetFactory__Make(x))
     }
   
     if (is.character(format)) {
       format <- FileFormat$create(match.arg(format), ...)
     } else {
       assert_is(format, "FileFormat")
     }
   
     path_and_fs <- get_paths_and_filesystem(x, filesystem)
     info <- path_and_fs$fs$GetFileInfo(path_and_fs$path)
   
     if (length(info) > 1 || info[[1]]$type == FileType$File) {
       # x looks like a vector of one or more file paths (not a directory path)
       return(FileSystemDatasetFactory$create(
         path_and_fs$fs,
         NULL,
         path_and_fs$path,
         format,
         factory_options = factory_options
       ))
     }
   
     partitioning <- handle_partitioning(partitioning, path_and_fs, hive_style)
     selector <- FileSelector$create(
       path_and_fs$path,
       allow_not_found = FALSE,
       recursive = TRUE
     )
   
     FileSystemDatasetFactory$create(path_and_fs$fs, selector, NULL, format, partitioning, factory_options)
   }
   ```
   
   If I understand correctly (which I very well might not), `info <- path_and_fs$fs$GetFileInfo(path_and_fs$path)` will call `GetFileInfo` on every single path which will trigger an individual S3 ls call for every single path.  We should probably just assume, if the length of x is greater than 1, then we are being given a list of files.
   
   On the bright side, if I use my fix in #35440 then this call (`ds <- open_dataset(s3)`) finishes in about 4.5 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org