You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/07 20:37:55 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #8351: ARROW-9870: [R] Friendly interface for filesystems (S3)

bkietz commented on a change in pull request #8351:
URL: https://github.com/apache/arrow/pull/8351#discussion_r501234407



##########
File path: r/vignettes/fs.Rmd
##########
@@ -44,41 +44,46 @@ the cost of reading the data over the network should be much lower.
 
 ## Creating a FileSystem object
 
-Another way to connect to S3 is to create an `S3FileSystem` object once and pass
-that to the read/write functions. This may be a convenience when dealing with
+Another way to connect to S3 is to create an `FileSystem` object once and pass

Review comment:
       ```suggestion
   Another way to connect to S3 is to create a `FileSystem` object once and pass
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -26,9 +26,27 @@ The total file size is around 37 gigabytes, even in the efficient Parquet file f
 That's bigger than memory on most people's computers,
 so we can't just read it all in and stack it into a single data frame.
 
-In a future release, you'll be able to point your R session at S3 and query
-the dataset from there. For now, datasets need to be on your local file system.
-To download the files,
+If you've installed a binary macOS or Windows `arrow` package, you should have
+support for S3 built in. If you do, you can query datasets directly on S3 without
+first copying the files locally. To check, run
+
+```{r}
+arrow::arrow_with_s3()
+```
+
+Even with S3 support enabled, unless your machine is located in the same AWS
+region as the data, network speed will be a bottleneck. So, for this vignette,
+we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi"
+directory.

Review comment:
       ```suggestion
   Even with S3 support enabled network speed will be a bottleneck unless your
   machine is located in the same AWS region as the data. So, for this vignette,
   we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi"
   directory. Following are two possible ways to download all of it but neither is
   evaluated; if you want to run with live data, you'll have to do it yourself
   separately. Given the size, if you're running this locally and don't have a fast
   connection, you may want to grab just a year or two of data.
   ```

##########
File path: r/vignettes/fs.Rmd
##########
@@ -44,41 +44,46 @@ the cost of reading the data over the network should be much lower.
 
 ## Creating a FileSystem object
 
-Another way to connect to S3 is to create an `S3FileSystem` object once and pass
-that to the read/write functions. This may be a convenience when dealing with
+Another way to connect to S3 is to create an `FileSystem` object once and pass
+that to the read/write functions.
+`s3_bucket()` is a convenience function to create an `S3FileSystem` object,
+automatically detecting the bucket's AWS region and holding onto the bucket's
+relative path.
+This may be a convenience when dealing with
 long URIs, and it's necessary for some options and authentication methods

Review comment:
       ```suggestion
   This may be convenient when dealing with
   long URIs, and it's necessary for some options and authentication methods
   ```

##########
File path: r/R/io.R
##########
@@ -230,6 +230,10 @@ mmap_open <- function(path, mode = c("read", "write", "readwrite")) {
 #' @return An `InputStream` or a subclass of one.
 #' @keywords internal
 make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem = NULL) {
+  if (inherits(file, "SubTreeFileSystem")) {
+    filesystem <- file$base_fs
+    file <- file$base_path
+  }

Review comment:
       ```suggestion
     if (inherits(file, "FileLocator")) {
       filesystem <- file$filesystem
       file <- file$path
     }
   ```

##########
File path: r/R/arrow-package.R
##########
@@ -53,18 +53,28 @@
 
 #' Is the C++ Arrow library available?
 #'
-#' You won't generally need to call this function, but it's here in case it
-#' helps for development purposes.
+#' You won't generally need to call these function, but they're here
+#' for diagnostic purposes.
 #' @return `TRUE` or `FALSE` depending on whether the package was installed
-#' with the Arrow C++ library. If `FALSE`, you'll need to install the C++
-#' library and then reinstall the R package. See [install_arrow()] for help.
+#' with the Arrow C++ library `arrow_available()` or with S3 support enabled
+#' `arrow_with_s3()`.

Review comment:
       ```suggestion
   #' with the Arrow C++ library (check with `arrow_available()`) or with S3
   #' support enabled (check with `arrow_with_s3()`).
   ```

##########
File path: r/R/arrow-package.R
##########
@@ -53,18 +53,28 @@
 
 #' Is the C++ Arrow library available?
 #'
-#' You won't generally need to call this function, but it's here in case it
-#' helps for development purposes.
+#' You won't generally need to call these function, but they're here

Review comment:
       ```suggestion
   #' You won't generally need to call these function, but they're made available
   ```

##########
File path: r/R/filesystem.R
##########
@@ -263,7 +271,10 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
     },
     OpenAppendStream = function(path) {
       shared_ptr(OutputStream, fs___FileSystem__OpenAppendStream(self, clean_path_rel(path)))
-    }
+    },
+
+    # Friendlier R user interface
+    path = function(x) SubTreeFileSystem$create(x, self)

Review comment:
       I think you called this method `cd` in the docs. My two cents: name it `chroot`

##########
File path: r/vignettes/fs.Rmd
##########
@@ -44,41 +44,46 @@ the cost of reading the data over the network should be much lower.
 
 ## Creating a FileSystem object
 
-Another way to connect to S3 is to create an `S3FileSystem` object once and pass
-that to the read/write functions. This may be a convenience when dealing with
+Another way to connect to S3 is to create an `FileSystem` object once and pass
+that to the read/write functions.
+`s3_bucket()` is a convenience function to create an `S3FileSystem` object,
+automatically detecting the bucket's AWS region and holding onto the bucket's
+relative path.

Review comment:
       ```suggestion
   `S3FileSystem` objects can be created with the `s3_bucket()` function, which
   automatically detects the bucket's AWS region. Additionally, the resulting
   `FileSystem` will consider paths relative to the bucket's path (so for example
   you don't need to prefix the bucket path when listing a directory).
   ```

##########
File path: r/R/feather.R
##########
@@ -24,9 +24,8 @@
 #' and the version 2 specification, which is the Apache Arrow IPC file format.
 #'
 #' @param x `data.frame`, [RecordBatch], or [Table]
-#' @param sink A string file path, URI, or [OutputStream]
-#' @param filesystem A [FileSystem] where `sink` should be written if it is a
-#' string file path; default is the local file system
+#' @param sink A string file path, URI, or [OutputStream], or path in a file
+#' system (`SubTreeFileSystem`)

Review comment:
       ```suggestion
   #' @param sink A string file path, URI, or [OutputStream]
   ```
   I think it's highly unintuitive to use a SubTreeFileSystem's base_path to point to a file rather than a directory. If we want to support an (fs, path) pair then I think we should name it consistently with C++ (`FileLocator`) if we want a distinct class for it. This can be constructed with `fs$loc(path)`

##########
File path: r/R/filesystem.R
##########
@@ -185,6 +185,14 @@ FileSelector$create <- function(base_dir, allow_not_found = FALSE, recursive = F
 #' - `$OpenAppendStream(path)`: Open an [output stream][OutputStream] for
 #'    appending.
 #'
+#' @section Active bindings:
+#'
+#' - `$type_name`: string filesystem type name, such as "local", "s3", etc.
+#' - `$region`: string AWS region, for `S3FileSystem` and `SubTreeFileSystem`
+#'    containing a `S3FileSystem`
+#' - `$base_fs`: for `SubTreeFileSystem`, the `FileSystem` it contains
+#' - `$base_path`: for `SubTreeFileSystem`, the string file path it contains

Review comment:
       ```suggestion
   #' - `$base_path`: for `SubTreeFileSystem`, the path in `$base_fs` which is considered
   #'    root in this `SubTreeFileSystem`.
   ```

##########
File path: r/vignettes/fs.Rmd
##########
@@ -44,41 +44,46 @@ the cost of reading the data over the network should be much lower.
 
 ## Creating a FileSystem object
 
-Another way to connect to S3 is to create an `S3FileSystem` object once and pass
-that to the read/write functions. This may be a convenience when dealing with
+Another way to connect to S3 is to create an `FileSystem` object once and pass
+that to the read/write functions.
+`s3_bucket()` is a convenience function to create an `S3FileSystem` object,
+automatically detecting the bucket's AWS region and holding onto the bucket's
+relative path.
+This may be a convenience when dealing with
 long URIs, and it's necessary for some options and authentication methods
 that aren't supported in the URI format.
 
+With a `FileSystem` object, we can point to specific files in it with the `$path()` method.
 In the previous example, this would look like:
 
 ```r
-fs <- S3FileSystem$create(region = "us-east-2")
-df <- read_parquet("ursa-labs-taxi-data/2019/06/data.parquet", filesystem = fs)
+bucket <- s3_bucket("ursa-labs-taxi-data")
+df <- read_parquet(bucket$path("2019/06/data.parquet"))
 ```
 
-See the help for `FileSystem` for a list of options that `S3FileSystem$create()`
+See the help for `FileSystem` for a list of options that `s3_bucket()` and `S3FileSystem$create()`
 can take. `region`, `scheme`, and `endpoint_override` can be encoded as query
-parameters in the URI (though `region` will be auto-detected the bucket URI if omitted),
-and `access_key` and `secret_key` can also be included,
+parameters in the URI (though `region` will be auto-detected in `s3_bucket()` or from the URI if omitted).
+`access_key` and `secret_key` can also be included,
 but other options are not supported in the URI.
 
-Using the `SubTreeFileSystem` class, you can represent an S3 bucket or
-subdirectory inside of one.
+The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`, which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere.
+
+One way to get a subtree is to call the `$cd()` method on a `FileSystem`

Review comment:
       ```suggestion
   A `SubTreeFileSystem` wraps a directory in another `FileSystem`, using that directory
   as its own root. One way to get a subtree is to call the `$chroot()` method on a
   `FileSystem`. `SubTreeFileSystem`s can be useful for holding a reference to a subdirectory
   somewhere, on S3 or elsewhere. The object that `s3_bucket()` returns is technically a
   `SubTreeFileSystem`, scoped to the given bucket.
   
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org