You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 22:26:55 UTC
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

nealrichardson commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r927142884


##########
r/R/filesystem.R:
##########
@@ -448,6 +448,28 @@ s3_bucket <- function(bucket, ...) {
   SubTreeFileSystem$create(fs_and_path$path, fs)
 }
 
+#' Connect to an Google Storage Service (GCS) bucket

Review Comment:
   Google Cloud Storage? "Google Storage Service (GCS)" looks weird, the acronym doesn't match.



##########
r/.gitignore:
##########
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
 extra-tests/files
+.deps

Review Comment:
   No objection; what creates this?



##########
r/vignettes/dataset.Rmd:
##########
@@ -312,19 +326,20 @@ percentage of rows from each batch:
 sampled_data <- ds %>%
   filter(year == 2015) %>%
   select(tip_amount, total_amount, passenger_count) %>%
-  map_batches(~ sample_frac(as.data.frame(.), 1e-4)) %>%
-  mutate(tip_pct = tip_amount / total_amount)
+  map_batches(~ as_record_batch(sample_frac(as.data.frame(.), 1e-4))) %>%
+  mutate(tip_pct = tip_amount / total_amount) %>%
+  collect()
 
 str(sampled_data)
 ```
 
 ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
 cat("
-'data.frame':	15603 obs. of  4 variables:
- $ tip_amount     : num  0 0 1.55 1.45 5.2 ...
- $ total_amount   : num  5.8 16.3 7.85 8.75 26 ...
- $ passenger_count: int  1 1 1 1 1 6 5 1 2 1 ...
- $ tip_pct        : num  0 0 0.197 0.166 0.2 ...
+tibble [10,918 × 4] (S3: tbl_df/tbl/data.frame)
+ $ tip_amount     : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ...
+ $ total_amount   : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ...
+ $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ...
+ $ tip_pct        : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ...

Review Comment:
   Not in scope here (and probably not relevant for this exact line, this just made me think of it): we should show an example of `glimpse()` in this vignette somewhere.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 

Review Comment:
   Maybe this helps finesse it? The full answer is in the vignette that is linked on the next line. 
   
   ```suggestion
   installing from source, S3 and GCS support is not always enabled by default, and it has 
   ```



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 

Review Comment:
   ! Why didn't we do this before then?!



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   It might be worth showing how this is equivalent to `GcsFileSystem$create(anonymous = TRUE, retry_limit_seconds = 10)$path("voltrondata-labs-datasets/nyc-taxi/")` or whatever it is. As in, URI query params are useful but not the only way to provide these options.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.

Review Comment:
   Oof, the answer is "it depends", it's not on by default in the pure source build, but if your system meets the requirements (which are too fussy to expand on here), you may get a prebuilt libarrow binary, which (may) have support for S3 and GCS. 



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   Oh, since the FS objects are introduced below, maybe this note about equivalence should be introduced down there. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org