You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/19 09:22:04 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

pitrou commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924259938


##########
docs/source/python/dataset.rst:
##########
@@ -355,7 +355,7 @@ specifying a S3 path:
 
 .. code-block:: python
 
-    dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
+    dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/")

Review Comment:
   Is there also a reason for removing the `partitioning` option?



##########
r/tests/testthat/test-filesystem.R:
##########
@@ -190,3 +190,18 @@ test_that("s3_bucket", {
   skip_on_os("windows") # FIXME
   expect_identical(bucket$base_path, "ursa-labs-r-test/")
 })
+
+test_that("gs_bucket", {
+  skip_on_cran()
+  skip_if_not_available("gcs")
+  skip_if_offline()
+  bucket <- gs_bucket("voltrondata-labs-datasets")
+  expect_r6_class(bucket, "SubTreeFileSystem")
+  expect_r6_class(bucket$base_fs, "GcsFileSystem")
+  expect_identical(
+    capture.output(print(bucket)),
+    "SubTreeFileSystem: gs://voltrondata-labs-datasets/"
+  )
+  skip_on_os("windows") # FIXME

Review Comment:
   What happens on Windows exactly? These bare FIXMEs are not explanatory.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googleapis.com&allow_bucket_creation=true

Review Comment:
   This is syntactically incorrect.
   You're supposed to escape any special characters in parameter values.
   ```suggestion
   s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org