You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/13 22:31:09 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

wjones127 opened a new pull request, #13601:
URL: https://github.com/apache/arrow/pull/13601

   I also replaced all references to the Ursa Labs bucket with the new `voltrondata-labs-datasets` bucket. That seems to be the last remaining mention of Ursa within the repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1194932832

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/594ebd5ee1f84cdfb48ee81f610943ab...b3d753a085414e199bb265a3b2bd9ac0/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r927142884


##########
r/R/filesystem.R:
##########
@@ -448,6 +448,28 @@ s3_bucket <- function(bucket, ...) {
   SubTreeFileSystem$create(fs_and_path$path, fs)
 }
 
+#' Connect to an Google Storage Service (GCS) bucket

Review Comment:
   Google Cloud Storage? "Google Storage Service (GCS)" looks weird, the acronym doesn't match.



##########
r/.gitignore:
##########
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
 extra-tests/files
+.deps

Review Comment:
   No objection; what creates this?



##########
r/vignettes/dataset.Rmd:
##########
@@ -312,19 +326,20 @@ percentage of rows from each batch:
 sampled_data <- ds %>%
   filter(year == 2015) %>%
   select(tip_amount, total_amount, passenger_count) %>%
-  map_batches(~ sample_frac(as.data.frame(.), 1e-4)) %>%
-  mutate(tip_pct = tip_amount / total_amount)
+  map_batches(~ as_record_batch(sample_frac(as.data.frame(.), 1e-4))) %>%
+  mutate(tip_pct = tip_amount / total_amount) %>%
+  collect()
 
 str(sampled_data)
 ```
 
 ```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
 cat("
-'data.frame':	15603 obs. of  4 variables:
- $ tip_amount     : num  0 0 1.55 1.45 5.2 ...
- $ total_amount   : num  5.8 16.3 7.85 8.75 26 ...
- $ passenger_count: int  1 1 1 1 1 6 5 1 2 1 ...
- $ tip_pct        : num  0 0 0.197 0.166 0.2 ...
+tibble [10,918 × 4] (S3: tbl_df/tbl/data.frame)
+ $ tip_amount     : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ...
+ $ total_amount   : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ...
+ $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ...
+ $ tip_pct        : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ...

Review Comment:
   Not in scope here (and probably not relevant for this exact line, this just made me think of it): we should show an example of `glimpse()` in this vignette somewhere.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 

Review Comment:
   Maybe this helps finesse it? The full answer is in the vignette that is linked on the next line. 
   
   ```suggestion
   installing from source, S3 and GCS support is not always enabled by default, and it has 
   ```



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 

Review Comment:
   ! Why didn't we do this before then?!



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   It might be worth showing how this is equivalent to `GcsFileSystem$create(anonymous = TRUE, retry_limit_seconds = 10)$path("voltrondata-labs-datasets/nyc-taxi/")` or whatever it is. As in, URI query params are useful but not the only way to provide these options.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.

Review Comment:
   Oof, the answer is "it depends", it's not on by default in the pure source build, but if your system meets the requirements (which are too fussy to expand on here), you may get a prebuilt libarrow binary, which (may) have support for S3 and GCS. 



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   Oh, since the FS objects are introduced below, maybe this note about equivalence should be introduced down there. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924582579


##########
docs/source/python/dataset.rst:
##########
@@ -355,7 +355,7 @@ specifying a S3 path:
 
 .. code-block:: python
 
-    dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
+    dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/")

Review Comment:
   This is a newer version of the dataset that is Hive partitioned, so doesn't require explicitly passing partitioning. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r925170866


##########
r/tests/testthat/test-filesystem.R:
##########
@@ -190,3 +190,18 @@ test_that("s3_bucket", {
   skip_on_os("windows") # FIXME
   expect_identical(bucket$base_path, "ursa-labs-r-test/")
 })
+
+test_that("gs_bucket", {
+  skip_on_cran()
+  skip_if_not_available("gcs")
+  skip_if_offline()
+  bucket <- gs_bucket("voltrondata-labs-datasets")
+  expect_r6_class(bucket, "SubTreeFileSystem")
+  expect_r6_class(bucket$base_fs, "GcsFileSystem")
+  expect_identical(
+    capture.output(print(bucket)),
+    "SubTreeFileSystem: gs://voltrondata-labs-datasets/"
+  )
+  skip_on_os("windows") # FIXME

Review Comment:
   I deleted the skips and seems to run fine locally and in CI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1194288432

   > @jonkeane @nealrichardson can we go ahead and merge it?
   
   I already approved and don't feel the need to re-review, I'm sure @wjones127 took care of everything, please merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924659991


##########
r/.gitignore:
##########
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
 extra-tests/files
+.deps

Review Comment:
   I added this because it got a lot of crap in it while building the site; lmk if there's a good reason to not add it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r923864904


##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.

Review Comment:
   IIRC we recently changed how the Linux installation works. Do these lines still apply?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1183746510

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924786305


##########
r/man/array.Rd:
##########
@@ -41,14 +41,12 @@ but not limited to strings only)
 }
 
 \section{Usage}{
-
-
-\if{html}{\out{<div class="sourceCode">}}\preformatted{a <- Array$create(x)
+\preformatted{a <- Array$create(x)

Review Comment:
   Upgrading to roxygen 7.2 fixed this.



##########
r/man/array.Rd:
##########
@@ -41,14 +41,12 @@ but not limited to strings only)
 }
 
 \section{Usage}{
-
-
-\if{html}{\out{<div class="sourceCode">}}\preformatted{a <- Array$create(x)
+\preformatted{a <- Array$create(x)

Review Comment:
   Upgrading to roxygen2 7.2 fixed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

ianmcook commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r926784120


##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.

Review Comment:
   @nealrichardson do you know the answer to this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1194932722

   Benchmark runs are scheduled for baseline = ab8c92cf0761fa0399860a7110038f9db8a76231 and contender = 42647dcd00d2ac4f92593c9ce54b05fe8322c91a. 42647dcd00d2ac4f92593c9ce54b05fe8322c91a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d3e65be4456144479bcecaeaf287968e...eb624a925c174340ba1d7215649b61bd/)
   [Finished :arrow_down:0.07% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/a25f13f764954f0695c2239d1015a12c...22e7db9c085a46199941602b5806f7e0/)
   [Finished :arrow_down:0.54% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/594ebd5ee1f84cdfb48ee81f610943ab...b3d753a085414e199bb265a3b2bd9ac0/)
   [Finished :arrow_down:0.25% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/95a8388c6db4409c9d8eaa9f4f4a96dc...17345921fa2e410e83b12f1bfba89c87/)
   Buildkite builds:
   [Failed] [`42647dcd` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1195)
   [Finished] [`42647dcd` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1207)
   [Finished] [`42647dcd` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1189)
   [Finished] [`42647dcd` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1209)
   [Failed] [`ab8c92cf` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1194)
   [Finished] [`ab8c92cf` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1206)
   [Finished] [`ab8c92cf` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1188)
   [Finished] [`ab8c92cf` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1208)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924664469


##########
r/man/array.Rd:
##########
@@ -41,14 +41,12 @@ but not limited to strings only)
 }
 
 \section{Usage}{
-
-
-\if{html}{\out{<div class="sourceCode">}}\preformatted{a <- Array$create(x)
+\preformatted{a <- Array$create(x)

Review Comment:
   This isn't intentional. I just ran `make doc` locally. Is something about my R out of date, or does that command need an update?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1192758053

   > cc @thisisnic would you be wiling to review?
   
   With combined vacation + rstudio::conf I won't have time to look at this til 2nd August, so probably best not wait for me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

kszucs commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1194221717

   @jonkeane @nealrichardson can we go ahead and merge it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1191603095

   cc @thisisnic would you be wiling to review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924659991


##########
r/.gitignore:
##########
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
 extra-tests/files
+.deps

Review Comment:
   I added this because it got a lot of files in it while building the R doc site; lmk if there's a good reason to not add it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r927878231


##########
r/.gitignore:
##########
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
 arrow_*.tar.gz
 arrow_*.tgz
 extra-tests/files
+.deps

Review Comment:
   When I run `pkgdown::build_site()` within RStudio, during the "Installing package into temporary library" it creates this directory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs merged pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

kszucs merged PR #13601:
URL: https://github.com/apache/arrow/pull/13601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924584479


##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googleapis.com&allow_bucket_creation=true

Review Comment:
   Oddly enough, it does work in practice 🙃  But I can switch since it probably is preferable in general to percent encode them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r927910312


##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   I've reordered them so we introduce the filesystems and that made it more natural to make the statement you suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13601:
URL: https://github.com/apache/arrow/pull/13601#issuecomment-1183746486

   https://issues.apache.org/jira/browse/ARROW-16887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

pitrou commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924259938


##########
docs/source/python/dataset.rst:
##########
@@ -355,7 +355,7 @@ specifying a S3 path:
 
 .. code-block:: python
 
-    dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
+    dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/")

Review Comment:
   Is there also a reason for removing the `partitioning` option?



##########
r/tests/testthat/test-filesystem.R:
##########
@@ -190,3 +190,18 @@ test_that("s3_bucket", {
   skip_on_os("windows") # FIXME
   expect_identical(bucket$base_path, "ursa-labs-r-test/")
 })
+
+test_that("gs_bucket", {
+  skip_on_cran()
+  skip_if_not_available("gcs")
+  skip_if_offline()
+  bucket <- gs_bucket("voltrondata-labs-datasets")
+  expect_r6_class(bucket, "SubTreeFileSystem")
+  expect_r6_class(bucket$base_fs, "GcsFileSystem")
+  expect_identical(
+    capture.output(print(bucket)),
+    "SubTreeFileSystem: gs://voltrondata-labs-datasets/"
+  )
+  skip_on_os("windows") # FIXME

Review Comment:
   What happens on Windows exactly? These bare FIXMEs are not explanatory.



##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https://storage.googleapis.com&allow_bucket_creation=true

Review Comment:
   This is syntactically incorrect.
   You're supposed to escape any special characters in parameter values.
   ```suggestion
   s3://voltrondata-labs-datasets/nyc-taxi/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r924583352


##########
r/tests/testthat/test-filesystem.R:
##########
@@ -190,3 +190,18 @@ test_that("s3_bucket", {
   skip_on_os("windows") # FIXME
   expect_identical(bucket$base_path, "ursa-labs-r-test/")
 })
+
+test_that("gs_bucket", {
+  skip_on_cran()
+  skip_if_not_available("gcs")
+  skip_if_offline()
+  bucket <- gs_bucket("voltrondata-labs-datasets")
+  expect_r6_class(bucket, "SubTreeFileSystem")
+  expect_r6_class(bucket$base_fs, "GcsFileSystem")
+  expect_identical(
+    capture.output(print(bucket)),
+    "SubTreeFileSystem: gs://voltrondata-labs-datasets/"
+  )
+  skip_on_os("windows") # FIXME

Review Comment:
   This was copied over from the S3 tests. I'm not sure what happens, but I can try and see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13601: ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13601:
URL: https://github.com/apache/arrow/pull/13601#discussion_r927152972


##########
r/vignettes/fs.Rmd:
##########
@@ -1,56 +1,90 @@
 ---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+  %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
 The Arrow C++ library includes a generic filesystem interface and specific
 implementations for some cloud storage systems. This setup allows various
 parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with 
+S3 and GCS data using Arrow.
 
-> In Windows and macOS binary packages, S3 support is included. On Linux when 
-installing from source, S3 support is not enabled by default, and it has 
+> In Windows and macOS binary packages, S3 and GCS support are included. On Linux when 
+installing from source, S3 and GCS support is not enabled by default, and it has 
 additional system requirements. See `vignette("install", package = "arrow")` 
 for details.
 
 ## URIs
 
 File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+accept a URI as the source or destination file, as do `open_dataset()` and `write_dataset()`.
 An S3 URI looks like:
 
 ```
 s3://[access_key:secret_key@]bucket/path[?region=]
 ```
 
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path[?region=]
+gs://anonymous@bucket/path[?region=]
+```
+
 For example, one of the NYC taxi data files used in `vignette("dataset", package = "arrow")` is found at
 
 ```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+# gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
 ```
 
 Given this URI, you can pass it to `read_parquet()` just as if it were a local file path:
 
 ```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
 ```
 
 Note that this will be slower to read than if the file were local,
 though if you're running on a machine in the same AWS region as the file in S3,
 the cost of reading the data over the network should be much lower.
 
+### URI options
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+tells the `S3FileSystem` that it should allow the creation of new buckets and to 
+talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API--see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below--but for better support for GCS use the GCSFileSystem with `gs://`.
+
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of seconds
+a request may spend retrying before returning an error. The current default is 
+15 minutes, so in many interactive contexts it's nice to set a lower value:
+
+```
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
+```
+

Review Comment:
   Yeah I wasn't sure I wanted to do this refactor, but it did feel like it would almost be easier to introduce `GcsFileSystem$create()` and *then* explain options in URIs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org