You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/27 14:40:05 UTC

[GitHub] [arrow] jonkeane commented on a change in pull request #10546: ARROW-12845: [R] [C++] S3 connections for different providers

jonkeane commented on a change in pull request #10546:
URL: https://github.com/apache/arrow/pull/10546#discussion_r677509801



##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data) by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide [nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com), 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).
+
+To connect to this space, you only need to adapt the code from the previous
+section:
+
+```r
+space <- arrow::S3FileSystem$create(
+  anonymous = TRUE,
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+The space that we are using space allows anonymous access, but if you were to 
+connect to a private space (i.e. with sensitive data), you would need to 
+provide a token, say:
+
+```r
+space <- arrow::S3FileSystem$create(
+  access_key = Sys.getenv('DO_ARROW_TAXI_TOKEN'),
+  secret_key = Sys.getenv('DO_ARROW_TAXI_SECRET'),
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```

Review comment:
       Instead of using digital ocean as our "here is an example of an alternative" let's lean into the minio example that is already there. That has a few nice advantages: 
   * anyone can install minio without needing to put data in a service
   * we can test against minio to confirm that this works
   * we don't need to maintain a new data source

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data) by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an example, as 
+it can be any other S3 compatible service.

Review comment:
       The list of providers is great, we should add this to the collapsed section in the proposed reorganization.

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data) by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide [nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com), 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).
+
+To connect to this space, you only need to adapt the code from the previous
+section:
+
+```r
+space <- arrow::S3FileSystem$create(
+  anonymous = TRUE,
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+The space that we are using space allows anonymous access, but if you were to 
+connect to a private space (i.e. with sensitive data), you would need to 
+provide a token, say:
+
+```r
+space <- arrow::S3FileSystem$create(
+  access_key = Sys.getenv('DO_ARROW_TAXI_TOKEN'),
+  secret_key = Sys.getenv('DO_ARROW_TAXI_SECRET'),
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+In order to list the files in the space, you can just type:
+
+```r
+space$ls('nyc-taxi', recursive = TRUE)
+```
+
+Just like AWS, one way to get a subtree is to call the `$cd()` method on a 
+`FileSystem`:
+
+```r
+june2019 <- space$path("nyc-taxi/2019/06")
+df <- read_parquet(june2019$path("data.parquet"))
+```
+
+From here, the same example from the [Working with Arrow Datasets and dplyr](dataset.html) vignette can be completed with a single change:
+
+```r 
+copy_files(space$path("nyc-taxi/"), "nyc-taxi")
+```
+
+Instead of:
+
+```
+copy_files("s3://ursa-labs-taxi-data", "nyc-taxi")
+```

Review comment:
       I don't think we necessarily need to re-hash all of the filesystem operations. How about instead of all of these, we add to the minio example above:
   
   1. the command needed to run minio (`minio server {path}`) so that someone could run that more easily without having to learn about how minio works
   1. copy a small subset of the taxi data (maybe a year or a few months from a single year)
   1. how to open a dataset (if the root of the minio path one is using is the root of the dataset, one might do `ds <- open_dataset(minio$path(""), partitioning = c("year", "month"))` which is a bit counter-intuitive, so we should get it out in the vignette.
   1. How to open a single parquet file?
   1. a more detailed description of the differences (and similarities) between `S3FileSystem$create` and the URI (in other words: show the correspondences between the elements of the URI and the arguments of `S3FileSystem$create`)

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data) by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide [nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com), 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).

Review comment:
       This digital ocean bucket is one you created, yeah? I'm not sure that we want to create a new storage location that we need to maintain on top of what we have in s3 already. See below / in the comment at the top about how we should reorganize this to avoid that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org