You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 01:27:11 UTC

[GitHub] [arrow] boshek opened a new pull request, #13183: Arrow 16144: [R] Write compressed data streams (particularly over S3)

boshek opened a new pull request, #13183:
URL: https://github.com/apache/arrow/pull/13183

   This PR enables reading/writing compressed data streams over s3 and locally and adds some tests to make some of those round trips. For the filesystem path I had to do a little regex on the string for compression detection but any feedback on alternative approaches is very welcome. Previously supplying a file with a compression extension wrote out an uncompressed file:
   
   ```r
   library(arrow, warn.conflicts = FALSE)
   ## local
   write_csv_arrow(mtcars, file = file)
   write_csv_arrow(mtcars, file = comp_file)
   file.size(file)
   [1] 1303
   file.size(comp_file)
   [1] 567
   
   ## or with s3
   dir <- tempfile()
   dir.create(dir)
   subdir <- file.path(dir, "bucket")
   dir.create(subdir)
   
   minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE)
   Sys.sleep(2)
   stopifnot(minio_server$is_alive())
   
   s3_uri <- "s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
   bucket <- s3_bucket(s3_uri)
   
   write_csv_arrow(mtcars, bucket$path("bucket/data.csv.gz"))
   write_csv_arrow(mtcars, bucket$path("bucket/data.csv"))
   
   file.size(file.path(subdir, "data.csv.gz"))
   [1] 567
   file.size(file.path(subdir, "data.csv"))
   [1] 1303
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r875999848


##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
   file
 }
 
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {

Review Comment:
   One thing to watch out here: sometimes people name their parquet files `something.parquet.snappy`, but you wouldn't use a CompressedOutputStream for that, you'd pass the compression option to the parquet writer itself. I would guess that the make_readable_file() path handles this already, maybe that can be a model (or maybe it doesn't and needs to).



##########
r/R/io.R:
##########
@@ -301,6 +301,7 @@ make_output_stream <- function(x, filesystem = NULL) {
     return(MakeRConnectionOutputStream(x))
   }
 
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13183: Arrow 16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129473487

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876393144


##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
   file
 }
 
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {

Review Comment:
   Sure, this fails on master too so ok to make a separate JIRA (please link to it here when you make it).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876393144


##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
   file
 }
 
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {

Review Comment:
   Sure, this fails on master too so ok to make a separate JIRA.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876000318


##########
r/R/io.R:
##########
@@ -322,6 +333,10 @@ detect_compression <- function(path) {
     return("uncompressed")
   }
 
+  ## Remove any trailing slashes
+  path <- gsub("/$", "", path)
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876183135


##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
   file
 }
 
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {

Review Comment:
   So for the `parquet.snappy` or even `snappy.parquet` I think it works because "snappy" isn't included here: 
   https://github.com/apache/arrow/blob/3df2e0568240d6b629c0a3163df21a1a2a160810/r/R/io.R#L325-L330
   
   But if someone tried something like this we do get an error that isn't super informative. I _think_ this is outside this PR so could the resolution here be to open another ticket for this specifically? 
   ``` r
   library(arrow, warn.conflicts = FALSE)
   tf <- tempfile(fileext = ".parquet.gz")
   write_parquet(data.frame(x = 1:5), tf, compression = "gzip", compression_level = 5)
   read_parquet(tf)
   #> Error: file must be a "RandomAccessFile"
   ```



##########
r/tests/testthat/test-s3-minio.R:
##########
@@ -54,6 +54,24 @@ if (arrow_with_s3() && process_is_running("minio server")) {
     )
   })
 
+  test_that("read/write compressed csv by filesystem", {
+    dat <- tibble(x = seq(1, 10, by = 0.2))

Review Comment:
   I will add in to be defensive. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson closed pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
URL: https://github.com/apache/arrow/pull/13183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1131232486

   Benchmark runs are scheduled for baseline = ce4dcbdf5f1bcfb3f23b494598b9125c8e7ee52e and contender = d2cbe9e0e2ce206fba71d3d171babe36bada1a9d. d2cbe9e0e2ce206fba71d3d171babe36bada1a9d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/fd040c4d0b6a48758dfd898cbfba4b86...36a656f35d7a4d48b792a81906043fc1/)
   [Failed :arrow_down:1.29% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/c48cb85c068f48a083d80fc67e0995e9...b9f36d3481b94ae28f1e85e1c960a4ea/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/90ae49d1c11441618a2107fe3f9f16f1...9ff3b051ad0c43e39f60dbe5713897db/)
   [Finished :arrow_down:0.28% :arrow_up:0.08%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/17b705000e04465a90f8f99f5c4eeafc...01a4529c238f4b77ba0f1279a08cb291/)
   Buildkite builds:
   [Finished] [`d2cbe9e0` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/789)
   [Failed] [`d2cbe9e0` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/786)
   [Failed] [`d2cbe9e0` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/776)
   [Finished] [`d2cbe9e0` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/792)
   [Finished] [`ce4dcbdf` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/788)
   [Failed] [`ce4dcbdf` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/785)
   [Failed] [`ce4dcbdf` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/775)
   [Finished] [`ce4dcbdf` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/791)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129474353

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129474342

   https://issues.apache.org/jira/browse/ARROW-16144


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r875991641


##########
r/R/io.R:
##########
@@ -322,6 +333,10 @@ detect_compression <- function(path) {
     return("uncompressed")
   }
 
+  ## Remove any trailing slashes

Review Comment:
   ```suggestion
     # Remove any trailing slashes, which FileSystem$from_uri may add
   ```



##########
r/R/io.R:
##########
@@ -309,11 +310,21 @@ make_output_stream <- function(x, filesystem = NULL) {
     filesystem <- fs_and_path$fs
     x <- fs_and_path$path
   }
+
+  if (is.null(compression)) {
+    # Infer compression from sink
+    compression <- detect_compression(x)
+  }
+
   assert_that(is.string(x))
-  if (is.null(filesystem)) {
-    FileOutputStream$create(x)
+  if (is.null(filesystem) && is_compressed(compression)) {
+    CompressedOutputStream$create(x) ##compressed local
+  } else if (is.null(filesystem) && !is_compressed(compression)) {
+    FileOutputStream$create(x) ## uncompressed local
+  } else if (!is.null(filesystem) && is_compressed(compression)) {
+    CompressedOutputStream$create(filesystem$OpenOutputStream(x)) ## compressed s3
   } else {
-    filesystem$OpenOutputStream(x)
+    filesystem$OpenOutputStream(x) ## uncompressed s3

Review Comment:
   Not only S3, any remote file system (GCS, Azure are coming soon, for example)



##########
r/tests/testthat/test-csv.R:
##########
@@ -564,6 +564,29 @@ test_that("write_csv_arrow can write from RecordBatchReader objects", {
   expect_equal(nrow(tbl_in), 3)
 })
 
+test_that("write_csv_arrow() compresses by file extension", {
+  skip_if_not_available("gzip")
+  tfgz <- tempfile(fileext = ".csv.gz")
+  tf <- tempfile(fileext = ".csv")
+  on.exit(unlink(tf))
+  on.exit(unlink(tfgz))
+
+  write_csv_arrow(tbl, tf)
+  write_csv_arrow(tbl, tfgz)
+  expect_lt(file.size(tfgz), file.size(tf))
+})
+
+test_that("read/write compressed file", {

Review Comment:
   Why is this a separate test? You could read in the file(s) in the previous test



##########
r/tests/testthat/test-s3-minio.R:
##########
@@ -54,6 +54,24 @@ if (arrow_with_s3() && process_is_running("minio server")) {
     )
   })
 
+  test_that("read/write compressed csv by filesystem", {
+    dat <- tibble(x = seq(1, 10, by = 0.2))

Review Comment:
   Does this need to be skipped if gzip is not present or have we already skipped by this point?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] boshek commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)

Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876183537


##########
r/tests/testthat/test-csv.R:
##########
@@ -564,6 +564,29 @@ test_that("write_csv_arrow can write from RecordBatchReader objects", {
   expect_equal(nrow(tbl_in), 3)
 })
 
+test_that("write_csv_arrow() compresses by file extension", {
+  skip_if_not_available("gzip")
+  tfgz <- tempfile(fileext = ".csv.gz")
+  tf <- tempfile(fileext = ".csv")
+  on.exit(unlink(tf))
+  on.exit(unlink(tfgz))
+
+  write_csv_arrow(tbl, tf)
+  write_csv_arrow(tbl, tfgz)
+  expect_lt(file.size(tfgz), file.size(tf))
+})
+
+test_that("read/write compressed file", {

Review Comment:
   Oh good point no reason. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org