You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 01:27:11 UTC
[GitHub] [arrow] boshek opened a new pull request, #13183: Arrow 16144: [R] Write compressed data streams (particularly over S3)
boshek opened a new pull request, #13183:
URL: https://github.com/apache/arrow/pull/13183
This PR enables reading/writing compressed data streams over s3 and locally and adds some tests to make some of those round trips. For the filesystem path I had to do a little regex on the string for compression detection but any feedback on alternative approaches is very welcome. Previously supplying a file with a compression extension wrote out an uncompressed file:
```r
library(arrow, warn.conflicts = FALSE)
## local
write_csv_arrow(mtcars, file = file)
write_csv_arrow(mtcars, file = comp_file)
file.size(file)
[1] 1303
file.size(comp_file)
[1] 567
## or with s3
dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "bucket")
dir.create(subdir)
minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE)
Sys.sleep(2)
stopifnot(minio_server$is_alive())
s3_uri <- "s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)
write_csv_arrow(mtcars, bucket$path("bucket/data.csv.gz"))
write_csv_arrow(mtcars, bucket$path("bucket/data.csv"))
file.size(file.path(subdir, "data.csv.gz"))
[1] 567
file.size(file.path(subdir, "data.csv"))
[1] 1303
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r875999848
##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
file
}
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
Review Comment:
One thing to watch out here: sometimes people name their parquet files `something.parquet.snappy`, but you wouldn't use a CompressedOutputStream for that, you'd pass the compression option to the parquet writer itself. I would guess that the make_readable_file() path handles this already, maybe that can be a model (or maybe it doesn't and needs to).
##########
r/R/io.R:
##########
@@ -301,6 +301,7 @@ make_output_stream <- function(x, filesystem = NULL) {
return(MakeRConnectionOutputStream(x))
}
+
Review Comment:
```suggestion
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13183: Arrow 16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129473487
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Thanks for opening a pull request!
If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
* [Other pull requests](https://github.com/apache/arrow/pulls/)
* [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876393144
##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
file
}
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
Review Comment:
Sure, this fails on master too so ok to make a separate JIRA (please link to it here when you make it).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876393144
##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
file
}
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
Review Comment:
Sure, this fails on master too so ok to make a separate JIRA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876000318
##########
r/R/io.R:
##########
@@ -322,6 +333,10 @@ detect_compression <- function(path) {
return("uncompressed")
}
+ ## Remove any trailing slashes
+ path <- gsub("/$", "", path)
+
Review Comment:
```suggestion
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] boshek commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876183135
##########
r/R/io.R:
##########
@@ -292,7 +292,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
file
}
-make_output_stream <- function(x, filesystem = NULL) {
+make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
Review Comment:
So for the `parquet.snappy` or even `snappy.parquet` I think it works because "snappy" isn't included here:
https://github.com/apache/arrow/blob/3df2e0568240d6b629c0a3163df21a1a2a160810/r/R/io.R#L325-L330
But if someone tried something like this we do get an error that isn't super informative. I _think_ this is outside this PR so could the resolution here be to open another ticket for this specifically?
``` r
library(arrow, warn.conflicts = FALSE)
tf <- tempfile(fileext = ".parquet.gz")
write_parquet(data.frame(x = 1:5), tf, compression = "gzip", compression_level = 5)
read_parquet(tf)
#> Error: file must be a "RandomAccessFile"
```
##########
r/tests/testthat/test-s3-minio.R:
##########
@@ -54,6 +54,24 @@ if (arrow_with_s3() && process_is_running("minio server")) {
)
})
+ test_that("read/write compressed csv by filesystem", {
+ dat <- tibble(x = seq(1, 10, by = 0.2))
Review Comment:
I will add in to be defensive.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
URL: https://github.com/apache/arrow/pull/13183
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1131232486
Benchmark runs are scheduled for baseline = ce4dcbdf5f1bcfb3f23b494598b9125c8e7ee52e and contender = d2cbe9e0e2ce206fba71d3d171babe36bada1a9d. d2cbe9e0e2ce206fba71d3d171babe36bada1a9d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/fd040c4d0b6a48758dfd898cbfba4b86...36a656f35d7a4d48b792a81906043fc1/)
[Failed :arrow_down:1.29% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/c48cb85c068f48a083d80fc67e0995e9...b9f36d3481b94ae28f1e85e1c960a4ea/)
[Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/90ae49d1c11441618a2107fe3f9f16f1...9ff3b051ad0c43e39f60dbe5713897db/)
[Finished :arrow_down:0.28% :arrow_up:0.08%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/17b705000e04465a90f8f99f5c4eeafc...01a4529c238f4b77ba0f1279a08cb291/)
Buildkite builds:
[Finished] [`d2cbe9e0` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/789)
[Failed] [`d2cbe9e0` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/786)
[Failed] [`d2cbe9e0` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/776)
[Finished] [`d2cbe9e0` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/792)
[Finished] [`ce4dcbdf` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/788)
[Failed] [`ce4dcbdf` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/785)
[Failed] [`ce4dcbdf` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/775)
[Finished] [`ce4dcbdf` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/791)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129474353
:warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13183:
URL: https://github.com/apache/arrow/pull/13183#issuecomment-1129474342
https://issues.apache.org/jira/browse/ARROW-16144
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nealrichardson commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r875991641
##########
r/R/io.R:
##########
@@ -322,6 +333,10 @@ detect_compression <- function(path) {
return("uncompressed")
}
+ ## Remove any trailing slashes
Review Comment:
```suggestion
# Remove any trailing slashes, which FileSystem$from_uri may add
```
##########
r/R/io.R:
##########
@@ -309,11 +310,21 @@ make_output_stream <- function(x, filesystem = NULL) {
filesystem <- fs_and_path$fs
x <- fs_and_path$path
}
+
+ if (is.null(compression)) {
+ # Infer compression from sink
+ compression <- detect_compression(x)
+ }
+
assert_that(is.string(x))
- if (is.null(filesystem)) {
- FileOutputStream$create(x)
+ if (is.null(filesystem) && is_compressed(compression)) {
+ CompressedOutputStream$create(x) ##compressed local
+ } else if (is.null(filesystem) && !is_compressed(compression)) {
+ FileOutputStream$create(x) ## uncompressed local
+ } else if (!is.null(filesystem) && is_compressed(compression)) {
+ CompressedOutputStream$create(filesystem$OpenOutputStream(x)) ## compressed s3
} else {
- filesystem$OpenOutputStream(x)
+ filesystem$OpenOutputStream(x) ## uncompressed s3
Review Comment:
Not only S3, any remote file system (GCS, Azure are coming soon, for example)
##########
r/tests/testthat/test-csv.R:
##########
@@ -564,6 +564,29 @@ test_that("write_csv_arrow can write from RecordBatchReader objects", {
expect_equal(nrow(tbl_in), 3)
})
+test_that("write_csv_arrow() compresses by file extension", {
+ skip_if_not_available("gzip")
+ tfgz <- tempfile(fileext = ".csv.gz")
+ tf <- tempfile(fileext = ".csv")
+ on.exit(unlink(tf))
+ on.exit(unlink(tfgz))
+
+ write_csv_arrow(tbl, tf)
+ write_csv_arrow(tbl, tfgz)
+ expect_lt(file.size(tfgz), file.size(tf))
+})
+
+test_that("read/write compressed file", {
Review Comment:
Why is this a separate test? You could read in the file(s) in the previous test
##########
r/tests/testthat/test-s3-minio.R:
##########
@@ -54,6 +54,24 @@ if (arrow_with_s3() && process_is_running("minio server")) {
)
})
+ test_that("read/write compressed csv by filesystem", {
+ dat <- tibble(x = seq(1, 10, by = 0.2))
Review Comment:
Does this need to be skipped if gzip is not present or have we already skipped by this point?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] boshek commented on a diff in pull request #13183: ARROW-16144: [R] Write compressed data streams (particularly over S3)
Posted by GitBox <gi...@apache.org>.
boshek commented on code in PR #13183:
URL: https://github.com/apache/arrow/pull/13183#discussion_r876183537
##########
r/tests/testthat/test-csv.R:
##########
@@ -564,6 +564,29 @@ test_that("write_csv_arrow can write from RecordBatchReader objects", {
expect_equal(nrow(tbl_in), 3)
})
+test_that("write_csv_arrow() compresses by file extension", {
+ skip_if_not_available("gzip")
+ tfgz <- tempfile(fileext = ".csv.gz")
+ tf <- tempfile(fileext = ".csv")
+ on.exit(unlink(tf))
+ on.exit(unlink(tfgz))
+
+ write_csv_arrow(tbl, tf)
+ write_csv_arrow(tbl, tfgz)
+ expect_lt(file.size(tfgz), file.size(tf))
+})
+
+test_that("read/write compressed file", {
Review Comment:
Oh good point no reason.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org