You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/04/12 13:06:00 UTC

[jira] [Comment Edited] (ARROW-16144) [R] Write compressed data streams (particularly over S3)

    [ https://issues.apache.org/jira/browse/ARROW-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521132#comment-17521132 ] 

Dewey Dunnington edited comment on ARROW-16144 at 4/12/22 1:05 PM:
-------------------------------------------------------------------

Thank you for catching my error here! I know that we did some compression detection but it turns out that's only on read: https://github.com/apache/arrow/blob/master/r/R/io.R#L240-L298

You can use {{OpenOutputStream}} and {{CompressedOutputStream}} for any filesystem (including S3), although we would need to implement the compression detection based on filename for this to "just work" with the .gz suffix:

{code:R}
library(arrow, warn.conflicts = FALSE)

dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "bucket")
dir.create(subdir)


minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())

s3_uri <- "s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)

data <- data.frame(x = 1:1e4)

out_compressed <- CompressedOutputStream$create(bucket$OpenOutputStream("bucket/data.csv.gz"))
write_csv_arrow(data, out_compressed)
out_compressed$close()


out <- bucket$OpenOutputStream("bucket/data.csv")
write_csv_arrow(data, out)
out$close()

file.size(file.path(subdir, "data.csv.gz"))
#> [1] 22627
file.size(file.path(subdir, "data.csv"))
#> [1] 48898

minio_server$interrupt()
#> [1] TRUE
Sys.sleep(1)
stopifnot(!minio_server$is_alive())
{code}



was (Author: paleolimbot):
Thank you for catching my error here! I know that we did some compression detection but it turns out that's only on read.

You can use {{OpenOutputStream}} and {{CompressedOutputStream}} for any filesystem (including S3), although we would need to implement the compression detection based on filename for this to "just work" with the .gz suffix:

{code:R}
library(arrow, warn.conflicts = FALSE)

dir <- tempfile()
dir.create(dir)
subdir <- file.path(dir, "bucket")
dir.create(subdir)


minio_server <- processx::process$new("minio", args = c("server", dir), supervise = TRUE)
Sys.sleep(1)
stopifnot(minio_server$is_alive())

s3_uri <- "s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000"
bucket <- s3_bucket(s3_uri)

data <- data.frame(x = 1:1e4)

out_compressed <- CompressedOutputStream$create(bucket$OpenOutputStream("bucket/data.csv.gz"))
write_csv_arrow(data, out_compressed)
out_compressed$close()


out <- bucket$OpenOutputStream("bucket/data.csv")
write_csv_arrow(data, out)
out$close()

file.size(file.path(subdir, "data.csv.gz"))
#> [1] 22627
file.size(file.path(subdir, "data.csv"))
#> [1] 48898

minio_server$interrupt()
#> [1] TRUE
Sys.sleep(1)
stopifnot(!minio_server$is_alive())
{code}


> [R] Write compressed data streams (particularly over S3)
> --------------------------------------------------------
>
>                 Key: ARROW-16144
>                 URL: https://issues.apache.org/jira/browse/ARROW-16144
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> The python bindings have `CompressedOutputStream`, but  I don't see how we can do this on the R side (e.g. with `write_csv_arrow()`).  It would be wonderful if we could both read and write compressed streams, particularly for CSV and particularly for remote filesystems, where this can provide considerable performance improvements.  
> (For comparison, readr will write a compressed stream automatically based on the extension for the given filename, e.g. `readr::write_csv(data, "file.csv.gz")` or `write_csv("data.file.xz")`  )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)