You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by th...@apache.org on 2021/11/01 05:43:54 UTC
[arrow-cookbook] branch main updated: ARROW-13713: [Doc][Cookbook] Reading and Writing Compressed Data - R (#91)

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new 8ec34a8  ARROW-13713: [Doc][Cookbook] Reading and Writing Compressed Data - R (#91)
8ec34a8 is described below

commit 8ec34a8fb7f2a216d9378195832aa82b0bd30608
Author: Nic Crane <th...@gmail.com>
AuthorDate: Mon Nov 1 05:43:49 2021 +0000

    ARROW-13713: [Doc][Cookbook] Reading and Writing Compressed Data - R (#91)
    
    * Add initial recipes
    
    * Add to bookdown
    
    * Move "compressed data" content to the "read and write data" chapter
    
    * Add "Solution" headings
    
    * Write parquet not feather!
    
    * Add .gz ending note
    
    * Add note about defaults
    
    * Add note in see also section
    
    * Add comment about default compression
    
    * Add to comment
---
 r/content/reading_and_writing_data.Rmd             | 124 ++++++++++++++++++++-
 .../work_with_compressed_or_partitioned_data.Rmd   |   5 -
 2 files changed, 123 insertions(+), 6 deletions(-)

diff --git a/r/content/reading_and_writing_data.Rmd b/r/content/reading_and_writing_data.Rmd
index f18f5b0..542401a 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -292,7 +292,6 @@ test_that("read_json_arrow chunk works as expected", {
 unlink(tf)
 ```
 
-
 ## Write partitioned data
 
 You want to save data to disk in partitions based on columns in the data.
@@ -359,3 +358,126 @@ unlink("my_table.parquet")
 unlink("dist_time.parquet")
 unlink("airquality_partitioned", recursive = TRUE)
 ```
+
+## Write compressed data
+
+You want to save a file, compressed with a specified compression algorithm.
+
+### Solution
+
+```{r, parquet_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write data compressed with the gzip algorithm instead of the default
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+```
+
+```{r, test_parquet_gzip, opts.label = "test"}
+test_that("parquet_gzip", {
+  file.exists(file.path(td, "iris.parquet"))
+})
+```
+
+### Discussion
+
+Note that `write_parquet()` by default already uses compression.  See 
+`default_parquet_compression()` to see what the default configured on your 
+machine is.
+
+You can also supply the `compression` argument to `write_dataset()`, as long as 
+the compression algorithm is compatible with the chosen format.
+
+```{r, dataset_gzip}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset to file
+write_dataset(iris, path = td, compression = "gzip")
+```
+
+```{r}
+# View files in the directory
+list.files(td, recursive = TRUE)
+```
+```{r, test_dataset_gzip, opts.label = "test"}
+test_that("dataset_gzip", {
+  file.exists(file.path(td, "part-0.parquet"))
+})
+```
+
+### See also
+
+Some formats write compressed data by default.  For more information 
+on the supported compression algorithms and default settings, see:
+
+* `?write_parquet()`
+* `?write_feather()`
+* `?write_dataset()`
+
+## Read compressed data
+
+You want to read in data which has been compressed.
+
+### Solution
+
+```{r, read_parquet_compressed}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")
+
+# Read in data
+ds <- read_parquet(file.path(td, "iris.parquet")) %>%
+  collect()
+
+ds
+```
+
+```{r, test_read_parquet_compressed, opts.label = "test"}
+test_that("read_parquet_compressed", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+### Discussion
+
+Note that Arrow automatically detects the compression and you do not have to 
+supply it in the call to `open_dataset()` or the `read_*()` functions.
+
+Although the CSV format does not support compression itself, Arrow supports 
+reading in CSV data which has been compressed, if the file extension is `.gz`.
+
+```{r, read_compressed_csv}
+# Create a temporary directory
+td <- tempfile()
+dir.create(td)
+
+# Write dataset which is to be read back in
+write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE)
+
+# Read in data
+ds <- open_dataset(td, format = "csv") %>%
+  collect()
+ds
+```
+
+```{r, test_read_compressed_csv, opts.label = "test"}
+test_that("read_compressed_csv", {
+  expect_s3_class(ds, "data.frame")
+  expect_named(
+    ds,
+    c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
+  )
+})
+```
+
+
diff --git a/r/content/unpublished/work_with_compressed_or_partitioned_data.Rmd b/r/content/unpublished/work_with_compressed_or_partitioned_data.Rmd
deleted file mode 100644
index b94c9bb..0000000
--- a/r/content/unpublished/work_with_compressed_or_partitioned_data.Rmd
+++ /dev/null
@@ -1,5 +0,0 @@
-# Work with Compressed or Partitioned Data
-
-## Read and write compressed data
-
-## Read and write partitioned data