You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by np...@apache.org on 2022/07/18 20:15:56 UTC

[arrow] branch master updated: ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626)

This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new d81d8451a0 ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626)
d81d8451a0 is described below

commit d81d8451a0ff1c5108bc04e727ae053365950551
Author: eitsupi <50...@users.noreply.github.com>
AuthorDate: Tue Jul 19 05:15:50 2022 +0900

    ARROW-8324: [R] Add read/write_ipc_file separate from _feather (#13626)
    
    Add `read_ipc_file()` and `write_ipc_file()` to read and write Arrow IPC files (Feather V2).
    These are much the same as `read_feather()`/`write_feather()` for now, but in the future *_feather functions may move to a different implementation to accommodate Feather V1 format.
    
    Authored-by: SHIMA Tatsuya <ts...@gmail.com>
    Signed-off-by: Neal Richardson <ne...@gmail.com>
---
 r/NAMESPACE                     |  2 ++
 r/NEWS.md                       |  7 +++++-
 r/R/feather.R                   | 56 ++++++++++++++++++++++++++++++++---------
 r/man/read_feather.Rd           | 13 +++++++---
 r/man/write_feather.Rd          | 38 ++++++++++++++++++++++------
 r/tests/testthat/test-feather.R | 33 ++++++++++++++++++++++++
 r/vignettes/arrow.Rmd           |  3 ++-
 7 files changed, 126 insertions(+), 26 deletions(-)

diff --git a/r/NAMESPACE b/r/NAMESPACE
index c7d2657bae..750a815f9f 100644
--- a/r/NAMESPACE
+++ b/r/NAMESPACE
@@ -335,6 +335,7 @@ export(open_dataset)
 export(read_csv_arrow)
 export(read_delim_arrow)
 export(read_feather)
+export(read_ipc_file)
 export(read_ipc_stream)
 export(read_json_arrow)
 export(read_message)
@@ -370,6 +371,7 @@ export(vctrs_extension_type)
 export(write_csv_arrow)
 export(write_dataset)
 export(write_feather)
+export(write_ipc_file)
 export(write_ipc_stream)
 export(write_parquet)
 export(write_to_raw)
diff --git a/r/NEWS.md b/r/NEWS.md
index fca55b047e..59245b971d 100644
--- a/r/NEWS.md
+++ b/r/NEWS.md
@@ -24,7 +24,12 @@
 * `lubridate::parse_date_time()` datetime parser:
   * `orders` with year, month, day, hours, minutes, and seconds components are supported.
   * the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`).
-* `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed. Use the `read/write_feather()` and `read/write_ipc_stream()` functions depending on whether you're working with the Arrow IPC file or stream format, respectively.
+* New functions `read_ipc_file()` and `write_ipc_file()` are added.
+  These functions are almost the same as `read_feather()` and `write_feather()`,
+  but differ in that they only target IPC files (Feather V2 files), not Feather V1 files.
+* `read_arrow()` and `write_arrow()`, deprecated since 1.0.0 (July 2020), have been removed.
+  Instead of these, use the `read_ipc_file()` and `write_ipc_file()` for IPC files, or,
+  `read_ipc_stream()` and `write_ipc_stream()` for IPC streams.
 * `write_parquet()` now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments `properties` and `arrow_properties` have been removed; if you need to deal with these lower-level properties objects directly, use `ParquetFileWriter`, which `write_parquet()` wraps.
 
 # arrow 8.0.0
diff --git a/r/R/feather.R b/r/R/feather.R
index 02871396fa..46863c98a1 100644
--- a/r/R/feather.R
+++ b/r/R/feather.R
@@ -15,19 +15,23 @@
 # specific language governing permissions and limitations
 # under the License.
 
-#' Write data in the Feather format
+#' Write a Feather file (an Arrow IPC file)
 #'
 #' Feather provides binary columnar serialization for data frames.
 #' It is designed to make reading and writing data frames efficient,
 #' and to make sharing data across data analysis languages easy.
-#' This function writes both the original, limited specification of the format
-#' and the version 2 specification, which is the Apache Arrow IPC file format.
+#' [write_feather()] can write both the Feather Version 1 (V1),
+#' a legacy version available starting in 2016, and the Version 2 (V2),
+#' which is the Apache Arrow IPC file format.
+#' The default version is V2.
+#' V1 files are distinct from Arrow IPC files and lack many feathures,
+#' such as the ability to store all Arrow data tyeps, and compression support.
+#' [write_ipc_file()] can only write V2 files.
 #'
 #' @param x `data.frame`, [RecordBatch], or [Table]
 #' @param sink A string file path, URI, or [OutputStream], or path in a file
 #' system (`SubTreeFileSystem`)
-#' @param version integer Feather file version. Version 2 is the current.
-#' Version 1 is the more limited legacy format.
+#' @param version integer Feather file version, Version 1 or Version 2. Version 2 is the default.
 #' @param chunk_size For V2 files, the number of rows that each chunk of data
 #' should have in the file. Use a smaller `chunk_size` when you need faster
 #' random row access. Default is 64K. This option is not supported for V1.
@@ -46,9 +50,18 @@
 #' @seealso [RecordBatchWriter] for lower-level access to writing Arrow IPC data.
 #' @seealso [Schema] for information about schemas and metadata handling.
 #' @examples
-#' tf <- tempfile()
-#' on.exit(unlink(tf))
-#' write_feather(mtcars, tf)
+#' # We recommend the ".arrow" extension for Arrow IPC files (Feather V2).
+#' tf1 <- tempfile(fileext = ".feather")
+#' tf2 <- tempfile(fileext = ".arrow")
+#' tf3 <- tempfile(fileext = ".arrow")
+#' on.exit({
+#'   unlink(tf1)
+#'   unlink(tf2)
+#'   unlink(tf3)
+#' })
+#' write_feather(mtcars, tf1, version = 1)
+#' write_feather(mtcars, tf2)
+#' write_ipc_file(mtcars, tf3)
 #' @include arrow-object.R
 write_feather <- function(x,
                           sink,
@@ -110,13 +123,27 @@ write_feather <- function(x,
   invisible(x_out)
 }
 
-#' Read a Feather file
+#' @rdname write_feather
+#' @export
+write_ipc_file <- function(x,
+                           sink,
+                           chunk_size = 65536L,
+                           compression = c("default", "lz4", "uncompressed", "zstd"),
+                           compression_level = NULL) {
+  mc <- match.call()
+  mc$version <- 2
+  mc[[1]] <- get("write_feather", envir = asNamespace("arrow"))
+  eval.parent(mc)
+}
+
+#' Read a Feather file (an Arrow IPC file)
 #'
 #' Feather provides binary columnar serialization for data frames.
 #' It is designed to make reading and writing data frames efficient,
 #' and to make sharing data across data analysis languages easy.
-#' This function reads both the original, limited specification of the format
-#' and the version 2 specification, which is the Apache Arrow IPC file format.
+#' [read_feather()] can read both the Feather Version 1 (V1), a legacy version available starting in 2016,
+#' and the Version 2 (V2), which is the Apache Arrow IPC file format.
+#' [read_ipc_file()] is an alias of [read_feather()].
 #'
 #' @inheritParams read_ipc_stream
 #' @inheritParams read_delim_arrow
@@ -128,7 +155,8 @@ write_feather <- function(x,
 #' @export
 #' @seealso [FeatherReader] and [RecordBatchReader] for lower-level access to reading Arrow IPC data.
 #' @examples
-#' tf <- tempfile()
+#' # We recommend the ".arrow" extension for Arrow IPC files (Feather V2).
+#' tf <- tempfile(fileext = ".arrow")
 #' on.exit(unlink(tf))
 #' write_feather(mtcars, tf)
 #' df <- read_feather(tf)
@@ -158,6 +186,10 @@ read_feather <- function(file, col_select = NULL, as_data_frame = TRUE, ...) {
   out
 }
 
+#' @rdname read_feather
+#' @export
+read_ipc_file <- read_feather
+
 #' @title FeatherReader class
 #' @rdname FeatherReader
 #' @name FeatherReader
diff --git a/r/man/read_feather.Rd b/r/man/read_feather.Rd
index 903c726825..07d20b8e01 100644
--- a/r/man/read_feather.Rd
+++ b/r/man/read_feather.Rd
@@ -2,9 +2,12 @@
 % Please edit documentation in R/feather.R
 \name{read_feather}
 \alias{read_feather}
-\title{Read a Feather file}
+\alias{read_ipc_file}
+\title{Read a Feather file (an Arrow IPC file)}
 \usage{
 read_feather(file, col_select = NULL, as_data_frame = TRUE, ...)
+
+read_ipc_file(file, col_select = NULL, as_data_frame = TRUE, ...)
 }
 \arguments{
 \item{file}{A character file name or URI, \code{raw} vector, an Arrow input stream,
@@ -31,11 +34,13 @@ Arrow \link{Table} otherwise
 Feather provides binary columnar serialization for data frames.
 It is designed to make reading and writing data frames efficient,
 and to make sharing data across data analysis languages easy.
-This function reads both the original, limited specification of the format
-and the version 2 specification, which is the Apache Arrow IPC file format.
+\code{\link[=read_feather]{read_feather()}} can read both the Feather Version 1 (V1), a legacy version available starting in 2016,
+and the Version 2 (V2), which is the Apache Arrow IPC file format.
+\code{\link[=read_ipc_file]{read_ipc_file()}} is an alias of \code{\link[=read_feather]{read_feather()}}.
 }
 \examples{
-tf <- tempfile()
+# We recommend the ".arrow" extension for Arrow IPC files (Feather V2).
+tf <- tempfile(fileext = ".arrow")
 on.exit(unlink(tf))
 write_feather(mtcars, tf)
 df <- read_feather(tf)
diff --git a/r/man/write_feather.Rd b/r/man/write_feather.Rd
index 746ac29910..85c83ff04b 100644
--- a/r/man/write_feather.Rd
+++ b/r/man/write_feather.Rd
@@ -2,7 +2,8 @@
 % Please edit documentation in R/feather.R
 \name{write_feather}
 \alias{write_feather}
-\title{Write data in the Feather format}
+\alias{write_ipc_file}
+\title{Write a Feather file (an Arrow IPC file)}
 \usage{
 write_feather(
   x,
@@ -12,6 +13,14 @@ write_feather(
   compression = c("default", "lz4", "uncompressed", "zstd"),
   compression_level = NULL
 )
+
+write_ipc_file(
+  x,
+  sink,
+  chunk_size = 65536L,
+  compression = c("default", "lz4", "uncompressed", "zstd"),
+  compression_level = NULL
+)
 }
 \arguments{
 \item{x}{\code{data.frame}, \link{RecordBatch}, or \link{Table}}
@@ -19,8 +28,7 @@ write_feather(
 \item{sink}{A string file path, URI, or \link{OutputStream}, or path in a file
 system (\code{SubTreeFileSystem})}
 
-\item{version}{integer Feather file version. Version 2 is the current.
-Version 1 is the more limited legacy format.}
+\item{version}{integer Feather file version, Version 1 or Version 2. Version 2 is the default.}
 
 \item{chunk_size}{For V2 files, the number of rows that each chunk of data
 should have in the file. Use a smaller \code{chunk_size} when you need faster
@@ -44,13 +52,27 @@ the stream will be left open.
 Feather provides binary columnar serialization for data frames.
 It is designed to make reading and writing data frames efficient,
 and to make sharing data across data analysis languages easy.
-This function writes both the original, limited specification of the format
-and the version 2 specification, which is the Apache Arrow IPC file format.
+\code{\link[=write_feather]{write_feather()}} can write both the Feather Version 1 (V1),
+a legacy version available starting in 2016, and the Version 2 (V2),
+which is the Apache Arrow IPC file format.
+The default version is V2.
+V1 files are distinct from Arrow IPC files and lack many feathures,
+such as the ability to store all Arrow data tyeps, and compression support.
+\code{\link[=write_ipc_file]{write_ipc_file()}} can only write V2 files.
 }
 \examples{
-tf <- tempfile()
-on.exit(unlink(tf))
-write_feather(mtcars, tf)
+# We recommend the ".arrow" extension for Arrow IPC files (Feather V2).
+tf1 <- tempfile(fileext = ".feather")
+tf2 <- tempfile(fileext = ".arrow")
+tf3 <- tempfile(fileext = ".arrow")
+on.exit({
+  unlink(tf1)
+  unlink(tf2)
+  unlink(tf3)
+})
+write_feather(mtcars, tf1, version = 1)
+write_feather(mtcars, tf2)
+write_ipc_file(mtcars, tf3)
 }
 \seealso{
 \link{RecordBatchWriter} for lower-level access to writing Arrow IPC data.
diff --git a/r/tests/testthat/test-feather.R b/r/tests/testthat/test-feather.R
index bed097762a..2120f6ac72 100644
--- a/r/tests/testthat/test-feather.R
+++ b/r/tests/testthat/test-feather.R
@@ -25,6 +25,13 @@ test_that("Write a feather file", {
   expect_identical(tib_out, tib)
 })
 
+test_that("write_ipc_file() returns its input", {
+  tib_out <- write_ipc_file(tib, feather_file)
+  expect_true(file.exists(feather_file))
+  # Input is returned unmodified
+  expect_identical(tib_out, tib)
+})
+
 expect_feather_roundtrip <- function(write_fun) {
   tf2 <- normalizePath(tempfile(), mustWork = FALSE)
   tf3 <- tempfile()
@@ -66,18 +73,25 @@ expect_feather_roundtrip <- function(write_fun) {
 test_that("feather read/write round trip", {
   expect_feather_roundtrip(function(x, f) write_feather(x, f, version = 1))
   expect_feather_roundtrip(function(x, f) write_feather(x, f, version = 2))
+  expect_feather_roundtrip(function(x, f) write_ipc_file(x, f))
   expect_feather_roundtrip(function(x, f) write_feather(x, f, chunk_size = 32))
+  expect_feather_roundtrip(function(x, f) write_ipc_file(x, f, chunk_size = 32))
   if (codec_is_available("lz4")) {
     expect_feather_roundtrip(function(x, f) write_feather(x, f, compression = "lz4"))
+    expect_feather_roundtrip(function(x, f) write_ipc_file(x, f, compression = "lz4"))
   }
   if (codec_is_available("zstd")) {
     expect_feather_roundtrip(function(x, f) write_feather(x, f, compression = "zstd"))
+    expect_feather_roundtrip(function(x, f) write_ipc_file(x, f, compression = "zstd"))
     expect_feather_roundtrip(function(x, f) write_feather(x, f, compression = "zstd", compression_level = 3))
+    expect_feather_roundtrip(function(x, f) write_ipc_file(x, f, compression = "zstd", compression_level = 3))
   }
 
   # Write from Arrow data structures
   expect_feather_roundtrip(function(x, f) write_feather(RecordBatch$create(x), f))
+  expect_feather_roundtrip(function(x, f) write_ipc_file(RecordBatch$create(x), f))
   expect_feather_roundtrip(function(x, f) write_feather(Table$create(x), f))
+  expect_feather_roundtrip(function(x, f) write_ipc_file(Table$create(x), f))
 })
 
 test_that("write_feather option error handling", {
@@ -103,6 +117,21 @@ test_that("write_feather option error handling", {
   expect_false(file.exists(tf))
 })
 
+test_that("write_ipc_file option error handling", {
+  tf <- tempfile()
+  expect_false(file.exists(tf))
+  expect_error(
+    write_ipc_file(tib, tf, version = 1),
+    "unused argument \\(version = 1\\)"
+  )
+  expect_error(
+    write_ipc_file(tib, tf, compression_level = 1024),
+    "Can only specify a 'compression_level' when 'compression' is 'zstd'"
+  )
+  expect_match_arg_error(write_ipc_file(tib, tf, compression = "bz2"))
+  expect_false(file.exists(tf))
+})
+
 test_that("write_feather with invalid input type", {
   bad_input <- Array$create(1:5)
   expect_snapshot_error(write_feather(bad_input, feather_file))
@@ -276,3 +305,7 @@ test_that("Error is created when feather reads a parquet file", {
     "Not a Feather V1 or Arrow IPC file"
   )
 })
+
+test_that("The read_ipc_file function is an alias of read_feather", {
+  expect_identical(read_ipc_file, read_feather)
+})
diff --git a/r/vignettes/arrow.Rmd b/r/vignettes/arrow.Rmd
index aafdc35ff5..bda717ecc4 100644
--- a/r/vignettes/arrow.Rmd
+++ b/r/vignettes/arrow.Rmd
@@ -47,7 +47,8 @@ The `arrow` package also includes a faster and more robust implementation of the
 on the same underlying C++ library as the Python version does,
 resulting in more reliable and consistent behavior across the two languages, as
 well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format,
+`arrow` also by default writes the Feather V2 format
+([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
 which supports a wider range of data types, as well as compression.
 
 For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.