You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/09 16:51:20 UTC

[GitHub] [arrow] nealrichardson commented on a change in pull request #8389: ARROW-8296: [C++][Dataset] Add IpcFileWriteOptions

nealrichardson commented on a change in pull request #8389:
URL: https://github.com/apache/arrow/pull/8389#discussion_r502545157



##########
File path: r/src/arrow_metadata.h
##########
@@ -0,0 +1,38 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "./arrow_types.h"
+
+#if defined(ARROW_R_WITH_ARROW)
+#include <arrow/util/key_value_metadata.h>
+
+inline std::shared_ptr<arrow::KeyValueMetadata> KeyValueMetadata__Make(

Review comment:
       Two thoughts:
   
   1. We already have KeyValueMetadata support in R, so if these helpers are necessary, I'd expect to see them used in those other places too.
   2. It looks like these functions are opposites, one takes a named character vector and turns it into arrow::KVM, and the other takes KVM and makes a named character vector. Can we make the function names more obvious about that?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -943,10 +943,42 @@ test_that("Dataset writing: from RecordBatch", {
   )
 })
 
+test_that("Writing a dataset: Ipc format options & compression", {
+  skip_on_os("windows") # https://issues.apache.org/jira/browse/ARROW-9651
+  ds <- open_dataset(csv_dir, partitioning = "part", format = "csv")
+  dst_dir <- make_temp_dir()
+
+  metadata <- c(hello = "world", eh = "!")
+  codec <- NULL
+  if (codec_is_available("zstd")) {
+    codec <- Codec$create("zstd")
+  }
+
+  write_dataset(ds, dst_dir, format = "feather", metadata = metadata, codec = codec)
+  expect_true(dir.exists(dst_dir))
+
+  file <- ds$filesystem$OpenInputStream(paste(dst_dir, dir(dst_dir)[[1]], sep = "/"))
+  expect_equivalent(metadata, RecordBatchFileReader$create(file)$metadata)

Review comment:
       ```suggestion
     file <- dir(dst_dir, full.names = TRUE)[[1]]
     expect_equivalent(read_feather(file, as_data_frame = FALSE)$metadata, metadata)
   ```

##########
File path: r/configure
##########
@@ -201,7 +201,10 @@ else
 fi
 
 # Write to Makevars
-sed -e "s|@cflags@|$PKG_CFLAGS|" -e "s|@libs@|$PKG_LIBS|" src/Makevars.in > src/Makevars
+sed -e "s|@cflags@|$PKG_CFLAGS|" \

Review comment:
       Please revert this change and the one in Makevars.in. This will provoke the wrath of CRAN (and it should cause at least one nightly build failure).

##########
File path: r/src/recordbatchreader.cpp
##########
@@ -74,6 +76,12 @@ int ipc___RecordBatchFileReader__num_record_batches(
   return reader->num_record_batches();
 }
 
+// [[arrow::export]]
+cpp11::strings ipc___RecordBatchFileReader__metadata(

Review comment:
       It's cool that you added this but it's probably not necessary; see below for a simpler test.

##########
File path: r/src/arrow_cpp11.h
##########
@@ -157,19 +157,42 @@ struct ns {
   static SEXP arrow;
 };
 
+// Specialize this struct to define a default value to be used when NULL is given.

Review comment:
       Why is this change necessary?

##########
File path: r/R/dataset-format.R
##########
@@ -139,6 +139,13 @@ FileWriteOptions <- R6Class("FileWriteOptions", inherit = ArrowObject,
         dataset___ParquetFileWriteOptions__update(self,
             ParquetWriterProperties$create(...),
             ParquetArrowWriterProperties$create(...))
+      } else if (self$type == "ipc") {
+        args <- list(...)
+        dataset___IpcFileWriteOptions__update(self,
+            get_ipc_use_legacy_format(args$use_legacy_format),
+            args$codec,

Review comment:
       I think a better approach for `codec` would be to allow you to specify the parameters like `write_feather` (though fine if you want to provide your own Codec I guess). so something like this:
   
   ```suggestion
               args$codec %||% Codec$create(
                 args$compression %||% "uncompressed", 
                 args$compression_level %||% NA
               ),
   ```
   
   Perhaps this lets you get rid of your funky `nil_value` helper too? With this change, you're always providing a Codec to C++, never `NULL`. 

##########
File path: r/R/dataset-write.R
##########
@@ -44,7 +44,8 @@
 #' @param filesystem A [FileSystem] where the dataset should be written if it is a
 #' string file path; default is the local file system
 #' @param ... additional format-specific arguments. For available Parquet
-#' options, see [write_parquet()].
+#' options, see [write_parquet()]. For available Feather options, see
+#' [RecordBatchFileWriter$create()].

Review comment:
       I don't think that doc link will work, and if you were to do `make doc`, this will probably fail check due to a bad xref. The options also aren't well documented there, nor are they all exposed in write_feather, so it might be best to enumerate them here.

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -943,10 +943,42 @@ test_that("Dataset writing: from RecordBatch", {
   )
 })
 
+test_that("Writing a dataset: Ipc format options & compression", {
+  skip_on_os("windows") # https://issues.apache.org/jira/browse/ARROW-9651
+  ds <- open_dataset(csv_dir, partitioning = "part", format = "csv")
+  dst_dir <- make_temp_dir()
+
+  metadata <- c(hello = "world", eh = "!")
+  codec <- NULL
+  if (codec_is_available("zstd")) {
+    codec <- Codec$create("zstd")
+  }
+
+  write_dataset(ds, dst_dir, format = "feather", metadata = metadata, codec = codec)
+  expect_true(dir.exists(dst_dir))
+
+  file <- ds$filesystem$OpenInputStream(paste(dst_dir, dir(dst_dir)[[1]], sep = "/"))
+  expect_equivalent(metadata, RecordBatchFileReader$create(file)$metadata)
+
+  new_ds <- open_dataset(dst_dir, format = "feather")
+  expect_equivalent(
+    new_ds %>%
+      select(string = chr, integer = int) %>%
+      filter(integer > 6 & integer < 11) %>%
+      collect() %>%
+      summarize(mean = mean(integer)),
+    df1 %>%
+      select(string = chr, integer = int) %>%
+      filter(integer > 6) %>%
+      summarize(mean = mean(integer))
+  )
+})
+
 test_that("Writing a dataset: Parquet format options", {
   skip_on_os("windows") # https://issues.apache.org/jira/browse/ARROW-9651
   ds <- open_dataset(csv_dir, partitioning = "part", format = "csv")
   dst_dir <- make_temp_dir()
+  dst_dir_no_truncated_timestamps <- make_temp_dir()

Review comment:
       Why this change? I wouldn't expect this PR to affect parquet writing options, should it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org