You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/31 16:23:51 UTC
[GitHub] [arrow] vibhatha removed a comment on pull request #12185: ARROW-15020: [R] Add bindings for new dataset writing options

vibhatha removed a comment on pull request #12185:
URL: https://github.com/apache/arrow/pull/12185#issuecomment-1025961158


   > 
   
   @jonkeane I tried `devtools::document()` and the resultant diff is the following. I am adding the diff, because it edited a set of files. Just want to make sure whether I followed it right. 
   
   ```git
   diff --git a/r/DESCRIPTION b/r/DESCRIPTION
   index ae4bbcb8c..c7daff152 100644
   --- a/r/DESCRIPTION
   +++ b/r/DESCRIPTION
   @@ -38,7 +38,7 @@ Imports:
        utils,
        vctrs
    Roxygen: list(markdown = TRUE, r6 = FALSE, load = "source")
   -RoxygenNote: 7.1.2
   +RoxygenNote: 7.1.2.9000
    Config/testthat/edition: 3
    VignetteBuilder: knitr
    Suggests:
   diff --git a/r/NAMESPACE b/r/NAMESPACE
   index d841bb290..6233b4fcd 100644
   --- a/r/NAMESPACE
   +++ b/r/NAMESPACE
   @@ -153,6 +153,8 @@ export(JsonTableReader)
    export(LargeListArray)
    export(ListArray)
    export(LocalFileSystem)
   +export(MapArray)
   +export(MapType)
    export(MemoryMappedFile)
    export(MessageReader)
    export(MessageType)
   @@ -251,6 +253,7 @@ export(list_flights)
    export(list_of)
    export(load_flight_server)
    export(map_batches)
   +export(map_of)
    export(match_arrow)
    export(matches)
    export(mmap_create)
   diff --git a/r/man/FileFormat.Rd b/r/man/FileFormat.Rd
   index cabacc937..3c6fd330b 100644
   --- a/r/man/FileFormat.Rd
   +++ b/r/man/FileFormat.Rd
   @@ -40,7 +40,7 @@ you encounter one that \code{arrow} should support. Also, the following options
    supported. From \link{CsvReadOptions}:
    \itemize{
    \item \code{skip_rows}
   -\item \code{column_names}
   +\item \code{column_names}. Note that if a \link{Schema} is specified, \code{column_names} must match those specified in the schema.
    \item \code{autogenerate_column_names}
    From \link{CsvFragmentScanOptions} (these values can be overridden at scan time):
    \item \code{convert_options}: a \link{CsvConvertOptions}
   diff --git a/r/man/array.Rd b/r/man/array.Rd
   index 81109a8f7..0b1371a56 100644
   --- a/r/man/array.Rd
   +++ b/r/man/array.Rd
   @@ -9,6 +9,7 @@
    \alias{ListArray}
    \alias{LargeListArray}
    \alias{FixedSizeListArray}
   +\alias{MapArray}
    \alias{StructScalar}
    \title{Arrow Arrays}
    \description{
   diff --git a/r/man/data-type.Rd b/r/man/data-type.Rd
   index 69475d8ad..dbf0436f7 100644
   --- a/r/man/data-type.Rd
   +++ b/r/man/data-type.Rd
   @@ -38,6 +38,8 @@
    \alias{large_list_of}
    \alias{FixedSizeListType}
    \alias{fixed_size_list_of}
   +\alias{MapType}
   +\alias{map_of}
    \title{Apache Arrow data types}
    \usage{
    int8()
   @@ -109,6 +111,8 @@ list_of(type)
    large_list_of(type)
    
    fixed_size_list_of(type, list_size)
   +
   +map_of(key_type, item_type, .keys_sorted = FALSE)
    }
    \arguments{
    \item{byte_width}{byte width for \code{FixedSizeBinary} type.}
   diff --git a/r/man/open_dataset.Rd b/r/man/open_dataset.Rd
   index 025ce5fd1..eef18b06e 100644
   --- a/r/man/open_dataset.Rd
   +++ b/r/man/open_dataset.Rd
   @@ -81,7 +81,7 @@ it is assumed to be "text".}
    \item{...}{additional arguments passed to \code{dataset_factory()} when \code{sources}
    is a directory path/URI or vector of file paths/URIs, otherwise ignored.
    These may include \code{format} to indicate the file format, or other
   -format-specific options.}
   +format-specific options (see \code{\link[=read_csv_arrow]{read_csv_arrow()}}, \code{\link[=read_parquet]{read_parquet()}} and \code{\link[=read_feather]{read_feather()}} on how to specify these).}
    }
    \value{
    A \link{Dataset} R6 object. Use \code{dplyr} methods on it to query the data,
   diff --git a/r/man/write_dataset.Rd b/r/man/write_dataset.Rd
   index 3c01396a1..5bcfef2b9 100644
   --- a/r/man/write_dataset.Rd
   +++ b/r/man/write_dataset.Rd
   @@ -13,6 +13,10 @@ write_dataset(
      hive_style = TRUE,
      existing_data_behavior = c("overwrite", "error", "delete_matching"),
      max_partitions = 1024L,
   +  max_open_files = 900L,
   +  max_rows_per_file = 0L,
   +  min_rows_per_group = 0L,
   +  max_rows_per_group = bitwShiftL(1, 20),
      ...
    )
    }
   @@ -56,6 +60,25 @@ partitions which data is not written to.
    \item{max_partitions}{maximum number of partitions any batch may be
    written into. Default is 1024L.}
    
   +\item{max_open_files}{maximum number of files that can be left opened
   +during a write operation. If greater than 0 then this will limit the
   +maximum number of files that can be left open. If an attempt is made to open
   +too many files then the least recently used file will be closed.
   +If this setting is set too low you may end up fragmenting your data
   +into many small files. The default is 900 which also allows some # of files to be
   +open by the scanner before hitting the default Linux limit of 1024.}
   +
   +\item{max_rows_per_file}{maximum number of rows to be placed in
   +any single file}
   +
   +\item{min_rows_per_group}{write the row groups to the disk when this number of
   +rows have accumulated.}
   +
   +\item{max_rows_per_group}{maximum rows allowed in a single
   +group and when this number of rows is exceeded, it is split and the next set
   +of rows is written to the next group. This value must be set such that it is
   +greater than \code{min_rows_per_group}.}
   +
    \item{...}{additional format-specific arguments. For available Parquet
    options, see \code{\link[=write_parquet]{write_parquet()}}. The available Feather options are:
    \itemize{
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org