You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/26 22:46:41 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #13005: ARROW-16272: Arrow 8.0 News

wjones127 opened a new pull request, #13005:
URL: https://github.com/apache/arrow/pull/13005

   Let me know if I've missed anything important in this release!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861982016


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 

Review Comment:
   The existing documentation is correct, although a little confusing: `as_arrow_array()` is about converting *to* an Arrow Array, but you need to use an `ExtensionType` subclass in order to customize converting *from* an Arrow Array. Another compelling use of `ExtensionType` is, as Neal mentioned, where the type is defined in a Python package as well.
   
   Perhaps a solution here is to group the S3 Generics heading and the ExtensionType heading, because they're both under the theme of extensibility? Maybe:
   
   ```
   ### Extensibility
   
   - Added S3 generic methods to create the core Arrow object types. In particular, packages can define the `as_arrow_array()` generic to ensure that a custom vector type is converted to an Arrow Array in a particular way (e.g., when converting a `data.frame` to an Arrow Table). Packages can also define an `as_arrow_table()` method to customize conversion of a table-like object (e.g., when an object is passed to `write_parquet()` or `write_feather()`).
   - Custom [ExtensionType](https://arrow.apache.org/docs/format/Columnar.html#extension-types)s can be created and registered, allowing other packages to define their own array types and/or conversions from Arrow Arrays to R vectors. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See documentation for `new_extension_type()` for details.
   - Implemented a generic extension type and `as_arrow_array()` methods for all objects where `vctrs::vec_is()` returns `TRUE` (i.e., any object that can be used as a column in a `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted to an Arrow Array.
   ```
   
   (feel free to mix/match/scramble/disregard this with what you've written already!)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1120194997

   Benchmark runs are scheduled for baseline = 6b32c300e1655b7e8eb2271b581948fb7864af12 and contender = 526fa070c82c0e1c6d26a4c1d06a591b37c05011. 526fa070c82c0e1c6d26a4c1d06a591b37c05011 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/fae4c263f423458aa547c726514cac35...7e0df42366f047e8820e2586585a2793/)
   [Finished :arrow_down:0.08% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7942d1042f954611a745895556d81fe5...9fa8e56138c54bb7a3ed1d709c0e0cbd/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6389f93766234704ace4401d6b1a7e64...0be7f52122184c1092f19d0f934be5b7/)
   [Finished :arrow_down:0.28% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/7dcb95ac46bc409fabacbd952fa55f3d...d16a01e9b6664b31a369bda43028bc1b/)
   Buildkite builds:
   [Finished] [`526fa070` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/689)
   [Finished] [`526fa070` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/686)
   [Finished] [`526fa070` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/675)
   [Finished] [`526fa070` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/691)
   [Finished] [`6b32c300` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/688)
   [Finished] [`6b32c300` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/685)
   [Finished] [`6b32c300` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/674)
   [Finished] [`6b32c300` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/690)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862093726


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 

Review Comment:
   Yeah agreed that combining the sections makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on pull request #13005: ARROW-16276: Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

eitsupi commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1110835008

   It might be better to describe function and package names as `` `{dplyr}` `` or `` `write_dataset()` `` so that pkgdown can create the links automatically.
   
   https://github.com/r-lib/pkgdown/blob/feb91bf46c3ea78f8a03aead9f9a4934e3965ba4/vignettes/linking.Rmd?rgh-link-date=2022-02-09T10%3A52%3A57Z#L20-L36


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13005: ARROW-16272: Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1110321392

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13005: ARROW-16272: Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1110321381

   https://issues.apache.org/jira/browse/ARROW-16272


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862955886


##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to 
+ * Chunked arrays can be concatenated with `c()`.
+ * Record batches and tables support `cbind()`.
+ * Arrow tables support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.
-* Removed Solaris workarounds, libarrow is now required.
+* `Array$cast()` can cast struct arrays into another struct type with the same field names
+  and structure (or a subset of fields) but different field types.
+* The CSV writer is now much faster when writing string columns.
+* Remove special handling for Solaris
+* The CSV writer is much faster when writing string columns.
+* Removed Solaris workarounds, libarrow is required.

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861908400


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 

Review Comment:
   Good point, and I think I agree with you. I borrowed this language from the extension types R docs:
   
   https://github.com/apache/arrow/blob/d6ca3e2e9e995ff42df2465484ebf86de853a136/r/R/extension.R#L262-L268
   
   @paleolimbot Any thoughts on that?
   
   I'm can simple cut that part out for now and link to the existing docs on extension types:
   
   > Custom [extension arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) can be created and registered, allowing other packages to define their own array types. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See further description and an example with `?new_extension_type`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864037507


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),

Review Comment:
   I think I'd rather limit these parenthetical to just explain abbreviations (tz, dst, epiyear), rather than try to function as docs. We link to the lubridate function docs directly for each bullet, so more detail is readily available to the reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861728055


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when

Review Comment:
   ```suggestion
   * `write_dataset()` now has more options for controlling row group and file sizes when
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.
+ * Arrow tables now support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## S3 Conversion Generics
+
+Arrow now provides S3 generic conversion functions such as `as_arrow_array()`
+and `as_chunked_array()` for main Arrow objects. This includes, Arrow tables,

Review Comment:
   ```suggestion
   and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also

Review Comment:
   ```suggestion
     [tzdb package](https://cran.r-project.org/package=tzdb) is also
   ```
   
   (I know that's a weird URL, but not using the 'canonical version' triggers a check NOTE, or used to, on the CMD check)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861884265


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also

Review Comment:
   Good to know! Would `R CMD CHECK` show this thing? Or some other command?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #13005: ARROW-16276: [R] Arrow 8.0 News
URL: https://github.com/apache/arrow/pull/13005


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864022568


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to

Review Comment:
   The alternative is to make it a Table, but that's not really new IMO. https://github.com/apache/arrow/blob/master/r/R/record-batch.R#L195



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

eitsupi commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862313809


##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 

Review Comment:
   These lines should be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864041948


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13005: ARROW-16276: Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1110321990

   :warning: Ticket **has no components in JIRA**, make sure you assign one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

eitsupi commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862313809


##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 

Review Comment:
   These duplicate lines should be removed.



##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 

Review Comment:
   These duplicated lines should be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861982016


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 

Review Comment:
   The existing documentation is correct, although a little confusing: `as_arrow_array()` is about converting *to* an Arrow Array, but you need to use an `ExtensionType` subclass in order to customize converting *from* an Arrow Array. Another compelling use of `ExtensionType` is, as Neal mentioned, where the type is defined in a Python package as well.
   
   Perhaps a solution here is to group the S3 Generics heading and the ExtensionType heading, because they're both under the theme of extensibility? Maybe:
   
   ### Extensibility
   
   - Added S3 generic methods to create the core Arrow object types. In particular, packages can define the `as_arrow_array()` generic to ensure that a custom vector type is converted to an Arrow Array in a particular way (e.g., when converting a `data.frame` to an Arrow Table). Packages can also define an `as_arrow_table()` method to customize conversion of a table-like object (e.g., when an object is passed to `write_parquet()` or `write_feather()`).
   - Custom [ExtensionType](https://arrow.apache.org/docs/format/Columnar.html#extension-types)s can be created and registered, allowing other packages to define their own array types and/or conversions from Arrow Arrays to R vectors. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See documentation for `new_extension_type()` for details.
   - Implemented a generic extension type and `as_arrow_array()` methods for all objects where `vctrs::vec_is()` returns `TRUE` (i.e., any object that can be used as a column in a `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted to an Arrow Array.
   
   (feel free to mix/match/scramble/disregard this with what you've written already!)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862965834


##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 

Review Comment:
   Thanks! I clearly need to do a better job of checking the diffs of git merges.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r863983260


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 

Review Comment:
   is this correct?
   
   ```suggestion
       * `lubridate::date()` (extract date from timestamp), 
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),

Review Comment:
   what is epiyear? drop the parenthetical if we don't have anything to clarify



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),

Review Comment:
   ?
   ```suggestion
       * `lubridate::tz()` (string timezone),
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.
-* Removed Solaris workarounds, libarrow is now required.
+* `Array$cast()` can cast struct arrays into another struct type with the same field names

Review Comment:
   ```suggestion
   * `Array$cast()` can cast StructArrays into another struct type with the same field names
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to

Review Comment:
   This is correct, no rbind for RecordBatch? wasn't there some alternative to concatenate batches?



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.

Review Comment:
   ```suggestion
   * `median()` and `quantile()` will warn only once about approximate calculations regardless of interactivity.
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.
-* Removed Solaris workarounds, libarrow is now required.
+* `Array$cast()` can cast struct arrays into another struct type with the same field names
+  and structure (or a subset of fields) but different field types.
+* Removed special handling for Solaris.
+* The CSV writer is much faster when writing string columns.
+* Fixed an issue where `set_io_thread_count()` would set the CPU count instead of
+  the IO thread count.
+* `RandomAccessFile` has a `$ReadMetadata()` method that provides useful
+  metadata provided by the filesystem.
+* `grepl` binding returns `FALSE` for `NA` inputs (previously it returned `NA`),
+  which matches the behavior of `base::grepl`.

Review Comment:
   ```suggestion
     to match the behavior of `base::grepl()`.
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),

Review Comment:
   this?
   ```suggestion
       * `lubridate::dst()` (daylight savings time indicator, logical/boolean),
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 

Review Comment:
   Drop "Added" from all of these, seems inconsistent with the ones above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864042096


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),

Review Comment:
   "year according to epidemilogical week calendar". 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r862955688


##########
r/NEWS.md:
##########
@@ -19,19 +19,123 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset()` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays support `base::format()`
+  * `strptime()` returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/package=tzdb) is also
+* Timezone operations are supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extensibility
+
+* Added S3 generic conversion functions such as `as_arrow_array()`
+  and `as_arrow_table()` for main Arrow objects. This includes, Arrow tables,
+  record batches, arrays, chunked arrays, record batch readers, schemas, and
+  data types. This allows other packages to define custom conversions from their
+  types to Arrow objects, including extension arrays.
+* Custom [extension types and arrays](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
+  can be created and registered, allowing other packages to
+  define their own array types. Extension arrays wrap regular Arrow array types and
+  provide customized behavior and/or storage. See description and an example with
+  `?new_extension_type`.
+* Implemented a generic extension type and as_arrow_array() methods for all objects where     
+  `vctrs::vec_is()` returns TRUE (i.e., any object that can be used as a column in a 
+  `tibble::tibble()`), provided that the underlying `vctrs::vec_data()` can be converted 
+  to an Arrow Array.
+
+## Concatenation Support
+
+Arrow arrays and tables can be easily concatenated:
+
+ * Arrays can be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * ChunkedArrays can be concatenated with `c()`.
+ * RecordBatches and Tables support `cbind()`.
+ * Tables support `rbind()`. `concat_tables()` is also provided to 
+ * Chunked arrays can be concatenated with `c()`.
+ * Record batches and tables support `cbind()`.
+ * Arrow tables support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## Other improvements and fixes
+
+* Dictionary arrays support using ALTREP when converting to R factors.
+* Math group generics are implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer supports Duration type columns.
+  * The dataset Parquet reader consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.
-* Removed Solaris workarounds, libarrow is now required.
+* `Array$cast()` can cast struct arrays into another struct type with the same field names
+  and structure (or a subset of fields) but different field types.
+* The CSV writer is now much faster when writing string columns.

Review Comment:
   More duplicated lines
   
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864024451


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 

Review Comment:
   It was originally intended to be supported, but wasn't tested and there was a regression that broke it, at least for the past few versions. It was fixed in https://github.com/apache/arrow/pull/12629.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

wjones127 commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r864030509


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are supported on `RecordBatchReader`. This allows, for example, results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - supports `dplyr::rename_with()`.
+  - `dplyr::count()` returns an ungrouped dataframe.
+* `write_dataset` has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` correctly accepts `Dataset` objects.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 

Review Comment:
   Yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r863982366


##########
r/NEWS.md:
##########
@@ -19,19 +19,110 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - can take a list of datasets with differing schemas and attempt to unify the 

Review Comment:
   This is new? I thought this has been supported from the beginning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13005: ARROW-16276: Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13005:
URL: https://github.com/apache/arrow/pull/13005#issuecomment-1110321977

   https://issues.apache.org/jira/browse/ARROW-16276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861735606


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.
+ * Arrow tables now support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## S3 Conversion Generics
+
+Arrow now provides S3 generic conversion functions such as `as_arrow_array()`
+and `as_chunked_array()` for main Arrow objects. This includes, Arrow tables,

Review Comment:
   (just because the array and table methods are the ones most likely to be used)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13005: ARROW-16276: [R] Arrow 8.0 News

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on code in PR #13005:
URL: https://github.com/apache/arrow/pull/13005#discussion_r861791774


##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 

Review Comment:
   Should we link to the format docs on extension types? https://arrow.apache.org/docs/format/Columnar.html#extension-types
   
   There are also some use cases described there. 
   
   Also, I'm not sure the use case here is correct. If it's just about custom serialization of R objects, isn't that what `as_arrow_array` is for? Extension types are about when you need to define a standard outside of just this implementation, like when you want to have Python and R both understand the semantics of the data. If you're just trying to round trip data with R, the regular R metadata mechanism works for you, and if you need to serialize/deserialize the data differently, define an S3 method. 
   
   



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.
+ * Arrow tables now support `rbind()`. `concat_tables()` is also provided to 

Review Comment:
   ```suggestion
    * Tables support `rbind()`. `concat_tables()` is also provided to 
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB

Review Comment:
   This is just one example of use case, there are/will be others (for example, you can pass a RecordBatchReader over the C interface, so you can get one from wherever in pyarrow, including Flight, and do dplyr on it)



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.
+ * Arrow tables now support `rbind()`. `concat_tables()` is also provided to 
+   concatenate tables while unifying schemas.
+
+## S3 Conversion Generics
+
+Arrow now provides S3 generic conversion functions such as `as_arrow_array()`
+and `as_chunked_array()` for main Arrow objects. This includes, Arrow tables,
+record batches, arrays, chunked arrays, record batch readers, schemas, and
+data types. This allows other packages to define custom conversions from their
+types to Arrow objects, including extension arrays.
+
+## Other improvements and fixes
+
+* Dictionary arrays now support using ALTREP when converting to R factors.
+* Math group generics are now implemented for ArrowDatum. This means you can use
+  base functions like `sqrt()`, `log()`, and `exp()` with Arrow arrays and scalars.
+* `read_*` and `write_*` functions now support R Connection objects for reading
+  and writing files.
+* Parquet improvements:
+  * Parquet writer now supports Duration type columns.
+  * The dataset Parquet reader now consumes less memory.
 * `median()` and `quantile()` will warn once about approximate calculations regardless of interactivity.
+* `Array$cast()` can now cast struct arrays into another struct type with the same field names
+  and structure (or a subset of fields) but different field types.
+* The CSV writer is now much faster when writing string columns.
 * Removed Solaris workarounds, libarrow is now required.

Review Comment:
   ```suggestion
   * Remove special handling for Solaris
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.

Review Comment:
   ```suggestion
    * ChunkedArrays can be concatenated with `c()`.
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.

Review Comment:
   ```suggestion
   * `write_csv_arrow()` can write a `Dataset` or an Arrow dplyr query to a single file.
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation

Review Comment:
   ```suggestion
   when the default conversion is slow or loses metadata important to the interpretation
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.

Review Comment:
   ```suggestion
   * `map_batches()` correctly accepts `Dataset` objects.
   ```



##########
r/NEWS.md:
##########
@@ -19,19 +19,111 @@
 
 # arrow 7.0.0.9000
 
-* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` instead of `timestamp(unit = "s")`.
-* `lubridate`:
-  * component extraction functions: `tz()` (timezone), `semester()` (semester), `dst()` (daylight savings time indicator), `date()` (extract date), `epiyear()` (epiyear), improvements to `month()`, which now works with integer inputs.
-  * Added `make_date()` & `make_datetime()` + `ISOdatetime()` & `ISOdate()` to create date-times from numeric representations. 
-  * Added `decimal_date()` and `date_decimal()`
-  * Added `make_difftime()` (duration constructor)
-  * Added duration helper functions: `dyears()`, `dmonths()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`, `dmilliseconds()`, `dmicroseconds()`, `dnanoseconds()`.
-* date-time functionality:
-  * Added `as_date()` and `as_datetime()`
-  * Added `difftime` and `as.difftime()` 
-  * Added `as.Date()` to convert to date
+## Enhancements to dplyr and datasets
+
+* `open_dataset()`:
+  - now correctly supports the `skip` argument for skipping header rows in CSV datasets.
+  - now can take a list of datasets with differing schemas and attempt to unify the 
+    schemas to produce a `UnionDataset`.
+* Arrow `{dplyr}` queries:
+  - are now supported on `RecordBatchReader`. This allows results from DuckDB
+  to be streamed back into Arrow rather than materialized before continuing the pipeline.
+  - no longer need to materialize the entire result table before writing to a dataset
+    if the query contains contains aggregations or joins.
+  - now supports `dplyr::rename_with()`.
+  - `dplyr::count()` now returns an ungrouped dataframe.
+* `write_dataset` now has more options for controlling row group and file sizes when
+  writing partitioned datasets, such as `max_open_files`, `max_rows_per_file`, 
+  `min_rows_per_group`, and `max_rows_per_group`.
+* `write_csv_arrow` now accepts a `Dataset` or an Arrow dplyr query.
+* Joining one or more datasets while `option(use_threads = FALSE)` no longer
+  crashes R. That option is set by default on Windows.
+* `dplyr` joins now support the `suffix` argument to handle overlap in column names.
+* Filtering a Parquet dataset with `is.na()` no longer misses any rows.
+* `map_batches()` no longer errors if passed a `Dataset` object.
+
+## Enhancements to date and time support
+
+* `read_csv_arrow()`'s readr-style type `T` is now mapped to `timestamp(unit = "ns")` 
+  instead of `timestamp(unit = "s")`.
+* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
+  * New component extraction functions: 
+    * `lubridate::tz()` (timezone),
+    * `lubridate::semester()` (semester), 
+    * `lubridate::dst()` (daylight savings time indicator),
+    * `lubridate::date()` (extract date), 
+    * `lubridate::epiyear()` (epiyear),
+  * `lubridate::month()` now works with integer inputs.
+  * Added `lubridate::make_date()` & `lubridate::make_datetime()` + 
+    `lubridate::ISOdatetime()` & `lubridate::ISOdate()` to 
+    create date-times from numeric representations. 
+  * Added `lubridate::decimal_date()` and `lubridate::date_decimal()`
+  * Added `lubridate::make_difftime()` (duration constructor)
+  * Added `?lubridate::duration` helper functions, such as `dyears()`, `dhours()`, `dseconds()`.
+  * Added `lubridate::leap_year()`
+  * Added `lubridate::as_date()` and `lubridate::as_datetime()`
+* Also for Arrow dplyr queries, added support for base date and time functions:
+  * Added `base::difftime` and `base::as.difftime()` 
+  * Added `base::as.Date()` to convert to date
+  * Arrow timestamp and date arrays now support `base::format()`
+  * `strptime()` now returns `NA` instead of erroring in case of format mismatch,
+    just like `base::strptime()`.
+* Timezone operations are now supported on Windows if the 
+  [tzdb package](https://cran.r-project.org/web/packages/tzdb/index.html) is also
+  installed.
+
+## Extension Array Support
+
+Custom extension arrays can be created and registered, allowing other packages to
+define their own array types. Extension arrays wrap regular Arrow array types and
+provide customized behavior and/or storage. A common use-case for extension types 
+is to define a customized conversion between an an Arrow Array and an R object 
+when the default conversion is slow or looses metadata important to the interpretation
+of values in the array. For most types, the built-in vctrs extension type is probably 
+sufficient. See description and an example with `?new_extension_type`.
+
+## Concatenation Support
+
+Arrow arrays and tables can now be easily concatenated:
+
+ * Arrays can now be concatenated with `concat_arrays()` or, if zero-copy is desired
+   and chunking is acceptable, using `ChunkedArray$create()`.
+ * Chunked arrays can now be concatenated with `c()`.
+ * Record batches and tables now support `cbind()`.

Review Comment:
   ```suggestion
    * RecordBatches and Tables support `cbind()`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org