You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/13 15:02:54 UTC

[GitHub] [arrow] ianmcook opened a new pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

ianmcook opened a new pull request #10014:
URL: https://github.com/apache/arrow/pull/10014


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #10014:
URL: https://github.com/apache/arrow/pull/10014


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612698178



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.

Review comment:
       I think the "Why" section I'm working on adding at the top of the README should answer that 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#issuecomment-819169195


   > Thanks for doing this. Some initial thoughts. You're covering a lot of the "whats" here, and I encourage you to think next of the "whys". The README should persuade potential users that there's value in the package for them and that they should try it out. I see here that I can read a parquet file, or create a RecordBatch, or use dplyr on that, but why would I want to do that?
   
   I added a new section "What can `arrow` do?" at the top. This effectively answers the question "Why should I use this?" with "Because it can..." and a list of the major capabilities, each in just a few words.
   
   I moved all the details about the metadata and data objects to the "Using the Arrow C++ Library in R" where I think they fit better and are valuable as a reference for more advanced users. Here in the README, I retained only two short bullets describing what `Table` and `Dataset` are for, in terms that are less technical and more intended to explain what they do that R data frames don't.
   
   I expanded the dplyr example into a broader "Usage" section in which the dplyr content is one subsection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612688353



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for

Review comment:
       Elsewhere in our docs, S3 is not set in monospace text, but `R6` is. Was that deliberate?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612700647



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with

Review comment:
       I think I'll use "until the result is required" instead of "until it is requested" if that's cool with you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613589255



##########
File path: r/vignettes/arrow.Rmd
##########
@@ -72,6 +72,45 @@ to other applications and services that use Arrow. One example is Spark: the
 move data to and from Spark, yielding [significant performance
 gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
 
+# Object hierarchy
+
+## Metadata objects

Review comment:
       Since this has been moved from the README to a less conspicuous place in the "Using the Arrow C++ Library in R" vignette, I don't think this is as important now. I also believe there are some good reasons to keep it in this order, chiefly the fact that it avoids forward references: reading from top to bottom, you never come across something that has not already been defined above. IMO we should keep this as is for now and plan to refine it for the next release.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612575473



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:

Review comment:
       The benefit of keeping it in this order is that there are no forward references; reading from top to bottom, you never come across something that has not already been defined above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612685610



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.

Review comment:
       ```suggestion
   functions also accept R dataframes.
   ```

##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with

Review comment:
       ```suggestion
   The `arrow` package uses lazy evaluation to delay computation until it is requested. `result` is an object with
   ```

##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.

Review comment:
       Would it be too much of a distraction to have a brief aside about why one might want to do this here? 

##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-### Full package validation
-
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most `dplyr` verbs except those that compute aggregates
+(such as `summarise()`, and `mutate()` after `group_by()`). Inside
+`dplyr` verbs, Arrow offers limited support for functions and operators,
+with broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For `dplyr` queries on `Table` and `RecordBatch` objects, if the `arrow`
+package detects an unsupported function within a `dplyr` verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that `dplyr` verb.

Review comment:
       Should we add something here about what happens on unsupported calls with datasets? 

##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.

Review comment:
       I wonder if this would be better below the `dplyr` block below? It might be the GH diff UI, but I thought this was semi-distracting from the example we're working with so far here

##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-### Full package validation
-
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most `dplyr` verbs except those that compute aggregates
+(such as `summarise()`, and `mutate()` after `group_by()`). Inside
+`dplyr` verbs, Arrow offers limited support for functions and operators,
+with broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For `dplyr` queries on `Table` and `RecordBatch` objects, if the `arrow`
+package detects an unsupported function within a `dplyr` verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that `dplyr` verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache
+Arrow (ARROW)**, select the component **R**, and begin the issue summary
+with **\[R\]** followed by a space. For more information, see the
+**Report bugs and propose features** section of the [Contributing to
+Apache
+Arrow](https://arrow.apache.org/docs/developers/contributing.html) page
+in the Arrow developer documentation.
+
+We welcome questions and discussion about the `arrow` package. For

Review comment:
       ```suggestion
   We welcome questions and discussion about and contributions to the `arrow` package. For
   ```
   
   (or something similar)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613526136



##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files

Review comment:
       So what? Isn't it pretty basic and universal to be able to read and write files?
   
   I know you're trying to be concise, but maybe it would be good to note reasons you'd want to use `arrow` to read your files, such as (1) supports different file formats; (2) fast; (3) can read and write to cloud storage.
   
   

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see

Review comment:
       Writing datasets is not just about the inverse of reading, it's about partitioning, which lets you speed up your queries significantly

##########
File path: r/vignettes/arrow.Rmd
##########
@@ -72,6 +72,45 @@ to other applications and services that use Arrow. One example is Spark: the
 move data to and from Spark, yielding [significant performance
 gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
 
+# Object hierarchy
+
+## Metadata objects

Review comment:
       As I suggested before, I'd swap the order of data and metadata subsections, lead with a mapping of R vector/data.frame to Arrow versions, and then in the metadata section, invert the order of the table rows.

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using `dplyr` with `arrow`
 
-    ./lint.sh --fix
+The `arrow` package provides a `dplyr` backend enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To use it, first load both
+packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
+`Dataset` object. For example, read the Parquet file written in the
+previous example into an Arrow `Table` named `sw`:
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+``` r
+sw <- read_parquet(file_path, as_data_frame = FALSE)
+```
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation to delay computation until the
+result is required. `result` is an object with class `arrow_dplyr_query`
+which represents the computations to be performed:
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and materialize the result, call
+`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
+suitable for passing to other `arrow` or `dplyr` functions:
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R `data.frame`, suitable for viewing or passing
+to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
 ```
+
+The `arrow` package works with most `dplyr` verbs except those that

Review comment:
       This is the part where I think that a full dplyr vignette would help, so you could get into these details and provide more examples. I think the support we have is greater than this paragraph implies (not that the words are wrong, it just focuses on what "limited" support there is rather than enumerating all of the things that are supported).
   
   The NEWS.md now has some more discussion that you could include here. 

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using `dplyr` with `arrow`
 
-    ./lint.sh --fix
+The `arrow` package provides a `dplyr` backend enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To use it, first load both
+packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
+`Dataset` object. For example, read the Parquet file written in the
+previous example into an Arrow `Table` named `sw`:
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+``` r
+sw <- read_parquet(file_path, as_data_frame = FALSE)
+```
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation to delay computation until the
+result is required. `result` is an object with class `arrow_dplyr_query`
+which represents the computations to be performed:
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and materialize the result, call
+`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
+suitable for passing to other `arrow` or `dplyr` functions:
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R `data.frame`, suitable for viewing or passing
+to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
 ```
+
+The `arrow` package works with most `dplyr` verbs except those that
+compute aggregates (such as `summarise()`, and `mutate()` after
+`group_by()`). Inside `dplyr` verbs, Arrow offers limited support for
+functions and operators, with broader support expected in upcoming
+releases. For more information about available compute functions, see
+`help("list_compute_functions")`.

Review comment:
       I don't think we want to link to this on the README. This should be deep in the (hypothetical) vignette. I would expect that most users won't call the arrow_namespaced compute functions but will just use the base/stringr/dplyr-mapped versions. Unfortunately there's not a good place to list those (though you could envision adding a function that could list them), which is why I was thinking a vignette would be necessary to document them.

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?

Review comment:
       ```suggestion
   What can the `arrow` package do?
   ```

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```

Review comment:
       It would also be good here to mention (and link to the "arrow" vignette) that R attributes are preserved round-trip, so you can save things like `sf` data frames to Parquet. That seems to be a use case of interest.

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by

Review comment:
       I wouldn't mention `file://` URIs, I think it's confusing

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the

Review comment:
       Why would I want to do this? Perhaps here is where you put the note/link about the Parquet format.

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and

Review comment:
       I would mention here that this does not require reading all of the data into memory, so you can filter and select data for analysis efficiently

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using `dplyr` with `arrow`
 
-    ./lint.sh --fix
+The `arrow` package provides a `dplyr` backend enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To use it, first load both
+packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
+`Dataset` object. For example, read the Parquet file written in the
+previous example into an Arrow `Table` named `sw`:
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+``` r
+sw <- read_parquet(file_path, as_data_frame = FALSE)
+```
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation to delay computation until the

Review comment:
       Why is delaying computation useful?

##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Analyze, process, and write **multi-file, larger-than-memory
+    datasets** (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Manipulate and analyze Arrow data with **`dplyr` verbs**
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **fine control over column types** for seamless
+    interoperability with databases and data warehouse systems
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Enables **zero-copy data sharing** between **R and Python**
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
+### Installing the latest release version
+
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
-## Installing a development version
+### Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Conda users can install `arrow` nightly builds with
 
-```r
-arrow::install_arrow(nightly = TRUE)
-```
-
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
-
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+## Usage
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
-
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs

Review comment:
       ```suggestion
   -   Manipulating bigger-than-memory data with `dplyr` verbs
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -277,6 +283,8 @@ avoid scanning because they have no rows where `total_amount > 100`.
 
 ## More dataset options
 
+<!-- ADD SOMETHING HERE ABOUT HOW YOU CAN PASS A SINGLE FILE PATH OR VECTOR OF FILE PATHS TO open_dataset() -->

Review comment:
       Yes, particularly as it relates to write_dataset(). The key point there is that you could have a CSV file that is too big to read into memory, and you could partition it into a bunch of parquet files that are more manageable in size without having to find some way to first read the file into R.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612734169



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with

Review comment:
       Yeah, that works — I had that at first but then went down a path of "but what does 'required' mean and how do we operationalize it?" but I think that's because I have a developer's perspective




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613541574



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -322,9 +330,9 @@ by calling `write_dataset()` on it:
 write_dataset(ds, "nyc-taxi/feather", format = "feather")
 ```
 
-Next, let's imagine that the "payment_type" column is something we often filter on,
+Next, let's imagine that the `payment_type` column is something we often filter on,
 so we want to partition the data by that variable. By doing so we ensure that a filter like
-`payment_type == 3` will touch only a subset of files where payment_type is always 3.
+`payment_type == 3` will touch only a subset of files where `payment_type `is always 3.

Review comment:
       I fixed this (sort of) in 88edc2d. It's an imperfect fix but I think it's about the best we can do without making major changes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613599475



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability

Review comment:
       I added some more detail on Parquet further down. I think for now it would be best not to add more links or detail in this section.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612587763



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_feather()` and `write_parquet()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To start, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the creation
+or file loading functions listed above. For example, create a `Table`
+named `sw` with the Star Wars characters data frame that’s included in
+`dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+data_file <- tempfile()
+write_parquet(starwars, data_file)
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most dplyr verbs except those that compute aggregates
+(such as `summarise()` and `mutate()` after `group_by()`). Inside dplyr
+verbs, Arrow offers limited support for functions and operators, with
+broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For dplyr queries on `Table` and `RecordBatch` objects, if the `arrow` R
+package detects an unsupported function within a dplyr verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that dplyr verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache

Review comment:
       I considered this and tried first to use one of these URLs:
   - https://issues.apache.org/jira/projects/ARROW/issues/
   - https://issues.apache.org/jira/projects/ARROW/summary
   but I tried opening these in incognito windows to see how users would see them if they're not already signed into Jira, and they were all quite confusing.
   
   I also tried constructing a direct link to the Create Issue page by using URL parameters after https://issues.apache.org/jira/secure/CreateIssue.jspa but I couldn't get that working either




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613450991



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability

Review comment:
       Agreed—I cut the words here down to the bone to keep each bullet point on one line when rendered in GitHub




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612893805



##########
File path: r/README.md
##########
@@ -11,243 +11,273 @@ data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
+access to the Arrow C++ library API as well as higher-level access
+through a `dplyr` backend and many R-flavored functions.
+
+## What can `arrow` do?
+
+-   Read and write Parquet files (`read_parquet()`, `write_parquet()`)
+-   Read and write Feather files (`read_feather()`, `write_feather()`)
+-   Read or write large, multi-file datasets with a single function call
+    (`open_dataset()`, `write_dataset()`)
+-   Read very large CSV and JSON files with excellent speed and
+    efficiency (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in Amazon S3 buckets
+-   Exercise full control over the data types of columns when reading
+    and writing data files
+-   Use compression codecs including Snappy, gzip, Brotli, Zstandard,
+    LZ4, LZO, and bzip2 when reading and writing data
+-   Manipulate and analyze larger-than-memory datasets with `dplyr`
+    verbs
+-   Pass data between R and Python in the same process
+-   Connect to Arrow Flight RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through low-level bindings to
+    the C++ library
+-   Provide a toolkit for building connectors to other applications and
+    services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
-
-```r
-arrow::install_arrow(nightly = TRUE)
-```
+Conda users can install `arrow` nightly builds with
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
-
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+## Usage
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to read larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with Arrow
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using dplyr with Arrow

Review comment:
       We discussed the possibility of adding a separate new vignette about using arrow with dplyr. When I started to do that, I found that most of what needed to be said was not about the dplyr backend functionality but about the Arrow objects that you call them on and the Arrow objects they return. So the dplyr vignette was really turning out to be less than half about dplyr, and mostly about the data reading and writing functions and about `Table` and `Dataset` objects. Since that content belongs here in the README, I think it makes sense to slip in the dplyr content here too, at least for now, especially since the new dplyr capabilities are a marquee feature for the release. But if there are strong feelings that the README is not the right place for these 55 lines of dplyr content, I'm happy to discuss alternatives.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613498152



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
-
-```r
-arrow::install_arrow(nightly = TRUE)
-```
+Conda users can install `arrow` nightly builds with
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
-
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+## Usage
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using `dplyr` with `arrow`
 
-    ./lint.sh --fix
+The `arrow` package provides a `dplyr` backend enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To use it, first load both
+packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
+`Dataset` object. For example, read the Parquet file written in the
+previous example into an Arrow `Table` named `sw`:
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+``` r
+sw <- read_parquet(file_path, as_data_frame = FALSE)
+```
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation to delay computation until the
+result is required. `result` is an object with class `arrow_dplyr_query`
+which represents the computations to be performed:
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and materialize the result, call
+`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
+suitable for passing to other `arrow` or `dplyr` functions:
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R `data.frame`, suitable for viewing or passing
+to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
 ```
+
+The `arrow` package works with most `dplyr` verbs except those that
+compute aggregates (such as `summarise()`, and `mutate()` after
+`group_by()`). Inside `dplyr` verbs, Arrow offers limited support for
+functions and operators, with broader support expected in upcoming
+releases. For more information about available compute functions, see
+`help("list_compute_functions")`.
+
+For `dplyr` queries on `Table` objects, if the `arrow` package detects
+an unimplemented function within a `dplyr` verb, it automatically calls
+`collect()` to return the data as an R `data.frame` before processing
+that `dplyr` verb. For queries on `Dataset` objects (which can be larger
+than memory), it raises an error if the function is unimplemented.
+
+### Other uses
+
+Other uses of `arrow` are described in the following vignettes:
+
+-   `vignette("python", package = "arrow")`: use `arrow` and
+    `reticulate` to pass data between R and Python
+-   `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
+    servers to send and receive data
+-   `vignette("arrow", package = "arrow")`: access and manipulate Arrow
+    objects through low-level bindings to the C++ library
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on the [Apache Jira issue
+tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
+an account or log in, then click **Create** to file an issue. Select the
+project **Apache Arrow (ARROW)**, select the component **R**, and begin
+the issue summary with **\[R\]** followed by a space. For more

Review comment:
       Yeah, I tried the "obvious" `\\[` `\\\[` and nothing worked easily without monospacing it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613420800



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and

Review comment:
       we could link to our blog post on CSV reading

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function

Review comment:
       I'm not sure "open" is meaningful here, nor single function call. I think the point is that you can treat a directory of many files as a single dataset. 
   
   For writing, the point is that you can split your data into multiple files in ways that improve the speed with which you can query.

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability

Review comment:
       One could argue that Parquet is also fast and interoperable. Perhaps have a look at the Arrow website FAQ (and maybe link to it here with "When should I use Parquet vs. Feather?" or something?)

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process

Review comment:
       Might want to say "zero-copy" here somewhere

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with

Review comment:
       I'd fold this into the earlier statement about datasets

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading

Review comment:
       This one isn't clear to me

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function

Review comment:
       This could also link to the dataset vignette.

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version

Review comment:
       Maybe delete this heading?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613479637



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version

Review comment:
       I think it's worthwhile to set this apart from the release install info. I'll try changing this and the other to subheadings under an "Installation" heading




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612587763



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_feather()` and `write_parquet()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To start, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the creation
+or file loading functions listed above. For example, create a `Table`
+named `sw` with the Star Wars characters data frame that’s included in
+`dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+data_file <- tempfile()
+write_parquet(starwars, data_file)
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most dplyr verbs except those that compute aggregates
+(such as `summarise()` and `mutate()` after `group_by()`). Inside dplyr
+verbs, Arrow offers limited support for functions and operators, with
+broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For dplyr queries on `Table` and `RecordBatch` objects, if the `arrow` R
+package detects an unsupported function within a dplyr verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that dplyr verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache

Review comment:
       I considered this and tried first to use one of these URLs:
   - https://issues.apache.org/jira/projects/ARROW/issues/ (that's the one the contributing page links to, after a redirection)
   - https://issues.apache.org/jira/projects/ARROW/summary
   
   but I tried opening these in incognito windows to see how users would see them if they're not already signed into Jira, and they were all quite confusing.
   
   I also tried constructing a direct link to the Create Issue page by using URL parameters after https://issues.apache.org/jira/secure/CreateIssue.jspa but I couldn't get that working either




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613502676



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -322,9 +330,9 @@ by calling `write_dataset()` on it:
 write_dataset(ds, "nyc-taxi/feather", format = "feather")
 ```
 
-Next, let's imagine that the "payment_type" column is something we often filter on,
+Next, let's imagine that the `payment_type` column is something we often filter on,
 so we want to partition the data by that variable. By doing so we ensure that a filter like
-`payment_type == 3` will touch only a subset of files where payment_type is always 3.
+`payment_type == 3` will touch only a subset of files where `payment_type `is always 3.

Review comment:
       You didn't touch the example below that does 
   
   ```
   ds %>%
     filter(payment_type == 3) %>%
     write_dataset("nyc-taxi/feather", format = "feather")
   ```
   
   But that is no longer valid since we don't autocast numerics to strings anymore. The `group_by`/`write_dataset` will all work, but the query with a filter must have the quotes around it now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#issuecomment-819842109


   > I’ve read through the readme + dataset vignettes and these are both great and improvements over what we had. I'm torn about this next suggestion so feel free to ignore it, but would it be helpful (or more distracting?) to have one-line examples in that first group of bullets in the readme to show how easy each (or most) of those actions are? They would be kind of floating without much context, but it might help get people in and excited to see that it really is just `read_parquet("big_file.parquet")` to get started with parquets.
   
   I think for this version, we should keep the bullets very high-level and easy for folks to quickly read through. Let's table this for later consideration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612601648



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with

Review comment:
       To clarify: do you mean I should move
   ``` shell
   conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
   ```
   to above
   ``` r
   arrow::install_arrow(nightly = TRUE)
   ```
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612709666



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_feather()` and `write_parquet()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To start, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the creation
+or file loading functions listed above. For example, create a `Table`
+named `sw` with the Star Wars characters data frame that’s included in
+`dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+data_file <- tempfile()
+write_parquet(starwars, data_file)
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most dplyr verbs except those that compute aggregates
+(such as `summarise()` and `mutate()` after `group_by()`). Inside dplyr
+verbs, Arrow offers limited support for functions and operators, with
+broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For dplyr queries on `Table` and `RecordBatch` objects, if the `arrow` R
+package detects an unsupported function within a dplyr verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that dplyr verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache

Review comment:
       I guess I'll just switch this to using https://issues.apache.org/jira/projects/ARROW/issues because that's used everywhere else, but oof it's a bad experience for folks who aren't familiar with Jira




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612697191



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.

Review comment:
       In R, it should be either "data frame" or "data.frame"—don't get me started on how it's totally different if we're talking about pandas or Spark...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613597360



##########
File path: r/vignettes/arrow.Rmd
##########
@@ -72,6 +72,45 @@ to other applications and services that use Arrow. One example is Spark: the
 move data to and from Spark, yielding [significant performance
 gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
 
+# Object hierarchy
+
+## Metadata objects

Review comment:
       Re: forward references, I disagree that this is a virtue, I think it's more readable to start with the thing you care about and recurse down into the details as needed. But agreed that we don't have to resolve this today.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612610097



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with

Review comment:
       correct. install_arrow is a convenience for both the CRAN and conda methods of installation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612682874



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for

Review comment:
       Very minor
    
   ```suggestion
   `arrow` package provides a variety of `R6` and `S3` methods for
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613561393



##########
File path: r/README.md
##########
@@ -4,250 +4,284 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?

Review comment:
       I think it looks much better with this set as a heading




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612699400



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.

Review comment:
       I think it's worth having here (or possibly in both places) but I'll add an "Or, " at the beginning to make it clearer that it's just another data loading option like the other two listed immediately above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612587763



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_feather()` and `write_parquet()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To start, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the creation
+or file loading functions listed above. For example, create a `Table`
+named `sw` with the Star Wars characters data frame that’s included in
+`dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+data_file <- tempfile()
+write_parquet(starwars, data_file)
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most dplyr verbs except those that compute aggregates
+(such as `summarise()` and `mutate()` after `group_by()`). Inside dplyr
+verbs, Arrow offers limited support for functions and operators, with
+broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For dplyr queries on `Table` and `RecordBatch` objects, if the `arrow` R
+package detects an unsupported function within a dplyr verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that dplyr verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache

Review comment:
       I considered this and tried first to use one of these URLs:
   - https://issues.apache.org/jira/projects/ARROW/issues/
   - https://issues.apache.org/jira/projects/ARROW/summary
   
   but I tried opening these in incognito windows to see how users would see them if they're not already signed into Jira, and they were all quite confusing.
   
   I also tried constructing a direct link to the Create Issue page by using URL parameters after https://issues.apache.org/jira/secure/CreateIssue.jspa but I couldn't get that working either




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612733595



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.

Review comment:
       I would go for `data.frame` personally, "data frame" looks super weird and 1990s to me!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613496829



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
-
-```r
-arrow::install_arrow(nightly = TRUE)
-```
+Conda users can install `arrow` nightly builds with
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
+``` shell
 conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+If you already have a version of `arrow` installed, you can switch to
+the latest nightly development version with
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+``` r
+arrow::install_arrow(nightly = TRUE)
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
-
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
-
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+## Usage
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Among the many uses of the `arrow` package, two of the most accessible
+uses are:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+-   Reading and writing data files
+-   Manipulating Arrow data with `dplyr` verbs
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+The sections below describe these two uses and illustrate them with
+basic examples. The sections below mention two Arrow data structures:
 
-Other flags that may be useful:
+-   `Table`: a tabular, column-oriented data structure capable of
+    storing and processing large amounts of data more efficiently than
+    R’s built-in `data.frame` and with SQL-like column data types that
+    afford better interoperability with databases and data warehouse
+    systems
+-   `Dataset`: a data structure functionally similar to `Table` but with
+    the capability to work on larger-than-memory data partitioned across
+    multiple files
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+### Reading and writing data files with `arrow`
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R `data.frame`. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-``` shell
-cd ../../r
+For writing data to single files, the `arrow` package provides the
+functions `write_parquet()` and `write_feather()`. These can be used
+with R `data.frame` and Arrow `Table` objects.
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+For example, let’s write the Star Wars characters data that’s included
+in `dplyr` to a Parquet file, then read it back in. First load the
+`arrow` and `dplyr` packages:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then write the `data.frame` named `starwars` to a Parquet file at
+`file_path`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+file_path <- tempfile()
+write_parquet(starwars, file_path)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Then read the Parquet file into an R `data.frame` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
+``` r
+sw <- read_parquet(file_path)
+```
 
-We use Google C++ style in our C++ code. Check for style errors with
+For reading and writing larger files or multiple files with Arrow
+`Dataset` objects, `arrow` provides the functions `open_dataset()` and
+`write_dataset()`. For examples of these, see
+`vignette("dataset", package = "arrow")`.
 
-    ./lint.sh
+All these functions can read and write files in the local filesystem (by
+passing unqualified paths or `file://` URIs) or files in Amazon S3 (by
+passing S3 URIs beginning with `s3://`). For more details, see
+`vignette("fs", package = "arrow")`
 
-Fix any style issues before committing with
+### Using `dplyr` with `arrow`
 
-    ./lint.sh --fix
+The `arrow` package provides a `dplyr` backend enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To use it, first load both
+packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
+`Dataset` object. For example, read the Parquet file written in the
+previous example into an Arrow `Table` named `sw`:
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+``` r
+sw <- read_parquet(file_path, as_data_frame = FALSE)
+```
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation to delay computation until the
+result is required. `result` is an object with class `arrow_dplyr_query`
+which represents the computations to be performed:
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and materialize the result, call
+`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
+suitable for passing to other `arrow` or `dplyr` functions:
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R `data.frame`, suitable for viewing or passing
+to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
 ```
+
+The `arrow` package works with most `dplyr` verbs except those that
+compute aggregates (such as `summarise()`, and `mutate()` after
+`group_by()`). Inside `dplyr` verbs, Arrow offers limited support for
+functions and operators, with broader support expected in upcoming
+releases. For more information about available compute functions, see
+`help("list_compute_functions")`.
+
+For `dplyr` queries on `Table` objects, if the `arrow` package detects
+an unimplemented function within a `dplyr` verb, it automatically calls
+`collect()` to return the data as an R `data.frame` before processing
+that `dplyr` verb. For queries on `Dataset` objects (which can be larger
+than memory), it raises an error if the function is unimplemented.
+
+### Other uses
+
+Other uses of `arrow` are described in the following vignettes:
+
+-   `vignette("python", package = "arrow")`: use `arrow` and
+    `reticulate` to pass data between R and Python
+-   `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
+    servers to send and receive data
+-   `vignette("arrow", package = "arrow")`: access and manipulate Arrow
+    objects through low-level bindings to the C++ library
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on the [Apache Jira issue
+tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
+an account or log in, then click **Create** to file an issue. Select the
+project **Apache Arrow (ARROW)**, select the component **R**, and begin
+the issue summary with **\[R\]** followed by a space. For more

Review comment:
       Huh, weird, thanks. Easy dodge: I'll just make it `monospace` so no escapes are needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#issuecomment-819752347


   I’ve read through the readme + dataset vignettes and these are both great and improvements over what we had. I'm torn about this next suggestion so feel free to ignore it, but would it be helpful (or more distracting?) to have one-line examples in that first group of bullets in the readme to show how easy each (or most) of those actions are? They would be kind of floating without much context, but it might help get people in and excited to see that it really is just `read_parquet("big_file.parquet")` to get started with parquets. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612885750



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:

Review comment:
       I moved the object hierarchy content to the "Using the Arrow C++ Library in R" vignette and added just two bullets here describing `Table` and `Dataset` which for many purposes will be the only two data structures users really need to know about.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#issuecomment-818812764


   You can see how this renders by looking at it in my fork at https://github.com/ianmcook/arrow/tree/ARROW-11477/r#readme


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612700647



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To begin, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the object
+creation or file loading functions listed above. For example, create a
+`Table` named `sw` with the Star Wars characters data frame that’s
+included in `dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
-
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+write_parquet(starwars, data_file <- tempfile()) # write file to demonstrate reading it
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with

Review comment:
       I think I'll use "until the result is required" instead of "until if that is requested" if that's cool with you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613305793



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and
+    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
+-   Read and write files in **Amazon S3** buckets with no additional
+    function calls
+-   Exercise **full control over data types** of columns when reading
+    and writing data files
+-   Use **compression codecs** including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
+-   Manipulate and analyze **larger-than-memory datasets** with
+    **`dplyr` verbs**
+-   Pass data between **R and Python** in the same process
+-   Connect to **Arrow Flight** RPC servers to send and receive large
+    datasets over networks
+-   Access and manipulate Arrow objects through **low-level bindings**
+    to the C++ library
+-   Provide a **toolkit for building connectors** to other applications
+    and services that use Arrow

Review comment:
       This **making words bold** thing is kinda cheesy, I know, but it stops this from just looking like a big wall of text.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612611031



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:

Review comment:
       Ok, gotcha. Based on this and your other comment above, I think it makes sense to add a "Why use this package?" section before the "Installation" section that will establish the framing to put the details here in more of a clearer context.
   
   I could also modify the content here to focus chiefly on the `Table` class and move these tables into a separate vignette, perhaps titled "Apache Arrow object hierarchy" (because it's valuable to have this content presented _somewhere_ in a documentation style as opposed to a narrative style, even if it's not here in the readme).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#issuecomment-819075694


   https://issues.apache.org/jira/browse/ARROW-11477


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612697191



##########
File path: r/README.md
##########
@@ -15,239 +15,227 @@ The `arrow` package exposes an interface to the Arrow C++ library to
 access many of its features in R. This includes support for analyzing
 large, multi-file datasets (`open_dataset()`), working with individual
 Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+(`read_feather()`, `write_feather()`) files, as well as a `dplyr`
+backend and lower-level access to Arrow memory and messages.
 
 ## Installation
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
+``` shell
 conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
-
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
-
-## Developing
-
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Conda users can install `arrow` nightly builds with
 
 ``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
+conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+## Apache Arrow metadata and data objects
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+Arrow defines the following classes for representing metadata:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | a character string name and a `DataType`         | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+Arrow defines the following classes for representing zero-dimensional
+(scalar), one-dimensional (array/vector-like), and two-dimensional
+(tabular/data frame-like) data:
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+| Dim | Class          | Description                             | How to create an instance                                     |
+|-----|----------------|-----------------------------------------|---------------------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`         | `Scalar$create(value, type)`                                  |
+| 1   | `Array`        | vector of values and its `DataType`     | `Array$create(vector, type)`                                  |
+| 1   | `ChunkedArray` | vectors of values and their `DataType`  | `ChunkedArray$create(..., type)`                              |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`        | `RecordBatch$create(...)`                                     |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`  | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema` | see `vignette("dataset", package = "arrow")`                  |
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
+Each of these is defined as an `R6` class in the `arrow` R package and
+corresponds to a class of the same name in the Arrow C++ library. The
+`arrow` package provides a variety of `R6` and S3 methods for
+interacting with instances of these classes.
 
-Other flags that may be useful:
+## Reading and writing data files with Arrow
 
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_parquet()` and `write_feather()`. These
+functions also accept R data frames.

Review comment:
       In R, it should be either "data frame" or `data.frame`—don't get me started on how it's totally different if we're talking about pandas or Spark...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ianmcook commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

ianmcook commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613596152



##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and

Review comment:
       I don't really think this is the right place for links; this is mean to be a fairly breezy list of key features, with elaboration provided below. I'd prefer to table this idea for later

##########
File path: r/README.md
##########
@@ -4,250 +4,283 @@
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data. It specifies a standardized
+**[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data.** It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-The `arrow` package exposes an interface to the Arrow C++ library to
-access many of its features in R. This includes support for analyzing
-large, multi-file datasets (`open_dataset()`), working with individual
-Parquet (`read_parquet()`, `write_parquet()`) and Feather
-(`read_feather()`, `write_feather()`) files, as well as lower-level
-access to Arrow memory and messages.
+**The `arrow` package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R.** It provides low-level
+access to the Arrow C++ library API and higher-level access through a
+`dplyr` backend and familiar R functions.
+
+## What can the `arrow` package do?
+
+-   Read and write **Parquet files** (`read_parquet()`,
+    `write_parquet()`), an efficient and widely used columnar format
+-   Read and write **Feather files** (`read_feather()`,
+    `write_feather()`), a format optimized for speed and
+    interoperability
+-   Open or write **large, multi-file datasets** with a single function
+    call (`open_dataset()`, `write_dataset()`)
+-   Read **large CSV and JSON files** with excellent **speed and

Review comment:
       I don't really think this is the right place for links; this is meant to be a fairly breezy list of key features, with elaboration provided below. I'd prefer to table this idea for later




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r613502676



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -322,9 +330,9 @@ by calling `write_dataset()` on it:
 write_dataset(ds, "nyc-taxi/feather", format = "feather")
 ```
 
-Next, let's imagine that the "payment_type" column is something we often filter on,
+Next, let's imagine that the `payment_type` column is something we often filter on,
 so we want to partition the data by that variable. By doing so we ensure that a filter like
-`payment_type == 3` will touch only a subset of files where payment_type is always 3.
+`payment_type == 3` will touch only a subset of files where `payment_type `is always 3.

Review comment:
       You didn't touch the example below that does 
   
   ```
   ds %>%
     filter(payment_type == 3) %>%
     write_dataset("nyc-taxi/feather", format = "feather")
   ```
   
   But that is no longer valid since we don't autocast numerics to strings anymore. The `group_by`/`write_dataset` will all work, but the query with a filter must have the quotes around it now. And we should probably update that in the text as well to avoid confusion, yeah?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10014: ARROW-11477: [R][Doc] Reorganize and improve README and vignette content

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10014:
URL: https://github.com/apache/arrow/pull/10014#discussion_r612564307



##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`

Review comment:
       I wouldn't include this in the README. This is an implementation detail. There are implementation details in vignette("arrow").

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |

Review comment:
       For standardization maybe this should use `ChunkedArray$create()`, just so the pattern is clear. Also IDK that "list of `Array`s with the same `DataType`", while accurate on some level, is the right thing to say here. In terms of interface, it behaves like Array, you don't treat it like a list type.

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow

Review comment:
       I'd lead with parquet and feather

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |

Review comment:
       perhaps?
   
   ```suggestion
   | 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*(as_data_frame = FALSE)` functions |
   ```

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:

Review comment:
       ```suggestion
   If you have `arrow` installed and want to switch to the latest nightly development version, you can use the included `install_arrow()` utility function:
   ```

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with

Review comment:
       Swap the order of conda and install_arrow() (previously, install_arrow did not handle conda/we didn't have conda nightlies)

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:

Review comment:
       I might set this up differently: start with what R has since we can assume users are familiar with that. R has vectors, and data.frame that collects vectors of equal length. Arrow has two in-memory representations for the same data: one, RecordBatch containing Arrays, is closest to the R model. The other, Table containing ChunkedArrays, supports array data that may be sliced into multiple pieces in order to accommodate larger sizes or other optimizations. A Dataset is similar to a Table but does not hold all of the data in memory.
   
   Then get into differences. Arrow has a true Scalar type, though you as an R user won't generally have to deal with them. Arrow has a richer data type system and metadata (discuss Schema, DataType, and the range of data types there are.) 

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:

Review comment:
       I would switch the order of these tables: lead with the data, the things that have parallels to R objects. And then for this table, I'd invert the order: Schema is list of Field, Field is pair of name and DataType.

##########
File path: r/README.md
##########
@@ -22,232 +22,225 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-```r
+``` r
 install.packages("arrow")
 ```
 
 Conda users can install `arrow` from conda-forge with
 
-```
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c conda-forge --strict-channel-priority r-arrow
 
 Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable `NOT_CRAN=true`.
-See `vignette("install", package = "arrow")` for details.
+more complete installation, set the environment variable
+`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for
+details.
 
 ## Installing a development version
 
-Development versions of the package (binary and source) are built daily and hosted at
-<https://arrow-r-nightly.s3.amazonaws.com>. To install from there:
+Development versions of the package (binary and source) are built
+nightly and hosted at <https://arrow-r-nightly.s3.amazonaws.com>. To
+install from there:
 
 ``` r
 install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
 ```
 
-Or
+Or to switch to the latest nightly development version:
 
-```r
+``` r
 arrow::install_arrow(nightly = TRUE)
 ```
 
-Conda users can install `arrow` nightlies from our nightlies channel using:
+Conda users can install `arrow` nightly builds with
 
-```
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
+    conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
 
-These daily package builds are not official Apache releases and are not
-recommended for production use. They may be useful for testing bug fixes
-and new features under active development.
+These nightly package builds are not official Apache releases and are
+not recommended for production use. They may be useful for testing bug
+fixes and new features under active development.
 
-## Developing
+## Apache Arrow metadata and data objects
 
-Windows and macOS users who wish to contribute to the R package and
-don’t need to alter the Arrow C++ library may be able to obtain a
-recent version of the library without building from source. On macOS,
-you may install the C++ library using [Homebrew](https://brew.sh/):
+Arrow defines the following classes for representing metadata:
 
-``` shell
-# For the released version:
-brew install apache-arrow
-# Or for a development version, you can try:
-brew install apache-arrow --HEAD
-```
+| Class      | Description                                      | How to create an instance        |
+|------------|--------------------------------------------------|----------------------------------|
+| `DataType` | attribute controlling how values are represented | functions in `help("data-type")` |
+| `Field`    | string name and a `DataType`                     | `field(name, type)`              |
+| `Schema`   | list of `Field`s                                 | `schema(...)`                    |
 
-On Windows, you can download a .zip file with the arrow dependencies from the
-[nightly repository](https://arrow-r-nightly.s3.amazonaws.com/libarrow/bin/windows/),
-and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. Version numbers in that
-repository correspond to dates, and you will likely want the most recent.
+Arrow defines the following classes for representing 0-dimensional
+(scalar), 1-dimensional (vector), and 2-dimensional (tabular/data
+frame-like) data:
 
-If you need to alter both the Arrow C++ library and the R package code,
-or if you can’t get a binary version of the latest C++ library
-elsewhere, you’ll need to build it from source too.
+| Dim | Class          | Description                               | How to create an instance                          |
+|-----|----------------|-------------------------------------------|----------------------------------------------------|
+| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                       |
+| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                       |
+| 1   | `ChunkedArray` | list of `Array`s with the same `DataType` | `chunked_array(..., type)`                         |
+| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `record_batch(...)`                                |
+| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)` or `arrow::read_*()` functions |
+| 2   | `Dataset`      | list of `Table`s with the same `Schema`   | see `vignette("dataset", package = "arrow")`       |
 
-First, install the C++ library. See the [developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) for details.
-It's recommended to make a `build` directory inside of the `cpp` directory of
-the Arrow git repository (it is git-ignored). Assuming you are inside `cpp/build`,
-you'll first call `cmake` to configure the build and then `make install`.
-For the R package, you'll need to enable several features in the C++ library
-using `-D` flags:
+These classes exist in the `arrow` R package and correspond to classes
+of the same names in the Arrow C++ library. For convenience, the `arrow`
+package also defines several synthetic classes that do not exist in the
+C++ library, including:
 
-```
-cmake \
-  -DARROW_COMPUTE=ON \
-  -DARROW_CSV=ON \
-  -DARROW_DATASET=ON \
-  -DARROW_FILESYSTEM=ON \
-  -DARROW_JEMALLOC=ON \
-  -DARROW_JSON=ON \
-  -DARROW_PARQUET=ON \
-  -DCMAKE_BUILD_TYPE=release \
-  -DARROW_INSTALL_NAME_RPATH=OFF \
-  ..
-```
+-   `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+-   `ArrowTabular`: inherited by `RecordBatch` and `Table`
+-   `ArrowObject`: inherited by all Arrow objects
 
-where `..` is the path to the `cpp/` directory when you're in `cpp/build`.
+These are all defined as `R6` classes. The `arrow` package provides a
+variety of `R6` and S3 methods for interacting with instances of these
+classes.
 
-To enable optional features including S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags:
+## Reading and writing data files with Arrow
 
-```
-  -DARROW_S3=ON \
-  -DARROW_MIMALLOC=ON \
-  -DARROW_WITH_BROTLI=ON \
-  -DARROW_WITH_BZ2=ON \
-  -DARROW_WITH_LZ4=ON \
-  -DARROW_WITH_SNAPPY=ON \
-  -DARROW_WITH_ZLIB=ON \
-  -DARROW_WITH_ZSTD=ON \
-```
-
-Other flags that may be useful:
-
-* `-DARROW_EXTRA_ERROR_CONTEXT=ON` makes errors coming from the C++ library point to files and line numbers
-* `-DBOOST_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
+The `arrow` package provides functions for reading data from several
+common file formats. By default, calling any of these functions returns
+an R data frame. To return an Arrow `Table`, set argument
+`as_data_frame = FALSE`.
 
-Note that after any change to the C++ library, you must reinstall it and
-run `make clean` or `git clean -fdx .` to remove any cached object code
-in the `r/src/` directory before reinstalling the R package. This is
-only necessary if you make changes to the C++ library source; you do not
-need to manually purge object files if you are only editing R or C++
-code inside `r/`.
+-   `read_delim_arrow()`: read a delimited text file (default delimiter
+    is comma)
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
+-   `read_feather()`: read a file in Feather format (the Apache Arrow
+    IPC format)
+-   `read_parquet()`: read a file in Parquet format (an efficient
+    columnar data format)
 
-Once you’ve built the C++ library, you can install the R package and its
-dependencies, along with additional dev dependencies, from the git
-checkout:
+For writing Arrow tabular data structures to files, the `arrow` package
+provides the functions `write_feather()` and `write_parquet()`. These
+functions also accept R data frames.
 
-``` shell
-cd ../../r
+## Using dplyr with Arrow
 
-Rscript -e '
-options(repos = "https://cloud.r-project.org/")
-if (!require("remotes")) install.packages("remotes")
-remotes::install_deps(dependencies = TRUE)
-'
+The `arrow` package provides a `dplyr` backend, enabling manipulation of
+Arrow tabular data with `dplyr` verbs. To start, load both `arrow` and
+`dplyr`:
 
-R CMD INSTALL .
+``` r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-If you need to set any compilation flags while building the C++
-extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
-example, if you are using `perf` to profile the R extensions, you may
-need to set
+Then create an Arrow `Table` or `RecordBatch` using one of the creation
+or file loading functions listed above. For example, create a `Table`
+named `sw` with the Star Wars characters data frame that’s included in
+`dplyr`:
 
-``` shell
-export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
+``` r
+sw <- Table$create(starwars)
 ```
 
-If the package fails to install/load with an error like this:
-
-    ** testing if installed package can be loaded from temporary location
-    Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
-    unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
-    dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
-
-ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
-macOS to prevent problems at link time and is a no-op on other platforms).
-Alternativelly, try setting the environment variable `R_LD_LIBRARY_PATH` to
-wherever Arrow C++ was put in `make install`, e.g. `export
-R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
+Or read the same data from a Parquet file, using `as_data_frame = FALSE`
+to create a `Table` named `sw`:
 
-When installing from source, if the R and C++ library versions do not
-match, installation may fail. If you’ve previously installed the
-libraries and want to upgrade the R package, you’ll need to update the
-Arrow C++ library first.
-
-For any other build/configuration challenges, see the [C++ developer
-guide](https://arrow.apache.org/docs/developers/cpp/building.html) and
-`vignette("install", package = "arrow")`.
-
-### Editing C++ code
-
-The `arrow` package uses some customized tools on top of `cpp11` to
-prepare its C++ code in `src/`. If you change C++ code in the R package,
-you will need to set the `ARROW_R_DEV` environment variable to `TRUE`
-(optionally, add it to your`~/.Renviron` file to persist across
-sessions) so that the `data-raw/codegen.R` file is used for code
-generation.
-
-We use Google C++ style in our C++ code. Check for style errors with
-
-    ./lint.sh
-
-Fix any style issues before committing with
-
-    ./lint.sh --fix
+``` r
+data_file <- tempfile()
+write_parquet(starwars, data_file)
+sw <- read_parquet(data_file, as_data_frame = FALSE)
+```
 
-The lint script requires Python 3 and `clang-format-8`. If the command
-isn’t found, you can explicitly provide the path to it like
-`CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get
-this by installing LLVM via Homebrew and running the script as
-`CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh`
+For larger or multi-file datasets, load the data into a `Dataset` as
+described in `vignette("dataset", package = "arrow")`.
 
-### Running tests
+Next, pipe on `dplyr` verbs:
 
-Some tests are conditionally enabled based on the availability of certain
-features in the package build (S3 support, compression libraries, etc.).
-Others are generally skipped by default but can be enabled with environment
-variables or other settings:
+``` r
+result <- sw %>% 
+  filter(homeworld == "Tatooine") %>% 
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
 
-* All tests are skipped on Linux if the package builds without the C++ libarrow.
-  To make the build fail if libarrow is not available (as in, to test that
-  the C++ build was successful), set `TEST_R_WITH_ARROW=TRUE`
-* Some tests are disabled unless `ARROW_R_DEV=TRUE`
-* Tests that require allocating >2GB of memory to test Large types are disabled
-  unless `ARROW_LARGE_MEMORY_TESTS=TRUE`
-* Integration tests against a real S3 bucket are disabled unless credentials
-  are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available
-  on request
-* S3 tests using [MinIO](https://min.io/) locally are enabled if the
-  `minio server` process is found running. If you're running MinIO with custom
-  settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and
-  `MINIO_PORT` to override the defaults.
+The `arrow` package uses lazy evaluation. `result` is an object with
+class `arrow_dplyr_query` which represents the computations to be
+performed.
 
-### Useful functions
+``` r
+result
+#> Table (query)
+#> name: string
+#> height_in: expr
+#> mass_lbs: expr
+#> 
+#> * Filter: equal(homeworld, "Tatooine")
+#> * Sorted by birth_year [desc]
+#> See $.data for the source Arrow object
+```
 
-Within an R session, these can help with package development:
+To execute these computations and obtain the result, call `compute()` or
+`collect()`. `compute()` returns an Arrow `Table`, suitable for passing
+to other `arrow` or `dplyr` functions.
 
 ``` r
-devtools::load_all() # Load the dev package
-devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
-devtools::document() # Update roxygen documentation
-pkgdown::build_site() # To preview the documentation website
-devtools::check() # All package checks; see also below
-covr::package_coverage() # See test coverage statistics
+result %>% compute()
+#> Table
+#> 10 rows x 3 columns
+#> $name <string>
+#> $height_in <double>
+#> $mass_lbs <double>
 ```
 
-Any of those can be run from the command line by wrapping them in `R -e
-'$COMMAND'`. There’s also a `Makefile` to help with some common tasks
-from the command line (`make test`, `make doc`, `make clean`, etc.)
-
-### Full package validation
+`collect()` returns an R data frame or tibble, suitable for viewing or
+passing to other R functions for analysis or visualization:
 
-``` shell
-R CMD build .
-R CMD check arrow_*.tar.gz --as-cran
-```
+``` r
+result %>% collect()
+#> # A tibble: 10 x 3
+#>    name               height_in mass_lbs
+#>    <chr>                  <dbl>    <dbl>
+#>  1 C-3PO                   65.7    165. 
+#>  2 Cliegg Lars             72.0     NA  
+#>  3 Shmi Skywalker          64.2     NA  
+#>  4 Owen Lars               70.1    265. 
+#>  5 Beru Whitesun lars      65.0    165. 
+#>  6 Darth Vader             79.5    300. 
+#>  7 Anakin Skywalker        74.0    185. 
+#>  8 Biggs Darklighter       72.0    185. 
+#>  9 Luke Skywalker          67.7    170. 
+#> 10 R5-D4                   38.2     70.5
+```
+
+Arrow supports most dplyr verbs except those that compute aggregates
+(such as `summarise()` and `mutate()` after `group_by()`). Inside dplyr
+verbs, Arrow offers limited support for functions and operators, with
+broader support expected in upcoming releases. For more information
+about available compute functions, see `help("list_compute_functions")`.
+
+For dplyr queries on `Table` and `RecordBatch` objects, if the `arrow` R
+package detects an unsupported function within a dplyr verb, it
+automatically calls `collect()` to return the data as an R data frame
+before processing that dplyr verb.
+
+## Getting help
+
+If you encounter a bug, please file an issue with a minimal reproducible
+example on [the Apache Jira issue
+tracker](https://issues.apache.org/jira/). Choose the project **Apache

Review comment:
       I think there is a direct link to the arrow JIRA, should be on our website on the contributing page.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org