You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/26 04:46:44 UTC

[GitHub] [arrow] djnavarro opened a new pull request, #14514: ARROW-17887: [R] [Doc] [WIP] Improve readability of the Get Started and README pages

djnavarro opened a new pull request, #14514:
URL: https://github.com/apache/arrow/pull/14514

   This pull request proposes a number of changes to the pkgdown site for the R package:
   
   - Reduces the content on the README page to the essential points
   - Rewrites the "get started" page to focus on common tasks and novice users
   - Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
   - Moves discussion of Arrow data types and conversions to a new "data types" vignette
   - Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
   - Moves discussion of package naming conventions to a new "package conventions" vignette
   - Moves discussion of read/write capabilities to a new "reading and writing data" vignette
   - Moves discussion of the dplyr back end to a new "data wrangling" vignette
   - Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
   - Minor edits to the "python" vignette to improve readability
   - Minor edits to the "cloud storage" vignette to improve readability
   - Minor edits to the "flight" vignette to improve readability
   - Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page
   
   In addition there are some structural changes:
   
   - Some vignette filenames have been edited and links updated
   - The pkgdown template now uses bootstrap 5
   - The developer vignettes are now in the main vignettes folder
   - The articles menu organizes the vignettes into meaningful categories
   - The former "project docs" menu has been replaced with a sidebar on the main page
   
   Possible issues as yet unaddressed: 
   
   - I have not yet checked whether the bootstrap 5 template breaks the script inserting the documentation versions switcher
   - Changes to developer vignettes are extremely minimal in comparison to other vignettes. I'm uncertain whether to make further changes there or to defer that to a later PR
   - The "articles" menu currently hides all the developer vignettes under the "more articles" link: they should be made more prominent and easy to find
   - Some topics may not be described as well as we'd like? 
   
   It is still a work-in-progress but feedback is appreciated. To make it easy to preview the proposed changes, a preview copy of the built documentation is posted here: https://djnavarro.net/draft-arrow-docs/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009048060


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 
+
+To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:
+
+```{r}
+set.seed(1234)
+nrows <- 100000
+random_data <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+```
+
+What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:
+
+```{r}
+dataset_path <- file.path(tempdir(), "random_data")
+```
+
+We can then use `group_by()` function from `dplyr` to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:
+
+```{r}
+random_data %>%
+  group_by(subset) %>%
+  write_dataset(dataset_path)
+```
+
+This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:
+
+```{r}
+list.files(dataset_path, recursive = TRUE)
+```
 
-### R object attributes
+Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:
+
+```{r}
+dset <- open_dataset(dataset_path)
+dset
+```
+
+This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.
+
+To learn more about Arrow Datasets, see the [dataset article](./dataset.html).
+
+## Analyzing Arrow data with dplyr
+
+Arrow Tables and Datasets can be analyzed using `dplyr` syntax. This is possible because the `arrow` R package supplies a backend that translates `dplyr` verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a `dplyr` verb. For example, although the `dset` Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a `dplyr` pipeline like the one shown below:
+
+```{r}
+dset %>%
+  group_by(subset) %>% 
+  summarize(mean_x = mean(x), min_y = min(y)) %>%
+  filter(mean_x > 0) %>%
+  arrange(subset) %>%
+  collect()
+```
+
+Notice that we call `collect()` at the end of the pipeline. No actual computations are performed until `collect()` (or the related `compute()` function) is called. This "lazy evaluation" makes it possible for the Arrow C++ compute engine to optimize how the computations are performed. 
+
+To learn more about analyzing Arrow data, see the [data wrangling article](./data_wrangling.html).
+

Review Comment:
   Yes, absolutely! I didn't even realise we had that page myself 😁 
   
   I'll link to it from the data wrangling article too 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009053961


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 

Review Comment:
   Yeah that was one of my biggest points of confusion early on, exacerbated by the slightly-misleading "Arrow is about in-memory not on-disk format" framing that exists in older docs. So I've tried to orient new users to the Table/Dataset distinction early on so that they're expecting to learn more about the difference later. I think it helps!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009075730


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 

Review Comment:
   Okay, what I've done (in order to smooth the transition) is subtly make "Arrow format" or "Arrow IPC format" the primary terminology everywhere, but make sure it always refers to "feather" in a secondary position, either by writing "Arrow/Feather format" or "Arrow (also known as Feather)". It's kind of a soft-deprecation strategy: for the time being I want the word "feather" to be present every time the format is referred to so that people are 100% certain that the Arrow format and the Feather format are the same thing. There's enough confusion on that point already, I think. But once it becomes standard to refer to Arrow format we can probably start deprecating further in the docs?
   
   See what you think!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291507133

   https://issues.apache.org/jira/browse/ARROW-17887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1293010442

   possible solution for the file renaming issue? pkgdown can generate client-side redirects like this: d022810a2b8e104ccf23b36d432bf78342d6256d. not the ideal way to do redirects, but it's an option I suppose


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005378186


##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.

Review Comment:
   nit: would change "with the `pyarrow` library" to "containing the `pyarrow` library" or similar



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

eitsupi commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1293353764

   Is it OK to use base pipes (`|>`) that only works with R4.1 or later in vignettes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029854271


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded

Review Comment:
   ```suggestion
   Apache Arrow lets you work efficiently with single and multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029880019


##########
r/vignettes/metadata.Rmd:
##########
@@ -0,0 +1,82 @@
+---
+title: "Metadata"
+description: > 
+  Learn how Arrow uses Schemas to document structure of data objects, 
+  and how R metadata are supported in Arrow
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data and metadata object types supplied by arrow, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+## Arrow metadata classes
+
+The arrow package defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+Consider this:
+
+```{r}
+df <- data.frame(x = 1:3, y = c("a", "b", "c"))
+tb <- arrow_table(df)
+tb$schema
+```
+
+The schema that has been automatically inferred could also be manually created:
+
+```{r}
+schema(
+  field(name = "x", type = int32()),
+  field(name = "y", type = utf8())
+)
+```
+
+The `schema()` function allows the following shorthand to define fields:
+
+```{r}
+schema(x = int32(), y = utf8())
+```
+
+Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types:
+
+```{r}
+arrow_table(df, schema = schema(x = int64(), y = utf8()))
+arrow_table(df, schema = schema(x = float64(), y = utf8()))
+```
+
+
+## R object attributes
+
+Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fashion are stored under the `r` key, as shown below:
+
+```{r}
+# data frame with custom metadata
+df <- data.frame(x = 1:3, y = c("a", "b", "c"))
+attr(df, "df_meta") <- "custom data frame metadata"
+attr(df$y, "col_meta") <- "custom column metadata"
+
+# when converted to a Table, the metadata is preserved
+tb <- arrow_table(df)
+tb$metadata
+```
+
+It is also possible to assign additional string metadata under any other key you wish, using a command like this:
+
+```{r}
+tb$metadata$new_key <- "new value"
+```
+
+Metadata attached to a Schema is preserved when writing the Table to Feather or Parquet. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.

Review Comment:
   ```suggestion
   Metadata attached to a Schema is preserved when writing the Table to Arrow/Feather or Parquet formats. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1015311921


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,223 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
+
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
+
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
+
+## Converting Tables to data frames
+
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
+
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required. As an example, Arrow does not have an analog of the POSIXlt class: date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list. As a consequence users need to decide if what needs to be preserved is the timestamp (in which case it is better to coerce to POSIXlt to POSIXct before translating to Arrow), or if a POSIXlt-style list is preferable. Similar issues exist in the other direction: Arrow dictionary objects are a little more flexible than R factors, for instance. 

Review Comment:
   > date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list
   
   Please can you give me a code example for this? Asking as I opened [a ticket for a bug when converting between POSIXlt and Arrow](https://issues.apache.org/jira/browse/ARROW-18263), and there is some weird stuff going on there, and I'm trying to work out how to get this POSIXlt -> list behaviour (and work out "is this a feature or a bug?")



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028632597


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   Yeah, please let's ditch it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016078081


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,223 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
+
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
+
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
+
+## Converting Tables to data frames
+
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
+
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required. As an example, Arrow does not have an analog of the POSIXlt class: date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list. As a consequence users need to decide if what needs to be preserved is the timestamp (in which case it is better to coerce to POSIXlt to POSIXct before translating to Arrow), or if a POSIXlt-style list is preferable. Similar issues exist in the other direction: Arrow dictionary objects are a little more flexible than R factors, for instance. 

Review Comment:
   Try this, maybe? 
   
   ```r
   tm <- as.POSIXlt(c(Sys.time(), Sys.time()))
   arrow::Array$create(tm)
   ```
   
   It seems to arrive in Arrow in the expected list format. But I am not surprised that there are weird cases. For example, an R user might intuitively think you can import a single POSIXlt as a scalar type (after all, you *can* do that for a single POSIXct object), but this errors:
   
   ```r
   arrow::Scalar$create(as.POSIXlt(Sys.time()))
   ```
   
   I dunno. Maybe I should find a safer example than POSIXlt/POSIXct for this section? I feel like I'm skating on thin ice here because I'm not sure how much of this behaviour is desirable. Perhaps there's a better example involving the various integer types? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1292572265

   > @djnavarro Are all of the "moves discussion of..." items copied verbatim?
   
   Unfortunately, no -- they're almost all new vignettes that expand the original content. Genuinely sorry for the huge PR btw -- I wanted to do all this in many small steps but found myself in this situation where everything felt connected to everything else. The reorganisation and rewrites ended up happening in parallel 😬 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009309208


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 

Review Comment:
   Let's loop this into the wider discussion on [ARROW-18148](https://issues.apache.org/jira/browse/ARROW-18148) as I think this is something we need to agree on at a high-level (as it's also being discussed on the mailing list), but I fully agree that some sort of soft-deprecation strategy is needed so we don't confuse users.  I'm thinking "formerly known as" could also be an option, but let's discuss more widely first.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012363444


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Probably not! This is also another case where I didn't really want to keep them but felt obligated to do so because these classes are currently listed on the get started page: https://arrow.apache.org/docs/r/articles/arrow.html#data-objects. Exactly as happened with the R6/C++ classes, I've moved it into this vignette as a way of making it less prominent than it currently is. Again, maybe the solution is to delete entirely or move into a developer vignette. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1325724500

   +1 for sure for pushing that into a subsequent ticket/PR @djnavarro. Great, great work on this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1030313668


##########
r/vignettes/developers/data_object_layout.Rmd:
##########
@@ -0,0 +1,179 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+---
+
+This article describes the internal structure of Arrow data objects. Users of the arrow R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This article provides a deeper dive into some of the topics described in the [data objects article](../data_objects.html), and is intended mostly for developers. It is not necessary knowledge for using the arrow package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 

Review Comment:
   ```suggestion
   - Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1026074877


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
+
+In addition to these data objects, `arrow` defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+To learn more about the metadata classes, see the [metadata article](./metadata.html).
+
+## Scalars
+
+A Scalar object is simply a single value that can be of any type. It might be an integer, a string, a timestamp, or any of the different `DataType` objects that Arrow supports. Most users of the `arrow` R package are unlikely to create Scalars directly, but should there be a need you can do this by calling the `Scalar$create()` method:
+
+```{r}
+Scalar$create("hello")
+```
+
+
+## Arrays
+
+Array objects are ordered sets of Scalar values. As with Scalars most users will not need to create Arrays directly, but if the need arises there is an `Array$create()` method that allows you to create new Arrays:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
+string_array
+```
+
+An Array can be subset using square brackets as shown below:
+
+```{r}
+string_array[4:5]
+```
+
+Arrays are immutable objects: once an Array has been created it cannot be modified or extended. 
+
+## Chunked Arrays
+
+In practice, most users of the `arrow` R package are likely to use Chunked Arrays rather than simple Arrays. Under the hood, a Chunked Array is a collection of one or more Arrays that can be indexed _as if_ they were a single Array. The reasons that Arrow provides this functionality are described in the [data object layout article](./developers/data_object_layout.html) but for the present purposes it is sufficient to notice that Chunked Arrays behave like Arrays in regular data analysis.
+
+To illustrate, let's use the `chunked_array()` function:
+
+```{r}
+chunked_string_array <- chunked_array(
+  string_array,
+  c("I", "love", "you")
+)
+```
+
+The `chunked_array()` function is just a wrapper around the functionality that `ChunkedArray$create()` provides. Let's print the object:
+
+```{r}
+chunked_string_array
+```
+
+The double bracketing in this output is intended to highlight the fact that Chunked Arrays are wrappers around one or more Arrays. However, although comprised of multiple distinct Arrays, a Chunked Array can be indexed as if they were laid end-to-end in a single "vector-like" object. This is illustrated below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_indexing.png")
+```
+
+We can use `chunked_string_array` to illustrate this: 
+
+```{r}
+chunked_string_array[4:7]
+```
+
+An important thing to note is that "chunking" is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit. Writing the data to disk, for example, often results in the data being organized into different chunks. Similarly, two Chunked Arrays that contain the same values assigned to different chunks are deemed equivalent. To illustrate this we can create a Chunked Array that contains the same four same four values as `chunked_string_array[4:7]`, but organized into one chunk rather than split into two:
+
+```{r}
+cruel_world <- chunked_array(c("cruel", "world", "I", "love"))
+cruel_world
+```
+
+Testing for equality using `==` produces an element-wise comparison, and the result is a new Chunked Array of four (boolean type) `true` values:
+
+```{r}
+cruel_world == chunked_string_array[4:7]
+```
+
+In short, the intention is that users interact with Chunked Arrays as if they are ordinary one-dimensional data structures without ever having to think much about the underlying chunking arrangement. 
+
+Chunked Arrays are mutable, in a specific sense: Arrays can be added and removed from a Chunked Array.
+
+## Record Batches
+
+A Record Batch is tabular data structure comprised of named Arrays. Record Batches are a fundamental unit for data interchange in Arrow, but are not typically used for data analysis. Tables and Datasets are usually more convenient in analytic contexts.
+
+These Arrays can be of different types but must all be the same length. Each Array is referred to as one of the "fields" or "columns" of the Record Batch. You can create a Record Batch using the `record_batch()` function or by using the `RecordBatch$create()` method. These functions are flexible and can accept inputs in several formats: you can pass a data frame, one or more named vectors, an input stream, or even a raw vector containing appropriate binary data. For example:
+
+```{r}
+rb <- record_batch(
+  strs = string_array, 
+  ints = integer_array,
+  dbls = c(1.1, 3.2, 0.2, NA, 11)
+)
+rb
+```
+
+This is a Record Batch containing 5 rows and 3 columns, and its conceptual structure is shown below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./record_batch.png")
+```
+
+The `arrow` package supplies a `$` method for Record Batch objects, used to extract a single column by name:
+
+```{r}
+rb$strs
+```
+
+You can use double brackets `[[` to refer to columns by position. The `rb$ints` array is the second column in our Record Batch so we can extract it with this:
+
+```{r}
+rb[[2]]
+```
+
+There is also `[` method that allows you to extract subsets of a record batch in the same way you would for a data frame. The command `rb[1:3, 1:2]` extracts the first three rows and the first two columns:
+
+```{r}
+rb[1:3, 1:2]
+```
+
+Record Batches cannot be concatenated: because they are comprised of Arrays, and Arrays are immutable objects, new rows cannot be added to Record Batch once created.
+
+## Tables
+
+A Table is comprised of named Chunked Arrays, in the same way that a Record Batch is comprised of named Arrays. You can subset Tables with `$`, `[[`, and `[` the same way you can for Record Batches. Unlike Record Batches, Tables can be concatenated (because they are comprised of Chunked Arrays). Suppose a second Record Batch arrives:
+
+```{r}
+new_rb <- record_batch(
+  strs = c("I", "love", "you"), 
+  ints = c(5L, 0L, 0L),
+  dbls = c(7.1, -0.1, 2)
+)
+```
+
+It is not possible to create a Record Batch that appends the data from `new_rb` to the data in `rb`, not without creating entirely new objects in memory. With Tables, however, we can:
+
+```{r}
+df <- arrow_table(rb)
+new_df <- arrow_table(new_rb)
+```
+
+We now have the two fragments of the data set represented as Tables. The difference between the Table and the Record Batch is that the columns are all represented as Chunked Arrays. Each Array from the original Record Batch is one chunk in the corresponding Chunked Array in the Table:
+
+```{r}
+rb$strs
+df$strs
+```
+
+It's the same underlying data -- and indeed the same immutable Array is referenced by both -- just enclosed by a new, flexible Chunked Array wrapper. However, it is this wrapper that allows us to concatenate Tables:
+
+```{r}
+concat_tables(df, new_df)
+```
+
+The resulting object is shown schematically below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./table.png")
+```
+

Review Comment:
   Okay there is now a section on Datasets. It's longer than I was expecting, and I'd appreciate any comments or suggestions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1023739616


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,223 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
+
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
+
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
+
+## Converting Tables to data frames
+
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
+
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required. As an example, Arrow does not have an analog of the POSIXlt class: date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list. As a consequence users need to decide if what needs to be preserved is the timestamp (in which case it is better to coerce to POSIXlt to POSIXct before translating to Arrow), or if a POSIXlt-style list is preferable. Similar issues exist in the other direction: Arrow dictionary objects are a little more flexible than R factors, for instance. 

Review Comment:
   Okay, I've opened a jira issue here: https://issues.apache.org/jira/browse/ARROW-18337
   
   In the meantime, I've simplified this passage of text. It now makes the more generic point that it's possible to customise how data types are converted, and then points the reader at the data types vignette
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1024566747


##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data types, and many data types that do not have a counterpart in R. This article describes the Arrow type system, compares it to R data types, and outlines the default mappings used when data are transferred from Arrow to R. At the end of the article there are two lookup tables: one describing the default "R to Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the differences between the output when obtain we use `dplyr::glimpse()` to inspect the `starwars` data in its original format -- as a data frame in R -- and the output we obtain when we convert it to an Arrow Table first by calling `arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first check to see if a Schema has been provided -- see `schema()` for more information -- and if none is available it will attempt to guess the appropriate type by following the default mappings. A complete listing of these mappings is provided at the end of the article, but the most common cases are depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to Arrow data types. Directional arrows specify conversions (e.g., the bidirectional arrow between the logical R type and the boolean Arrow type means that R logicals convert to Arrow booleans and vice versa). Solid lines indicate that the this conversion rule is always the default; dashed lines mean that it only sometimes applies (the rules and special cases are described below). 

Review Comment:
   Latest push makes connecting lines and arrow heads bigger, and the dashed lines now stay as dashed lines the whole time. It doesn't feel like an ideal solution even still because the dashed lines kind of go all over the place, but I'm not 100% sure what the right answer is here because the reality is messy. One possibility might be to include a dashed line only for the "most typical case" (e.g., int64 in arrow -> integer in R) and direct the reader to the text for a detailed explanation?
   
   As an aside one change in the pkgdown configuration is that the vignettes aren't being bundled into the build anymore (I think): it's web-only now. So the png files shouldn't contribute to the size of the package on CRAN. That seems like the right thing to do if we want to simultaneously have pretty docs and not run afoul of the CRAN size restrictions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1292798991

   Thanks Neal! 
   
   - Yeah I think you're right about file renaming. It's not something I'm keen on myself, but found myself being pushed in that direction because of the scale of the rewrite. In some cases the content has changed so much that I'm not sure there's any meaningful continuity to be preserved. That being said, `install.Rmd` is a really good example of one that we should leave as-is. I think `dataset.Rmd` (which hasn't changed) is another one like that
   
   - I wonder if we can convert the vignettes to articles (in the pkgdown sense) so they don't get bundled in the build. It probably means we have to rewrite the cross-linking markdown since `vignette("whatever")` won't work anymore, but at some point I think we might have to make that change if we want to include images to help make the documentation readable to novice users?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016082278


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 

Review Comment:
   I decided to simplify it and say "work" without quotes or the "just" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1015311921


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,223 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
+
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
+
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
+
+## Converting Tables to data frames
+
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
+
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required. As an example, Arrow does not have an analog of the POSIXlt class: date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list. As a consequence users need to decide if what needs to be preserved is the timestamp (in which case it is better to coerce to POSIXlt to POSIXct before translating to Arrow), or if a POSIXlt-style list is preferable. Similar issues exist in the other direction: Arrow dictionary objects are a little more flexible than R factors, for instance. 

Review Comment:
   > date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list
   
   Please can you give me a code example for this? Asking as I opened [a ticket for a bug when converting between POSIXlt and Arrow](https://issues.apache.org/jira/browse/ARROW-18263), and there is some weird stuff going on there, and I'm trying to work out how to get this POSIXlt -> list behaviour.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

jonkeane commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1013418658


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 
+requiring any additional system dependencies, especially if you are using 
+Window or a Mac. For those users, CRAN hosts binary packages that contain 
+the Arrow C++ library upon which the `arrow` package relies, and no 
+additional steps should be required.
 
-```r
-sw %>%
-  group_by(species) %>%
-  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
-  collect()
-```
+There are some special cases to note:
 
-Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
-for joining multiple tables.
+- On Linux the installation process can sometimes be more involved because 
+CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html).
 
-```r
-jedi <- data.frame(
-  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
-  jedi = c(FALSE, TRUE, TRUE)
-)
-
-sw %>%
-  select(1:11) %>%
-  right_join(jedi) %>%
-  collect()
-```
+- If you are compiling `arrow` from source, please note that as of version 
+10.0.0, `arrow` requires C++17 to build. This has implications on Windows and
+CentOS 7. For Windows users it means you need to be running an R version of 
+4.0 or later. On CentOS 7, it means you need to install a newer compiler 
+than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Note that 
+this does not affect users who are installing a binary version of the package.
 
-Window functions (e.g. `ntile()`) are not yet
-supported. Inside `dplyr` verbs, Arrow offers support for many functions and
-operators, with common functions mapped to their base R and tidyverse
-equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
-lists many of them. If there are additional functions you would like to see
-implemented, please file an issue as described in the [Getting
-help](#getting-help) section below.
+- Development versions of `arrow` are released nightly. Most users will not 
+need to install nightly builds, but if you do please see the article on [installing nightly builds]([installation guide](https://arrow.apache.org/docs/r/articles/install_nightly.html) for more information.
 
-For `dplyr` queries on `Table` objects, if the `arrow` package detects
-an unimplemented function within a `dplyr` verb, it automatically calls
-`collect()` to return the data as an R `data.frame` before processing
-that `dplyr` verb. For queries on `Dataset` objects (which can be larger
-than memory), it raises an error if the function is unimplemented;
-you need to explicitly tell it to `collect()`.
+## Arrow resources 
 
-### Additional features
-
-Other applications of `arrow` are described in the following vignettes:
-
--   `vignette("python", package = "arrow")`: use `arrow` and
-    `reticulate` to pass data between R and Python
--   `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
-    servers to send and receive data
--   `vignette("arrow", package = "arrow")`: access and manipulate Arrow
-    objects through low-level bindings to the C++ library
-
-The Arrow for R [cheatsheet](https://github.com/apache/arrow/blob/-/r/cheatsheet/arrow-cheatsheet.pdf) and [Cookbook](https://arrow.apache.org/cookbook/r/index.html) are additional resources for getting started with `arrow`.
+In addition to the official [Arrow R package documentation](https://arrow.apache.org/docs/r/), the [Arrow for R cheatsheet](https://github.com/apache/arrow/blob/-/r/cheatsheet/arrow-cheatsheet.pdf), and the [Apache Arrow R Cookbook](https://arrow.apache.org/cookbook/r/index.html) are useful resources for getting started with `arrow`.
 
 ## Getting help
 
 If you encounter a bug, please file an issue with a minimal reproducible
 example on the [Apache Jira issue
 tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
-an account or log in, then click **Create** to file an issue. Select the
-project **Apache Arrow (ARROW)**, select the component **R**, and begin
-the issue summary with **`[R]`** followed by a space. For more
-information, see the **Report bugs and propose features** section of the
+an account or log in, then click "Create" to file an issue. Select the
+project "Apache Arrow (ARROW)", select the component "R", and begin
+the issue summary with "[R]" followed by a space. For more
+information, see the "Report bugs and propose features" section of the
 [Contributing to Apache

Review Comment:
   This'll soon be github, but we can change it when we get that up and running



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028640498


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized

Review Comment:
   Updated to read "in-memory and larger-than-memory data"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029855499


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded
 
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
 
 ## Example: NYC taxi data
 
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 

Review Comment:
   ```suggestion
   The primary motivation for Arrow's Datasets object is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028813713


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,163 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with arrow 
+output: rmarkdown::html_vignette
+---
+
+The arrow package provides functions for reading single data files into memory, in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (formerly called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the arrow package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in arrow, see the [cloud storage article](./fs.html).
+
+The arrow package also supports reading and writing multi-file datasets,

Review Comment:
   ```suggestion
   The arrow package also supports reading larger-than-memory single data files, and reading and writing multi-file data sets.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028641963


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,221 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
 
+It is possible to exercise fine grained control over this conversion process. To learn more about the different types and how they are converted, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Arrow/Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 

Review Comment:
   Yeah, that's a nice thought. I'll update to use your suggested phrasing below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028661795


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.

Review Comment:
   That's nice, I'll do that. @stephhazlitt had some suggestions about other places we can rephrase to emphasise the in-memory/on-disk distinction too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

jonkeane commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1013418370


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 
+requiring any additional system dependencies, especially if you are using 
+Window or a Mac. For those users, CRAN hosts binary packages that contain 
+the Arrow C++ library upon which the `arrow` package relies, and no 
+additional steps should be required.
 
-```r
-sw %>%
-  group_by(species) %>%
-  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
-  collect()
-```
+There are some special cases to note:
 
-Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
-for joining multiple tables.
+- On Linux the installation process can sometimes be more involved because 
+CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html).
 
-```r
-jedi <- data.frame(
-  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
-  jedi = c(FALSE, TRUE, TRUE)
-)
-
-sw %>%
-  select(1:11) %>%
-  right_join(jedi) %>%
-  collect()
-```
+- If you are compiling `arrow` from source, please note that as of version 
+10.0.0, `arrow` requires C++17 to build. This has implications on Windows and
+CentOS 7. For Windows users it means you need to be running an R version of 
+4.0 or later. On CentOS 7, it means you need to install a newer compiler 
+than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Note that 
+this does not affect users who are installing a binary version of the package.
 
-Window functions (e.g. `ntile()`) are not yet
-supported. Inside `dplyr` verbs, Arrow offers support for many functions and
-operators, with common functions mapped to their base R and tidyverse
-equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
-lists many of them. If there are additional functions you would like to see
-implemented, please file an issue as described in the [Getting
-help](#getting-help) section below.
+- Development versions of `arrow` are released nightly. Most users will not 
+need to install nightly builds, but if you do please see the article on [installing nightly builds]([installation guide](https://arrow.apache.org/docs/r/articles/install_nightly.html) for more information.

Review Comment:
   I think this is a typo?
   
   ```suggestion
   need to install nightly builds, but if you do please see the article on [installing nightly builds](https://arrow.apache.org/docs/r/articles/install_nightly.html) for more information.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005378186


##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.

Review Comment:
   nit: would change "with" to "containing" or similar



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.
+
+You can perform the set up from within R. Let's suppose you want to call your virtual environment something like `my-pyarrow-env`. Your setup code would look like this: 
 
 ```r
-library(reticulate)
-virtualenv_create("arrow-env")
-install_pyarrow("arrow-env")
+virtualenv_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
 ```
 
-If you want to install a development version of `pyarrow`,
-add `nightly = TRUE`:
+If you want to install a development version of `pyarrow` to the virtual environment, add `nightly = TRUE` to the `install_pyarrow()` command:
 
 ```r
-install_pyarrow("arrow-env", nightly = TRUE)
+install_pyarrow("my-pyarrow-env", nightly = TRUE)
 ```
 
-A virtualenv or a virtual environment is a specific Python installation
-created for one project or purpose. It is a good practice to use
-specific environments in Python so that updating a package doesn't
-impact packages in other projects.
+Note that you don't have to use virtual environments. If you prefer [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html), you can use this setup code:
 
-`install_pyarrow()` also works with `conda` environments
-(`conda_create()` instead of `virtualenv_create()`).
+```r
+conda_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
+```
 
-For more on installing and configuring Python,
-see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+To learn more about installing and configuring Python from R,
+see the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html), which discusses the topic in more detail.
 
-## Using
+## Importing PyArrow
 
-To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+Assuming that `arrow` and `reticulate` are both loaded in R, your first step is to make sure that the correct Python environment is being used. To do that, use a command like this:
+
+```r
+use_virtualenv("my-pyarrow-env") # virtualenv users
+use_condaenv("my-pyarrow-env")   # conda users
+```

Review Comment:
   Would it be worth splitting these out into separate code chunks so that it's clear for people who are skim-reading or blindly copying-and-pasting (I know I'm guilty of that a lot) that they only actually need to run one or the other based on how they've done their setup?



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.
+
+You can perform the set up from within R. Let's suppose you want to call your virtual environment something like `my-pyarrow-env`. Your setup code would look like this: 
 
 ```r
-library(reticulate)
-virtualenv_create("arrow-env")
-install_pyarrow("arrow-env")
+virtualenv_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
 ```
 
-If you want to install a development version of `pyarrow`,
-add `nightly = TRUE`:
+If you want to install a development version of `pyarrow` to the virtual environment, add `nightly = TRUE` to the `install_pyarrow()` command:
 
 ```r
-install_pyarrow("arrow-env", nightly = TRUE)
+install_pyarrow("my-pyarrow-env", nightly = TRUE)
 ```
 
-A virtualenv or a virtual environment is a specific Python installation
-created for one project or purpose. It is a good practice to use
-specific environments in Python so that updating a package doesn't
-impact packages in other projects.
+Note that you don't have to use virtual environments. If you prefer [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html), you can use this setup code:
 
-`install_pyarrow()` also works with `conda` environments
-(`conda_create()` instead of `virtualenv_create()`).
+```r
+conda_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
+```
 
-For more on installing and configuring Python,
-see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+To learn more about installing and configuring Python from R,
+see the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html), which discusses the topic in more detail.
 
-## Using
+## Importing PyArrow
 
-To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+Assuming that `arrow` and `reticulate` are both loaded in R, your first step is to make sure that the correct Python environment is being used. To do that, use a command like this:
+
+```r
+use_virtualenv("my-pyarrow-env") # virtualenv users
+use_condaenv("my-pyarrow-env")   # conda users
+```
+
+Once you have done this, the next step is to import `pyarrow` into the Python session as shown below:
 
 ```r
-library(arrow)
-library(reticulate)
-use_virtualenv("arrow-env")
 pa <- import("pyarrow")
 ```
 
-The arrow R package include support for sharing Arrow `Array` and `RecordBatch`
-objects in-process between R and Python. For example, let's create an `Array`
-in pyarrow.
+Executing this command in R is the equivalent of the following import in Python:
+
+```python
+import pyarrow as pa
+```
+
+It may be a good idea to check your `pyarrow` version too, as shown below:
+
+```r
+pa$`__version__`
+```
+
+```
+## [1] "8.0.0"
+```
+
+Support for passing data to and from R is included in `pyarrow` versions 0.17 and greater.
+
+## Using PyArrow
+
+You can use the `reticulate` function `r_to_py()` to pass objects from R to Python, and similarly you can use `py_to_r()` to pull objects from the Python session into R. To illustrate this, let's create two objects in R: `df_random` is an R data frame containing 100 million rows of random data, and `tb_random` is the same data stored as an Arrow Table: 
+
+```r
+set.seed(1234)
+nrows <- 10^8
+df_random <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+tb_random <- arrow_table(df_random)
+```
+
+Transferring the data from R to Python without Arrow is a time-consuming process because the underlying object has to be copied and converted to a Python data structure:
+
+```r
+system.time({
+  df_py <- r_to_py(df_random)
+})
+```
+
+```
+##   user  system elapsed 
+##  0.307   5.172   5.529 
+```
+
+In contrast, sending the Arrow Table across happens almost instantaneously:
+
+```r
+system.time({
+  tb_py <- r_to_py(tb_random)
+})
+```
+
+```
+##   user  system elapsed 
+##  0.004   0.000   0.003 
+```
+

Review Comment:
   This is brilliant, I love how you really draw out the motivation here.



##########
r/vignettes/python.Rmd:
##########
@@ -113,40 +185,12 @@ a_and_b
 
 Now you have a single Array in R.
 
-## How this works
+## Futher reading

Review Comment:
   ```suggestion
   ## Further reading
   ```



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 

Review Comment:
   This section is a huge improvement to the vignette IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016380400


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,223 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
+
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
+
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
+
+## Converting Tables to data frames
+
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
+
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required. As an example, Arrow does not have an analog of the POSIXlt class: date/time data expressed as POSIXlt objects will preserve the internal list structure of the POSIXlt object, but it will arrive in Arrow as a list. As a consequence users need to decide if what needs to be preserved is the timestamp (in which case it is better to coerce to POSIXlt to POSIXct before translating to Arrow), or if a POSIXlt-style list is preferable. Similar issues exist in the other direction: Arrow dictionary objects are a little more flexible than R factors, for instance. 

Review Comment:
   I don't know how much is desirable either (we should think about why it's doing it like this, and what (if anything) the alternative might look like).  Can we relegate the "weird" cases to other vignettes and keep things relatively straightforward here?  Maybe just allude to the fact that it's more complex for POSIXlt and link to where that's mentioned, or pick a simpler example for this vignette if there is one.
   
   Thank you for surfacing this though, it's great to see these rough edges exposed.  Mind opening up a JIRA asking if this is desired behaviour? I think this warrants more discussion. Feel free to tag me in it and/or link to this comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1011838318


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |

Review Comment:
   We can use the convenience function `as_arrow_array()` to create Array objects from vectors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012363970


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
+
+In addition to these data objects, `arrow` defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+To learn more about the metadata classes, see the [metadata article](./metadata.html).
+
+## Scalars
+
+A Scalar object is simply a single value that can be of any type. It might be an integer, a string, a timestamp, or any of the different `DataType` objects that Arrow supports. Most users of the `arrow` R package are unlikely to create Scalars directly, but should there be a need you can do this by calling the `Scalar$create()` method:
+
+```{r}
+Scalar$create("hello")
+```
+
+
+## Arrays
+
+Array objects are ordered sets of Scalar values. As with Scalars most users will not need to create Arrays directly, but if the need arises there is an `Array$create()` method that allows you to create new Arrays:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
+string_array
+```
+
+An Array can be subset using square brackets as shown below:
+
+```{r}
+string_array[4:5]
+```
+
+Arrays are immutable objects: once an Array has been created it cannot be modified or extended. 
+
+## Chunked Arrays
+
+In practice, most users of the `arrow` R package are likely to use Chunked Arrays rather than simple Arrays. Under the hood, a Chunked Array is a collection of one or more Arrays that can be indexed _as if_ they were a single Array. The reasons that Arrow provides this functionality are described in the [data object layout article](./developers/data_object_layout.html) but for the present purposes it is sufficient to notice that Chunked Arrays behave like Arrays in regular data analysis.
+
+To illustrate, let's use the `chunked_array()` function:
+
+```{r}
+chunked_string_array <- chunked_array(
+  string_array,
+  c("I", "love", "you")
+)
+```
+
+The `chunked_array()` function is just a wrapper around the functionality that `ChunkedArray$create()` provides. Let's print the object:
+
+```{r}
+chunked_string_array
+```
+
+The double bracketing in this output is intended to highlight the fact that Chunked Arrays are wrappers around one or more Arrays. However, although comprised of multiple distinct Arrays, a Chunked Array can be indexed as if they were laid end-to-end in a single "vector-like" object. This is illustrated below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_indexing.png")
+```
+
+We can use `chunked_string_array` to illustrate this: 
+
+```{r}
+chunked_string_array[4:7]
+```
+
+An important thing to note is that "chunking" is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit. Writing the data to disk, for example, often results in the data being organized into different chunks. Similarly, two Chunked Arrays that contain the same values assigned to different chunks are deemed equivalent. To illustrate this we can create a Chunked Array that contains the same four same four values as `chunked_string_array[4:7]`, but organized into one chunk rather than split into two:
+
+```{r}
+cruel_world <- chunked_array(c("cruel", "world", "I", "love"))
+cruel_world
+```
+
+Testing for equality using `==` produces an element-wise comparison, and the result is a new Chunked Array of four (boolean type) `true` values:
+
+```{r}
+cruel_world == chunked_string_array[4:7]
+```
+
+In short, the intention is that users interact with Chunked Arrays as if they are ordinary one-dimensional data structures without ever having to think much about the underlying chunking arrangement. 
+
+Chunked Arrays are mutable, in a specific sense: Arrays can be added and removed from a Chunked Array.
+
+## Record Batches
+
+A Record Batch is tabular data structure comprised of named Arrays. Record Batches are a fundamental unit for data interchange in Arrow, but are not typically used for data analysis. Tables and Datasets are usually more convenient in analytic contexts.
+
+These Arrays can be of different types but must all be the same length. Each Array is referred to as one of the "fields" or "columns" of the Record Batch. You can create a Record Batch using the `record_batch()` function or by using the `RecordBatch$create()` method. These functions are flexible and can accept inputs in several formats: you can pass a data frame, one or more named vectors, an input stream, or even a raw vector containing appropriate binary data. For example:
+
+```{r}
+rb <- record_batch(
+  strs = string_array, 
+  ints = integer_array,
+  dbls = c(1.1, 3.2, 0.2, NA, 11)
+)
+rb
+```
+
+This is a Record Batch containing 5 rows and 3 columns, and its conceptual structure is shown below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./record_batch.png")
+```
+
+The `arrow` package supplies a `$` method for Record Batch objects, used to extract a single column by name:
+
+```{r}
+rb$strs
+```
+
+You can use double brackets `[[` to refer to columns by position. The `rb$ints` array is the second column in our Record Batch so we can extract it with this:
+
+```{r}
+rb[[2]]
+```
+
+There is also `[` method that allows you to extract subsets of a record batch in the same way you would for a data frame. The command `rb[1:3, 1:2]` extracts the first three rows and the first two columns:
+
+```{r}
+rb[1:3, 1:2]
+```
+
+Record Batches cannot be concatenated: because they are comprised of Arrays, and Arrays are immutable objects, new rows cannot be added to Record Batch once created.
+
+## Tables
+
+A Table is comprised of named Chunked Arrays, in the same way that a Record Batch is comprised of named Arrays. You can subset Tables with `$`, `[[`, and `[` the same way you can for Record Batches. Unlike Record Batches, Tables can be concatenated (because they are comprised of Chunked Arrays). Suppose a second Record Batch arrives:
+
+```{r}
+new_rb <- record_batch(
+  strs = c("I", "love", "you"), 
+  ints = c(5L, 0L, 0L),
+  dbls = c(7.1, -0.1, 2)
+)
+```
+
+It is not possible to create a Record Batch that appends the data from `new_rb` to the data in `rb`, not without creating entirely new objects in memory. With Tables, however, we can:
+
+```{r}
+df <- arrow_table(rb)
+new_df <- arrow_table(new_rb)
+```
+
+We now have the two fragments of the data set represented as Tables. The difference between the Table and the Record Batch is that the columns are all represented as Chunked Arrays. Each Array from the original Record Batch is one chunk in the corresponding Chunked Array in the Table:
+
+```{r}
+rb$strs
+df$strs
+```
+
+It's the same underlying data -- and indeed the same immutable Array is referenced by both -- just enclosed by a new, flexible Chunked Array wrapper. However, it is this wrapper that allows us to concatenate Tables:
+
+```{r}
+concat_tables(df, new_df)
+```
+
+The resulting object is shown schematically below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./table.png")
+```
+

Review Comment:
   Yes absolutely. The only reason its missing is that I still don't understand what Dataset objects actually are under the hood!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012364256


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |

Review Comment:
   Ah awesome I somehow did not know this. Will add it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012480175


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Sounds good to me: I chatted with @jonkeane this morning about it too and they independently made the same suggestion which makes me think moving these details to dev vignettes is the right move



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1292688946

   
   > All that said though, you could make the argument that whilst navigating between docs for different versions of a function may be desirable, it might be less important for vignettes, and so we could just link to the articles index page in those cases.
   
   After thinking about it for a little I'm starting to think this might be the best approach to vignette linking across versions. The structure of the vignettes has changed quite a bit in this rewrite, and there's not really a 1:1 mapping between old and new. I was unhappy about it because I hate breaking backwards compatibility even in the docs, but it's not like we could fix the issue with a simple file rename -- the underlying content is different enough in some cases that there's nothing useful to be gained by pretending that the old "get started" vignette is meaningfully the same document as the new one? So yeah maybe we just map to the "articles" page, since that's guaranteed to be structurally the same thing in every release?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1023528743


##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar `dplyr` syntax. To use this functionality, make sure that the `arrow` and `dplyr` packages are both loaded. In this article we will take the `starwars` data set included in `dplyr`, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory `Table` object, the same functionality works for an on-disk `Dataset` object with only minor differences in behavior (documented later in the article).

Review Comment:
   Hm yeah. Okay, what I've done is refer only to "functionality". It's still ambiguous, but nobody expects that term to be precise. I figure that's better than trying to describe what the arrow/dplyr bindings actually are in the opening sentence!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022687152


##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data types, and many data types that do not have a counterpart in R. This article describes the Arrow type system, compares it to R data types, and outlines the default mappings used when data are transferred from Arrow to R. At the end of the article there are two lookup tables: one describing the default "R to Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the differences between the output when obtain we use `dplyr::glimpse()` to inspect the `starwars` data in its original format -- as a data frame in R -- and the output we obtain when we convert it to an Arrow Table first by calling `arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.

Review Comment:
   ```suggestion
   Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; conversely, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.
   ```



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data types, and many data types that do not have a counterpart in R. This article describes the Arrow type system, compares it to R data types, and outlines the default mappings used when data are transferred from Arrow to R. At the end of the article there are two lookup tables: one describing the default "R to Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the differences between the output when obtain we use `dplyr::glimpse()` to inspect the `starwars` data in its original format -- as a data frame in R -- and the output we obtain when we convert it to an Arrow Table first by calling `arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first check to see if a Schema has been provided -- see `schema()` for more information -- and if none is available it will attempt to guess the appropriate type by following the default mappings. A complete listing of these mappings is provided at the end of the article, but the most common cases are depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to Arrow data types. Directional arrows specify conversions (e.g., the bidirectional arrow between the logical R type and the boolean Arrow type means that R logicals convert to Arrow booleans and vice versa). Solid lines indicate that the this conversion rule is always the default; dashed lines mean that it only sometimes applies (the rules and special cases are described below). 
+
+## Logical/boolean types
+
+Arrow and R both use three-valued logic. In R, logical values can be `TRUE` or `FALSE`, with `NA` used to represent missing data. In Arrow, the corresponding boolean type can take values `true`, `false`, or `null`, as shown below:
+
+```{r}
+chunked_array(c(TRUE, FALSE, NA), type = boolean()) # default
+```
+
+It is not strictly necessary to set `type = boolean()` in this example because the default behavior in `arrow` is to translate R logical vectors to Arrow booleans and vice versa. However, for the sake of clarity we will specify the data types explicitly throughout this article. We will likewise use `chunked_array()` to create Arrow data from R objects and `as.vector()` to create R data from Arrow objects, but similar results are obtained if we use other methods. 
+
+## Integer types
+
+Base R natively supports only one type of integer, using 32 bits to represent signed numbers between -2147483648 and 2147483647, though R can also support 64 bit integers via the [`bit64`](https://cran.r-project.org/package=bit64) package. Arrow inherits signed and unsigned integer types from C++ in 8-bit, 16-bit, 32-bit, and 64-bit versions:
+
+| Description     | Data Type Function | Smallest Value       |        Largest Value |
+| --------------- | -----------------: | -------------------: | -------------------: |
+| 8 bit unsigned  | `uint8()`          | 0                    |                  255 |
+| 16 bit unsigned | `uint16()`         | 0                    |                65535 |
+| 32 bit unsigned | `uint32()`         | 0                    |           4294967295 |
+| 64 bit unsigned | `uint64()`         | 0                    | 18446744073709551615 |
+| 8 bit signed    | `int8()`           | -128                 |                  127 |
+| 16 bit signed   | `int16()`          | -32768               |                32767 |
+| 32 bit signed   | `int32()`          | -2147483648          |           2147483647 |
+| 64 bit signed   | `int64()`          | -9223372036854775808 |  9223372036854775807 |
+
+By default, `arrow` translates R integers to the int32 type in Arrow, but you can override this by explicitly specifying another integer type:
+
+```{r}
+chunked_array(c(10L, 3L, 200L), type = int32()) # default
+chunked_array(c(10L, 3L, 200L), type = int64())
+```
+
+If the value in R does not fall within the permissible range for the corresponding Arrow type, `arrow` throws an error:
+
+```{r, error=TRUE}
+chunked_array(c(10L, 3L, 200L), type = int8())
+```
+
+When translating from Arrow to R, integer types alway translate to R integers unless one of the following exceptions applies:
+
+- If the value of an Arrow uint32 or uint64 falls outside the range allowed for R integers, the result will be a numeric vector in R 
+- If the value of an Arrow int64 variable falls outside the range allowed for R integers, the result will be a `bit64::integer64` vector in R
+- If the user sets `options(arrow.int64_downcast = FALSE)`, the Arrow int64 type always yields a `bit64::integer64` vector in R regardless of the value
+
+## Floating point numeric types
+
+R has one double-precision (64-bit) numeric type, which translates to the Arrow 64-bit floating point type by default. Arrow supports both single-precision (32-bit) and double-precision (64-bit) floating point numbers, specified using the `float32()` and `float64()` data type functions. Both of these are translated to doubles in R. Examples are shown below:
+
+```{r}
+chunked_array(c(0.1, 0.2, 0.3), type = float64()) # default
+chunked_array(c(0.1, 0.2, 0.3), type = float32())
+
+arrow_double <- chunked_array(c(0.1, 0.2, 0.3), type = float64())
+as.vector(arrow_double)
+```
+
+Note that the Arrow specification also permits half-precision (16-bit) floating point numbers, but these have not yet been implemented. 
+
+## Fixed point decimal types
+
+Arrow also contains `decimal()` data types, in which numeric values are specified in decimal format rather than binary. Decimals in Arrow come in two varieties, a 128-bit version and a 256-bit version, but in most cases users should be able to use the more general `decimal()` data type function rather than the specific `decimal128()` and `decimal256()` functions. 
+
+The decimal types in Arrow are fixed-precision numbers (rather than floating-point), which means it is necessary to explicitly specify the `precision` and `scale` arguments:
+
+- `precision` specifies the number of significant digits to store.
+- `scale` specifies the number of digits that should be stored after the decimal point. If you set `scale = 2`, exactly two digits will be stored after the decimal point. If you set `scale = 0`, values will be rounded to the nearest whole number. Negative scales are also permitted (handy when dealing with extremely large numbers), so `scale = -2` stores the value to the nearest 100.
+
+Because R does not have any way to create decimal types natively, the example below is a little circuitous. First we create some floating point numbers as Chunked Arrays, and then explicitly cast these to decimal types within Arrow. This is possible because Chunked Array objects possess a `cast()` method:
+
+```{r}
+arrow_floating <- chunked_array(c(.01, .1, 1, 10, 100))
+arrow_decimals <- arrow_floating$cast(decimal(precision = 5, scale = 2))
+arrow_decimals
+```
+
+Though not natively used in R, decimal types can be useful in situations where it is especially important to avoid problems that arise in floating point arithmetic.
+
+## String/character types
+
+R uses a single character type to represent strings whereas Arrow has two types. In the Arrow C++ library these types are referred to as strings and large_strings, but to avoid ambiguity in the `arrow` R package they are defined using the `utf8()` and `large_utf8()` data type functions. The distinction between these two Arrow types is unlikely to be important for R users, though the difference is discussed in the article on [data object layout](./developers/data_object_layout.html). 
+
+The default behavior is to translate R character vectors to the utf8/string type, and to translate both Arrow types to R character vectors:
+
+```{r}
+strings <- chunked_array(c("oh", "well", "whatever"))
+strings
+as.vector(strings)
+```
+
+## Factor/dictionary types
+
+The analog of R factors in Arrow is the dictionary type. Factors translate to dictionaries and vice versa. To illustrate this, let's create a small factor object in R:
+
+```{r}
+fct <- factor(c("cat", "dog", "pig", "dog"))
+fct
+```
+
+When translated to Arrow, this is the dictionary that results:
+
+```{r}
+dict <- chunked_array(fct, type = dictionary())
+dict
+```
+
+When translated back to R, we recover the original factor:
+
+```{r}
+as.vector(dict)
+```
+
+Arrow dictionaries are slightly more flexible than R factors: values in a dictionary do not necessarily have to be strings, but labels in a factor do. As a consequence, non-string values in an Arrow dictionary are coerced to strings when translated to R.
+
+## Date types
+
+In R, dates are typically represented using the Date class. Internally a Date object is a numeric type whose value counts the number of days since the beginning of the unix epoch (1 January 1970). Arrow supplies two data types that can be used to represent dates: the date32 type and the date64 type. The date32 type is similar to the Date class in R: internally it stores a 32-bit integer that counts the number of days since 1 January 1970. The default in `arrow` is to translate R Date objects to Arrow date32 types:

Review Comment:
   ```suggestion
   In R, dates are typically represented using the Date class. Internally a Date object is a numeric type whose value counts the number of days since the beginning of the Unix epoch (1 January 1970). Arrow supplies two data types that can be used to represent dates: the date32 type and the date64 type. The date32 type is similar to the Date class in R: internally it stores a 32-bit integer that counts the number of days since 1 January 1970. The default in `arrow` is to translate R Date objects to Arrow date32 types:
   ```



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data types, and many data types that do not have a counterpart in R. This article describes the Arrow type system, compares it to R data types, and outlines the default mappings used when data are transferred from Arrow to R. At the end of the article there are two lookup tables: one describing the default "R to Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the differences between the output when obtain we use `dplyr::glimpse()` to inspect the `starwars` data in its original format -- as a data frame in R -- and the output we obtain when we convert it to an Arrow Table first by calling `arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first check to see if a Schema has been provided -- see `schema()` for more information -- and if none is available it will attempt to guess the appropriate type by following the default mappings. A complete listing of these mappings is provided at the end of the article, but the most common cases are depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to Arrow data types. Directional arrows specify conversions (e.g., the bidirectional arrow between the logical R type and the boolean Arrow type means that R logicals convert to Arrow booleans and vice versa). Solid lines indicate that the this conversion rule is always the default; dashed lines mean that it only sometimes applies (the rules and special cases are described below). 

Review Comment:
   I don't recall what the plan was regarding the png files here, but a few comments on them:
   
   - the unidirectional and bidirectional arrows are really effective for simplifying the explanation here
   - can we make the line width and arrow head sizes larger?
   - the dashed lines seem to merge together and are hard to interpret
   



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data types, and many data types that do not have a counterpart in R. This article describes the Arrow type system, compares it to R data types, and outlines the default mappings used when data are transferred from Arrow to R. At the end of the article there are two lookup tables: one describing the default "R to Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the differences between the output when obtain we use `dplyr::glimpse()` to inspect the `starwars` data in its original format -- as a data frame in R -- and the output we obtain when we convert it to an Arrow Table first by calling `arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.

Review Comment:
   ```suggestion
   Some of these differences are purely cosmetic: integers in R are in fact 32-bit signed integers, so the underlying data types in Arrow and R are direct analogs of one another. In other cases the differences are purely about the implementation: Arrow and R have different ways to store a vector of strings, but at a high level of abstraction the R character type and the Arrow string type can be viewed as direct analogs. In some cases, however, there are no clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it does not have an analog of POSIXlt; converselt, while R can natively represent 32 bit signed integers, it does not have an equivalent of a 64 bit unsigned integer.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028813972


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,163 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with arrow 
+output: rmarkdown::html_vignette
+---
+
+The arrow package provides functions for reading single data files into memory, in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (formerly called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the arrow package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in arrow, see the [cloud storage article](./fs.html).
+
+The arrow package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 

Review Comment:
   ```suggestion
   This enables analysis and processing of larger-than-memory data, and provides 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028627482


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,221 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
 
+It is possible to exercise fine grained control over this conversion process. To learn more about the different types and how they are converted, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Arrow/Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 

Review Comment:
   I know this is the Multi-file section, but wonder if including that you can read+analyze a single and too-large-for-memory file here might be useful?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1322947133

   At long last I've moved this out of draft status 😁 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1325701704

   @thisisnic I think all the linting issues are now fixed. And yes, I agree this PR is too big to review easily: I'll work harder at splitting into separate PRs in future 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291653378

   > The proposed edits to the vignettes include a lot more code that is executed at build time. This may not be desirable?
   
   Could [ARROW-17655](https://issues.apache.org/jira/browse/ARROW-17655) help with this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291647638

   > I have not yet checked whether the bootstrap 5 template breaks the script inserting the documentation versions switcher
   
   I'd expect it will (see discussion on https://github.com/apache/arrow/pull/12531).  Happy to take a look at that as I did the original version switcher stuff (equally, happy to leave it with you, just let me know).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291896244

   @djnavarro Are all of the "moves discussion of..." items copied verbatim?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016078636


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 
+requiring any additional system dependencies, especially if you are using 
+Window or a Mac. For those users, CRAN hosts binary packages that contain 
+the Arrow C++ library upon which the `arrow` package relies, and no 
+additional steps should be required.
 
-```r
-sw %>%
-  group_by(species) %>%
-  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
-  collect()
-```
+There are some special cases to note:
 
-Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
-for joining multiple tables.
+- On Linux the installation process can sometimes be more involved because 
+CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html).
 
-```r
-jedi <- data.frame(
-  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
-  jedi = c(FALSE, TRUE, TRUE)
-)
-
-sw %>%
-  select(1:11) %>%
-  right_join(jedi) %>%
-  collect()
-```
+- If you are compiling `arrow` from source, please note that as of version 
+10.0.0, `arrow` requires C++17 to build. This has implications on Windows and
+CentOS 7. For Windows users it means you need to be running an R version of 
+4.0 or later. On CentOS 7, it means you need to install a newer compiler 
+than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Note that 
+this does not affect users who are installing a binary version of the package.
 
-Window functions (e.g. `ntile()`) are not yet
-supported. Inside `dplyr` verbs, Arrow offers support for many functions and
-operators, with common functions mapped to their base R and tidyverse
-equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
-lists many of them. If there are additional functions you would like to see
-implemented, please file an issue as described in the [Getting
-help](#getting-help) section below.
+- Development versions of `arrow` are released nightly. Most users will not 
+need to install nightly builds, but if you do please see the article on [installing nightly builds]([installation guide](https://arrow.apache.org/docs/r/articles/install_nightly.html) for more information.
 
-For `dplyr` queries on `Table` objects, if the `arrow` package detects
-an unimplemented function within a `dplyr` verb, it automatically calls
-`collect()` to return the data as an R `data.frame` before processing
-that `dplyr` verb. For queries on `Dataset` objects (which can be larger
-than memory), it raises an error if the function is unimplemented;
-you need to explicitly tell it to `collect()`.
+## Arrow resources 
 
-### Additional features
-
-Other applications of `arrow` are described in the following vignettes:
-
--   `vignette("python", package = "arrow")`: use `arrow` and
-    `reticulate` to pass data between R and Python
--   `vignette("flight", package = "arrow")`: connect to Arrow Flight RPC
-    servers to send and receive data
--   `vignette("arrow", package = "arrow")`: access and manipulate Arrow
-    objects through low-level bindings to the C++ library
-
-The Arrow for R [cheatsheet](https://github.com/apache/arrow/blob/-/r/cheatsheet/arrow-cheatsheet.pdf) and [Cookbook](https://arrow.apache.org/cookbook/r/index.html) are additional resources for getting started with `arrow`.
+In addition to the official [Arrow R package documentation](https://arrow.apache.org/docs/r/), the [Arrow for R cheatsheet](https://github.com/apache/arrow/blob/-/r/cheatsheet/arrow-cheatsheet.pdf), and the [Apache Arrow R Cookbook](https://arrow.apache.org/cookbook/r/index.html) are useful resources for getting started with `arrow`.
 
 ## Getting help
 
 If you encounter a bug, please file an issue with a minimal reproducible
 example on the [Apache Jira issue
 tracker](https://issues.apache.org/jira/projects/ARROW/issues). Create
-an account or log in, then click **Create** to file an issue. Select the
-project **Apache Arrow (ARROW)**, select the component **R**, and begin
-the issue summary with **`[R]`** followed by a space. For more
-information, see the **Report bugs and propose features** section of the
+an account or log in, then click "Create" to file an issue. Select the
+project "Apache Arrow (ARROW)", select the component "R", and begin
+the issue summary with "[R]" followed by a space. For more
+information, see the "Report bugs and propose features" section of the
 [Contributing to Apache

Review Comment:
   I am so looking forward to this: not gonna lie jira gives me headaches



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

jonkeane commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1013417565


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 

Review Comment:
   This both very minor and very just style, but I get twitchy with quotes like this, how about italics?
   
   ```suggestion
   In most cases installing the latest release should _just work_ without 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1027992996


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)

Review Comment:
   ```suggestion
   - `read_feather()`: read a file in the Apache Arrow IPC format (formerly called the Feather format)
   ```



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.

Review Comment:
   What do you think to the idea of mentioning in this paragraph that the file reader functions read things into memory?  I've been in a few discussions lately where it's been mentioned that the eager/lazy distinction for the `read_*()` functions as compared to `open_dataset()` is a more accurate one than single/multi-files (you can read in a single file via the datasets API), and whilst we haven't started pushing that narrative yet, here could be a good place to start?



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   Lol, are we going there? :P  



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 

Review Comment:
   I like the addition of this context for why it's no longer called Feather, but why it was.



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 

Review Comment:
   Is it worth mentioning that for advanced usage, users can pass Arrow-specific arguments through in the `parse_options`, `read_options`, and `convert_options` parameters (or at least acknowledging that users have access to the Arrow-specific options even if not explicitly mentioning these parameters)?  (I'm anticipating that the answer to this question may well be 'no'!)



##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.
+
+A simple example of writing and reading a CSV file with `arrow` is shown below:
+
+```{r}
+file_path <- tempfile()
+write_csv_arrow(mtcars, file_path)
+read_csv_arrow(file_path, col_select = starts_with("d"))
+```
+
+## JSON format
+
+The `arrow` package supports reading (but not writing) of tabular data from line-delimited JSON, using the `read_json_arrow()` function. A minimal example is shown below:
+
+```{r}
+file_path <- tempfile()
+writeLines('
+    { "hello": 3.5, "world": false, "yo": "thing" }
+    { "hello": 3.25, "world": null }
+    { "hello": 0.0, "world": true, "yo": null }
+  ', file_path, useBytes = TRUE)
+read_json_arrow(file_path)
+```
+
+## Further reading
+
+- To learn more about cloud storage, see the [cloud storage article](./fs.html).
+- To learn more about multi-file datasets, see the [datasets article](./dataset.html).

Review Comment:
   Perhaps link to cookbook chapters?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028625943


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,221 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
 
+It is possible to exercise fine grained control over this conversion process. To learn more about the different types and how they are converted, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:

Review Comment:
   ```suggestion
   When the goal is to read a single data file into memory, there are several functions you can use:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028603118


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets

Review Comment:
   ```suggestion
   -   Read and write multi-file and larger-than-memory datasets
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028594410


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized

Review Comment:
   Would it be useful to highlight that Arrow is for in-memory and larger-than-memory data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029911635


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded
 
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
 
 ## Example: NYC taxi data
 
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 

Review Comment:
   These are all great thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1324498095

   okay so we might be ready to merge this? I'm running out of things that I want to try fixing in this iteration of improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1329796734

   Benchmark runs are scheduled for baseline = 409a95ddc2f1b8aac54612da3658b1f36a734469 and contender = 4afe71030cdd9d3103c7b028082ba63bafdf5d27. 4afe71030cdd9d3103c7b028082ba63bafdf5d27 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1b4698d58bbc48c19f58fa0267948f91...3e7c954cf1774e76ac71043e3ecd4629/)
   [Failed :arrow_down:0.13% :arrow_up:4.06%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/57fdaa8c6c4d480f95b4f8c4610ba903...dcc48f7dcb7a4752805cbb4206dfdc80/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/878260d21ad1432ba5a3af1460a78faf...28fe6de9b426492b909c7810dbf1bd04/)
   [Finished :arrow_down:0.28% :arrow_up:0.42%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9e4386d376b14590848baa380d91247a...db366ad4a61041729bfd7daa56d0b19e/)
   Buildkite builds:
   [Finished] [`4afe7103` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1936)
   [Failed] [`4afe7103` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1958)
   [Finished] [`4afe7103` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1928)
   [Finished] [`4afe7103` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1950)
   [Finished] [`409a95dd` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1935)
   [Finished] [`409a95dd` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1957)
   [Finished] [`409a95dd` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1927)
   [Finished] [`409a95dd` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1949)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1006820770


##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   > I'm not confident I'm doing it well
   
   Hard disagree, I find this a **lot** easier to understand than when I tried to read the Arrow spec a while ago, and IMO it's just a case of iterating on some bits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007966036


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.

Review Comment:
   This summary is way clearer, love it.



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.

Review Comment:
   > there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
   
   This is great, but does leave me asking "how will this inexactness affect me" and "what is 'care' in this context?"  My guess is that will be answered in the content linked to in the next sentence, but how about either rephrasing this a tiny bit to be more clear, or making it more explicit in the next sentence that the information I need will be there?



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 

Review Comment:
   Nice; I think we need to do more about talking about both Tables and Datasets, so bringing this distinction in at this point is great!



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 
+
+To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:
+
+```{r}
+set.seed(1234)
+nrows <- 100000
+random_data <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+```
+
+What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:
+
+```{r}
+dataset_path <- file.path(tempdir(), "random_data")
+```
+
+We can then use `group_by()` function from `dplyr` to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:
+
+```{r}
+random_data %>%
+  group_by(subset) %>%
+  write_dataset(dataset_path)
+```
+
+This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:
+
+```{r}
+list.files(dataset_path, recursive = TRUE)
+```
 
-### R object attributes
+Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:
+
+```{r}
+dset <- open_dataset(dataset_path)
+dset
+```
+
+This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.
+
+To learn more about Arrow Datasets, see the [dataset article](./dataset.html).
+
+## Analyzing Arrow data with dplyr
+
+Arrow Tables and Datasets can be analyzed using `dplyr` syntax. This is possible because the `arrow` R package supplies a backend that translates `dplyr` verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a `dplyr` verb. For example, although the `dset` Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a `dplyr` pipeline like the one shown below:
+
+```{r}
+dset %>%
+  group_by(subset) %>% 
+  summarize(mean_x = mean(x), min_y = min(y)) %>%
+  filter(mean_x > 0) %>%
+  arrange(subset) %>%
+  collect()
+```
+
+Notice that we call `collect()` at the end of the pipeline. No actual computations are performed until `collect()` (or the related `compute()` function) is called. This "lazy evaluation" makes it possible for the Arrow C++ compute engine to optimize how the computations are performed. 
+
+To learn more about analyzing Arrow data, see the [data wrangling article](./data_wrangling.html).
+
+## Connecting to cloud storage

Review Comment:
   I like this succint overview



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 

Review Comment:
   Small note here that we are deprecating the usage of "Feather" in favour of "Arrow" or whatever specific phrasing is preferred.



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 
+
+To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:
+
+```{r}
+set.seed(1234)
+nrows <- 100000
+random_data <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+```
+
+What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:
+
+```{r}
+dataset_path <- file.path(tempdir(), "random_data")
+```
+
+We can then use `group_by()` function from `dplyr` to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:
+
+```{r}
+random_data %>%
+  group_by(subset) %>%
+  write_dataset(dataset_path)
+```
+
+This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:
+
+```{r}
+list.files(dataset_path, recursive = TRUE)
+```
 
-### R object attributes
+Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:
+
+```{r}
+dset <- open_dataset(dataset_path)
+dset
+```
+
+This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.
+
+To learn more about Arrow Datasets, see the [dataset article](./dataset.html).
+

Review Comment:
   Love this section, and the work you've done to add in explicit motivations for why we might want to do different things, and what we should be considering.



##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 
+
+To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:
+
+```{r}
+set.seed(1234)
+nrows <- 100000
+random_data <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+```
+
+What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:
+
+```{r}
+dataset_path <- file.path(tempdir(), "random_data")
+```
+
+We can then use `group_by()` function from `dplyr` to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:
+
+```{r}
+random_data %>%
+  group_by(subset) %>%
+  write_dataset(dataset_path)
+```
+
+This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:
+
+```{r}
+list.files(dataset_path, recursive = TRUE)
+```
 
-### R object attributes
+Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:
+
+```{r}
+dset <- open_dataset(dataset_path)
+dset
+```
+
+This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.
+
+To learn more about Arrow Datasets, see the [dataset article](./dataset.html).
+
+## Analyzing Arrow data with dplyr
+
+Arrow Tables and Datasets can be analyzed using `dplyr` syntax. This is possible because the `arrow` R package supplies a backend that translates `dplyr` verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a `dplyr` verb. For example, although the `dset` Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a `dplyr` pipeline like the one shown below:
+
+```{r}
+dset %>%
+  group_by(subset) %>% 
+  summarize(mean_x = mean(x), min_y = min(y)) %>%
+  filter(mean_x > 0) %>%
+  arrange(subset) %>%
+  collect()
+```
+
+Notice that we call `collect()` at the end of the pipeline. No actual computations are performed until `collect()` (or the related `compute()` function) is called. This "lazy evaluation" makes it possible for the Arrow C++ compute engine to optimize how the computations are performed. 
+
+To learn more about analyzing Arrow data, see the [data wrangling article](./data_wrangling.html).
+

Review Comment:
   Is it worth adding a note here to where we can find out which functions are supported (https://arrow.apache.org/docs/r/reference/acero.html)?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007454150


##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   Thanks! This one is hard to write!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291507155

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005366066


##########
r/vignettes/flight.Rmd:
##########
@@ -1,48 +1,38 @@
 ---
-title: "Connecting to Flight RPC Servers"
+title: "Connecting to a flight server"
+description: >
+  Learn how to efficiently stream Apache Arrow data objects across a 
+  network using Arrow Flight 
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Connecting to Flight RPC Servers}
+  %\VignetteIndexEntry{Connecting to a flight server}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-[**Flight**](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
-is a general-purpose client-server framework for high performance
-transport of large datasets over network interfaces, built as part of the
-[Apache Arrow](https://arrow.apache.org) project.
+[Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:
 
-Flight allows for highly efficient data transfer as it:
+* Flight removes the need for deserialization during data transfer.
+* Flight allows for parallel data streaming.
+* Flight employs optimizations designed to take advantage of Arrow's columnar format.
 
-* removes the need for deserialization during data transfer
-* allows for parallel data streaming
-* is highly optimized to take advantage of Arrow's columnar format.
+The `arrow` package provides methods for connecting to Flight servers to send and receive data.

Review Comment:
   In previous iterations of doc refactoring, we decided to refer to packages on the first instance with a link, and on the subsequent instances with a link to that package, instead of backticks as it makes the sentence more skimmable (and tbh were just copying [how the dplyr docs do it](https://dplyr.tidyverse.org/articles/programming.html) ;) ) There's a little bit in here about that: https://github.com/apache/arrow/blob/master/r/STYLE.md.



##########
r/vignettes/flight.Rmd:
##########
@@ -1,48 +1,38 @@
 ---
-title: "Connecting to Flight RPC Servers"
+title: "Connecting to a flight server"
+description: >
+  Learn how to efficiently stream Apache Arrow data objects across a 
+  network using Arrow Flight 
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Connecting to Flight RPC Servers}
+  %\VignetteIndexEntry{Connecting to a flight server}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-[**Flight**](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
-is a general-purpose client-server framework for high performance
-transport of large datasets over network interfaces, built as part of the
-[Apache Arrow](https://arrow.apache.org) project.
+[Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:
 
-Flight allows for highly efficient data transfer as it:
+* Flight removes the need for deserialization during data transfer.
+* Flight allows for parallel data streaming.
+* Flight employs optimizations designed to take advantage of Arrow's columnar format.
 
-* removes the need for deserialization during data transfer
-* allows for parallel data streaming
-* is highly optimized to take advantage of Arrow's columnar format.
+The `arrow` package provides methods for connecting to Flight servers to send and receive data.
 
-The arrow package provides methods for connecting to Flight RPC servers
-to send and receive data.
+## Prerequisites
 
-## Getting Started
-
-The `flight` functions in the package use [reticulate](https://rstudio.github.io/reticulate/) to call methods in the
-[pyarrow](https://arrow.apache.org/docs/python/api/flight.html) Python package.
-
-Before using them for the first time,
-you'll need to be sure you have reticulate and pyarrow installed:
+At present the `arrow` package in R does not supply an independent implementation of Arrow Flight: it works by calling [Flight methods supplied by PyArrow](https://arrow.apache.org/docs/python/api/flight.html) Python, and requires both the [`reticulate`](https://rstudio.github.io/reticulate/) package and the Python PyArrow library to be installed. If you are using them for the first time you can install them like this:

Review Comment:
   Love this phrasing, this is much clearer



##########
r/vignettes/flight.Rmd:
##########
@@ -84,6 +73,13 @@ client %>%
 
 Because `flight_get()` returns an Arrow data structure, you can directly pipe
 its result into a [dplyr](https://dplyr.tidyverse.org/) workflow.
-See `vignette("dataset", package = "arrow")` for more information on working with Arrow objects via a dplyr interface.
+See `vignette("data_wrangling", package = "arrow")` for more information on working with Arrow objects via a `dplyr` interface.
+
+## Further reading
+
+- The specification of the [Flight remote procedure call protocol](https://arrow.apache.org/docs/format/Flight.html) is listed on the Arrow project homepage
+- The Arrow C++ documentation contains a list of [best practices](https://arrow.apache.org/docs/cpp/flight.html#best-practices) for Arrow Flight.
+- A detailed worked example of an Arrow Flight server in Python is provided in the [Apache Arrow Python Cookbook](https://arrow.apache.org/cookbook/py/flight.html).

Review Comment:
   Good call, great addition



##########
r/vignettes/flight.Rmd:
##########
@@ -1,48 +1,38 @@
 ---
-title: "Connecting to Flight RPC Servers"
+title: "Connecting to a flight server"

Review Comment:
   Should "flight" here be capitalised?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1291677557

   As this is a humongous PR, I'm adding checkboxes here so I can work out what I have or have not looked at.
   
   - [ ]     Reduces the content on the README page to the essential points
   - [ ]     Rewrites the "get started" page to focus on common tasks and novice users
   - [ ]     Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
   - [ ]     Moves discussion of Arrow data types and conversions to a new "data types" vignette
   - [ ]     Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
   - [ ]     Moves discussion of package naming conventions to a new "package conventions" vignette
   - [ ]     Moves discussion of read/write capabilities to a new "reading and writing data" vignette
   - [ ]     Moves discussion of the dplyr back end to a new "data wrangling" vignette
   - [ ]     Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
   - [ ]     Minor edits to the "python" vignette to improve readability
   - [ ]     Minor edits to the "cloud storage" vignette to improve readability
   - [x]     Minor edits to the "flight" vignette to improve readability
   - [ ]     Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1006121776


##########
r/vignettes/flight.Rmd:
##########
@@ -1,48 +1,38 @@
 ---
-title: "Connecting to Flight RPC Servers"
+title: "Connecting to a flight server"
+description: >
+  Learn how to efficiently stream Apache Arrow data objects across a 
+  network using Arrow Flight 
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Connecting to Flight RPC Servers}
+  %\VignetteIndexEntry{Connecting to a flight server}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-[**Flight**](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/)
-is a general-purpose client-server framework for high performance
-transport of large datasets over network interfaces, built as part of the
-[Apache Arrow](https://arrow.apache.org) project.
+[Arrow Flight](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:
 
-Flight allows for highly efficient data transfer as it:
+* Flight removes the need for deserialization during data transfer.
+* Flight allows for parallel data streaming.
+* Flight employs optimizations designed to take advantage of Arrow's columnar format.
 
-* removes the need for deserialization during data transfer
-* allows for parallel data streaming
-* is highly optimized to take advantage of Arrow's columnar format.
+The `arrow` package provides methods for connecting to Flight servers to send and receive data.

Review Comment:
   > In previous iterations of doc refactoring, we decided to refer to packages on the first instance with a link, and on the subsequent instances with a link to that package, instead of backticks as it makes the sentence more skimmable (and tbh were just copying [how the dplyr docs do it](https://dplyr.tidyverse.org/articles/programming.html) ;) ) There's a little bit in here about that: https://github.com/apache/arrow/blob/master/r/STYLE.md.
   
   Ah thanks -- I missed that. Thanks! I'll update 🙂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1292725903

   This is great, thank you for taking this on. I'll give the content a close read at some point, but a couple of quick considerations:
   
   * We should be careful about changing the vignette filenames since they map to URLs and URLs make up an API. For some of the lesser vignettes it's maybe not a big deal, but the number of links to `vignette("install")` i.e. https://arrow.apache.org/docs/r/articles/install.html I've made out in the internet (and including twice in our own `r/configure`) make me wary of changing that one in particular.
   * We also should avoid adding .pngs to the R package tarball. We're already at 4.7mb without this change (I only know that because I noticed today when doing the CRAN submission) and the CRAN limit is 5mb. I don't know what the right pkgdown way of handling these extra documents is, but we should do that. We basically are trying to populate the website and don't really care if all of these vignettes ship in the package itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009171974


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 
+
+To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:
+
+```{r}
+set.seed(1234)
+nrows <- 100000
+random_data <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+```
+
+What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:
+
+```{r}
+dataset_path <- file.path(tempdir(), "random_data")
+```
+
+We can then use `group_by()` function from `dplyr` to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:
+
+```{r}
+random_data %>%
+  group_by(subset) %>%
+  write_dataset(dataset_path)
+```
+
+This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:
+
+```{r}
+list.files(dataset_path, recursive = TRUE)
+```
 
-### R object attributes
+Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:
+
+```{r}
+dset <- open_dataset(dataset_path)
+dset
+```
+
+This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.
+
+To learn more about Arrow Datasets, see the [dataset article](./dataset.html).
+
+## Analyzing Arrow data with dplyr
+
+Arrow Tables and Datasets can be analyzed using `dplyr` syntax. This is possible because the `arrow` R package supplies a backend that translates `dplyr` verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a `dplyr` verb. For example, although the `dset` Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a `dplyr` pipeline like the one shown below:
+
+```{r}
+dset %>%
+  group_by(subset) %>% 
+  summarize(mean_x = mean(x), min_y = min(y)) %>%
+  filter(mean_x > 0) %>%
+  arrange(subset) %>%
+  collect()
+```
+
+Notice that we call `collect()` at the end of the pipeline. No actual computations are performed until `collect()` (or the related `compute()` function) is called. This "lazy evaluation" makes it possible for the Arrow C++ compute engine to optimize how the computations are performed. 
+
+To learn more about analyzing Arrow data, see the [data wrangling article](./data_wrangling.html).
+

Review Comment:
   Thanks for adding it there too!  Yeah, it's super new and is generated from some funky scripts that Neal put together recently, will be getting the word out about it soon :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009064383


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.

Review Comment:
   That's a good point. I've expanded on it a little by highlighting the fact that POSIXct and POSIXlt objects get handled very differently, hinted at subtle differences between R factors and Arrow dictionaries and then tried to direct readers to the data types article should they want to learn more about the nuances. See what you think!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1011820496


##########
r/vignettes/package_conventions.Rmd:
##########
@@ -0,0 +1,25 @@
+---
+title: "Package conventions"
+description: >
+  Learn how R6 classes are used in `arrow` to wrap the 
+  underlying C++ library, and when to use these objects
+  rather than the R-friendly wrapper functions
+output: rmarkdown::html_vignette
+---
+
+C++ is an object-oriented language, so the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package, these classes are implemented as [`R6`](https://r6.r-lib.org) classes, most of which are exported from the namespace.
+
+## Naming conventions
+
+In order to match the C++ naming conventions, the `R6` classes are named in "TitleCase", e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.
+
+Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a "snake_case" alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.
+
+The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.
+
+## Further reading
+
+- [Documentation for the Arrow C++ library](https://arrow.apache.org/docs/cpp/)
+- [API reference for the Arrow C++ classes](https://arrow.apache.org/docs/cpp/api.html)
+
+

Review Comment:
   I'm feeling a little bit resistant about this new vignette, if only because having to explain "when to use these objects rather than the R-friendly wrapper functions" might in some cases just be a symptom that we need to make more wrapper functions ;)  
   
   I think this content is well-explained, but can we chat a bit about who are aiming it at and why we're including it?  I want to make sure that we do need it before we incorporate it, as it's more content to need to maintain.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012361763


##########
r/vignettes/package_conventions.Rmd:
##########
@@ -0,0 +1,25 @@
+---
+title: "Package conventions"
+description: >
+  Learn how R6 classes are used in `arrow` to wrap the 
+  underlying C++ library, and when to use these objects
+  rather than the R-friendly wrapper functions
+output: rmarkdown::html_vignette
+---
+
+C++ is an object-oriented language, so the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package, these classes are implemented as [`R6`](https://r6.r-lib.org) classes, most of which are exported from the namespace.
+
+## Naming conventions
+
+In order to match the C++ naming conventions, the `R6` classes are named in "TitleCase", e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.
+
+Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a "snake_case" alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.
+
+The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.
+
+## Further reading
+
+- [Documentation for the Arrow C++ library](https://arrow.apache.org/docs/cpp/)
+- [API reference for the Arrow C++ classes](https://arrow.apache.org/docs/cpp/api.html)
+
+

Review Comment:
   So yeah, I'm hesitant about it too. The reason it exists is that we currently have the same content in an even more prominent location: it's on the "get started" vignette, which feels even more obtrusive to me? https://arrow.apache.org/docs/r/articles/arrow.html#class-structure-and-package-conventions
   
   Personally I'm happy to remove it, or else push it somewhere into the developer vignettes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1016078767


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 
+requiring any additional system dependencies, especially if you are using 
+Window or a Mac. For those users, CRAN hosts binary packages that contain 
+the Arrow C++ library upon which the `arrow` package relies, and no 
+additional steps should be required.
 
-```r
-sw %>%
-  group_by(species) %>%
-  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
-  collect()
-```
+There are some special cases to note:
 
-Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
-for joining multiple tables.
+- On Linux the installation process can sometimes be more involved because 
+CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html).
 
-```r
-jedi <- data.frame(
-  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
-  jedi = c(FALSE, TRUE, TRUE)
-)
-
-sw %>%
-  select(1:11) %>%
-  right_join(jedi) %>%
-  collect()
-```
+- If you are compiling `arrow` from source, please note that as of version 
+10.0.0, `arrow` requires C++17 to build. This has implications on Windows and
+CentOS 7. For Windows users it means you need to be running an R version of 
+4.0 or later. On CentOS 7, it means you need to install a newer compiler 
+than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Note that 
+this does not affect users who are installing a binary version of the package.
 
-Window functions (e.g. `ntile()`) are not yet
-supported. Inside `dplyr` verbs, Arrow offers support for many functions and
-operators, with common functions mapped to their base R and tidyverse
-equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html)
-lists many of them. If there are additional functions you would like to see
-implemented, please file an issue as described in the [Getting
-help](#getting-help) section below.
+- Development versions of `arrow` are released nightly. Most users will not 
+need to install nightly builds, but if you do please see the article on [installing nightly builds]([installation guide](https://arrow.apache.org/docs/r/articles/install_nightly.html) for more information.

Review Comment:
   yup, typo. thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022706108


##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar `dplyr` syntax. To use this functionality, make sure that the `arrow` and `dplyr` packages are both loaded. In this article we will take the `starwars` data set included in `dplyr`, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory `Table` object, the same functionality works for an on-disk `Dataset` object with only minor differences in behavior (documented later in the article).

Review Comment:
   Could we use a different term to "back end" here? I've heard different people use the terms "back end", "frontend", "API" and other terms, and I think this can sound a bit ambiguous.



##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar `dplyr` syntax. To use this functionality, make sure that the `arrow` and `dplyr` packages are both loaded. In this article we will take the `starwars` data set included in `dplyr`, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory `Table` object, the same functionality works for an on-disk `Dataset` object with only minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing users to construct data analysis pipelines in a familiar way. The example below shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+  filter(homeworld == "Tatooine") %>%
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay computation until the result is explicitly requested. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. As a consequence of this design choice, we have not yet performed computations on the `sw` data have been performed. The `result` variable is an object with class `arrow_dplyr_query` that represents all the computations to be performed:

Review Comment:
   ```suggestion
   It is important to note that `arrow` uses lazy evaluation to delay computation until the result is explicitly requested. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. As a consequence of this design choice, we have not yet performed computations on the `sw` data. The `result` variable is an object with class `arrow_dplyr_query` that represents all the computations to be performed:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028640176


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   Okay, I've removed that paragraph



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029856292


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded
 
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
 
 ## Example: NYC taxi data
 
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 

Review Comment:
   Offering some very minor edits to ingrain the idea that Tables is in-memory and Datasets out-of-memory, rather than the single/multi-file distinction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1027961179


##########
r/vignettes/metadata.Rmd:
##########
@@ -0,0 +1,82 @@
+---
+title: "Metadata"
+description: > 
+  Learn how Arrow uses Schemas to document structure of data objects, 
+  and how R metadata are supported in Arrow
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data and metadata object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+## Arrow metadata classes
+
+The `arrow` package defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+Consider this:
+
+```{r}
+df <- data.frame(x = 1:3, y = c("a", "b", "c"))
+tb <- arrow_table(df)
+tb$schema
+```
+
+The schema that has been automatically inferred could also be manually created:
+
+```{r}
+schema(
+  field(name = "x", type = int32()),
+  field(name = "y", type = utf8())
+)
+```
+
+The `schema()` function allows the following shorthand to define fields:
+
+```{r}
+schema(x = int32(), y = utf8())
+```
+
+Sometimes it is important to specify the schema manually, particularly if you want fine grained control over the Arrow data types:
+
+```{r}
+arrow_table(df, schema = schema(x = int64(), y = utf8()))
+arrow_table(df, schema = schema(x = float64(), y = utf8()))
+```
+
+
+## R object attributes
+
+Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fasnion are stored under the `r` key, as shown below:

Review Comment:
   ```suggestion
   Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fashion are stored under the `r` key, as shown below:
   ```



##########
r/vignettes/metadata.Rmd:
##########
@@ -0,0 +1,82 @@
+---
+title: "Metadata"
+description: > 
+  Learn how Arrow uses Schemas to document structure of data objects, 
+  and how R metadata are supported in Arrow
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data and metadata object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+## Arrow metadata classes
+
+The `arrow` package defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+Consider this:
+
+```{r}
+df <- data.frame(x = 1:3, y = c("a", "b", "c"))
+tb <- arrow_table(df)
+tb$schema
+```
+
+The schema that has been automatically inferred could also be manually created:
+
+```{r}
+schema(
+  field(name = "x", type = int32()),
+  field(name = "y", type = utf8())
+)
+```
+
+The `schema()` function allows the following shorthand to define fields:
+
+```{r}
+schema(x = int32(), y = utf8())
+```
+
+Sometimes it is important to specify the schema manually, particularly if you want fine grained control over the Arrow data types:

Review Comment:
   ```suggestion
   Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028614229


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized

Review Comment:
   Agreed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic merged pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic merged PR #14514:
URL: https://github.com/apache/arrow/pull/14514


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028682421


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.
+
+A simple example of writing and reading a CSV file with `arrow` is shown below:
+
+```{r}
+file_path <- tempfile()
+write_csv_arrow(mtcars, file_path)
+read_csv_arrow(file_path, col_select = starts_with("d"))
+```
+
+## JSON format
+
+The `arrow` package supports reading (but not writing) of tabular data from line-delimited JSON, using the `read_json_arrow()` function. A minimal example is shown below:
+
+```{r}
+file_path <- tempfile()
+writeLines('
+    { "hello": 3.5, "world": false, "yo": "thing" }
+    { "hello": 3.25, "world": null }
+    { "hello": 0.0, "world": true, "yo": null }
+  ', file_path, useBytes = TRUE)
+read_json_arrow(file_path)
+```
+
+## Further reading
+
+- To learn more about cloud storage, see the [cloud storage article](./fs.html).
+- To learn more about multi-file datasets, see the [datasets article](./dataset.html).

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028743223


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 

Review Comment:
   I think it's worth it if we can pull it off without seeming to introduce unnecessary complexity for new users. I've attempted to add a sentence or two linking to the reader options for csv and parquet, so at least there's a trail of breadcrumbs for them to follow if they want to dive into the R6 classes 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029856743


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with dplyr and arrow: we'll start by ensuring both packages are loaded
 
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
 
 ## Example: NYC taxi data
 
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
 
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
+This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 

Review Comment:
   ```suggestion
   This multi-file data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028628223


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,221 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+For developers interested in learning more about the package structure, see the [developer guide](./developing.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
 
+It is possible to exercise fine grained control over this conversion process. To learn more about the different types and how they are converted, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 
 
+### Individual files
 
-### Arrow to R
+When the goal is to read a single data file, there are several functions you can use:
 
-| Arrow type        | R type                       |
-|-------------------|------------------------------|
-| boolean           | logical                      |
-| int8              | integer                      |
-| int16             | integer                      |
-| int32             | integer                      |
-| int64             | integer^1^                   |
-| uint8             | integer                      |
-| uint16            | integer                      |
-| uint32            | integer^1^                   |
-| uint64            | integer^1^                   |
-| float16           | -^2^                         |
-| float32           | double                       |
-| float64           | double                       |
-| utf8              | character                    |
-| large_utf8        | character                    |
-| binary            | arrow_binary ^3^             |
-| large_binary      | arrow_large_binary ^3^       |
-| fixed_size_binary | arrow_fixed_size_binary ^3^  |
-| date32            | Date                         |
-| date64            | POSIXct                      |
-| time32            | hms::hms                     |
-| time64            | hms::hms                     |
-| timestamp         | POSIXct                      |
-| duration          | difftime                     |
-| decimal           | double                       |
-| dictionary        | factor^4^                    |
-| list              | arrow_list ^5^               |
-| large_list        | arrow_large_list ^5^         |
-| fixed_size_list   | arrow_fixed_size_list ^5^    |
-| struct            | data.frame                   |
-| null              | vctrs::vctrs_unspecified     |
-| map               | arrow_list ^5^               |
-| union             | -^2^                         |
-
-^1^: These integer types may contain values that exceed the range of R's 
-`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` are 
-converted to `double` ("numeric") and `int64` is converted to 
-`bit64::integer64`. This conversion can be disabled (so that `int64` always
-yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = FALSE)`.
+-   `read_parquet()`: read a file in Parquet format
+-   `read_feather()`: read a file in Arrow/Feather format
+-   `read_delim_arrow()`: read a delimited text file 
+-   `read_csv_arrow()`: read a comma-separated values (CSV) file
+-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
+-   `read_json_arrow()`: read a JSON data file
 
-^2^: Some Arrow data types do not currently have an R equivalent and will raise an error
-if cast to or mapped to via a schema.
+In every case except JSON, there is a corresponding `write_*()` function 
+that allows you to write data files in the appropriate format. 
 
-^3^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 
 
-^4^: Due to the limitation of R factors, Arrow `dictionary` values are coerced
-to string when translated to R if they are not already strings.
+In the example below, we take the `starwars` data provided by the `dplyr` package and write it to a Parquet file using `write_parquet()`
 
-^5^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
-with a `ptype` attribute set to what an empty Array of the value type converts to. 
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+
+file_path <- tempfile(fileext = ".parquet")
+write_parquet(starwars, file_path)
+```
 
+We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):
+
+```{r}
+sw_frame <- read_parquet(file_path)
+sw_table <- read_parquet(file_path, as_data_frame = FALSE)
+sw_table
+```
+
+To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).
+
+### Multi-file data sets
+
+When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides a convenient way to read, write, and analyze with data stored in this fashion using the Dataset interface. 

Review Comment:
   ```suggestion
   When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The `arrow` package provides the Dataset interface, a convenient way to read, write, and analyze a single data file that is larger-than-memory and multi-file data sets.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028612461


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.
+
+- `read_parquet()`: read a file in Parquet format
+- `read_feather()`: read a file in the Apache Arrow IPC format (also called the Feather format)
+- `read_delim_arrow()`: read a delimited text file (default delimiter is comma)
+- `read_csv_arrow()`: read a comma-separated values (CSV) file
+- `read_tsv_arrow()`: read a tab-separated values (TSV) file
+- `read_json_arrow()`: read a JSON data file
+
+For writing data to single files, the `arrow` package provides the
+following functions, which can be used with both R data frames and 
+Arrow Tables:
+
+- `write_parquet()`: write a file in Parquet format
+- `write_feather()`: write a file in Arrow IPC format
+- `write_csv_arrow()`: write a file in CSV format
+
+All these functions can read and write files in the local filesystem or
+to cloud storage. For more on cloud storage support in `arrow`, see the [cloud storage article](./fs.html).
+
+The `arrow` package also supports reading and writing multi-file datasets,
+which enable analysis and processing of larger-than-memory data, and provide 
+the ability to partition data into smaller chunks without loading the full 
+data into memory. For more information on this topic, see the [dataset article](./dataset.html).
+
+## Parquet format
+
+[Apache Parquet](https://parquet.apache.org/) is a popular
+choice for storing analytics data; it is a binary format that is 
+optimized for reduced file sizes and fast read performance, especially 
+for column-based access patterns. The simplest way to read and write
+Parquet data using `arrow` is with the `read_parquet()` and 
+`write_parquet()` functions. To illustrate this, we'll write the 
+`starwars` data included in `dplyr` to a Parquet file, then read it 
+back in. First load the `arrow` and `dplyr` packages:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+Next we'll write the data frame to a Parquet file located at `file_path`:
+
+```{r}
+file_path <- tempfile()
+write_parquet(starwars, file_path)
+```
+
+The size of a Parquet file is typically much smaller than the corresponding CSV 
+file would have been. This is in part due to the use of file compression: by default, 
+Parquet files written with the `arrow` package use [Snappy compression](https://google.github.io/snappy/) but other options such as gzip 
+are also supported. See `help("write_parquet", package = "arrow")` for more
+information.
+
+Having written the Parquet file, we now can read it with `read_parquet()`:
+
+```{r}
+read_parquet(file_path)
+```
+
+The default is to return a data frame or tibble. If we want an Arrow Table instead, we would set `as_data_frame = FALSE`:
+
+```{r}
+read_parquet(file_path, as_data_frame = FALSE)
+```
+
+One useful feature of Parquet files is that they store data column-wise, and contain metadata that allow file readers to skip to the relevant sections of the file. That means it is possible to load only a subset of the columns without reading the complete file. The `col_select` argument to `read_parquet()` supports this functionality is supported in `arrow`:
+
+```{r}
+read_parquet(file_path, col_select = c("name", "height", "mass"))
+```
+
+R object attributes are preserved when writing data to Parquet or
+Arrow/Feather files and when reading those files back into R. This enables
+round-trip writing and reading of `sf::sf` objects, R data frames with
+with `haven::labelled` columns, and data frame with other custom
+attributes. To learn more about how metadata are handled in `arrow`, the [metadata article](./metadata.html).
+
+## Arrow/Feather format
+
+The Arrow file format was developed to provide binary columnar 
+serialization for data frames, to make reading and writing data frames 
+efficient, and to make sharing data across data analysis languages easy.
+This file format is sometimes referred to as Feather because it is an
+outgrowth of the original [Feather](https://github.com/wesm/feather) project 
+that has now been moved into the Arrow project itself. You can find the 
+detailed specification of version 2 of the Arrow format -- officially 
+referred to as [the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) --
+on the Arrow specification page. 
+
+The `write_feather()` function writes version 2 Arrow/Feather files by default, and supports multiple kinds of file compression. Basic use is shown below:
+
+```{r}
+file_path <- tempfile()
+write_feather(starwars, file_path)
+```
+
+The `read_feather()` function provides a familiar interface for reading feather files:
+
+```{r}
+read_feather(file_path)
+```
+
+Like the Parquet reader, this reader supports reading a only subset of columns, and can produce Arrow Table output:
+
+```{r}
+read_feather(
+  file = file_path, 
+  col_select = c("name", "height", "mass"), 
+  as_data_frame = FALSE
+)
+```
+
+## CSV format
+
+The read/write capabilities of the `arrow` package also include support for 
+CSV and other text-delimited files. The `read_csv_arrow()`, `read_tsv_arrow()`, 
+and `read_delim_arrow()` functions all use the Arrow C++ CSV reader to read 
+data files, where the Arrow C++ options have been mapped to arguments in a 
+way that mirrors the conventions used in `readr::read_delim()`, with a 
+`col_select` argument inspired by `vroom::vroom()`. 
+
+Although `read_csv_arrow()` currently has fewer parsing options for dealing 
+with every CSV format variation in the wild than other CSV readers available
+in R, for those files that it can read, it is often significantly faster than 
+other R CSV readers, such as `base::read.csv`, `readr::read_csv`, and
+`data.table::fread`.

Review Comment:
   I would prefer not to. It's another one where this copy is preserved from the existing docs (it's currently here: https://arrow.apache.org/docs/r/articles/arrow.html) and I personally think it's unwise. I don't think it helps the community to make benchmarking claims in the documentation. That feels like something to do elsewhere? I'd be very happy to delete this actually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028673962


##########
r/vignettes/read_write.Rmd:
##########
@@ -0,0 +1,164 @@
+---
+title: "Reading and writing data files"
+description: >
+  Learn how to read and write CSV, Parquet, and Feather files with `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides functions for reading single data files in
+several common formats. By default, calling any of these functions
+returns an R data frame. To return an Arrow Table, set argument
+`as_data_frame = FALSE`.

Review Comment:
   Great!  Steph mentioned earlier that you'd coined "in-memory vs. larger-than-memory" btw; I really like that!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1325592744

   I approved this before the CI had run, but there are a few failures relating to linting that need to be sorted before we merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012448675


##########
r/vignettes/package_conventions.Rmd:
##########
@@ -0,0 +1,25 @@
+---
+title: "Package conventions"
+description: >
+  Learn how R6 classes are used in `arrow` to wrap the 
+  underlying C++ library, and when to use these objects
+  rather than the R-friendly wrapper functions
+output: rmarkdown::html_vignette
+---
+
+C++ is an object-oriented language, so the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package, these classes are implemented as [`R6`](https://r6.r-lib.org) classes, most of which are exported from the namespace.
+
+## Naming conventions
+
+In order to match the C++ naming conventions, the `R6` classes are named in "TitleCase", e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.
+
+Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a "snake_case" alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.
+
+The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.
+
+## Further reading
+
+- [Documentation for the Arrow C++ library](https://arrow.apache.org/docs/cpp/)
+- [API reference for the Arrow C++ classes](https://arrow.apache.org/docs/cpp/api.html)
+
+

Review Comment:
   Another one I mistook as new content, but yeah, let's pop it into the dev content somewhere!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012447084


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Sorry, I was reading it as new content as I'd not looked at the "getting started" page in ages and ages!  Honestly, I'd just err on the side of your own judgment in cases like this; I agree this section is for one of the dev vignettes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007963532


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets
+-   Read JSON files
 
-Conda users can install `arrow` from conda-forge with
+It provides data analysis tools for both in-memory and larger-than-memory data sets
 
-``` shell
-conda install -c conda-forge --strict-channel-priority r-arrow
-```
+-   Analyze and process larger-than-memory datasets
+-   Manipulate and analyze Arrow data with `dplyr` verbs
 
-Installing a released version of the `arrow` package requires no
-additional system dependencies. For macOS and Windows, CRAN hosts binary
-packages that contain the Arrow C++ library. On Linux, source package
-installation will also build necessary C++ dependencies. For a faster,
-more complete installation, set the environment variable
-`NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details.
+It provides access to remote filesystems and servers
 
-As of version 10.0.0, `arrow` requires C++17 to build. This means that:
+-   Read and write files in Amazon S3 and Google Cloud Storage buckets
+-   Connect to Arrow Flight servers to transport large datasets over networks  
+    
+Additional features include:
 
-* On Windows, you need `R >= 4.0`. Version 9.0.0 was the last version to support
-R 3.6.
-* On CentOS 7, you can build the latest version of `arrow`,
-but you first need to install a newer compiler than the default system compiler,
-gcc 4.8. See `vignette("install", package = "arrow")` for guidance.
-Note that you only need the newer compiler to build `arrow`:
-installing a binary package, as from RStudio Package Manager,
-or loading a package you've already installed works fine with the system defaults.
+-   Zero-copy data sharing between R and Python
+-   Fine control over column types to work seamlessly
+    with databases and data warehouses
+-   Support for compression codecs including Snappy, gzip, Brotli,
+    Zstandard, LZ4, LZO, and bzip2
+-   Access and manipulate Arrow objects through low-level bindings
+    to the C++ library
+-   Toolkit for building connectors to other applications
+    and services that use Arrow
 
-### Installing a development version
+## Installation
 
-Development versions of the package (binary and source) are built
-nightly and hosted at <https://nightlies.apache.org/arrow/r/>. To
-install from there:
+Most R users will probably want to install the latest release of `arrow` 
+from CRAN:
 
 ``` r
-install.packages("arrow", repos = c(arrow = "https://nightlies.apache.org/arrow/r", getOption("repos")))
+install.packages("arrow")
 ```
 
-Conda users can install `arrow` nightly builds with
+Alternatively, if you are using conda you can install `arrow` from conda-forge:
 
 ``` shell
-conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
-```
-
-If you already have a version of `arrow` installed, you can switch to
-the latest nightly development version with
-
-``` r
-arrow::install_arrow(nightly = TRUE)
-```
-
-These nightly package builds are not official Apache releases and are
-not recommended for production use. They may be useful for testing bug
-fixes and new features under active development.
-
-## Usage
-
-Among the many applications of the `arrow` package, two of the most accessible are:
-
--   High-performance reading and writing of data files with multiple
-    file formats and compression codecs, including built-in support for
-    cloud storage
--   Analyzing and manipulating bigger-than-memory data with `dplyr`
-    verbs
-
-The sections below describe these two uses and illustrate them with
-basic examples. The sections below mention two Arrow data structures:
-
--   `Table`: a tabular, column-oriented data structure capable of
-    storing and processing large amounts of data more efficiently than
-    R’s built-in `data.frame` and with SQL-like column data types that
-    afford better interoperability with databases and data warehouse
-    systems
--   `Dataset`: a data structure functionally similar to `Table` but with
-    the capability to work on larger-than-memory data partitioned across
-    multiple files
-
-### Reading and writing data files with `arrow`
-
-The `arrow` package provides functions for reading single data files in
-several common formats. By default, calling any of these functions
-returns an R `data.frame`. To return an Arrow `Table`, set argument
-`as_data_frame = FALSE`.
-
--   `read_parquet()`: read a file in Parquet format
--   `read_feather()`: read a file in Feather format (the Apache Arrow
-    IPC format)
--   `read_delim_arrow()`: read a delimited text file (default delimiter
-    is comma)
--   `read_csv_arrow()`: read a comma-separated values (CSV) file
--   `read_tsv_arrow()`: read a tab-separated values (TSV) file
--   `read_json_arrow()`: read a JSON data file
-
-For writing data to single files, the `arrow` package provides the
-functions `write_parquet()`, `write_feather()`, and `write_csv_arrow()`.
-These can be used with R `data.frame` and Arrow `Table` objects.
-
-For example, let’s write the Star Wars characters data that’s included
-in `dplyr` to a Parquet file, then read it back in. Parquet is a popular
-choice for storing analytic data; it is optimized for reduced file sizes
-and fast read performance, especially for column-based access patterns.
-Parquet is widely supported by many tools and platforms.
-
-First load the `arrow` and `dplyr` packages:
-
-``` r
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-Then write the `data.frame` named `starwars` to a Parquet file at
-`file_path`:
-
-``` r
-file_path <- tempfile()
-write_parquet(starwars, file_path)
-```
-
-Then read the Parquet file into an R `data.frame` named `sw`:
-
-``` r
-sw <- read_parquet(file_path)
-```
-
-R object attributes are preserved when writing data to Parquet or
-Feather files and when reading those files back into R. This enables
-round-trip writing and reading of `sf::sf` objects, R `data.frame`s with
-with `haven::labelled` columns, and `data.frame`s with other custom
-attributes.
-
-For reading and writing larger files or sets of multiple files, `arrow`
-defines `Dataset` objects and provides the functions `open_dataset()`
-and `write_dataset()`, which enable analysis and processing of
-bigger-than-memory data, including the ability to partition data into
-smaller chunks without loading the full data into memory. For examples
-of these functions, see `vignette("dataset", package = "arrow")`.
-
-All these functions can read and write files in the local filesystem or
-in Amazon S3 (by passing S3 URIs beginning with `s3://`). For more
-details, see `vignette("fs", package = "arrow")`
-
-### Using `dplyr` with `arrow`
-
-The `arrow` package provides a `dplyr` backend enabling manipulation of
-Arrow tabular data with `dplyr` verbs. To use it, first load both
-packages `arrow` and `dplyr`. Then load data into an Arrow `Table` or
-`Dataset` object. For example, read the Parquet file written in the
-previous example into an Arrow `Table` named `sw`:
-
-``` r
-sw <- read_parquet(file_path, as_data_frame = FALSE)
-```
-
-Next, pipe on `dplyr` verbs:
-
-``` r
-result <- sw %>%
-  filter(homeworld == "Tatooine") %>%
-  rename(height_cm = height, mass_kg = mass) %>%
-  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
-  arrange(desc(birth_year)) %>%
-  select(name, height_in, mass_lbs)
-```
-
-The `arrow` package uses lazy evaluation to delay computation until the
-result is required. This speeds up processing by enabling the Arrow C++
-library to perform multiple computations in one operation. `result` is
-an object with class `arrow_dplyr_query` which represents all the
-computations to be performed:
-
-``` r
-result
-#> Table (query)
-#> name: string
-#> height_in: expr
-#> mass_lbs: expr
-#>
-#> * Filter: equal(homeworld, "Tatooine")
-#> * Sorted by birth_year [desc]
-#> See $.data for the source Arrow object
-```
-
-To perform these computations and materialize the result, call
-`compute()` or `collect()`. `compute()` returns an Arrow `Table`,
-suitable for passing to other `arrow` or `dplyr` functions:
-
-``` r
-result %>% compute()
-#> Table
-#> 10 rows x 3 columns
-#> $name <string>
-#> $height_in <double>
-#> $mass_lbs <double>
-```
-
-`collect()` returns an R `data.frame`, suitable for viewing or passing
-to other R functions for analysis or visualization:
-
-``` r
-result %>% collect()
-#> # A tibble: 10 x 3
-#>    name               height_in mass_lbs
-#>    <chr>                  <dbl>    <dbl>
-#>  1 C-3PO                   65.7    165.
-#>  2 Cliegg Lars             72.0     NA
-#>  3 Shmi Skywalker          64.2     NA
-#>  4 Owen Lars               70.1    265.
-#>  5 Beru Whitesun lars      65.0    165.
-#>  6 Darth Vader             79.5    300.
-#>  7 Anakin Skywalker        74.0    185.
-#>  8 Biggs Darklighter       72.0    185.
-#>  9 Luke Skywalker          67.7    170.
-#> 10 R5-D4                   38.2     70.5
+conda install -c conda-forge --strict-channel-priority r-arrow
 ```
 
-The `arrow` package works with most single-table `dplyr` verbs, including those
-that compute aggregates.
+In most cases installing the latest release should "just work" without 
+requiring any additional system dependencies, especially if you are using 
+Window or a Mac. For those users, CRAN hosts binary packages that contain 
+the Arrow C++ library upon which the `arrow` package relies, and no 
+additional steps should be required.
 
-```r
-sw %>%
-  group_by(species) %>%
-  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
-  collect()
-```
+There are some special cases to note:
 
-Additionally, equality joins (e.g. `left_join()`, `inner_join()`) are supported
-for joining multiple tables.
+- On Linux the installation process can sometimes be more involved because 
+CRAN does not host binaries for Linux. For more information please see the [installation guide](https://arrow.apache.org/docs/r/articles/install.html).
 
-```r
-jedi <- data.frame(
-  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
-  jedi = c(FALSE, TRUE, TRUE)
-)
-
-sw %>%
-  select(1:11) %>%
-  right_join(jedi) %>%
-  collect()
-```
+- If you are compiling `arrow` from source, please note that as of version 
+10.0.0, `arrow` requires C++17 to build. This has implications on Windows and
+CentOS 7. For Windows users it means you need to be running an R version of 
+4.0 or later. On CentOS 7, it means you need to install a newer compiler 
+than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Not that 

Review Comment:
   ```suggestion
   than the default system compiler gcc 4.8. See the [installation details article](https://arrow.apache.org/docs/r/articles/developers/install_details.html) for guidance. Note that 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1006134236


##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   Yeah, let me think about it. On the one hand I feel like it would be weird to do a full digression into "memory addresses can be rewritten from hex notation to a decimal number..." and then talk about the reasons why we want all data blocks to start (and stop) at an address that is a multiple of 64 bytes. That seems like a long and unhelpful tangent (especially since I'm right at the edge of my own knowledge trying to understand why it actually matters!) but at the same time the Arrow spec page does go into this detail and treats it as if it's assumed knowledge. So I feel almost obligated to try to unpack it in the R docs just so that readers of this vignette will be able to read the Arrow spec and not get completely confused. 
   
   ugh. it's a mess. this vignette is the one I'm least certain about -- I feel like we do need it to bridge the yawning chasm between the R docs and the Arrow spec, but I'm not confident I'm doing it well 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1009076215


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.
 
+To learn more about data types in Arrow and how they are mapped to R data types, see the [data types](./data_types.html) article. 
 
-^1^: `float64` and `double` are the same concept and data type in Arrow C++; 
-however, only `float64()` is used in arrow as the function `double()` already 
-exists in base R
 
-^2^: If the character vector exceeds 2GB of strings, it will be converted to a 
-`large_utf8` Arrow type
+## Reading and writing data
 
-^3^: Only lists where all elements are the same type are able to be translated 
-to Arrow list type (which is a "list of" some type).
+One of the main ways to use `arrow` is to read and write data files in
+several common formats. The `arrow` package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Feather that are not widely supported in other packages. In addition, the `arrow` package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 

Review Comment:
   lol, I have to stop using "see what you think" as my way of saying "this is what I think we should do but happy to pivot if you think we need to" 😁 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022660802


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 

Review Comment:
   This change makes sense - same dataset but smaller version. 



##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 
 
-```{r, eval = FALSE}
-arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-# Alternatively, with GCS:
-arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-```
+If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` (true for most users; see links at the end of this article if you need to troubleshoot this), you can connect to the "tiny taxi data" with either of the following commands:

Review Comment:
   Why either of these commands, what's the difference between the two?



##########
r/vignettes/dataset.Rmd:
##########
@@ -186,34 +124,9 @@ month: int32
 ")
 ```
 
-## Querying the dataset
-
-Up to this point, you haven't loaded any data. You've walked directories to find
-files, you've parsed file paths to identify partitions, and you've read the
-headers of the Parquet files to inspect their schemas so that you can make sure
-they all are as expected.
+## Querying Datasets
 
-In the current release, arrow supports the dplyr verbs:
-
- * `mutate()` and `transmute()`,
- * `select()`, `rename()`, and `relocate()`,
- * `filter()`,
- * `arrange()`,
- * `union()` and `union_all()`,
- * `left_join()`, `right_join()`, `full_join()`, `inner_join()`, and `anti_join()`,
- * `group_by()` and `summarise()`.
-
-At any point in a chain, you can use `collect()` to pull the selected subset of
-the data into an in-memory R data frame. 
-
-Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
-in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
-for dplyr queries on Arrow Table objects (which are already in memory), the
-package automatically calls `collect()` before processing that dplyr verb.
-
-Here's an example: suppose that you are curious about tipping behavior among the
-longest taxi rides. Let's find the median tip percentage for rides with
-fares greater than $100 in 2015, broken down by the number of passengers:
+Now that we have a Dataset object that refers to out data, we can construct `dplyr`-style queries. This is possible because `arrow` supplies a back end that allows users to manipulate tabular Arrow data using `dplyr` verbs. Here's an example: suppose you are curious about tipping behavior in the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers:

Review Comment:
   ```suggestion
   Now that we have a Dataset object that refers to our data, we can construct `dplyr`-style queries. This is possible because `arrow` supplies a back end that allows users to manipulate tabular Arrow data using `dplyr` verbs. Here's an example: suppose you are curious about tipping behavior in the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers:
   ```



##########
r/vignettes/dataset.Rmd:
##########
@@ -548,4 +451,11 @@ Most file formats have magic numbers which are written at the end.  This means a
 partial file write can safely be detected and discarded.  The CSV file format does
 not have any such concept and a partially written CSV file may be detected as valid.
 
+## Further reading
+
+- To learn about cloud storage, see the [cloud storage article](./fs.html).
+- To learn about `dplyr` with `arrow`, see the [data wrangling article](./data_wrangling.html).
+- To learn about reading and writing data, see the [read/write article](./read_write.html).
+- To manually enable cloud support on Linux, see the article on [installation on Linux](./install.html).
+- To learn about schemas and metadata, see the [metadata article](./metadata.html).

Review Comment:
   Do you reckon it might be worth linking to the cookbook chapter on datasets here too?



##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 
 
-```{r, eval = FALSE}
-arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-# Alternatively, with GCS:
-arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-```
+If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` (true for most users; see links at the end of this article if you need to troubleshoot this), you can connect to the "tiny taxi data" with either of the following commands:
 
-If your arrow build doesn't have S3 support, you can download the files
-with the additional code shown below.  Since these are large files, 
-you may need to increase R's download timeout from the default of 60 seconds, e.g.
-`options(timeout = 300)`.
-
-```{r, eval = FALSE}
-bucket <- "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com"
-for (year in 2009:2022) {
-  if (year == 2022) {
-    # We only have through Feb 2022 there
-    months <- 1:2
-  } else {
-    months <- 1:12
-  }
-  for (month in months) {
-    dataset_path <- file.path("nyc-taxi", paste0("year=", year), paste0("month=", month))
-    dir.create(dataset_path, recursive = TRUE)
-    try(download.file(
-      paste(bucket, dataset_path, "part-0.parquet", sep = "/"),
-      file.path(dataset_path, "part-0.parquet"),
-      mode = "wb"
-    ), silent = TRUE)
-  }
-}
+```r
+bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi-tiny")
+bucket <- gs_bucket("voltrondata-labs-datasets/nyc-taxi-tiny", anonymous = TRUE)
 ```
 
-Note that these download steps in the vignette are not executed: if you want to run
-with live data, you'll have to do it yourself separately.
-Given the size, if you're running this locally and don't have a fast connection,
-feel free to grab only a year or two of data.
+If you want to use the full data set, replace `nyc-taxi-tiny` with `nyc-taxi` in the code above. Apart from size -- and with it the cost in time, bandwidth usage, and CPU cycles -- there is no difference in the two versions of the data: you can test your code using the tiny taxi data and then check how it scales using the full data set.
 
-If you don't have the taxi data downloaded, the vignette will still run and will
-yield previously cached output for reference. To be explicit about which version
-is running, let's check whether you're running with live data:
+To make a local copy of the data set stored in the `bucket` to a folder called `"nyc-taxi"`, use the `copy_files()` function:
 
-```{r}
-dir.exists("nyc-taxi")
+```r
+copy_files(from = bucket, to = "nyc-taxi")
 ```
 
-## Opening the dataset
+For the purposes of this article, we assume that the NYC taxi dataset (either the full data or the tiny version) has been downloaded locally and exists in an `"nyc-taxi"` directory. 
 
-Because dplyr is not necessary for many Arrow workflows,
-it is an optional (`Suggests`) dependency. So, to work with Datasets,
-you need to load both arrow and dplyr.
+## Opening Datasets
 
-```{r}
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-The first step is to create a Dataset object, pointing at the directory of data.
+The first step in the process is to create a Dataset object that points at the data directory:
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi")
 ```
 
-The file format for `open_dataset()` is controlled by the `format` parameter, 
-which has a default value of `"parquet"`.  If you had a directory
-of Arrow format files, you could instead specify `format = "arrow"` in the call.
+It is important to note that when we do this, the data values are not loaded into memory. Instead, Arrow scans the data directory to find relevant files, parses the file paths looking for a "Hive-style partitioning" (see below), and reads headers of the data files to construct a Schema that contains metadata describing the structure of the data. For more information about Schemas see the [metadata article](./metadata.html).
 
-Other supported formats include: 
+Two questions naturally follow from this: what kind of files does `open_dataset()` look for, and what structure does it expect to find in the file paths? Let's start by looking at the file types.
 
-* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+By default `open_dataset()` looks for Parquet files but you can override this using the `format` argument. For example if the data were encoded as CSV files we could set `format = "csv"` to connect to the data. The Arrow Dataset interface supports several file formats including: 

Review Comment:
   +1 for explicitly mentioning parquet is default and the others are not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005609362


##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.

Review Comment:
   ```suggestion
   - The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
   ```



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   ```suggestion
   However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian (i.e. right-to-left). To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is 
 what the buffer looks like this in memory:
   ```
   Maybe?
   
   (it's not showing up on the preview but I added "(i.e. right-to-left)")



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:
+
+::: {.grid}
+::: {.g-col-6}
+|Byte 0 (validity bitmap) | Bytes 1-63            |
+|-------------------------|-----------------------|
+| `00011101`              | `0` (padding)         |
+:::
+:::
+
+## Data buffer
+
+The data buffer, like the validity bitmap, is padded out to a length of 64 bytes to preserve natural alignment. Here's the diagram showing the physical layout:
+
+::: {.grid}
+::: {.g-col-12}
+| Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
+|-----------|-------------|------------|-------------|-------------|-------------|
+| `1`       | unspecified | `2`        | `4`         | `8`         | unspecified |
+:::
+:::
+
+Each integer occupies 4 bytes, as per the requirements of a 32-bit signed integer. Notice that the bytes associate with the missing value are left unspecified: space is allocated for the value but those bytes are not filled. 

Review Comment:
   ```suggestion
   Each integer occupies 4 bytes, as per the requirements of a 32-bit signed integer. Notice that the bytes associated with the missing value are left unspecified: space is allocated for the value but those bytes are not filled. 
   ```



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   ```suggestion
   However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multiple of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks
  like this in memory:
   ```



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:

Review Comment:
   Something about "allocated memory addresses are a multiple of the data block sizes" isn't just quite clicking for me.  I'm definitely misinterpreting this, as my first thoughts are "but memory addresses look like '0x7fff324d73e0' or whatever, what does that have to do with data block sizes?"  I'm definitely doing something stupid here in how I'm interpreting the wording, but I'm not sure what extra words in that sentence would stop me from doing it.



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:
+
+::: {.grid}
+::: {.g-col-6}
+|Byte 0 (validity bitmap) | Bytes 1-63            |
+|-------------------------|-----------------------|
+| `00011101`              | `0` (padding)         |
+:::
+:::
+
+## Data buffer
+
+The data buffer, like the validity bitmap, is padded out to a length of 64 bytes to preserve natural alignment. Here's the diagram showing the physical layout:
+
+::: {.grid}
+::: {.g-col-12}
+| Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
+|-----------|-------------|------------|-------------|-------------|-------------|
+| `1`       | unspecified | `2`        | `4`         | `8`         | unspecified |
+:::
+:::
+
+Each integer occupies 4 bytes, as per the requirements of a 32-bit signed integer. Notice that the bytes associate with the missing value are left unspecified: space is allocated for the value but those bytes are not filled. 
+
+## Offset buffer
+
+Some types of Arrow array include a third buffer known as the offset buffer. This is most frequently encountered in the context of string arrays, such as this one:
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world")) 
+string_array
+```
+
+Using the same schematic notation as before, this is the structure of the object. It has the same metadata as before but as shown below, there are now three buffers:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_string.png")
+```
+
+To understand the role of the offset buffer, it helps to note the format of the data buffer for a string array: it concatenates all strings end to end in one contiguous section of memory. For the `string_array` object, the contents of the data buffer would look like one long utf8-encoded string:
+
+```
+helloamazingandcruelworld
+```
+
+Because individual strings can be of variable length, the role of the offset buffer is to specify where the boundaries between the slots are. The second slot in our array is the string `"amazing"`. If the positions in the data array are indexed like this
+
+|  h |  e |  l |  l |  o |  a |  m |  a |  z |  i |  n |  g |  a |  n |  d | ... |
+| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --- |
+|  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 | 12 | 13 | 14 | ... |
+
+then we can see that the string of interest begins at position 5 and ends at position 11. The offset buffer consists integers that store these break point locations. For `string_array` it might look like this:
+
+```
+0 5 12 15 20 25
+```
+
+The difference between the `utf8()` data type and the `large_utf8()` data type is that these the `utf8()` data type stores these as 32-bit integers whereas the `large_utf8()` type stores them as 64-bit integers.
+
+## Chunked arrays
+
+Arrays are immutable objects: once an Array has been initialized the valuse it stores cannot be altered. This ensures that multiple entities can safely refer to an Array via pointers, and not run the risk that the values will change. Using immutable Arrays makes it possible for Arrow to avoid unnecessary copies of data objects. 
+
+There are limitations to immutable Arrays, most notably when new batches of data arrive. Because an array is immutable, you can't add the new information to an existing array. The only thing you can do if you don't want to disturb or copy your existing array is create a new array that contains the new data. Doing that preserves the immutability of arrays and doesn't lead to any unnecessary copying but now we have a new problem: the data are now split across two arrays. Each array contains only one "chunk" of the data. What would be ideal is an abstraction layer that allows us to treat these two Arrays as though they were a single "Array-like" object.

Review Comment:
   "but now we have a new problem: the data are now split across two arrays" - could we remove one of those "now"s?



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a  pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l
 ike this in memory:
+
+::: {.grid}
+::: {.g-col-6}
+|Byte 0 (validity bitmap) | Bytes 1-63            |
+|-------------------------|-----------------------|
+| `00011101`              | `0` (padding)         |
+:::
+:::
+
+## Data buffer
+
+The data buffer, like the validity bitmap, is padded out to a length of 64 bytes to preserve natural alignment. Here's the diagram showing the physical layout:
+
+::: {.grid}
+::: {.g-col-12}
+| Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
+|-----------|-------------|------------|-------------|-------------|-------------|
+| `1`       | unspecified | `2`        | `4`         | `8`         | unspecified |
+:::
+:::
+
+Each integer occupies 4 bytes, as per the requirements of a 32-bit signed integer. Notice that the bytes associate with the missing value are left unspecified: space is allocated for the value but those bytes are not filled. 
+
+## Offset buffer
+
+Some types of Arrow array include a third buffer known as the offset buffer. This is most frequently encountered in the context of string arrays, such as this one:
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world")) 
+string_array
+```
+
+Using the same schematic notation as before, this is the structure of the object. It has the same metadata as before but as shown below, there are now three buffers:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_string.png")
+```
+
+To understand the role of the offset buffer, it helps to note the format of the data buffer for a string array: it concatenates all strings end to end in one contiguous section of memory. For the `string_array` object, the contents of the data buffer would look like one long utf8-encoded string:
+
+```
+helloamazingandcruelworld
+```
+
+Because individual strings can be of variable length, the role of the offset buffer is to specify where the boundaries between the slots are. The second slot in our array is the string `"amazing"`. If the positions in the data array are indexed like this
+
+|  h |  e |  l |  l |  o |  a |  m |  a |  z |  i |  n |  g |  a |  n |  d | ... |
+| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --- |
+|  0 |  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 | 12 | 13 | 14 | ... |
+
+then we can see that the string of interest begins at position 5 and ends at position 11. The offset buffer consists integers that store these break point locations. For `string_array` it might look like this:
+
+```
+0 5 12 15 20 25
+```
+
+The difference between the `utf8()` data type and the `large_utf8()` data type is that these the `utf8()` data type stores these as 32-bit integers whereas the `large_utf8()` type stores them as 64-bit integers.
+
+## Chunked arrays
+
+Arrays are immutable objects: once an Array has been initialized the valuse it stores cannot be altered. This ensures that multiple entities can safely refer to an Array via pointers, and not run the risk that the values will change. Using immutable Arrays makes it possible for Arrow to avoid unnecessary copies of data objects. 

Review Comment:
   ```suggestion
   Arrays are immutable objects: once an Array has been initialized the values it stores cannot be altered. This ensures that multiple entities can safely refer to an Array via pointers, and not run the risk that the values will change. Using immutable Arrays makes it possible for Arrow to avoid unnecessary copies of data objects. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007451768


##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides [reticulate](https://rstudio.github.io/reticulate/) methods for passing data between R and Python within the same process. This vignette provides a brief overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of functionality that is better supported in Python than in R at the current state of development. For example, at one point in time the R `arrow` package didn't support `concat_arrays()` but PyArrow did, so this would have been a good use case at that time. At the time of current writing PyArrow has more comprehensive support for [Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- but see `vignette("flight", package = "arrow")` -- so that would be another instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass data objects between R and Python. With large data sets, it can be quite costly -- in terms of time and CPU cycles -- to perform the copy and covert operations required to translate a native data structure in R (e.g., a data frame) to an analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. Because Arrow data objects such as Tables have the same in-memory format in R and Python, it is possible to perform "zero-copy" data transfers, in which only the metadata needs to be passed between languages. As illustrated later, this drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For example, you may wish to create a Python [virtual environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` library. A virtual environment is a specific Python installation created for one project or purpose. It is a good practice to use specific environments in Python so that updating a package doesn't impact packages in other projects.
+
+You can perform the set up from within R. Let's suppose you want to call your virtual environment something like `my-pyarrow-env`. Your setup code would look like this: 
 
 ```r
-library(reticulate)
-virtualenv_create("arrow-env")
-install_pyarrow("arrow-env")
+virtualenv_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
 ```
 
-If you want to install a development version of `pyarrow`,
-add `nightly = TRUE`:
+If you want to install a development version of `pyarrow` to the virtual environment, add `nightly = TRUE` to the `install_pyarrow()` command:
 
 ```r
-install_pyarrow("arrow-env", nightly = TRUE)
+install_pyarrow("my-pyarrow-env", nightly = TRUE)
 ```
 
-A virtualenv or a virtual environment is a specific Python installation
-created for one project or purpose. It is a good practice to use
-specific environments in Python so that updating a package doesn't
-impact packages in other projects.
+Note that you don't have to use virtual environments. If you prefer [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html), you can use this setup code:
 
-`install_pyarrow()` also works with `conda` environments
-(`conda_create()` instead of `virtualenv_create()`).
+```r
+conda_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
+```
 
-For more on installing and configuring Python,
-see the [reticulate docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+To learn more about installing and configuring Python from R,
+see the [reticulate documentation](https://rstudio.github.io/reticulate/articles/python_packages.html), which discusses the topic in more detail.
 
-## Using
+## Importing PyArrow
 
-To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+Assuming that `arrow` and `reticulate` are both loaded in R, your first step is to make sure that the correct Python environment is being used. To do that, use a command like this:
+
+```r
+use_virtualenv("my-pyarrow-env") # virtualenv users
+use_condaenv("my-pyarrow-env")   # conda users
+```

Review Comment:
   ha yeah I'll split



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1294226138

   | @eitsupi Is it OK to use base pipes (|>) that only works with R4.1 or later in vignettes?
   
   Probably not! I've reverted to magrittr pipes now 😁 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eitsupi commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

eitsupi commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1007998182


##########
r/vignettes/data_wrangling.Rmd:
##########
@@ -0,0 +1,172 @@
+---
+title: "Data analysis with dplyr syntax"
+description: >
+  Learn how to use the `dplyr` backend supplied by `arrow` 
+output: rmarkdown::html_vignette
+---
+
+The `arrow` package provides a `dplyr` back end that allows users to manipulate tabular Arrow data (`Table` and `Dataset` objects) using familiar `dplyr` syntax. To use this functionality, make sure that the `arrow` and `dplyr` packages are both loaded. In this article we will take the `starwars` data set included in `dplyr`, convert it to an Arrow Table, and then analyze this data. Note that, although these examples all use an in-memory `Table` object, the same functionality works for an on-disk `Dataset` object with only minor differences in behavior (documented later in the article).
+
+To get started let's load the packages and create the data:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+sw <- arrow_table(starwars, as_data_frame = FALSE)
+```
+
+## One-table dplyr verbs
+
+The `arrow` package provides support for the `dplyr` one-table verbs, allowing users to construct data analysis pipelines in a familiar way. The example below shows the use of `filter()`, `rename()`, `mutate()`, `arrange()` and `select()`:
+
+```{r}
+result <- sw %>%
+  filter(homeworld == "Tatooine") %>%
+  rename(height_cm = height, mass_kg = mass) %>%
+  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
+  arrange(desc(birth_year)) %>%
+  select(name, height_in, mass_lbs)
+```
+
+It is important to note that `arrow` users lazy evaluation to delay computation until the result is explicitly requested. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. As a consequence of this design choice, we have not yet performed computations on the `sw` data have been performed. The `result` variable is an object with class `arrow_dplyr_query` that represents all the computations to be performed:
+
+```{r}
+result
+```
+
+To perform these computations and materialize the result, we call
+`compute()` or `collect()`. The difference between the two determines what kind of object will be returned. Calling `compute()` returns an Arrow Table, suitable for passing to other `arrow` or `dplyr` functions:
+
+```{r}
+compute(result)
+```
+
+In contrast, `collect()` returns an R data frame, suitable for viewing or passing to other R functions for analysis or visualization:
+
+```{r}
+collect(result)
+```
+
+The `arrow` package has broad support for single-table `dplyr` verbs, including those that compute aggregates. For example, it supports `group_by()` and `summarize()`, as well as commonly-used convenience functions such as `count()`:
+
+```{r}
+sw %>%
+  group_by(species) %>%
+  summarize(mean_height = mean(height, na.rm = TRUE)) %>%
+  collect()
+
+sw %>% 
+  count(gender) %>%
+  collect()
+```
+
+Note, however, that window functions such as `ntile()` are not yet supported. 
+
+## Two-table dplyr verbs
+
+Equality joins (e.g. `left_join()`, `inner_join()`) are supported for joining multiple tables. This is illustrated below:
+
+```{r}
+jedi <- data.frame(
+  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
+  jedi = c(FALSE, TRUE, TRUE)
+)
+
+sw %>%
+  select(1:3) %>%
+  right_join(jedi) %>%
+  collect()
+```
+
+## Expressions within dplyr verbs
+
+Inside `dplyr` verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents. The [changelog](https://arrow.apache.org/docs/r/news/index.html) lists many of them. If there are additional functions you would like to see implemented, please file an issue as described in the [Getting help](https://arrow.apache.org/docs/r/#getting-help) guidelines.
+
+## Registering custom bindings
+
+The `arrow` package makes it possible for users to supply bindings for custom functions in some situations using `register_scalar_function()`. To operate correctly, the to-be-registered function must have `context` as its first argument, as required by the query engine. For example, suppose we wanted to implement a function that converts a string to snake case (a greatly simplified version of `janitor::make_clean_names()`). The function could be written as follows:
+
+```{r}
+to_snake_name <- function(context, string) {
+  replace <- c(`'` = "", `"` = "", `-` = "", `\\.` = "_", ` ` = "_")
+  string %>% 
+    stringr::str_replace_all(replace) %>%
+    stringr::str_to_lower() %>% 
+    stringi::stri_trans_general(id = "Latin-ASCII")
+}
+```
+
+To call this within an `arrow`/`dplyr` pipeline, it needs to be registered:
+
+```{r}
+register_scalar_function(
+  name = "to_snake_name",
+  fun = to_snake_name,
+  in_type = utf8(),
+  out_type = utf8(),
+  auto_convert = TRUE
+)
+```
+
+In this expression, the `name` argument specifies the name by which it will be recognized in the context of the `arrow`/`dplyr` pipeline and `fun` is the function itself. The `in_type` and `out_type` arguments are used to specify the expected data type for the input and output, and `auto_convert` specifies whether `arrow` should automatically convert any R inputs to their Arrow equivalents. 
+
+Once registered, the following works:
+
+```{r}
+sw %>% 
+  transmute(name, snake_name = to_snake_name(name)) %>%

Review Comment:
   How about use `mutate(.keep = "none")` instead of `transmute()`?
   tidyverse/dplyr#6414



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012447084


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Sorry, I was reading it as new content as I'd not looking at the getting started page in ages and ages!  Honestly, I'd just err on the side of your own judgment in cases like this; I agree this section is for one of the dev vignettes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1011863788


##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Is there benefit to mentioning these classes to Arrow users who aren't package developers? 



##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |

Review Comment:
   We can use the convenience function `as_arrow_array()` to create Arrays from R vectors.



##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional rectangular data structures used to store tabular data. For columnar, one-dimensional data, the `Array` and `ChunkedArray` classes are provided. Finally, `Scalar` objects represent individual values. The table below summarizes these objects and shows how you can create new instances using the [`R6`](https://r6.r-lib.org/) class object, as well as convenience functions that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | Convenience function                          |
+| --- | -------------- | ----------------------------------------------| --------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |                                               |
+| 1   | `Array`        | `Array$create(vector, type)`                  |                                               |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | `chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | `record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | `arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | `open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of the same name in the underlying Arrow C++ library. It is also worth mentioning that the `arrow` package also defines classes that do not exist in the C++ library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
+
+In addition to these data objects, `arrow` defines the following classes for representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+To learn more about the metadata classes, see the [metadata article](./metadata.html).
+
+## Scalars
+
+A Scalar object is simply a single value that can be of any type. It might be an integer, a string, a timestamp, or any of the different `DataType` objects that Arrow supports. Most users of the `arrow` R package are unlikely to create Scalars directly, but should there be a need you can do this by calling the `Scalar$create()` method:
+
+```{r}
+Scalar$create("hello")
+```
+
+
+## Arrays
+
+Array objects are ordered sets of Scalar values. As with Scalars most users will not need to create Arrays directly, but if the need arises there is an `Array$create()` method that allows you to create new Arrays:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
+string_array
+```
+
+An Array can be subset using square brackets as shown below:
+
+```{r}
+string_array[4:5]
+```
+
+Arrays are immutable objects: once an Array has been created it cannot be modified or extended. 
+
+## Chunked Arrays
+
+In practice, most users of the `arrow` R package are likely to use Chunked Arrays rather than simple Arrays. Under the hood, a Chunked Array is a collection of one or more Arrays that can be indexed _as if_ they were a single Array. The reasons that Arrow provides this functionality are described in the [data object layout article](./developers/data_object_layout.html) but for the present purposes it is sufficient to notice that Chunked Arrays behave like Arrays in regular data analysis.
+
+To illustrate, let's use the `chunked_array()` function:
+
+```{r}
+chunked_string_array <- chunked_array(
+  string_array,
+  c("I", "love", "you")
+)
+```
+
+The `chunked_array()` function is just a wrapper around the functionality that `ChunkedArray$create()` provides. Let's print the object:
+
+```{r}
+chunked_string_array
+```
+
+The double bracketing in this output is intended to highlight the fact that Chunked Arrays are wrappers around one or more Arrays. However, although comprised of multiple distinct Arrays, a Chunked Array can be indexed as if they were laid end-to-end in a single "vector-like" object. This is illustrated below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_indexing.png")
+```
+
+We can use `chunked_string_array` to illustrate this: 
+
+```{r}
+chunked_string_array[4:7]
+```
+
+An important thing to note is that "chunking" is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit. Writing the data to disk, for example, often results in the data being organized into different chunks. Similarly, two Chunked Arrays that contain the same values assigned to different chunks are deemed equivalent. To illustrate this we can create a Chunked Array that contains the same four same four values as `chunked_string_array[4:7]`, but organized into one chunk rather than split into two:
+
+```{r}
+cruel_world <- chunked_array(c("cruel", "world", "I", "love"))
+cruel_world
+```
+
+Testing for equality using `==` produces an element-wise comparison, and the result is a new Chunked Array of four (boolean type) `true` values:
+
+```{r}
+cruel_world == chunked_string_array[4:7]
+```
+
+In short, the intention is that users interact with Chunked Arrays as if they are ordinary one-dimensional data structures without ever having to think much about the underlying chunking arrangement. 
+
+Chunked Arrays are mutable, in a specific sense: Arrays can be added and removed from a Chunked Array.
+
+## Record Batches
+
+A Record Batch is tabular data structure comprised of named Arrays. Record Batches are a fundamental unit for data interchange in Arrow, but are not typically used for data analysis. Tables and Datasets are usually more convenient in analytic contexts.
+
+These Arrays can be of different types but must all be the same length. Each Array is referred to as one of the "fields" or "columns" of the Record Batch. You can create a Record Batch using the `record_batch()` function or by using the `RecordBatch$create()` method. These functions are flexible and can accept inputs in several formats: you can pass a data frame, one or more named vectors, an input stream, or even a raw vector containing appropriate binary data. For example:
+
+```{r}
+rb <- record_batch(
+  strs = string_array, 
+  ints = integer_array,
+  dbls = c(1.1, 3.2, 0.2, NA, 11)
+)
+rb
+```
+
+This is a Record Batch containing 5 rows and 3 columns, and its conceptual structure is shown below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./record_batch.png")
+```
+
+The `arrow` package supplies a `$` method for Record Batch objects, used to extract a single column by name:
+
+```{r}
+rb$strs
+```
+
+You can use double brackets `[[` to refer to columns by position. The `rb$ints` array is the second column in our Record Batch so we can extract it with this:
+
+```{r}
+rb[[2]]
+```
+
+There is also `[` method that allows you to extract subsets of a record batch in the same way you would for a data frame. The command `rb[1:3, 1:2]` extracts the first three rows and the first two columns:
+
+```{r}
+rb[1:3, 1:2]
+```
+
+Record Batches cannot be concatenated: because they are comprised of Arrays, and Arrays are immutable objects, new rows cannot be added to Record Batch once created.
+
+## Tables
+
+A Table is comprised of named Chunked Arrays, in the same way that a Record Batch is comprised of named Arrays. You can subset Tables with `$`, `[[`, and `[` the same way you can for Record Batches. Unlike Record Batches, Tables can be concatenated (because they are comprised of Chunked Arrays). Suppose a second Record Batch arrives:
+
+```{r}
+new_rb <- record_batch(
+  strs = c("I", "love", "you"), 
+  ints = c(5L, 0L, 0L),
+  dbls = c(7.1, -0.1, 2)
+)
+```
+
+It is not possible to create a Record Batch that appends the data from `new_rb` to the data in `rb`, not without creating entirely new objects in memory. With Tables, however, we can:
+
+```{r}
+df <- arrow_table(rb)
+new_df <- arrow_table(new_rb)
+```
+
+We now have the two fragments of the data set represented as Tables. The difference between the Table and the Record Batch is that the columns are all represented as Chunked Arrays. Each Array from the original Record Batch is one chunk in the corresponding Chunked Array in the Table:
+
+```{r}
+rb$strs
+df$strs
+```
+
+It's the same underlying data -- and indeed the same immutable Array is referenced by both -- just enclosed by a new, flexible Chunked Array wrapper. However, it is this wrapper that allows us to concatenate Tables:
+
+```{r}
+concat_tables(df, new_df)
+```
+
+The resulting object is shown schematically below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./table.png")
+```
+

Review Comment:
   Do we perhaps also want a section on Datasets here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1012456105


##########
r/vignettes/arrow.Rmd:
##########
@@ -1,227 +1,222 @@
 ---
-title: "Using the Arrow C++ Library in R"
-description: "This document describes the low-level interface to the Apache Arrow C++ library in R and reviews the patterns and conventions of the R package."
+title: "Get started with Arrow"
+description: >
+  An overview of the Apache Arrow project and the `arrow` R package
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Using the Arrow C++ Library in R}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-The Apache Arrow C++ library provides rich, powerful features for working with columnar data. The `arrow` R package provides both a low-level interface to the C++ library and some higher-level, R-flavored tools for working with it. This vignette provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
+Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.
 
-# Features
+The `arrow` package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.
 
-## Multi-file datasets
+## Package conventions
 
-The `arrow` package lets you work efficiently with large, multi-file datasets
-using `dplyr` methods. See `vignette("dataset", package = "arrow")` for an overview.
+The `arrow` R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the `arrow` R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:
 
-## Reading and writing files
+- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
+- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
+- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`
 
-`arrow` provides some simple functions for using the Arrow C++ library to read and write files.
-These functions are designed to drop into your normal R workflow
-without requiring any knowledge of the Arrow C++ library
-and use naming conventions and arguments that follow popular R packages, particularly `readr`.
-The readers return `data.frame`s
-(or if you use the `tibble` package, they will act like `tbl_df`s),
-and the writers take `data.frame`s.
+This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because `arrow` also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:
 
-Importantly, `arrow` provides basic read and write support for the [Apache
-Parquet](https://parquet.apache.org/) columnar data file format.
+- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
+- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object
 
-```r
-library(arrow)
-df <- read_parquet("path/to/file.parquet")
+All the examples used in this article rely on this high-level interface.
+
+To learn more, see the article on [package conventions](./package_conventions.html).
+
+
+## Tabular data in Arrow 
+
+A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the `arrow` R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:
+
+```{r}
+library(arrow, warn.conflicts = FALSE)
+
+dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
+dat
 ```
 
-Just as you can read, you can write Parquet files:
+You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:
 
-```r
-write_parquet(df, "path/to/different_file.parquet")
+```{r}
+dat[1:2, 1:2]
 ```
 
-The `arrow` package also includes a faster and more robust implementation of the
-[Feather](https://github.com/wesm/feather) file format, providing `read_feather()` and
-`write_feather()`. This implementation depends
-on the same underlying C++ library as the Python version does,
-resulting in more reliable and consistent behavior across the two languages, as
-well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).
-`arrow` also by default writes the Feather V2 format
-([the Arrow IPC file format](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format)),
-which supports a wider range of data types, as well as compression.
-
-For CSV and line-delimited JSON, there are `read_csv_arrow()` and `read_json_arrow()`, respectively.
-While `read_csv_arrow()` currently has fewer parsing options for dealing with
-every CSV format variation in the wild, for the files it can read, it is
-often significantly faster than other R CSV readers, such as
-`base::read.csv`, `readr::read_csv`, and `data.table::fread`.
-
-## Working with Arrow data in Python
-
-Using [`reticulate`](https://rstudio.github.io/reticulate/), `arrow` lets you
-share data between R and Python (`pyarrow`) efficiently, enabling you to take
-advantage of the vibrant ecosystem of Python packages that build on top of
-Apache Arrow. See `vignette("python", package = "arrow")` for details.
+Along the same lines, the `$` operator can be used to extract named columns:
 
-## Access to Arrow messages, buffers, and streams
+```{r}
+dat$y
+```
 
-The `arrow` package also provides many lower-level bindings to the C++ library, which enable you
-to access and manipulate Arrow objects. You can use these to build connectors
-to other applications and services that use Arrow. One example is Spark: the
-[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
-move data to and from Spark, yielding [significant performance
-gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
+Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 
 
-# Object hierarchy
-
-## Metadata objects
-
-Arrow defines the following classes for representing metadata:
-
-| Class      | Description                                        | How to create an instance        |
-| ---------- | -------------------------------------------------- | -------------------------------- |
-| `DataType` | attribute controlling how values are represented   | functions in `help("data-type")` |
-| `Field`    | a character string name and a `DataType`           | `field(name, type)`              |
-| `Schema`   | list of `Field`s                                   | `schema(...)`                    |
-
-## Data objects
-
-Arrow defines the following classes for representing zero-dimensional (scalar),
-one-dimensional (array/vector-like), and two-dimensional (tabular/data
-frame-like) data:
-
-| Dim | Class          | Description                               | How to create an instance                                                                             |
-| --- | -------------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------|
-| 0   | `Scalar`       | single value and its `DataType`           | `Scalar$create(value, type)`                                                                          |
-| 1   | `Array`        | vector of values and its `DataType`       | `Array$create(vector, type)`                                                                          | 
-| 1   | `ChunkedArray` | vectors of values and their `DataType`    | `ChunkedArray$create(..., type)` or alias `chunked_array(..., type)`                                  |
-| 2   | `RecordBatch`  | list of `Array`s with a `Schema`          | `RecordBatch$create(...)` or alias `record_batch(...)`                                                |
-| 2   | `Table`        | list of `ChunkedArray` with a `Schema`    | `Table$create(...)`, alias `arrow_table(...)`, or `arrow::read_*(file, as_data_frame = FALSE)`        |
-| 2   | `Dataset`      | list of `Table`s  with the same `Schema`  | `Dataset$create(sources, schema)` or alias `open_dataset(sources, schema)`                            |
-
-Each of these is defined as an `R6` class in the `arrow` R package and
-corresponds to a class of the same name in the Arrow C++ library. The `arrow`
-package provides a variety of `R6` and S3 methods for interacting with instances
-of these classes.
-
-For convenience, the `arrow` package also defines several synthetic classes that
-do not exist in the C++ library, including:
-
-* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
-* `ArrowTabular`: inherited by `RecordBatch` and `Table`
-* `ArrowObject`: inherited by all Arrow objects
-
-# Internals
-
-## Mapping of R <--> Arrow types
-
-Arrow has a rich data type system that includes direct parallels with R's data types and much more.
+Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 
 
-In the tables, entries with a `-` are not currently implemented.
+To learn more about the different data object classes in `arrow`, see the article on [data objects](./data_objects.html).
 
-### R to Arrow
+## Converting Tables to data frames
 
-| R type                   | Arrow type |
-|--------------------------|------------|
-| logical                  | boolean    |
-| integer                  | int32      |
-| double ("numeric")       | float64^1^ |
-| character                | utf8^2^    |
-| factor                   | dictionary |
-| raw                      | uint8      |
-| Date                     | date32     |
-| POSIXct                  | timestamp  |
-| POSIXlt                  | struct     |
-| data.frame               | struct     |
-| list^3^                  | list       |
-| bit64::integer64         | int64      |
-| hms::hms                 | time32     |
-| difftime                 | duration   |
-| vctrs::vctrs_unspecified | null       |
+Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`
 
+```{r}
+as.data.frame(dat)
+```
+
+When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 
+In most instances the data conversion takes place automatically and without friction: a column stored as a timestamp in Arrow becomes a POSIXct vector in R, for example. However, there are some instances where the mapping between Arrow data types and R data types is not exact and care is required.

Review Comment:
   Ambiguity in my head totally gone now, thanks! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005556357


##########
r/vignettes/cloud_storage.Rmd:
##########
@@ -0,0 +1,329 @@
+---
+title: "Using cloud storage (S3, GCS)"
+description: >
+  Learn how to work with data sets stored in an 
+  Amazon S3 bucket or on Google Cloud Storage 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Using cloud storage (S3, GCS)}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+Working with data stored in cloud storage systems like [Amazon Simple Storage Service](https://docs.aws.amazon.com/s3/) (S3) and [Google Cloud Storage](https://cloud.google.com/storage/docs) (GCS) is a very common task. Because of this, the Arrow C++ library provides a toolkit aimed to make it as simple to work with cloud storage as it is to work with the local filesystem.

Review Comment:
   TIL why S3 is called S3 :D



##########
r/vignettes/cloud_storage.Rmd:
##########
@@ -0,0 +1,329 @@
+---
+title: "Using cloud storage (S3, GCS)"
+description: >
+  Learn how to work with data sets stored in an 
+  Amazon S3 bucket or on Google Cloud Storage 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Using cloud storage (S3, GCS)}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+Working with data stored in cloud storage systems like [Amazon Simple Storage Service](https://docs.aws.amazon.com/s3/) (S3) and [Google Cloud Storage](https://cloud.google.com/storage/docs) (GCS) is a very common task. Because of this, the Arrow C++ library provides a toolkit aimed to make it as simple to work with cloud storage as it is to work with the local filesystem.
+
+To make this work, the Arrow C++ library contains a general-purpose interface for file systems, and the `arrow` package exposes this interface to R users. For instance, if you want to you can create a `LocalFileSystem` object that allows you to interact with the local file system in the usual ways: copying, moving, and deleting files, obtaining information about files and folders, and so on (see `help("FileSystem", package = "arrow")` for details). In general you probably don't need this functionality because you already have tools for working with your local file system, but this interface becomes much more useful in the context of remote file systems. Currently there is a specific implementation for Amazon S3 provided by the `S3FileSystem` class, and another one for Google Cloud Storage provided by `GcsFileSystem`.
+
+This vignette provides an overview of working with both S3 and GCS data using the Arrow toolkit. 
+
+## S3 and GCS support on Linux
+
+Before you start, make sure that your `arrow` install has support for S3 and/or GCS enabled. For most users this will be true by default, because the Windows and MacOS binary packages hosted on CRAN include S3 and GCS support. You can check whether support is enabled via helper functions:
+
+```r
+arrow_with_s3()
+arrow_with_gcs()
+```
+
+If these return `TRUE` then the relevant support is enabled.
+
+In some cases you may find that your system does not have support enabled. The most common case for this occurs on Linux when installing `arrow` from source. In this situation S3 and GCS support is not always enabled by default, and there are additional system requirements involved. See `vignette("install_linux", package = "arrow")` for details on how to resolve this.
+
+## Connecting to cloud storage
+
+One way of working with filesystems is to create `?FileSystem` objects. 
+`?S3FileSystem` objects can be created with the `s3_bucket()` function, which
+automatically detects the bucket's AWS region. Similarly, `?GcsFileSystem` objects
+can be created with the `gs_bucket()` function. The resulting
+`FileSystem` will consider paths relative to the bucket's path (so for example
+you don't need to prefix the bucket path when listing a directory).
+
+With a `FileSystem` object, you can point to specific files in it with the `$path()` method
+and pass the result to file readers and writers (`read_parquet()`, `write_feather()`, et al.).
+
+Often the reason users work with cloud storage in real world analysis is to access large data sets. An example of this is discussed in `vignette("dataset", package = "arrow")`, but new users may prefer to work with a much smaller data set while learning how the `arrow` cloud storage interface works. To that end, the examples in this vignette rely on a multi-file Parquet dataset that stores a copy of the `diamonds` data made available through the [`ggplot2`](https://ggplot2.tidyverse.org/) package, documented in `help("diamonds", package = "ggplot2")`. The cloud storage version of this data set consists of 5 Parquet files totaling less than 1MB in size.
+
+The diamonds data set is hosted on both S3 and GCS, in a bucket named `voltrondata-labs-datasets`. To create an S3FileSystem object that refers to that bucket, use the following command:
+
+```r
+bucket <- s3_bucket("voltrondata-labs-datasets")
+```
+
+To do this for the GCS version of the data, the command is as follows:
+
+```r
+bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
+```
+
+Note that `anonymous = TRUE` is required for GCS if credentials have not been configured. 
+
+<!-- TODO: update GCS note above if ARROW-17097 is addressed -->
+
+Within this bucket there is a folder called `diamonds`. We can call `bucket$ls("diamonds")` to list the files stored in this folder, or `bucket$ls("diamonds", recursive = TRUE)` to recursively search subfolders. Note that on GCS, you should always set `recursive = TRUE` because directories often don't appear in the results.
+
+Here's what we get when we list the files stored in the GCS bucket:
+
+``` r
+bucket$ls("diamonds", recursive = TRUE)
+```
+
+``` r
+## [1] "diamonds/cut=Fair/part-0.parquet"     
+## [2] "diamonds/cut=Good/part-0.parquet"     
+## [3] "diamonds/cut=Ideal/part-0.parquet"    
+## [4] "diamonds/cut=Premium/part-0.parquet"  
+## [5] "diamonds/cut=Very Good/part-0.parquet"
+```
+
+There are 5 Parquet files here, one corresponding to each of the "cut" categories in the `diamonds` data set. We can specify the path to a specific file by calling `bucket$path()`:
+
+``` r
+parquet_good <- bucket$path("diamonds/cut=Good/part-0.parquet")
+```
+
+We can use `read_parquet()` to read from this path directly into R:
+
+``` r
+diamonds_good <- read_parquet(parquet_good)
+diamonds_good
+```
+
+``` r
+## # A tibble: 4,906 × 9
+##    carat color clarity depth table price     x     y     z
+##    <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
+##  1  0.23 E     VS1      56.9    65   327  4.05  4.07  2.31
+##  2  0.31 J     SI2      63.3    58   335  4.34  4.35  2.75
+##  3  0.3  J     SI1      64      55   339  4.25  4.28  2.73
+##  4  0.3  J     SI1      63.4    54   351  4.23  4.29  2.7 
+##  5  0.3  J     SI1      63.8    56   351  4.23  4.26  2.71
+##  6  0.3  I     SI2      63.3    56   351  4.26  4.3   2.71
+##  7  0.23 F     VS1      58.2    59   402  4.06  4.08  2.37
+##  8  0.23 E     VS1      64.1    59   402  3.83  3.85  2.46
+##  9  0.31 H     SI1      64      54   402  4.29  4.31  2.75
+## 10  0.26 D     VS2      65.2    56   403  3.99  4.02  2.61
+## # … with 4,896 more rows
+## # ℹ Use `print(n = ...)` to see more rows
+```
+
+Note that this will be slower to read than if the file were local.
+
+<!-- though if you're running on a machine in the same AWS region as the file in S3,
+the cost of reading the data over the network should be much lower. -->
+
+
+<!--
+See `help(FileSystem)` for a list of options that `s3_bucket()`/`S3FileSystem$create()`
+and `gs_bucket()`/`GcsFileSystem$create()` can take.
+
+The object that `s3_bucket()` and `gs_bucket()` return is technically a `SubTreeFileSystem`, 
+which holds a path and a file system to which it corresponds. `SubTreeFileSystem`s can be 
+useful for holding a reference to a subdirectory somewhere (on S3, GCS, or elsewhere).
+
+One way to get a subtree is to call the `$cd()` method on a `FileSystem`
+
+```r
+june2019 <- bucket$cd("nyc-taxi/year=2019/month=6")
+df <- read_parquet(june2019$path("part-0.parquet"))
+```
+
+`SubTreeFileSystem` can also be made from a URI:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
+```
+-->
+
+
+
+## Connecting directly with a URI
+
+In most use cases, the easiest and most natural way to connect to cloud storage in `arrow` is to use the FileSystem objects returned by `s3_bucket()` and `gs_bucket()`, especially when multiple file operations are required. However, in some cases you may want to download a file directly by specifying the URI. This is permitted by `arrow`, and functions like `read_parquet()`, `write_feather()`, `open_dataset()` etc will all accept URIs to cloud resources hosted on S3 or GCS. The format of an S3 URI is as follows:
+
+```
+s3://[access_key:secret_key@]bucket/path[?region=]
+```
+
+For GCS, the URI format looks like this:
+
+```
+gs://[access_key:secret_key@]bucket/path
+gs://anonymous@bucket/path
+```
+
+For example, the Parquet file storing the "good cut" diamonds that we downloaded earlier in the vignette is available on both S3 and CGS. The relevant URIs are as follows:
+
+```r
+uri <- "s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
+uri <- "gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
+```
+
+Note that "anonymous" is required on GCS for public buckets. Regardless of which version you use, you can pass this URI to `read_parquet()` as if the file were stored locally:
+
+```r
+df <- read_parquet(uri)
+```
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+is equivalent to:
+
+```r
+bucket <- S3FileSystem$create(
+  endpoint_override="https://storage.googleapis.com",
+  allow_bucket_creation=TRUE
+)
+bucket$path("voltrondata-labs-datasets/")
+```
+
+Both tell the `S3FileSystem` object that it should allow the creation of new buckets 
+and to talk to Google Storage instead of S3. The latter works because GCS implements an 
+S3-compatible API -- see [File systems that emulate S3](#file-systems-that-emulate-s3) 
+below -- but if you want better support for GCS you should refer to a `GcsFileSystem` 
+bt using a URI that starts with `gs://`. 

Review Comment:
   ```suggestion
   but using a URI that starts with `gs://`. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1307173135

   Moving my checkboxes down here so I can find them! :joy:
   
   - [x]     Reduces the content on the README page to the essential points
   - [x]     Rewrites the "get started" page to focus on common tasks and novice users
   - [x]     Moves discussion of the Arrow data object hierarchy to a new "data objects" vignette
   - [ ]     Moves discussion of Arrow data types and conversions to a new "data types" vignette
   - [ ]     Moves discussion of schemas and storage of R attributes to a new "metadata" vignette
   - [x]     Moves discussion of package naming conventions to a new "package conventions" vignette
   - [ ]     Moves discussion of read/write capabilities to a new "reading and writing data" vignette
   - [ ]     Moves discussion of the dplyr back end to a new "data wrangling" vignette
   - [ ]     Edits the "multi-file data sets" vignette to improve readability and to minimize risk of novice users unintentionally downloading the 70GB NYC taxi data by copy/paste errors
   - [x]     Minor edits to the "python" vignette to improve readability
   - [x]     Minor edits to the "cloud storage" vignette to improve readability
   - [x]     Minor edits to the "flight" vignette to improve readability
   - [x]     Inserts a new "data object layout" vignette (in the developer vignettes) to bridge between the R documentation and the Arrow specification page
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1023516220


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when that data set is too large to be loaded into memory. With the help of Arrow Dataset objects you can analyze this kind of data using familiar  [`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze extremely large datasets. As an example, consider the [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) that is widely used in big data exercises and competitions. To demonstrate the capabilities of Apache Arrow we host a Parquet-formatted version this data in a public Amazon S3 bucket: in its full form, our version of the data set is one very large table with about 1.7 billion rows and 24 columns, where each row corresponds to a single taxi ride sometime between 2009 and 2022. A [data dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding to a month of data. A single file is typically around 400-500MB in size, and the full data set is about 70GB in size. It is not a small data set -- it is slow to download and does not fit in memory on a typical machine 🙂  -- so we also host a "tiny" version of the NYC taxi data that is formatted in exactly the same way but includes only one out of every thousand entries in the original data set (i.e., individual files are <1MB in size, and the "tiny" data set is only 70MB) 
 
-```{r, eval = FALSE}
-arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-# Alternatively, with GCS:
-arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-```
+If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` (true for most users; see links at the end of this article if you need to troubleshoot this), you can connect to the "tiny taxi data" with either of the following commands:

Review Comment:
   updated to clarify that `s3_bucket()` refers to Amazon S3 copy of the data and `gs_bucket()` refers to Google Cloud copy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1023520893


##########
r/vignettes/dataset.Rmd:
##########
@@ -548,4 +451,11 @@ Most file formats have magic numbers which are written at the end.  This means a
 partial file write can safely be detected and discarded.  The CSV file format does
 not have any such concept and a partially written CSV file may be detected as valid.
 
+## Further reading
+
+- To learn about cloud storage, see the [cloud storage article](./fs.html).
+- To learn about `dplyr` with `arrow`, see the [data wrangling article](./data_wrangling.html).
+- To learn about reading and writing data, see the [read/write article](./read_write.html).
+- To manually enable cloud support on Linux, see the article on [installation on Linux](./install.html).
+- To learn about schemas and metadata, see the [metadata article](./metadata.html).

Review Comment:
   Good suggestion! Done!  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1028607398


##########
r/README.md:
##########
@@ -1,331 +1,104 @@
-# arrow
+# arrow <img src="https://arrow.apache.org/img/arrow-logo_hex_black-txt_white-bg.png" align="right" alt="" width="120" />
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
 [![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
 
-**[Apache Arrow](https://arrow.apache.org/) is a cross-language
-development platform for in-memory data.** It specifies a standardized
+[Apache Arrow](https://arrow.apache.org/) is a cross-language
+development platform for in-memory data. It specifies a standardized
 language-independent columnar memory format for flat and hierarchical
 data, organized for efficient analytic operations on modern hardware. It
 also provides computational libraries and zero-copy streaming messaging
 and interprocess communication.
 
-**The `arrow` package exposes an interface to the Arrow C++ library,
-enabling access to many of its features in R.** It provides low-level
+The `arrow` R package exposes an interface to the Arrow C++ library,
+enabling access to many of its features in R. It provides low-level
 access to the Arrow C++ library API and higher-level access through a
 `{dplyr}` backend and familiar R functions.
 
 ## What can the `arrow` package do?
 
--   Read and write **Parquet files** (`read_parquet()`,
-    `write_parquet()`), an efficient and widely used columnar format
--   Read and write **Feather files** (`read_feather()`,
-    `write_feather()`), a format optimized for speed and
-    interoperability
--   Analyze, process, and write **multi-file, larger-than-memory
-    datasets** (`open_dataset()`, `write_dataset()`)
--   Read **large CSV and JSON files** with excellent **speed and
-    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
--   Write CSV files (`write_csv_arrow()`)
--   Manipulate and analyze Arrow data with **`dplyr` verbs**
--   Read and write files in **Amazon S3** and **Google Cloud Storage**
-    buckets with no additional function calls
--   Exercise **fine control over column types** for seamless
-    interoperability with databases and data warehouse systems
--   Use **compression codecs** including Snappy, gzip, Brotli,
-    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
--   Enable **zero-copy data sharing** between **R and Python**
--   Connect to **Arrow Flight** RPC servers to send and receive large
-    datasets over networks
--   Access and manipulate Arrow objects through **low-level bindings**
-    to the C++ library
--   Provide a **toolkit for building connectors** to other applications
-    and services that use Arrow
-
-## Installation
+The `arrow` package provides functionality for a wide range of data analysis
+tasks. It allows users to read and write data in a variety formats:
 
-### Installing the latest release version
-
-Install the latest release of `arrow` from CRAN with
-
-``` r
-install.packages("arrow")
-```
+-   Read and write Parquet files, an efficient and widely used columnar format
+-   Read and write Feather files, a format optimized for speed and
+    interoperability
+-   Read and write CSV files with excellent speed and efficiency
+-   Read and write multi-file larger-than-memory datasets

Review Comment:
   I wonder if `and` is appropriate here. One common use case I have seen is working with one, too-large-for-memory tabular file with Arrow, and you can use arrow::dataset() on multi-files even when all the data would fit into memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

thisisnic commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1324892920

   I've removed the WIP flag from the issue title to run the CI on it.
   
   I'm happy to give the contents a thumbs up and defer any more changes to follow-up tickets.
   
   Before I approve it, I'm going to pull it locally and build it there just to give it a final look over all together.
   
   @nealrichardson - did you want to give this a look over as well?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] stephhazlitt commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

stephhazlitt commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1325553691

   One thing I noticed, which might be worth a follow up ticket (or maybe I am being too picky) -- in the `Articles` drop down menu readers can no longer see that there are `Developer guides` articles. A user needs a second click through `More articles...` and then scroll a distance. I wonder if the drop down can be tailored a bit to show the categories at first glance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djnavarro commented on pull request #14514: ARROW-17887: [R][Doc] Improve readability of the Get Started and README pages

Posted by GitBox <gi...@apache.org>.

djnavarro commented on PR #14514:
URL: https://github.com/apache/arrow/pull/14514#issuecomment-1325704109

   @stephhazlitt Yeah I noticed that issue with the developer vignettes earlier. It would be nice if we could have a menu item that linked to that subsection of the articles page. That way devs would be able to click through with about the same level of ease as they have under the current docs where dev vignettes are tucked into a submenu. I haven't quite worked out how to do that with the bootstrap5 pkgdown templates yet. Tempted to push that to a separate PR 😁   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org