You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/21 11:32:10 UTC

[GitHub] [arrow] thisisnic opened a new pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette [WIP]

thisisnic opened a new pull request #10765:
URL: https://github.com/apache/arrow/pull/10765


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682603777



##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.

Review comment:
       ```suggestion
   When referring to external packages in documentation, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.
   ```

##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.
+
+* "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets.  This vignette introduces Datasets and shows how to use dplyr to analyze them."
+
+## Data frames
+
+When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g.
+
+* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables."

Review comment:
       ```suggestion
   * "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatches, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables."
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find

Review comment:
       ```suggestion
   Up to this point, you haven't loaded any data. You've walked directories to find
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the

Review comment:
       ```suggestion
   for dplyr queries on Arrow Table objects (which are already in memory), the
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -228,12 +240,11 @@ cat("
 ")
 ```
 
-We just selected a subset out of a dataset with around 2 billion rows, computed
-a new column, and aggregated on it in under 2 seconds on my laptop. How does
+You've just selected a subset out of a dataset with around 2 billion rows, computed
+a new column, and aggregated it in under 2 seconds on most modern laptops. How does

Review comment:
       ```suggestion
   a new column, and aggregated it in under 2 seconds on a modern laptop. How does
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, without
 loading data from the files. Because the evaluation of these queries is deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 

Review comment:
       em-dash
   ```suggestion
   smaller slices from each file—you don't have to load the whole dataset in 
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, without
 loading data from the files. Because the evaluation of these queries is deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 
+memory to slice from it.
 
-Third, because of partitioning, we can ignore some files entirely.
+Third, because of partitioning, you can ignore some files entirely.
 In this example, by filtering `year == 2015`, all files corresponding to other years
-are immediately excluded: we don't have to load them in order to find that no
+are immediately excluded: you don't have to load them in order to find that no
 rows match the filter. Relatedly, since Parquet files contain row groups with
-statistics on the data within, there may be entire chunks of data we can
+statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.
 
 ## More dataset options
 
 There are a few ways you can control the Dataset creation to adapt to special use cases.
-For one, if you are working with a single file or a set of files that are not
-all in the same directory, you can provide a file path or a vector of multiple
-file paths to `open_dataset()`. This is useful if, for example, you have a
-single CSV file that is too big to read into memory. You could pass the file
-path to `open_dataset()`, use `group_by()` to partition the Dataset into
-manageable chunks, then use `write_dataset()` to write each chunk to a separate
-Parquet file---all without needing to read the full CSV file into R.
-
-You can specify a `schema` argument to `open_dataset()` to declare the columns
-and their data types. This is useful if you have data files that have different
-storage schema (for example, a column could be `int32` in one and `int8` in another)
-and you want to ensure that the resulting Dataset has a specific type.
-To be clear, it's not necessary to specify a schema, even in this example of
-mixed integer types, because the Dataset constructor will reconcile differences like these.
-The schema specification just lets you declare what you want the result to be.
+
+### Work with files in a directory
+
+If you are working with a single file or a set of files that are not all in the 
+same directory, you can provide a file path or a vector of multiple file paths 
+to `open_dataset()`. This is useful if, for example, you have a single CSV file 
+that is too big to read into memory. You could pass the file path to 
+`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, 
+then use `write_dataset()` to write each chunk to a separate Parquet file - all 

Review comment:
       ```suggestion
   then use `write_dataset()` to write each chunk to a separate Parquet file—all 
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       more em-dashes
   
   ```suggestion
   data object—an Arrow Table or RecordBatch, or an R data frame—and write
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679181551



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       you may want to intersect that with `names(formals(readr::read_delim))` since some of those are arrow function args




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679805001



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I'm confused by @nealrichardson 's comment, please can you rephrase that?
   
   In the meantime, it looks like there are only 5 parsing options supported and more unsupported, so I feel like I'm better off explicitly listing the ones that are supported.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679144335



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       Had a look in the docs and I found this in `open_dataset()`: 
   > additional arguments passed to dataset_factory() when sources is a directory path/URI or vector of file paths/URIs
   
   Is this what you meant, as in it only works for certain values of `sources`, or something else?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette [WIP]

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#issuecomment-884119572


   https://issues.apache.org/jira/browse/ARROW-13399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #10765:
URL: https://github.com/apache/arrow/pull/10765


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679285857



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       I know 😞  I think I'd go with "data frame" so as to (in theory) avoid the "you said data.frame but it's a tibble" though my reasoning for picking one or the other is more about what's in my head when I'm writing, and you're right, the distinction is not perfect in the real world, nor may the reader share my distinction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679808079



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       I'm leaning towards this interpretation too, so have added guidance in `STYLE.md`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r678290277



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       Something we've discussed elsewhere (I remember this came up when working on a blog post, for example), is we should come up with a style for naming packages like this. I've come around to like `{pkg}` since it's clear + unambiguous at least in the R community now what that is. It looks like tidyverse doesn't use anything (or adds " package" at the end most of the time, if we consider that a marker), so maybe we should follow that lead (like you did here). Or we could stick with backticks. 
   
   I don't have super strong opinions, but would be happy to pick one and go with it

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I could be misremembered (or remembering the issue but not that it was resolved), but I thought there were _some_ options that weren't compatible with the dataset version of csv reading. We don't have to list them here, but if that is true + those are documented elsewhere (like the docs for `read_csv_arrow()` or `open_dataset()`), maybe a link to that elsewhere that contains those caveats would be nice.

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -20,34 +20,36 @@ and what is on the immediate development roadmap.
 The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 is widely used in big data exercises and competitions.
 For demonstration purposes, we have hosted a Parquet-formatted version
-of about 10 years of the trip data in a public Amazon S3 bucket.
+of about ten years of the trip data in a public Amazon S3 bucket.
 
 The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so we can't just
+format. That's bigger than memory on most people's computers, so you can't just
 read it all in and stack it into a single data frame.
 
-In Windows and macOS binary packages, S3 support is included.
-On Linux when installing from source, S3 support is not enabled by default,
+In Windows (for R > 3.6) and macOS binary packages, S3 support is included.
+On Linux, when installing from source, S3 support is not enabled by default,
 and it has additional system requirements.
 See `vignette("install", package = "arrow")` for details.
-To see if your `arrow` installation has S3 support, run
+To see if your arrow installation has S3 support, run:
 
 ```{r}
 arrow::arrow_with_s3()
 ```
 
-Even with S3 support enabled network, speed will be a bottleneck unless your
+Even with S3 support enabled, network speed will be a bottleneck unless your
 machine is located in the same AWS region as the data. So, for this vignette,
 we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi"

Review comment:
       ```suggestion
   we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first
+path segment gives the value for `year`, and the second segment is `month`.
 Every row in `2009/01/data.parquet` has a value of 2009 for `year`
-and 1 for `month`, even though those columns may not actually be present in the file.
+and 1 for `month`, even though those columns may not be present in the file.
 
-Indeed, when we look at the dataset, we see that in addition to the columns present
+Indeed, when you look at the dataset, you can see that in addition to the columns present
 in every file, there are also columns `year` and `month`.

Review comment:
       This might be beating a dead horse, but maybe it would be good to repeat "even though they are not present in the files themselves" here too? 

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +166,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the
+package automatically calls `collect()` before processing that dplyr verb.
 
-Here's an example. Suppose I was curious about tipping behavior among the
+Here's an example. Suppose that you are curious about tipping behavior among the

Review comment:
       ```suggestion
   Here's an example: Suppose that you are curious about tipping behavior among the
   ```
   
   Minor, and stylistic, feel free to disregard

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       ```suggestion
   data object - an Arrow Table or RecordBatch, or an R data.frame - and write
   ```
   
   This is another we should probably standardize on. This isn't perfect, but as a metric: I see 47 results for `data.frame` in arrow/r/man but only 5 results for `data frame`

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first

Review comment:
       I wonder if it might be clearer / easier to wade through if we use `c("year", "month")` instead of "a character vector"? That way the values are right here, it will be obvious to many R users what that is, even if they don't have "character vector" in their vocabulary.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679253544



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       How about in this example here @nealrichardson ? I was thinking `data frame` as we're referring to both `data.frame` and `tibble::tibble` objects, but then they're both of or inherit from class `data.frame ` so I wasn't sure what makes the most sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679145325



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679074273



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first
+path segment gives the value for `year`, and the second segment is `month`.
 Every row in `2009/01/data.parquet` has a value of 2009 for `year`
-and 1 for `month`, even though those columns may not actually be present in the file.
+and 1 for `month`, even though those columns may not be present in the file.
 
-Indeed, when we look at the dataset, we see that in addition to the columns present
+Indeed, when you look at the dataset, you can see that in addition to the columns present
 in every file, there are also columns `year` and `month`.

Review comment:
       Not beating a dead horse at all, being explicit about these things makes it a lot easier to understand with less effort.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682609463



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       more em-dashes
   
   ```suggestion
   data object—an Arrow Table or RecordBatch, or an R data frame—and write
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679823029



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,
-as well as other tools for interactive exploration of Arrow data.
+The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
+and other tools for interactive exploration of Arrow data.
 
-This vignette introduces Datasets and shows how to use `dplyr` to analyze them.
+This vignette introduces Datasets and shows how to use dplyr to analyze them.
 It describes both what is possible to do with Arrow now
 and what is on the immediate development roadmap.

Review comment:
       Does it? I think we should delete this sentence but it'd be good to hear other people's thoughts on this first.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682605964



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find

Review comment:
       ```suggestion
   Up to this point, you haven't loaded any data. You've walked directories to find
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

jonkeane commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679178422



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I was thinking about https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L135-L153 bit. 
   
   Running the code in the middle of that, I get:
   
   ```
   >   # Catch any readr-style options specified with full option names that are
   >   # supported by read_delim_arrow() (and its wrappers) but are not yet
   >   # supported here
   >   unsup_readr_opts <- setdiff(
   +     names(formals(read_delim_arrow)),
   +     names(formals(readr_to_csv_parse_options))
   +   )
   > unsup_readr_opts
    [1] "file"              "schema"            "col_names"         "col_types"         "col_select"       
    [6] "na"                "quoted_na"         "skip"              "parse_options"     "convert_options"  
   [11] "read_options"      "as_data_frame"     "timestamp_parsers"
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r674293424



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -20,34 +20,36 @@ and what is on the immediate development roadmap.
 The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 is widely used in big data exercises and competitions.
 For demonstration purposes, we have hosted a Parquet-formatted version
-of about 10 years of the trip data in a public Amazon S3 bucket.
+of about ten years of the trip data in a public Amazon S3 bucket.
 
 The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so we can't just
+format. That's bigger than memory on most people's computers, so you can't just
 read it all in and stack it into a single data frame.
 
-In Windows and macOS binary packages, S3 support is included.
-On Linux when installing from source, S3 support is not enabled by default,
+In Windows (for R > 3.6) and macOS binary packages, S3 support is included.
+On Linux, when installing from source, S3 support is not enabled by default,
 and it has additional system requirements.
 See `vignette("install", package = "arrow")` for details.
-To see if your `arrow` installation has S3 support, run
+To see if your __arrow__ installation has S3 support, run:
 
 ```{r}
 arrow::arrow_with_s3()
 ```
 
-Even with S3 support enabled network, speed will be a bottleneck unless your
+Even with an S3 support enabled network, speed will be a bottleneck unless your

Review comment:
       ```suggestion
   Even with S3 support enabled, network speed will be a bottleneck unless your
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682603777



##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.

Review comment:
       ```suggestion
   When referring to external packages in documentation, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679073277



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I have no recollection of that but I'll have a search on JIRA and see what I can find




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679073927



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first

Review comment:
       Great suggestion, being explicit about this can only help understanding.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679284777



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       This is an example of both I found and was trying to emulate: https://r4ds.had.co.nz/tibbles.html




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679145745



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first
+path segment gives the value for `year`, and the second segment is `month`.
 Every row in `2009/01/data.parquet` has a value of 2009 for `year`
-and 1 for `month`, even though those columns may not actually be present in the file.
+and 1 for `month`, even though those columns may not be present in the file.
 
-Indeed, when we look at the dataset, we see that in addition to the columns present
+Indeed, when you look at the dataset, you can see that in addition to the columns present
 in every file, there are also columns `year` and `month`.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682604473



##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g.
+
+* "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets.  This vignette introduces Datasets and shows how to use dplyr to analyze them."
+
+## Data frames
+
+When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g.
+
+* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables."

Review comment:
       ```suggestion
   * "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatches, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables."
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679139203



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       Actually; just curly braces or curly braces + backticks?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679797010



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       Braces more of a Twitter/blog post thing - PR now updated with `STYLE.md` which contains some guidelines on this and suggests using the tidyverse style of things.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679948944



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       Many of the args in `unsup_readr_opts` aren't actually from readr, they're just arguments to read_delim_arrow that aren't in readr_to_csv_parse_options. (In practice where this code is run, this doesn't matter because if you supplied `as_data_frame` as an argument, it will have matched that and not be in the `...` passed in here.) We only care about the ones that are readr options, so:
   
   ```
   > intersect(unsup_readr_opts, names(formals(readr::read_delim)))
   [1] "file"       "col_names"  "col_types"  "col_select" "na"        
   [6] "quoted_na"  "skip" 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679147098



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +166,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the
+package automatically calls `collect()` before processing that dplyr verb.
 
-Here's an example. Suppose I was curious about tipping behavior among the
+Here's an example. Suppose that you are curious about tipping behavior among the

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679950661



##########
File path: r/STYLE.md
##########
@@ -0,0 +1,19 @@
+# Style

Review comment:
       Need to add the ASF license header here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682606477



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the

Review comment:
       ```suggestion
   for dplyr queries on Arrow Table objects (which are already in memory), the
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679253544



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       How about in this example here @nealrichardson ? I was thinking `data frame` as we're referring to both `data.frame` and `tibble::tibble` objects but then they're both of or inherit from class `data.frame ` so wasn't sure what makes most sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682608693



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, without
 loading data from the files. Because the evaluation of these queries is deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 
+memory to slice from it.
 
-Third, because of partitioning, we can ignore some files entirely.
+Third, because of partitioning, you can ignore some files entirely.
 In this example, by filtering `year == 2015`, all files corresponding to other years
-are immediately excluded: we don't have to load them in order to find that no
+are immediately excluded: you don't have to load them in order to find that no
 rows match the filter. Relatedly, since Parquet files contain row groups with
-statistics on the data within, there may be entire chunks of data we can
+statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.
 
 ## More dataset options
 
 There are a few ways you can control the Dataset creation to adapt to special use cases.
-For one, if you are working with a single file or a set of files that are not
-all in the same directory, you can provide a file path or a vector of multiple
-file paths to `open_dataset()`. This is useful if, for example, you have a
-single CSV file that is too big to read into memory. You could pass the file
-path to `open_dataset()`, use `group_by()` to partition the Dataset into
-manageable chunks, then use `write_dataset()` to write each chunk to a separate
-Parquet file---all without needing to read the full CSV file into R.
-
-You can specify a `schema` argument to `open_dataset()` to declare the columns
-and their data types. This is useful if you have data files that have different
-storage schema (for example, a column could be `int32` in one and `int8` in another)
-and you want to ensure that the resulting Dataset has a specific type.
-To be clear, it's not necessary to specify a schema, even in this example of
-mixed integer types, because the Dataset constructor will reconcile differences like these.
-The schema specification just lets you declare what you want the result to be.
+
+### Work with files in a directory
+
+If you are working with a single file or a set of files that are not all in the 
+same directory, you can provide a file path or a vector of multiple file paths 
+to `open_dataset()`. This is useful if, for example, you have a single CSV file 
+that is too big to read into memory. You could pass the file path to 
+`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, 
+then use `write_dataset()` to write each chunk to a separate Parquet file - all 

Review comment:
       ```suggestion
   then use `write_dataset()` to write each chunk to a separate Parquet file—all 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682607073



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -228,12 +240,11 @@ cat("
 ")
 ```
 
-We just selected a subset out of a dataset with around 2 billion rows, computed
-a new column, and aggregated on it in under 2 seconds on my laptop. How does
+You've just selected a subset out of a dataset with around 2 billion rows, computed
+a new column, and aggregated it in under 2 seconds on most modern laptops. How does

Review comment:
       ```suggestion
   a new column, and aggregated it in under 2 seconds on a modern laptop. How does
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#issuecomment-886668591


   I've also removed a lot of the backticks around "arrow" and "dplyr" as I think they make the text harder to scan.  I took a look at the tidyverse packages and the convention in their vignettes seems to be to link to any external packages the first time they're mentioned, and then subsequently just treat the package names as if they are words.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #10765:
URL: https://github.com/apache/arrow/pull/10765


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679950373



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,
-as well as other tools for interactive exploration of Arrow data.
+The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets,
+and other tools for interactive exploration of Arrow data.
 
-This vignette introduces Datasets and shows how to use `dplyr` to analyze them.
+This vignette introduces Datasets and shows how to use dplyr to analyze them.
 It describes both what is possible to do with Arrow now
 and what is on the immediate development roadmap.

Review comment:
       At one point this was true (the discussion at the end talked about what's not yet implemented but coming) but perhaps that's no longer accurate.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r681966895



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       Right, I get you, I think. I think that now I'm explicitly specifying the arguments that *can* be passed through rather than *can't*, I don't think I need to make any more changes here on account of the above comments?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679072617



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       I think your suggestions of backticks+curly brackets around package names could be a good shout as it just makes it that tiny bit more scannable than just backticks and would help us distinguish between Arrow (the project and/or C++ implementation) and arrow (the R package). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679173516



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       Here's how I think about it when writing: "`data.frame`" is a proper noun, meaning literally that class of R object, while "data frame" is the more conceptual noun. In some cases, one or the other is obviously correct; in others, either might work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682608301



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, without
 loading data from the files. Because the evaluation of these queries is deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 

Review comment:
       em-dash
   ```suggestion
   smaller slices from each file—you don't have to load the whole dataset in 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679948944



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       Many of the args in `unsup_readr_opts` aren't actually from readr, they're just arguments to read_delim_arrow that aren't in readr_to_csv_parse_options. We only care about the ones that are readr options, so:
   
   ```
   > intersect(unsup_readr_opts, names(formals(readr::read_delim)))
   [1] "file"       "col_names"  "col_types"  "col_select" "na"        
   [6] "quoted_na"  "skip" 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org