You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/28 15:47:43 UTC

[GitHub] [arrow] jonkeane commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

jonkeane commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r678290277



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       Something we've discussed elsewhere (I remember this came up when working on a blog post, for example), is we should come up with a style for naming packages like this. I've come around to like `{pkg}` since it's clear + unambiguous at least in the R community now what that is. It looks like tidyverse doesn't use anything (or adds " package" at the end most of the time, if we consider that a marker), so maybe we should follow that lead (like you did here). Or we could stick with backticks. 
   
   I don't have super strong opinions, but would be happy to pick one and go with it

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I could be misremembered (or remembering the issue but not that it was resolved), but I thought there were _some_ options that weren't compatible with the dataset version of csv reading. We don't have to list them here, but if that is true + those are documented elsewhere (like the docs for `read_csv_arrow()` or `open_dataset()`), maybe a link to that elsewhere that contains those caveats would be nice.

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -20,34 +20,36 @@ and what is on the immediate development roadmap.
 The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 is widely used in big data exercises and competitions.
 For demonstration purposes, we have hosted a Parquet-formatted version
-of about 10 years of the trip data in a public Amazon S3 bucket.
+of about ten years of the trip data in a public Amazon S3 bucket.
 
 The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so we can't just
+format. That's bigger than memory on most people's computers, so you can't just
 read it all in and stack it into a single data frame.
 
-In Windows and macOS binary packages, S3 support is included.
-On Linux when installing from source, S3 support is not enabled by default,
+In Windows (for R > 3.6) and macOS binary packages, S3 support is included.
+On Linux, when installing from source, S3 support is not enabled by default,
 and it has additional system requirements.
 See `vignette("install", package = "arrow")` for details.
-To see if your `arrow` installation has S3 support, run
+To see if your arrow installation has S3 support, run:
 
 ```{r}
 arrow::arrow_with_s3()
 ```
 
-Even with S3 support enabled network, speed will be a bottleneck unless your
+Even with S3 support enabled, network speed will be a bottleneck unless your
 machine is located in the same AWS region as the data. So, for this vignette,
 we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi"

Review comment:
       ```suggestion
   we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi"
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first
+path segment gives the value for `year`, and the second segment is `month`.
 Every row in `2009/01/data.parquet` has a value of 2009 for `year`
-and 1 for `month`, even though those columns may not actually be present in the file.
+and 1 for `month`, even though those columns may not be present in the file.
 
-Indeed, when we look at the dataset, we see that in addition to the columns present
+Indeed, when you look at the dataset, you can see that in addition to the columns present
 in every file, there are also columns `year` and `month`.

Review comment:
       This might be beating a dead horse, but maybe it would be good to repeat "even though they are not present in the files themselves" here too? 

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +166,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the
+package automatically calls `collect()` before processing that dplyr verb.
 
-Here's an example. Suppose I was curious about tipping behavior among the
+Here's an example. Suppose that you are curious about tipping behavior among the

Review comment:
       ```suggestion
   Here's an example: Suppose that you are curious about tipping behavior among the
   ```
   
   Minor, and stylistic, feel free to disregard

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in an
 efficient binary columnar format like Parquet or Feather and partitioning based on
-columns commonly used for filtering. However, we don't always get our data delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
+columns commonly used for filtering. However, data isn't always stored that way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       ```suggestion
   data object - an Arrow Table or RecordBatch, or an R data.frame - and write
   ```
   
   This is another we should probably standardize on. This isn't perfect, but as a metric: I see 47 results for `data.frame` in arrow/r/man but only 5 results for `data frame`

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first

Review comment:
       I wonder if it might be clearer / easier to wade through if we use `c("year", "month")` instead of "a character vector"? That way the values are right here, it will be obvious to many R users what that is, even if they don't have "character vector" in their vocabulary.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org