You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 17:35:42 UTC
[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r681966895



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and will
 yield previously cached output for reference. To be explicit about which version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of data.
+The first step is to create a Dataset object, pointing at the directory of data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       Right, I get you, I think. I think that now I'm explicitly specifying the arguments that *can* be passed through rather than *can't*, I don't think I need to make any more changes here on account of the above comments?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org