You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/12 20:06:28 UTC

[GitHub] [arrow] thisisnic commented on a change in pull request #9748: ARROW-11729: [R] Add examples to datasets documentation

thisisnic commented on a change in pull request #9748:
URL: https://github.com/apache/arrow/pull/9748#discussion_r631349759



##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,49 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We start by creating temporary directories.
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' 
+#' # We can write datasets partitioned by the values in a column (here: "cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.

Review comment:
       Really minor point, but I feel like this could be clearer if we run through the one_part_dir example in its entirety (including list.files) first, and then move on to the two_part_dir example, as it's fewer lines of code before the reader has successfully run some code and gets insight into what's going on.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,49 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We start by creating temporary directories.
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' 
+#' # We can write datasets partitioned by the values in a column (here: "cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' write_dataset(mtcars, one_part_dir, partitioning = "cyl")
+#'
+#' # We can also partition by the values in multiple columns.
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' write_dataset(mtcars, two_part_dir, partitioning = c("cyl", "gear"))
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#' 
+#' # And we can check what we just saved.
+#' list.files(one_part_dir, recursive = TRUE)
+#' list.files(two_part_dir, recursive = TRUE)
+#'
+#' # We can do the same as the previous call with two variables combining both
+#' # arrow and dplyr, so the example is just a repetition with different steps.
+#' # We shall do it exactly as above and then with a slight change to the
+#' # output.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'
+#'  # Write a structure cyl=X/gear=Y/part-Z.parquet.
+#'  two_part_dir_2 <- tempfile()
+#'  d %>% write_dataset(two_part_dir_2)
+#'  list.files(two_part_dir_2, recursive = TRUE)
+#'
+#'  # We can also turn off the Hive-style directory naming where the column name
+#'  # is included with the value for each directory with `hive_style = FALSE`.
+#'
+#'  # Write a structure X/Y/part-Z.parquet.
+#'  two_part_dir_3 <- tempfile()
+#'  d %>% write_dataset(two_part_dir_3, hive_style = FALSE)
+#'  list.files(two_part_dir_3, recursive = TRUE)
+#' }
 #' @export
 write_dataset <- function(dataset,
                           path,

Review comment:
       The only other general comment I have to make is that I think this is some really good documentation as it covers a lot of ground and explains things in detail.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,49 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We start by creating temporary directories.
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' 
+#' # We can write datasets partitioned by the values in a column (here: "cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' write_dataset(mtcars, one_part_dir, partitioning = "cyl")
+#'
+#' # We can also partition by the values in multiple columns.
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' write_dataset(mtcars, two_part_dir, partitioning = c("cyl", "gear"))
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#' 
+#' # And we can check what we just saved.
+#' list.files(one_part_dir, recursive = TRUE)
+#' list.files(two_part_dir, recursive = TRUE)
+#'
+#' # We can do the same as the previous call with two variables combining both
+#' # arrow and dplyr, so the example is just a repetition with different steps.
+#' # We shall do it exactly as above and then with a slight change to the
+#' # output.
+#'
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'
+#'  # Write a structure cyl=X/gear=Y/part-Z.parquet.
+#'  two_part_dir_2 <- tempfile()
+#'  d %>% write_dataset(two_part_dir_2)
+#'  list.files(two_part_dir_2, recursive = TRUE)
+#'
+#'  # We can also turn off the Hive-style directory naming where the column name
+#'  # is included with the value for each directory with `hive_style = FALSE`.
+#'
+#'  # Write a structure X/Y/part-Z.parquet.
+#'  two_part_dir_3 <- tempfile()
+#'  d %>% write_dataset(two_part_dir_3, hive_style = FALSE)
+#'  list.files(two_part_dir_3, recursive = TRUE)
+#' }
 #' @export
 write_dataset <- function(dataset,
                           path,

Review comment:
       Another general suggestion - could we consider using second person ("you") pronouns instead of first person plural ("we") pronouns?  It can help give documentation a conversational tone, and I've seen it being used as a convention in lots of style guides before.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,49 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We start by creating temporary directories.
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' 
+#' # We can write datasets partitioned by the values in a column (here: "cyl").
+#' # This creates a structure of the form cyl=X/part-Z.parquet.
+#' write_dataset(mtcars, one_part_dir, partitioning = "cyl")
+#'
+#' # We can also partition by the values in multiple columns.
+#' # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
+#' write_dataset(mtcars, two_part_dir, partitioning = c("cyl", "gear"))
+#'
+#' # In the two previous examples we would have:
+#' # X = \{4,6,8\}, the number of cylinders.
+#' # Y = \{3,4,5\}, the number of forward gears.
+#' # Z = \{0,1,2\}, the number of saved parts, starting from 0.
+#' 
+#' # And we can check what we just saved.
+#' list.files(one_part_dir, recursive = TRUE)

Review comment:
       I think this is a great way to help demonstrate what we've just done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org