You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/24 20:57:49 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

nealrichardson opened a new pull request #8041:
URL: https://github.com/apache/arrow/pull/8041


   * r/R/dataset.R is broken out into smaller files. I did this in the first commit, isolated from the behavior changes, so if you do a diff without the first commit, it's easier to see what has changed
   * Normalize paths in `write_dataset()` as was done in `open_dataset()` in ARROW-9743
   * Add bindings to create `InMemoryDataset` and use those in `write_dataset()` to enable you to write a `data.frame`, `RecordBatch`, or `Table`
   * Allow writing a subset of columns, and gather information from a previous `select()` call to do that by default. Renaming columns is not supported


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8041:
URL: https://github.com/apache/arrow/pull/8041#discussion_r476555171



##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet data and a directory
 of CSVs on the local file system and query them together as a single dataset.
 To create a multi-source dataset, provide a list of datasets to `open_dataset()`
 instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it is stored in an efficient binary columnar format like Parquet or Feather and when it is partitioned into separate files based on the value of a column commonly used in filtering. However, we don't always get our data delivered to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data is cleaning is up and reshaping it into a more usable form.
+
+The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files.
+
+Assume we have a version of the NYC Taxi data as CSV:
+
+```r
+ds <- open_dataset("nyc-taxi/csv/", format = "csv")
+```
+
+We can write it to a new location and translate the files to the Feather format
+by calling `write_dataset()` on it:
+
+```r
+write_dataset(ds, "nyc-taxi/feather", format = "feather")
+```
+
+Next, let's imagine that the "payment_type" column is something we often filter on,
+so we want to partition the day by that variable. By doing so, when we filter
+the resulting dataset on `payment_type == 3`, we would only have to look at the
+files that we know contain only rows where payment_type is 3.

Review comment:
       ```suggestion
   so we want to partition the data by that variable. By doing so we ensure that a filter like
   `payment_type == 3` will touch only a subset of files where payment_type is always 3.
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet data and a directory
 of CSVs on the local file system and query them together as a single dataset.
 To create a multi-source dataset, provide a list of datasets to `open_dataset()`
 instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it is stored in an efficient binary columnar format like Parquet or Feather and when it is partitioned into separate files based on the value of a column commonly used in filtering. However, we don't always get our data delivered to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data is cleaning is up and reshaping it into a more usable form.

Review comment:
       ```suggestion
   As you can see, querying a large dataset can be made quite fast by storage in an
   efficient binary columnar format like Parquet or Feather and partitioning based on
   columns commonly used for filtering. However, we don't always get our data delivered
   to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data
   is cleaning is up and reshaping it into a more usable form.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

Posted by GitBox <gi...@apache.org>.

nealrichardson closed pull request #8041:
URL: https://github.com/apache/arrow/pull/8041


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8041:
URL: https://github.com/apache/arrow/pull/8041#issuecomment-679365736


   https://issues.apache.org/jira/browse/ARROW-8001


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org