You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by th...@apache.org on 2021/09/09 16:38:14 UTC

[arrow-cookbook] branch main updated: ARROW-13709: Reading JSON in R recipe (#64)

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new 894ec77  ARROW-13709: Reading JSON in R recipe (#64)
894ec77 is described below

commit 894ec7789c4d850271ee80ba6f83514a56f8fcf9
Author: Nic <th...@gmail.com>
AuthorDate: Thu Sep 9 16:38:09 2021 +0000

    ARROW-13709: Reading JSON in R recipe (#64)
    
    * Ensure that test chunks are not rendered
    
    * Add code to delete any temporarily generated files, add recipe for reading JSON
    
    * Rephrase
---
 r/content/index.Rmd                    |   1 +
 r/content/reading_and_writing_data.Rmd | 157 +++++++++++++++++++++++----------
 2 files changed, 111 insertions(+), 47 deletions(-)

diff --git a/r/content/index.Rmd b/r/content/index.Rmd
index 39711b8..66faf58 100644
--- a/r/content/index.Rmd
+++ b/r/content/index.Rmd
@@ -10,6 +10,7 @@ library(testthat)
 library(dplyr)
 # Include test 
 knitr::opts_template$set(test = list(
+  include = FALSE,
   test = TRUE,
   eval = params$inline_test_output
 ))
diff --git a/r/content/reading_and_writing_data.Rmd b/r/content/reading_and_writing_data.Rmd
index 4efceae..13a5832 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -1,8 +1,8 @@
 # Reading and Writing Data
 
-This chapter contains recipes related to reading and writing data using Apache Arrow.  When reading data using Apache Arrow, there are 2 different ways you may choose to read in the data:
-1. a `tibble`
-2. an Arrow Table
+This chapter contains recipes related to reading and writing data using Apache 
+Arrow.  When reading files into R using Apache Arrow, you can choose to read in 
+your file as either a `tibble` or as an Arrow Table object.
 
 There are a number of circumstances in which you may want to read in the data as an Arrow Table:
 * your dataset is large and if you load it into memory, it may lead to performance issues
@@ -11,7 +11,9 @@ There are a number of circumstances in which you may want to read in the data as
 
 ## Converting from a tibble to an Arrow Table
 
-You can convert an existing `tibble` or `data.frame` into an Arrow Table.
+You want to convert an existing `tibble` or `data.frame` into an Arrow Table.
+
+### Solution
 
 ```{r, table_create_from_tibble}
 air_table <- Table$create(airquality)
@@ -25,7 +27,11 @@ test_that("table_create_from_tibble chunk works as expected", {
 
 ## Converting data from an Arrow Table to a tibble
 
-You may want to convert an Arrow Table to a tibble to view the data or work with it in your usual analytics pipeline.  You can use either `dplyr::collect()` or `as.data.frame()` to do this.
+You want to convert an Arrow Table to a tibble to view the data or work with it
+in your usual analytics pipeline.  You can use either `dplyr::collect()` or 
+`as.data.frame()` to do this.
+
+### Solution
 
 ```{r, collect_table}
 air_tibble <- dplyr::collect(air_table)
@@ -37,11 +43,12 @@ test_that("collect_table chunk works as expected", {
 })
 ```
 
-## Reading and Writing Parquet Files
+## Writing a Parquet file
+
+You want to write Parquet files to disk.
 
-### Writing a Parquet file
+### Solution
 
-You can write Parquet files to disk using `arrow::write_parquet()`.
 ```{r, write_parquet}
 # Create table
 my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
@@ -54,9 +61,11 @@ test_that("write_parquet chunk works as expected", {
 })
 ```
  
-### Reading a Parquet file
+## Reading a Parquet file
 
-Given a Parquet file, it can be read back in by using `arrow::read_parquet()`.
+You want to read a Parquet file.
+
+### Solution
 
 ```{r, read_parquet}
 parquet_tbl <- read_parquet("my_table.parquet")
@@ -78,6 +87,9 @@ test_that("read_parquet_2 works as expected", {
   expect_s3_class(parquet_tbl, "data.frame")
 })
 ```
+
+### Discussion
+
 If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow Table.
 
 ```{r, read_parquet_table}
@@ -94,18 +106,25 @@ test_that("read_parquet_table_class works as expected", {
 })
 ```
 
-### How to read a Parquet file from S3 
+## Read a Parquet file from S3 
+
+You want to read a Parquet file from S3.
 
-You can open a Parquet file saved on S3 by calling `read_parquet()` and passing the relevant URI as the `file` argument.
+### Solution
 
 ```{r, read_parquet_s3, eval = FALSE}
 df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
 ```
+
+### See also
+
 For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html.
 
-### How to filter columns while reading a Parquet file 
+## Filter columns while reading a Parquet file 
+
+You want to specify which columns to include when reading in a Parquet file.
 
-When reading in a Parquet file, you can specify which columns to read in via the `col_select` argument.
+### Solution
 
 ```{r, read_parquet_filter}
 # Create table to read back in 
@@ -123,28 +142,11 @@ test_that("read_parquet_filter works as expected", {
 })
 ```
 
-## Reading and Writing Feather files 
+## Write an IPC/Feather V2 file
 
-### Write an IPC/Feather V2 file
+You want to read in a Feather file.
 
-The Arrow IPC file format is identical to the Feather version 2 format.  If you call `write_arrow()`, you will get a warning telling you to use `write_feather()` instead.
-
-```{r, write_arrow}
-# Create table
-my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
-write_arrow(my_table, "my_table.arrow")
-```
-```{r, test_write_arrow, opts.label = "test"}
-test_that("write_arrow chunk works as expected", {
-  expect_true(file.exists("my_table.arrow"))
-  expect_warning(
-    write_arrow(iris, "my_table.arrow"),
-    regexp = "Use 'write_ipc_stream' or 'write_feather' instead."
-  )
-})
-```
-
-Instead, you can use `write_feather()`.
+### Solution
 
 ```{r, write_feather}
 my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
@@ -155,7 +157,7 @@ test_that("write_feather chunk works as expected", {
   expect_true(file.exists("my_table.arrow"))
 })
 ```
-### Write a Feather (version 1) file
+### Discussion
 
 For legacy support, you can write data in the original Feather format by setting the `version` parameter to `1`.
 
@@ -169,11 +171,15 @@ write_feather(mtcars, "my_table.feather", version = 1)
 test_that("write_feather1 chunk works as expected", {
   expect_true(file.exists("my_table.feather"))
 })
+
+unlink("my_table.feather")
 ```
 
-### Read a Feather file
+## Read a Feather file
 
-You can read Feather files in via `read_feather()`.
+You want to read a Feather file.
+
+### Solution
 
 ```{r, read_feather}
 my_feather_tbl <- read_feather("my_table.arrow")
@@ -182,15 +188,23 @@ my_feather_tbl <- read_feather("my_table.arrow")
 test_that("read_feather chunk works as expected", {
   expect_identical(dplyr::collect(my_feather_tbl), tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
 })
+unlink("my_table.arrow")
 ```
 
-## Reading and Writing Streaming IPC Files
+## Write Streaming IPC Files
+
+You want to write to the IPC stream format.
 
-You can write to the IPC stream format using `write_ipc_stream()`.
+### Solution
 
 ```{r, write_ipc_stream}
 # Create table
-my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
+my_table <- Table$create(
+  tibble::tibble(
+    group = c("A", "B", "C"),
+    score = c(99, 97, 99)
+    )
+)
 # Write to IPC stream format
 write_ipc_stream(my_table, "my_table.arrows")
 ```
@@ -199,15 +213,23 @@ test_that("write_ipc_stream chunk works as expected", {
   expect_true(file.exists("my_table.arrows"))
 })
 ```
-You can read from IPC stream format using `read_ipc_stream()`.
 
+## Read Streaming IPC Files
+
+You want to read from the IPC stream format.
+
+### Solution
 ```{r, read_ipc_stream}
 my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows")
 ```
 ```{r, test_read_ipc_stream, opts.label = "test"}
 test_that("read_ipc_stream chunk works as expected", {
-  expect_equal(my_ipc_stream, tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
+  expect_equal(
+    my_ipc_stream,
+    tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))
+  )
 })
+unlink("my_table.arrows")
 ```
 
 ## Reading and Writing CSV files 
@@ -233,13 +255,48 @@ my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE)
 test_that("read_csv_arrow chunk works as expected", {
   expect_equivalent(dplyr::collect(my_csv), cars)
 })
+unlink("cars.csv")
+```
+
+## Read JSON files 
+
+You want to read a JSON file.
+
+### Solution
+
+```{r, read_json_arrow}
+# Create a file to read back in 
+tf <- tempfile()
+writeLines('
+    {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
+    {"country": "France", "code": "FR", "long": 2.21, "lat": 46.23}
+    {"country": "Germany", "code": "DE", "long": 10.45, "lat": 51.17}
+  ', tf, useBytes = TRUE)
+
+# Read in the data
+countries <- read_json_arrow(tf, col_select = c("country", "long", "lat"))
+countries
+```
+```{r, test_read_json_arrow, opts.label = "test"}
+test_that("read_json_arrow chunk works as expected", {
+  expect_equivalent(
+    countries,
+    tibble::tibble(
+      country = c("United Kingdom", "France", "Germany"),
+      long = c(-3.44, 2.21, 10.45),
+      lat = c(55.38, 46.23, 51.17)
+    )
+  )
+})
+unlink(tf)
 ```
 
-## Reading and Writing Partitioned Data 
 
-### Writing Partitioned Data
+## Write Partitioned Data
 
-You can use `write_dataset()` to save data to disk in partitions based on columns in the data.
+You want to save data to disk in partitions based on columns in the data.
+
+### Solution
 
 ```{r, write_dataset}
 write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day"))
@@ -267,9 +324,11 @@ Each of these folders contains 1 or more Parquet files containing the relevant p
 list.files("airquality_partitioned/Month=5/Day=10")
 ```
 
-### Reading Partitioned Data
+## Reading Partitioned Data
+
+You want to read partitioned data.
 
-You can use `open_dataset()` to read partitioned data.
+### Solution
 
 ```{r, open_dataset}
 # Read data from directory
@@ -285,3 +344,7 @@ test_that("open_dataset chunk works as expected", {
 })
 ```
 
+```{r}
+unlink("airquality_partitioned", recursive = TRUE)
+```
+