You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Mauricio 'Pachá' Vargas Sepúlveda (Jira)" <ji...@apache.org> on 2021/08/12 15:19:00 UTC

[jira] [Created] (ARROW-13616) [R] Cheat Sheet Structure

Mauricio 'Pachá' Vargas Sepúlveda created ARROW-13616:
---------------------------------------------------------

             Summary: [R] Cheat Sheet Structure
                 Key: ARROW-13616
                 URL: https://issues.apache.org/jira/browse/ARROW-13616
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 5.0.0
            Reporter: Mauricio 'Pachá' Vargas Sepúlveda


h1. *Front page*
h2. About

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

The arrow R package  integrates with dplyr and allows you to work with multiple storage formats as well as data in AWS S3 and other similar cloud storage systems.
h2. Installation

Our goal is to make the package just work on Windows, Mac and Linux.

*On Windows and Mac:*

{{install.packages("arrow")}}

On Linux:

{{Sys.setenv(NOT_CRAN = TRUE)}}

{{install.packages("arrow")}}
h2. Import

Follow the same steps to update.

To read Parquet/Feather data from a directory you can specify a partioning for efficient filtering:

{{d <- open_dataset("nyc-taxi",}}
{{ partitioning = c("year",}}
{{ "month"))}}

{{For *single files* you can do either:}}
{{read_parquet("gapminder.parquet")}}
{{read_feather("gapminder.feather")}}

Arrow can also read large CSV and JSON files with excellent speed and efficiency: 
{{read_csv_arrow("gapminder.csv")}}
{{read_json_arrow("gapminder.json")}}

-This reads data as data.frame.-
h2. Dplyr compatibility

Arrow and dplyr combination allow efficient reading, since dplyr filters "know" which files to read and what to skip based on the partitioning:

{{d %>%}}
{{ filter(year == 2009,}}
{{ month == 1) %>%}}
{{ collect() %>%}}
{{ group_by(year,month) %>%}}
{{ summarise(mean_amount = }}
{{ mean(total_amount))}}

Collect converts Arrow-type objects into regular tibbles. This then allows you to use your data with your existing visualisation and analysis workflow.

Arrow in R shares most of the characteristics of SQL in R throught RPostgres and other packages.

Hint: If an operation is not implemented (yet) in Arrow, you can collect and then use the operation. For example, mutate is implemented, but summarise and distinct will be announced later.
h2. Export

When saving data stored in a tibble to parquet format, the default partitioning is based on any groups in the tibble. To save with partitioning:

{{d2 %>%}}
{{ write_dataset("nyc-summary",}}
{{ hive_style = F)}}

This shall create different folders like 2015/01, 2015/02, etc. Hint: experiment changing hive to TRUE.

You can also save without partitioning:

{{write_parquet(d2, "d2.parquet")}}
{{write_feather(d2, "d2.feather")}}


-To save without partitioning, you can use:-
{{-write_parquet(d2, "mydata.parquet")-}}
{{-write_feather(d2, "mydata.feather")-}}
{{-write_csv_arrow(d2, "mydata.csv")-}}
-The read_ counterparts of these functions work exactly like read_csv.-
h2. S3 support

You can read files from S3 filesystems without having to download them, and this is done with:

{{d2 <- open_dataset(}}
{{ "s3://ursa-labs-taxi-data",}}
{{ partitioning = c("year",}}
{{ "month"))}}

You can also copy the data to your computer:

{{copy_files(}}
{{ "s3://ursa-labs-taxi-data", }}
{{ "~/nyc-taxi")}}
h1. Back page
h2. Generic S3 filesystems?
h2. Specific writing operations?
h2. More on dplyr compatibility?
h2. Mention something you would like to see here

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)