You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2021/11/03 19:09:04 UTC

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #158: R blog post

nealrichardson commented on a change in pull request #158:
URL: https://github.com/apache/arrow-site/pull/158#discussion_r742248863



##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a whole dataset, but 6.0.0 now allows you to aggregate across groups with `group_by()`. These are usable both with in-memory Arrow tables as well as across partitioned datasets. As usual, Arrow will read and process data in chunks and in parallel when possible to produce results much faster than one could by loading it all into memory then processing and even better, allows for operations that wouldn’t fit into memory on a single machine.

Review comment:
       I don't think 5.0.0 supported summarize at all, did it?

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+

Review comment:
       We need an intro paragraph

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a whole dataset, but 6.0.0 now allows you to aggregate across groups with `group_by()`. These are usable both with in-memory Arrow tables as well as across partitioned datasets. As usual, Arrow will read and process data in chunks and in parallel when possible to produce results much faster than one could by loading it all into memory then processing and even better, allows for operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this functionality - for the next release we’ll be looking to profile and optimize to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability are also supported and currently return approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to the way that dplyr pipelines are constructed and executed in R. There are almost no changes (besides new capabilities) that one would run into, but the improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead you must call `collect()` or `compute()` to evaluate the query. This is inline with how datasets worked before as well as how a number of other dplyr backends work.
+the order of dataset queries is no longer deterministic. If you need a stable sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand off an Arrow Dataset or query object to duckdb for further querying using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and want to avoid the worst-of-the-worst delays. To do this, we can use `percent_rank()`; however that requires a window function which isn’t yet available in Arrow, so let’s try sending the data to DuckDB to do that, then pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+# use duckdb to do a mutate after a group_by

Review comment:
       ```suggestion
   ```

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a whole dataset, but 6.0.0 now allows you to aggregate across groups with `group_by()`. These are usable both with in-memory Arrow tables as well as across partitioned datasets. As usual, Arrow will read and process data in chunks and in parallel when possible to produce results much faster than one could by loading it all into memory then processing and even better, allows for operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this functionality - for the next release we’ll be looking to profile and optimize to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability are also supported and currently return approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to the way that dplyr pipelines are constructed and executed in R. There are almost no changes (besides new capabilities) that one would run into, but the improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead you must call `collect()` or `compute()` to evaluate the query. This is inline with how datasets worked before as well as how a number of other dplyr backends work.
+the order of dataset queries is no longer deterministic. If you need a stable sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand off an Arrow Dataset or query object to duckdb for further querying using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB data (or the result of a query) to arrow with `to_arrow()`.

Review comment:
       Doesn't duckdb also push down projection to arrow? So duckdb queries on arrow datasets would still be able to exploit partitions etc.

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a whole dataset, but 6.0.0 now allows you to aggregate across groups with `group_by()`. These are usable both with in-memory Arrow tables as well as across partitioned datasets. As usual, Arrow will read and process data in chunks and in parallel when possible to produce results much faster than one could by loading it all into memory then processing and even better, allows for operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this functionality - for the next release we’ll be looking to profile and optimize to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability are also supported and currently return approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to the way that dplyr pipelines are constructed and executed in R. There are almost no changes (besides new capabilities) that one would run into, but the improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead you must call `collect()` or `compute()` to evaluate the query. This is inline with how datasets worked before as well as how a number of other dplyr backends work.
+the order of dataset queries is no longer deterministic. If you need a stable sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand off an Arrow Dataset or query object to duckdb for further querying using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and want to avoid the worst-of-the-worst delays. To do this, we can use `percent_rank()`; however that requires a window function which isn’t yet available in Arrow, so let’s try sending the data to DuckDB to do that, then pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+# use duckdb to do a mutate after a group_by
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%

Review comment:
       Why not just make this one long dplyr pipeline?

##########
File path: _posts/2021-11-01-r-6.0.0.md
##########
@@ -0,0 +1,140 @@
+---
+layout: post
+title: Apache Arrow R 6.0.0 Release
+date: "2021-11-01"
+author: 
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Grouped aggregation
+
+Aggregations can now be made across groups using dplyr’s `group_by() %>% summarise()` syntax. Arrow 5.0.0 allowed `summarise()` to aggregate across a whole dataset, but 6.0.0 now allows you to aggregate across groups with `group_by()`. These are usable both with in-memory Arrow tables as well as across partitioned datasets. As usual, Arrow will read and process data in chunks and in parallel when possible to produce results much faster than one could by loading it all into memory then processing and even better, allows for operations that wouldn’t fit into memory on a single machine.
+
+The focus this release has been on the initial implementation of this functionality - for the next release we’ll be looking to profile and optimize to enhance performance.
+
+Supported aggregation functions include `n()`, `n_distinct()`, `min(),` `max()`, `sum()`, `mean()`, `var()`, `sd()`, `any()`, and `all()`. `median()` and `quantile()` with one probability are also supported and currently return approximate results using the t-digest algorithm.
+
+# Joins 
+
+Multiple Arrow tables and datasets can now be joined in queries. 
+
+Say I want to get a table of all the flights from JFK to Las Vegas Airport on 
+9th October 2013, with the full name of the airline included.
+
+```r
+arrow_table(nycflights13::flights) %>%
+  filter(
+    year == 2013,
+    month == 10,
+    day == 9,
+    origin == "JFK",
+    dest == "LAS"
+    ) %>%
+  select(dep_time, arr_time, carrier) %>%
+  left_join(
+    arrow_table(nycflights13::airlines)
+   ) %>%
+  collect()
+```
+
+# Big changes to the execution of queries under-the-hood 
+
+Both of the first two points are driven by a large under-the-hood change to the way that dplyr pipelines are constructed and executed in R. There are almost no changes (besides new capabilities) that one would run into, but the improvement unlocked grouped aggregation, joins, and much more to come!
+
+A few small but notable changes came with this:
+`summarise()` when used with Arrow tables does not eagerly evaluate, instead you must call `collect()` or `compute()` to evaluate the query. This is inline with how datasets worked before as well as how a number of other dplyr backends work.
+the order of dataset queries is no longer deterministic. If you need a stable sort order, you should use `arrange()` at the end of your query.
+
+# Integration with DuckDB
+
+If you have the [duckdb](https://duckdb.org/) package installed, you can hand off an Arrow Dataset or query object to duckdb for further querying using the `to_duckdb()` function. This allows you to use duckdb's `dbplyr` methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before `to_duckdb()` is evaluated in Arrow.  You can also hand off DuckDB data (or the result of a query) to arrow with `to_arrow()`.
+
+In the example below, we are looking at flights between NYC and Chicago, and want to avoid the worst-of-the-worst delays. To do this, we can use `percent_rank()`; however that requires a window function which isn’t yet available in Arrow, so let’s try sending the data to DuckDB to do that, then pull it back into Arrow:
+
+```r
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+ 
+# use duckdb to do a mutate after a group_by
+flights_filtered <- arrow_table(nycflights13::flights) %>%
+  select(carrier, origin, dest, arr_delay) %>%
+  # arriving early doesn't matter, so call negative delays 0
+  mutate(arr_delay = pmax(arr_delay, 0)) %>%
+  to_duckdb() %>%
+  # for each carrier-origin-dest, take the worst 5% of delays
+  group_by(carrier, origin, dest) %>%
+  mutate(arr_delay_rank = percent_rank(arr_delay)) %>%
+  filter(arr_delay_rank > 0.95)
+
+# pull data back into arrow to complete analysis
+flights_filtered %>%
+  to_arrow() %>%
+  # now summarise to get mean/min
+  group_by(carrier, origin, dest) %>%
+  summarise(arr_delay_mean = mean(arr_delay), arr_delay_min = min(arr_delay), num_flights = n()) %>%
+  filter(dest %in% c("ORD", "MDW")) %>%
+  arrange(desc(arr_delay_mean)) %>%
+  collect()
+```

Review comment:
       show the output?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org