You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/24 12:22:27 UTC

[GitHub] [arrow-cookbook] vibhatha opened a new pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

vibhatha opened a new pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155


   This PR includes a new section to address the issue: https://github.com/apache/arrow-cookbook/issues/110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] vibhatha commented on pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#issuecomment-1058079172


   @amol- Thank you for the feedback, I will address these points. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] vibhatha commented on a change in pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#discussion_r818692078



##########
File path: python/source/data.rst
##########
@@ -294,6 +294,146 @@ using :meth:`pyarrow.Table.set_column`
     item: [["Potato","Bean","Cucumber","Eggs"]]
     new_amount: [[30,20,15,40]]
 
+Group and Sort a Table
+======================
+
+If you have a table which needs to be grouped by a particular key, 
+you can use :meth:`pyarrow.Table.group_by` followed by an aggregation
+operation :meth:`pyarrow.TableGroupBy.aggregate`.
+
+For example, let’s say we have some data with a particular set of keys
+and values associated with that key. And we want to group the data by 
+those keys and apply an aggregate function like sum to evaluate
+how many items are for each unique key. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+       pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
+       pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
+      ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","c","d","e","c"]]
+    values: [[11,20,3,4,5,1,4,10]]
+
+Now we let's apply a groupby operation. Note that a groupby 
+operation returns a :class:`pyarrow.TableGroupBy` object which contains 
+the aggregate operator as :meth:`pyarrow.TableGroupBy.aggregate`. 
+
+.. testcode::
+
+  grouped_table = table.group_by("keys")
+
+  print(type(grouped_table))
+
+.. testoutput::
+
+    <class 'pyarrow.lib.TableGroupBy'>
+
+The output will look something similar to this. Now the table is 
+grouped by the field ``key`` and let's apply the aggregate operation
+``sum`` based on the values in the column ``values``. Note that, an 
+aggregation operation pairs with a column name. 
+
+.. testcode::
+
+  aggregated_table = grouped_table.aggregate([("values", "sum")])
+
+  print(aggregated_table)

Review comment:
       Yes, that make sense. I will remove this part and redirect the user to the API docs for more details. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] amol- commented on a change in pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#discussion_r818670658



##########
File path: python/source/data.rst
##########
@@ -294,6 +294,146 @@ using :meth:`pyarrow.Table.set_column`
     item: [["Potato","Bean","Cucumber","Eggs"]]
     new_amount: [[30,20,15,40]]
 
+Group and Sort a Table
+======================
+
+If you have a table which needs to be grouped by a particular key, 
+you can use :meth:`pyarrow.Table.group_by` followed by an aggregation
+operation :meth:`pyarrow.TableGroupBy.aggregate`.
+
+For example, let’s say we have some data with a particular set of keys
+and values associated with that key. And we want to group the data by 
+those keys and apply an aggregate function like sum to evaluate
+how many items are for each unique key. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+       pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
+       pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
+      ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","c","d","e","c"]]
+    values: [[11,20,3,4,5,1,4,10]]
+
+Now we let's apply a groupby operation. Note that a groupby 
+operation returns a :class:`pyarrow.TableGroupBy` object which contains 
+the aggregate operator as :meth:`pyarrow.TableGroupBy.aggregate`. 
+
+.. testcode::
+
+  grouped_table = table.group_by("keys")
+
+  print(type(grouped_table))
+
+.. testoutput::
+
+    <class 'pyarrow.lib.TableGroupBy'>
+
+The output will look something similar to this. Now the table is 
+grouped by the field ``key`` and let's apply the aggregate operation
+``sum`` based on the values in the column ``values``. Note that, an 
+aggregation operation pairs with a column name. 
+
+.. testcode::
+
+  aggregated_table = grouped_table.aggregate([("values", "sum")])
+
+  print(aggregated_table)

Review comment:
       This and previous code blocks should probably be collapsed. The purpose of the recipes is to showcase an immediately copy/pastable code block that people can use in their codebase or to play around with the feature. So the topic of the recipe shouldn't be divided in multiple code blocks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] amol- merged pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
amol- merged pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] vibhatha commented on a change in pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#discussion_r818691016



##########
File path: python/source/data.rst
##########
@@ -294,6 +294,146 @@ using :meth:`pyarrow.Table.set_column`
     item: [["Potato","Bean","Cucumber","Eggs"]]
     new_amount: [[30,20,15,40]]
 
+Group and Sort a Table
+======================
+
+If you have a table which needs to be grouped by a particular key, 
+you can use :meth:`pyarrow.Table.group_by` followed by an aggregation
+operation :meth:`pyarrow.TableGroupBy.aggregate`.
+
+For example, let’s say we have some data with a particular set of keys
+and values associated with that key. And we want to group the data by 
+those keys and apply an aggregate function like sum to evaluate
+how many items are for each unique key. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+       pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
+       pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
+      ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","c","d","e","c"]]
+    values: [[11,20,3,4,5,1,4,10]]
+
+Now we let's apply a groupby operation. Note that a groupby 
+operation returns a :class:`pyarrow.TableGroupBy` object which contains 
+the aggregate operator as :meth:`pyarrow.TableGroupBy.aggregate`. 
+
+.. testcode::
+
+  grouped_table = table.group_by("keys")
+
+  print(type(grouped_table))
+
+.. testoutput::
+
+    <class 'pyarrow.lib.TableGroupBy'>
+
+The output will look something similar to this. Now the table is 
+grouped by the field ``key`` and let's apply the aggregate operation
+``sum`` based on the values in the column ``values``. Note that, an 
+aggregation operation pairs with a column name. 
+
+.. testcode::
+
+  aggregated_table = grouped_table.aggregate([("values", "sum")])
+
+  print(aggregated_table)
+
+.. testoutput::
+
+    pyarrow.Table
+    values_sum: int64
+    keys: string
+    ----
+    values_sum: [[31,7,15,1,4]]
+    keys: [["a","b","c","d","e"]]
+
+If you observe carefully, the new table returns the aggregated column
+as ``values_sum`` which is formed by the column name and aggregation operation name. 
+
+Aggregation operations can be applied with options. Let's take a case where
+we have null values included in our dataset, but we want to take the 
+count of the unique groups excluding the null values. 
+
+A sample dataset can be formed as follows. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+        pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]),
+        pa.array([None, 20, 3, 4, 5, 6, 10, 1, 4, None]),
+        ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","b","c","d","d","e","c"]]
+    values: [[null,20,3,4,5,6,10,1,4,null]]
+
+Let's apply an aggregation operation ``count`` with the option to exclude
+null values. 
+
+.. testcode::
+
+  import pyarrow.compute as pc
+
+  grouped_table = table.group_by("keys").aggregate(
+    [("values", 
+    "count",
+    pc.CountOptions(mode="only_valid"))]
+  )
+
+  print(grouped_table)
+
+.. testoutput::
+
+    pyarrow.Table
+    values_count: int64
+    keys: string
+    ----
+    values_count: [[1,3,1,2,1]]
+    keys: [["a","b","c","d","e"]]
+
+So far we discussed how we can apply the group by operation
+on a table. Another important operation on a grouped data 
+is sorting. Data can be either sorted ``ascending`` or ``descending``. 

Review comment:
       Sure, I will add a separate recipe for this. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] amol- commented on a change in pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#discussion_r818672623



##########
File path: python/source/data.rst
##########
@@ -294,6 +294,146 @@ using :meth:`pyarrow.Table.set_column`
     item: [["Potato","Bean","Cucumber","Eggs"]]
     new_amount: [[30,20,15,40]]
 
+Group and Sort a Table
+======================
+
+If you have a table which needs to be grouped by a particular key, 
+you can use :meth:`pyarrow.Table.group_by` followed by an aggregation
+operation :meth:`pyarrow.TableGroupBy.aggregate`.
+
+For example, let’s say we have some data with a particular set of keys
+and values associated with that key. And we want to group the data by 
+those keys and apply an aggregate function like sum to evaluate
+how many items are for each unique key. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+       pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
+       pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
+      ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","c","d","e","c"]]
+    values: [[11,20,3,4,5,1,4,10]]
+
+Now we let's apply a groupby operation. Note that a groupby 
+operation returns a :class:`pyarrow.TableGroupBy` object which contains 
+the aggregate operator as :meth:`pyarrow.TableGroupBy.aggregate`. 
+
+.. testcode::
+
+  grouped_table = table.group_by("keys")
+
+  print(type(grouped_table))
+
+.. testoutput::
+
+    <class 'pyarrow.lib.TableGroupBy'>
+
+The output will look something similar to this. Now the table is 
+grouped by the field ``key`` and let's apply the aggregate operation
+``sum`` based on the values in the column ``values``. Note that, an 
+aggregation operation pairs with a column name. 
+
+.. testcode::
+
+  aggregated_table = grouped_table.aggregate([("values", "sum")])
+
+  print(aggregated_table)

Review comment:
       If you want to make sure the reader has a chance to get an explanation you can link to the docs ( https://arrow.apache.org/docs/python/compute.html#grouped-aggregations ) so that the reader can get a proper explanation of how things work, but the purpose of the cookbook is not to explain things but to provide an immediately usable code snippet to solve the target problem.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] vibhatha commented on pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#issuecomment-1058202848


   @amol- updated the PR with some modifications. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-cookbook] amol- commented on a change in pull request #155: [Python] Add a Python Cookbook recipe on group_by + sort

Posted by GitBox <gi...@apache.org>.
amol- commented on a change in pull request #155:
URL: https://github.com/apache/arrow-cookbook/pull/155#discussion_r818669250



##########
File path: python/source/data.rst
##########
@@ -294,6 +294,146 @@ using :meth:`pyarrow.Table.set_column`
     item: [["Potato","Bean","Cucumber","Eggs"]]
     new_amount: [[30,20,15,40]]
 
+Group and Sort a Table
+======================
+
+If you have a table which needs to be grouped by a particular key, 
+you can use :meth:`pyarrow.Table.group_by` followed by an aggregation
+operation :meth:`pyarrow.TableGroupBy.aggregate`.
+
+For example, let’s say we have some data with a particular set of keys
+and values associated with that key. And we want to group the data by 
+those keys and apply an aggregate function like sum to evaluate
+how many items are for each unique key. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+       pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
+       pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
+      ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","c","d","e","c"]]
+    values: [[11,20,3,4,5,1,4,10]]
+
+Now we let's apply a groupby operation. Note that a groupby 
+operation returns a :class:`pyarrow.TableGroupBy` object which contains 
+the aggregate operator as :meth:`pyarrow.TableGroupBy.aggregate`. 
+
+.. testcode::
+
+  grouped_table = table.group_by("keys")
+
+  print(type(grouped_table))
+
+.. testoutput::
+
+    <class 'pyarrow.lib.TableGroupBy'>
+
+The output will look something similar to this. Now the table is 
+grouped by the field ``key`` and let's apply the aggregate operation
+``sum`` based on the values in the column ``values``. Note that, an 
+aggregation operation pairs with a column name. 
+
+.. testcode::
+
+  aggregated_table = grouped_table.aggregate([("values", "sum")])
+
+  print(aggregated_table)
+
+.. testoutput::
+
+    pyarrow.Table
+    values_sum: int64
+    keys: string
+    ----
+    values_sum: [[31,7,15,1,4]]
+    keys: [["a","b","c","d","e"]]
+
+If you observe carefully, the new table returns the aggregated column
+as ``values_sum`` which is formed by the column name and aggregation operation name. 
+
+Aggregation operations can be applied with options. Let's take a case where
+we have null values included in our dataset, but we want to take the 
+count of the unique groups excluding the null values. 
+
+A sample dataset can be formed as follows. 
+
+.. testcode::
+
+  import pyarrow as pa
+
+  table = pa.table([
+        pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]),
+        pa.array([None, 20, 3, 4, 5, 6, 10, 1, 4, None]),
+        ], names=["keys", "values"])
+
+  print(table)
+
+.. testoutput::
+
+    pyarrow.Table
+    keys: string
+    values: int64
+    ----
+    keys: [["a","a","b","b","b","c","d","d","e","c"]]
+    values: [[null,20,3,4,5,6,10,1,4,null]]
+
+Let's apply an aggregation operation ``count`` with the option to exclude
+null values. 
+
+.. testcode::
+
+  import pyarrow.compute as pc
+
+  grouped_table = table.group_by("keys").aggregate(
+    [("values", 
+    "count",
+    pc.CountOptions(mode="only_valid"))]
+  )
+
+  print(grouped_table)
+
+.. testoutput::
+
+    pyarrow.Table
+    values_count: int64
+    keys: string
+    ----
+    values_count: [[1,3,1,2,1]]
+    keys: [["a","b","c","d","e"]]
+
+So far we discussed how we can apply the group by operation
+on a table. Another important operation on a grouped data 
+is sorting. Data can be either sorted ``ascending`` or ``descending``. 

Review comment:
       This should be a separate recipe. Each recipe should explain a single thing. We can add a reference to the other recipe from the group+aggregate one so that readers know that results can be sorted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org