You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/24 17:38:15 UTC

[GitHub] [arrow] westonpace commented on a diff in pull request #14482: ARROW-18137: [Python][Docs] adding info about TableGroupBy.aggregation with empty list

westonpace commented on code in PR #14482:
URL: https://github.com/apache/arrow/pull/14482#discussion_r1003583810


##########
python/pyarrow/table.pxi:
##########
@@ -5282,6 +5282,7 @@ class TableGroupBy:
 list[tuple(str, str, FunctionOptions)]
             List of tuples made of aggregation column names followed
             by function names and optionally aggregation function options.
+            Pass empty list to imitate drop_duplicates pandas function.

Review Comment:
   It's not quite the same though.  Pandas `drop_duplicates` will keep columns that are not key columns. By default it will keep the first value in each group, though this is configurable.  For example:
   
   ```
   >>> tab = pa.Table.from_pydict({"x": [1, 1, 1, 2, 2], "y": ["a", "b", "c", "d", "e"]})
   >>> pa.TableGroupBy(tab, "x").aggregate([])
   pyarrow.Table
   x: int64
   ----
   x: [[1,2]]
   ```
   
   With `drop_duplicates` you would also get `y: [["a", "d"]]`.  You can kind of imitate this by using the `one` function which just picks some arbitrary value from a non-key column ("first" and "last" are difficult concepts within datasets at the moment).
   
   ```
   >>> pa.TableGroupBy(tab, "x").aggregate([("y", "one")])
   pyarrow.Table
   y_one: string
   x: int64
   ----
   y_one: [["a","d"]]
   x: [[1,2]]
   ```
   
   Either way, maybe this should be:
   
   ```suggestion
               Pass empty list to get a single row for each group.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org