You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2021/03/11 17:19:01 UTC
[jira] [Commented] (BEAM-11393) Support grouping by a Series

    [ https://issues.apache.org/jira/browse/BEAM-11393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299736#comment-17299736 ] 

Beam JIRA Bot commented on BEAM-11393:
--------------------------------------

This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3.

Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean.


> Support grouping by a Series
> ----------------------------
>
>                 Key: BEAM-11393
>                 URL: https://issues.apache.org/jira/browse/BEAM-11393
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Priority: P2
>              Labels: stale-P2
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> grouping by a Series (e.g. \{{df.groupby(df.column)}}, \{{series.groupby(other_series)}}) does not work. The previous implementation relied on aligning the index between the two deferred frames, but it's possible that one or both frames will have duplicate values in their index. Leading to the following error at execution time:
> {code}
>     Traceback (most recent call last):                                                                                                                                                                                                  
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", line 237, in fix                                                                                                           
>         computed = self.compute(to_compute)                                                                                                                                                                                             
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", line 195, in compute_using_session
>         return {                                                                                                                                                                                                                        
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", line 196, in <dictcomp>                                              
>         name: frame._expr.evaluate_at(session)                                                                     
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 329, in evaluate_at                        
>         return self._func(*(session.evaluate(arg) for arg in self._args))                                          
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 329, in <genexpr>                                             
>         return self._func(*(session.evaluate(arg) for arg in self._args))                                          
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 144, in evaluate                           
>         result = evaluate_with(input_partitioning)                                                                                                                                                                                            File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 114, in evaluate_with
>         results.append(session.evaluate(expr))                                                                                                                                                                                          
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 42, in evaluate
>         self._bindings[expr] = expr.evaluate_at(self)                                                                                                                                                                                   
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", line 329, in evaluate_at
>         return self._func(*(session.evaluate(arg) for arg in self._args))                                                                                                                                                               
>       File "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py", line 149, in set_index
>         df, by = df.align(by, axis=0, join='inner')                                                                                                                                                                                     
>       File "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/frame.py", line 3962, in align                                                                                                         return super().align(                             
>       File "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py", line 8559, in align                                   
>         return self._align_series(                        
>       File "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py", line 8681, in _align_series                                                      
>         fdata = fdata.reindex_indexer(join_index, lidx, axis=1)                                                                                                                                                                               File "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1276, in reindex_indexer
>         self.axes[axis]._can_reindex(indexer)             
>       File "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3289, in _can_reindex                                                                                           raise ValueError("cannot reindex from a duplicate axis")                                                    
>     ValueError: cannot reindex from a duplicate axis           
> {code}
> Discovered in https://github.com/apache/beam/pull/13401, GHA run: https://github.com/apache/beam/runs/1445605501



--
This message was sent by Atlassian Jira
(v8.3.4#803005)