You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/29 15:41:28 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15089: [BEAM-12533] Add simple `repr` for `DeferredDataFrame` and `DeferredSeries`

TheNeuralBit commented on a change in pull request #15089:
URL: https://github.com/apache/beam/pull/15089#discussion_r660743021



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -143,6 +143,14 @@ def wrapper(self, *args, **kwargs):
 
 
 class DeferredDataFrameOrSeries(frame_base.DeferredFrame):
+  def _render_indexes(self):

Review comment:
       > It may be better to keep things simple here and do `indexes={...}` even for a single index. Though TBH I am not sure why `index.name` and `index.names` are separate attributes (is this different between a pandas index and a deferred index)?
   
   This is true in pandas as well. If a DataFrame or Series has multiple indexes a separate type is used ([MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html)), which doesn't have the `name` attribute. The `names` attribute always works though, it's just a single element array in the non-MultiIndex case.
   
   I tend to agree it would be nice to keep this simpler. I'm hesitant though since I think the single index case is much more common, so I want it to look correct.
    
   > Related question, why can't we use `repr(index)`? I wonder if there is more to an index than just its name? For example, if there are different types of index (which I gather there are) could that information be useful to the user?
   
   Yeah good question. A pandas index does have more information: a type, plus actual data. pandas's `repr(index)` shows all of this (name, type, and data), e.g.:
   
   ```
   In [4]: df.set_index('bazzy').index
   Out[4]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64', name='bazzy')
   
   In [9]: df.set_index(['bazzy', 'barbar']).index
   Out[9]: 
   MultiIndex([(1, 'A'),
               (2, 'B'),
               (3, 'C'),
               (4, 'A'),
               (5, 'B'),
               (6, 'C')],
              names=['bazzy', 'barbar'])
   ```
   
   In the Beam case we don't have any data to show, so `repr(index)` should show just the name and type. However I don't think it would make sense to re-use this in the repr for DataFrame and Series:
   1. DataFrame columns also have type information, if we include index types we should include the column types as well, which is getting pretty verbose (without a more formatted output).
   2. pandas doesn't show types for columns or indexes in it's repr for DataFrame.
   
   Note users can always inspect other attributes to get type information, e.g. with `df.dtypes`.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15089: [BEAM-12533] Add simple `__repr__` for `DeferredDataFrame` and `DeferredSeries`

[GitHub] [beam] TheNeuralBit commented on a change in pull request #15089: [BEAM-12533] Add simple `repr` for `DeferredDataFrame` and `DeferredSeries`