You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/07/13 18:59:49 UTC
[GitHub] [beam] aaltay commented on a change in pull request #15165: [BEAM-12593] Verify DataFrame API on pandas 1.3.0

aaltay commented on a change in pull request #15165:
URL: https://github.com/apache/beam/pull/15165#discussion_r669031002



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3403,10 +3406,16 @@ def value_counts(self, subset=None, sort=False, normalize=False,
           "ordering on the dataset which likely will not be preserved.",
           reason="order-sensitive")
     columns = subset or list(self.columns)
-    result = self.groupby(columns).size()
+
+    if dropna:
+      dropped = self.dropna()
+    else:
+      dropped = self
+
+    result = dropped.groupby(columns, dropna=dropna).size()
 
     if normalize:
-      return result/self.dropna().length()
+      return result/dropped.length()

Review comment:
       This looks correct, but to be explicit this is changing the normalized value for dropna=False. Is this intentional?

##########
File path: sdks/python/apache_beam/dataframe/pandas_doctests_test.py
##########
@@ -386,6 +394,7 @@ def test_series_tests(self):
             ],
             'pandas.core.series.Series.fillna': [
                 "df.fillna(method='ffill')",
+                'df.fillna(method="ffill")',

Review comment:
       duplicate?

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3392,7 +3395,7 @@ def melt(self, ignore_index, **kwargs):
 
   @frame_base.with_docs_from(pd.DataFrame)
   def value_counts(self, subset=None, sort=False, normalize=False,
-                   ascending=False):
+                   ascending=False, dropna=True):

Review comment:
       There was a note about "dropna=False is new in pandas 1.3" below. Should we match that?

##########
File path: sdks/python/apache_beam/dataframe/pandas_doctests_test.py
##########
@@ -491,12 +503,13 @@ def test_series_tests(self):
                 # Inspection after modification.
                 's'
             ],
+            'pandas.core.series.Series.resample': ['df'],
         })
     self.assertEqual(result.failed, 0)
 
   def test_string_tests(self):
-    PD_VERSION = tuple(int(v) for v in pd.__version__.split('.'))
-    if PD_VERSION < (1, 2, 0):
+    PD_VERSION = tuple(int(v) for v in pd.__version__.split('.')[:2])

Review comment:
       Related to `PD_VERSION`:
   - Should we define this in one common place?
   - And use a consistent way to get it, either tuple(int(... or tuple(map(... for consistency?
   - And is it worth keeping the patch version or not? (probably not.)

##########
File path: sdks/python/apache_beam/dataframe/pandas_doctests_test.py
##########
@@ -69,6 +69,7 @@ def test_ndframe_tests(self):
             ],
             'pandas.core.generic.NDFrame.fillna': [
                 "df.fillna(method='ffill')",
+                'df.fillna(method="ffill")',

Review comment:
       duplicate?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org