You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/08 23:18:47 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14853: [BEAM-12379] Verify proxies in frames_test.py, and address some proxy errors

TheNeuralBit commented on a change in pull request #14853:
URL: https://github.com/apache/beam/pull/14853#discussion_r647858103



##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -122,7 +123,12 @@ def _run_test(self, func, *args, distributed=True, nonparallel=False):
             generated twice, once outside of an allow_non_parallel_operations
             block (to verify NonParallelOperation is raised), and again inside
             of an allow_non_parallel_operations block to actually generate an
-            expression to verify."""
+            expression to verify.
+        check_proxy (bool): Whether or not to check that the proxy of the
+            generated expression matches the actual result, defaults to True.

Review comment:
       So proxies are used for tracking data types at pipeline construction time. We generate them for every expression in the DataFrame expression tree, sometimes we construct them manually, but usually we "compute" them by calling the expression's function with an empty input - the proxies from the input expressions. This is nice because it leverages pandas' input validation e.g. pandas will raise an error in proxy generation if the user tries to get the mean() of a non-numeric column,  or tries to groupby() a column that doesn't exist.
   
   So the impact of having a mismatched proxy is that we may not properly validate the expression. We could allow an expression that will fail at execution time, or disallow an expression that would have worked at execution time.
   
   We also use the proxies for converting back to Beam types in `to_pcollection`. So a mismatched proxy there would make us have the wrong types in the output PCollection's schema.
   
   I think these mismatches are mostly just due to having the wrong numeric type, so they should be harmless




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org