You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/08/10 23:32:52 UTC

[GitHub] [beam] pabloem commented on a change in pull request #12420: Extending ApproximateQuantiles functionality to deal with non-uniform weights.

pabloem commented on a change in pull request #12420:
URL: https://github.com/apache/beam/pull/12420#discussion_r468063036



##########
File path: sdks/python/apache_beam/transforms/stats.py
##########
@@ -263,30 +265,38 @@ class Globally(PTransform):
 
     Args:
       num_quantiles: number of elements in the resulting quantiles values list.
+      weighted: (optional) if set to True, the transform returns weighted
+        quantiles. The input PCollection is then expected to contain tuples of
+        input values with the corresponding weight.

Review comment:
       Perhaps under the ApproximateQuantiles transform pydoc, rather than on these args (so it'll work for perKey and globally)? Up to you.

##########
File path: sdks/python/apache_beam/transforms/stats.py
##########
@@ -576,17 +633,34 @@ def _interpolate(self, i_buffers, count, step, offset):
     weighted_element = next(sorted_elem)
     current = weighted_element[1]
     j = 0
-    while j < count:
-      target = j * step + offset
-      j = j + 1
-      try:
-        while current <= target:
-          weighted_element = next(sorted_elem)
-          current = current + weighted_element[1]
-      except StopIteration:
-        pass
-      new_elements.append(weighted_element[0])
-    return new_elements
+    if self._weighted:

Review comment:
       I wonder if it's possible to do this without duplicating the code in both branches? I understand that separating branches may help performance, but it decreases on readability...  WDYT?

##########
File path: sdks/python/apache_beam/transforms/stats.py
##########
@@ -263,30 +265,38 @@ class Globally(PTransform):
 
     Args:
       num_quantiles: number of elements in the resulting quantiles values list.
+      weighted: (optional) if set to True, the transform returns weighted
+        quantiles. The input PCollection is then expected to contain tuples of
+        input values with the corresponding weight.
       key: (optional) Key is  a mapping of elements to a comparable key, similar
         to the key argument of Python's sorting methods.
       reverse: (optional) whether to order things smallest to largest, rather
         than largest to smallest
     """
-    def __init__(self, num_quantiles, key=None, reverse=False):
+    def __init__(self, num_quantiles, weighted=False, key=None, reverse=False):

Review comment:
       Can we add weighted as the last argument? So we'll have a change that's backwards compatible

##########
File path: sdks/python/apache_beam/transforms/stats.py
##########
@@ -263,30 +265,38 @@ class Globally(PTransform):
 
     Args:
       num_quantiles: number of elements in the resulting quantiles values list.
+      weighted: (optional) if set to True, the transform returns weighted
+        quantiles. The input PCollection is then expected to contain tuples of
+        input values with the corresponding weight.

Review comment:
       Perhaps show an example of this in a code snippet?

##########
File path: sdks/python/apache_beam/transforms/stats.py
##########
@@ -398,8 +424,8 @@ class ApproximateQuantilesCombineFn(CombineFn):
   http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1
   &type=pdf
 
-  The default error bound is (1 / N), though in practice the accuracy
-  tends to be much better.
+  The default error bound is (1 / N) for uniformly distributed data and 1e-2 for

Review comment:
       I don't know much math, so I wonder: How come the error bound for the approximation for weighted elements is not f(N)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org