You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/02 00:56:47 UTC
[GitHub] [beam] danthev commented on a change in pull request #14723: [BEAM-12272] Python - Backport Firestore connector's ramp-up throttling to Datastore connector

danthev commented on a change in pull request #14723:
URL: https://github.com/apache/beam/pull/14723#discussion_r643578566



##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       I haven't been able to find a reference to throttling counters in the Python Dataflow runner code. I've added a counter with the same name as the others (`cumulativeThrottlingSeconds`) into the DoFn for potential future proofing., @chamikaramj let me know if that is alright.
   
   I have done a few more tests with the default value of 500, and even then autoscaling quickly went up to whatever maximum I set (69 was max quota for my test account). We originally picked such a high default to make sure unconfigured one-off pipelines don't cause much overload even at 2000 workers, if worker count tops out at a much lower number there's always the option for a user to reconfigure.

##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/rampup_throttling_fn.py
##########
@@ -0,0 +1,80 @@
+import datetime
+import logging
+import time
+from typing import TypeVar
+
+from apache_beam import typehints
+from apache_beam.io.gcp.datastore.v1new import util
+from apache_beam.transforms import DoFn
+from apache_beam.utils.retry import FuzzedExponentialIntervals
+
+T = TypeVar('T')
+
+_LOG = logging.getLogger(__name__)
+
+
+@typehints.with_input_types(T)
+@typehints.with_output_types(T)
+class RampupThrottlingFn(DoFn):
+  """A ``DoFn`` that throttles ramp-up following an exponential function.
+
+  An implementation of a client-side throttler that enforces a gradual ramp-up,
+  broadly in line with Datastore best practices. See also
+  https://cloud.google.com/datastore/docs/best-practices#ramping_up_traffic.
+  """
+  def to_runner_api_parameter(self, unused_context):
+    from apache_beam.internal import pickler
+    config = {
+        'num_workers': self._num_workers,
+    }
+    return 'beam:fn:rampup_throttling:v0', pickler.dumps(config)
+
+  _BASE_BUDGET = 500
+  _RAMP_UP_INTERVAL = datetime.timedelta(minutes=5)
+
+  def __init__(self, num_workers, *unused_args, **unused_kwargs):
+    """Initializes a ramp-up throttler transform.
+
+     Args:
+       num_workers: A hint for the expected number of workers, used to derive
+                    the local rate limit.
+     """
+    super(RampupThrottlingFn, self).__init__(*unused_args, **unused_kwargs)
+    self._num_workers = num_workers
+    self._successful_ops = util.MovingSum(window_ms=1000, bucket_ms=1000)
+    self._first_instant = datetime.datetime.now()
+
+  def _calc_max_ops_budget(
+      self,
+      first_instant: datetime.datetime,
+      current_instant: datetime.datetime):
+    """Function that returns per-second budget according to best practices.
+
+    The exact function is `500 / num_shards * 1.5^max(0, (x-5)/5)`, where x is

Review comment:
       Done.

##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       I haven't been able to find a reference to throttling counters in the Python Dataflow runner code. I've added a counter with the same name as the others (`cumulativeThrottlingSeconds`) into the DoFn for potential future proofing, @chamikaramj let me know if that is alright.
   
   I have done a few more tests with the default value of 500, and even then autoscaling quickly went up to whatever maximum I set (69 was max quota for my test account). We originally picked such a high default to make sure unconfigured one-off pipelines don't cause much overload even at 2000 workers, if worker count tops out at a much lower number there's always the option for a user to reconfigure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org