You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 22:00:09 UTC
[GitHub] [beam] kennknowles opened a new issue, #19013: Reduce Pub/Sub publishing latency

kennknowles opened a new issue, #19013:
URL: https://github.com/apache/beam/issues/19013

   The current implementation of the `PubsubUnboundedSink` uses a global window with a trigger on a fixed batch size of 1000 elements or a processing timespan of 2 seconds. After that, a random sharding of 100 is applied via a `GroupByKey` transform. The result is then pushed into a `DoFn` which performs the actual publishing step. 
   
   In case of low-latency (10s or 100s of milliseconds), this logic is quite bad, because it leads to a latency of  around 1.2 seconds, introduced by the transform steps described above.
   
   There are several possibilities to improve the Pub/Sub sink, for example:
   
   Let the upper parameters be configured via `PipelineOptions:`
    * `pubsubBatchSize`: Approx. maximum number of elements in a Pub/Sub publishing batch
    * `pubsubDelayThreshold`: Max. processing time duration before firing the sharding window
    * `pubsubShardCount`: The number of shards to create before publishing
   
   This would allow tweaking of the Pub/Sub sink for different scenarious of throughput and message size in the pipeline.
   
   However, if the throughput is small (< 100 element/s), this mechanism is still quite slow. If we take a look at the Java client at `com.google.cloud:google-cloud-pubsub`, the `Publisher` class supports a wide range of options to optimize its batching behaviour. This would allow not to rely on a window with group by key functionality and let the publisher itself handle the batching.
   
   Consider the following `DoFn` for publishing messages to Pub/Sub using that client:
   ```
   
   class PublishFn extends DoFn<PubsubMessage, Void> {
       private transient Publisher publisher;
   
   
      private final ValueProvider<String> topicPath;
   
       public PublishFn(final ValueProvider<String>
   topicPath) {
           this.topicPath = topicPath;
       }
   
       @Setup
       public void setup() throws
   IOException {
           publisher = Publisher.defaultBuilder(TopicName.parse(topicPath.get()))
       
              .setBatchingSettings(BatchingSettings.newBuilder()
                           .setRequestByteThreshold(40000L)
   
                          .setElementCountThreshold(1000L)
                           .setDelayThreshold(Duration.ofMillis(50))
   
                          .build())
                   .build();
       }
   
       @ProcessElement
       public
   void processElement(final ProcessContext context) {
           publisher.publish(context.element());
   
      }
   
       @Teardown
       public void teardown() throws Exception {
           publisher.shutdown();
   
      }
   
       @Override
       public void populateDisplayData(final DisplayData.Builder builder) {
     
        builder.add(DisplayData.item("topic", topicPath));
       }
   }
   
   ```
   
   In small test, this resulted in a publish latency of around 50 – 70 ms instead of 1000 – 1200 with the original `PubsubUnboundedSink`.
   
   I can understand, that the windowing mechanism could lead to better performance and throughput in a scenario with a high number of elements per second. However, it would be nice to enable a "low-latency-mode" using the provided code as an example.
   
   Imported from Jira [BEAM-4829](https://issues.apache.org/jira/browse/BEAM-4829). Original Jira may contain additional context.
   Reported by: JanHicken.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org