You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/05/07 21:32:14 UTC

[GitHub] [beam] jaketf commented on a change in pull request #11596: [BEAM-9856] Optimization/hl7v2 io list messages

jaketf commented on a change in pull request #11596:
URL: https://github.com/apache/beam/pull/11596#discussion_r421807059



##########
File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java
##########
@@ -472,24 +548,120 @@ public void initClient() throws IOException {
       this.client = new HttpHealthcareApiClient();
     }
 
+    @GetInitialRestriction
+    public OrderedTimeRange getEarliestToLatestRestriction(@Element String hl7v2Store)
+        throws IOException {
+      from = this.client.getEarliestHL7v2SendTime(hl7v2Store, this.filter);
+      // filters are [from, to) to match logic of OffsetRangeTracker but need latest element to be
+      // included in results set to add an extra ms to the upper bound.
+      to = this.client.getLatestHL7v2SendTime(hl7v2Store, this.filter).plus(1);
+      return new OrderedTimeRange(from, to);
+    }
+
+    @NewTracker
+    public OrderedTimeRangeTracker newTracker(@Restriction OrderedTimeRange timeRange) {
+      return timeRange.newTracker();
+    }
+
+    @SplitRestriction
+    public void split(
+        @Restriction OrderedTimeRange timeRange, OutputReceiver<OrderedTimeRange> out) {
+      // TODO(jaketf) How to pick optimal values for desiredNumOffsetsPerSplit ?

Review comment:
       Unfortunately, in this use case dynamic splitting would be crucial because we can't know the distribution of data in the restriction dimension (sendTime). 
   
   If you imagine a hospital might be much busier during daytime / weekdays than night times weekends (though never dormant due to ICU and emergency services). "Day time" might change base on hospital location, week days are subject to holidays, etc.
   
   Data distribution in sendTime may be subject to significant spikes if one of the upstream systems populating sendTime has to backfill after a maintenance period and doesn't responsibly set this field to event time but sets all of the sendTimes to a short range of  backfill 
   processing time (this is sub optimal behavior of that system but sometimes a reality).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org