You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/04/28 22:16:57 UTC

[GitHub] [beam] jaketf commented on a change in pull request #11538: [BEAM-9831] Improve performance and UX for HL7v2IO

jaketf commented on a change in pull request #11538:
URL: https://github.com/apache/beam/pull/11538#discussion_r416957276



##########
File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java
##########
@@ -437,6 +444,20 @@ private Message fetchMessage(HealthcareApiClient client, String msgId)
           .apply(Create.of(this.hl7v2Stores))
           .apply(ParDo.of(new ListHL7v2MessagesFn(this.filter)))
           .setCoder(new HL7v2MessageCoder())
+          // Listing takes a long time for each input element (HL7v2 store) because it has to
+          // paginate through results in a single thread / ProcessElement call in order to keep
+          // track of page token.
+          // Eagerly emit data on 1 second intervals so downstream processing can get started before
+          // all of the list results have been paginated through.

Review comment:
       Each "page" of responses is a collection of messages. It don't think it make sense to page through all the pages (dropping the real data) to then re-fetch it in the downstream parallelized step. 
   
   In testing w/ customer when pointing at an HL7v2 store with many, many messages (and therefore pages) they reported 
   before this change:
   there was a long time before any elements were output. so long that they gave up and killed the pipeline. 
   after this change: 
   there was data coming out more regularly.
   
   This could have been a misunderstanding or a bad test scenario.
   I will try to come up with a test that reproduces this behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org