You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Chloe Thonin (JIRA)" <ji...@apache.org> on 2019/04/02 15:17:00 UTC

[jira] [Created] (BEAM-6970) Java SDK : How to bounded messages from PubSub ?

Chloe Thonin created BEAM-6970:
----------------------------------

             Summary: Java SDK : How to bounded messages from PubSub ? 
                 Key: BEAM-6970
                 URL: https://issues.apache.org/jira/browse/BEAM-6970
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow, sdk-java-core
    Affects Versions: 2.11.0
         Environment: Beam worker : Dataflow (GCP)
            Reporter: Chloe Thonin


I want to set up a dataflow Pipeline in Streaming mode.

My dataflow do theses tasks :
 * Read messages from pubsub
 * Build my "Document" object
 * Insert this Document into BigQuery
 * Store the initial message into Google Cloud Storage

The code successfully build and run, but it takes a lot of time for some messages.

I think it takes a lot of time because it treats pusub message one by one :

 ->  I build a PCollection<Documents> with messages from pubsub.
{code:java}
PCollection<Documents> rawtickets = pipeline
        .apply("Read PubSub Events", PubsubIO.readStrings().fromTopic(TOPIC))
        .apply("Make Document ", ParDo.of(new DoFn<String, Documents>() {
                                              @ProcessElement                                              public void processElement(ProcessContext c) throws Exception {
                                     String rawXMl = c.element();
                                 /** My code for build my object "Document" **/
                                     c.output(rawXml);
                                     }
                                 }
        ))
        .setCoder(AvroCoder.of(Documents.class));

{code}
 

Here a picture of my complete process :

[Complete Process |https://zupimages.net/up/19/14/w4df.png]

We can see the latency of the process after reading messages.

Concretely, I search to create a Pcollection of N element. I have tried methods founds here : [https://beam.apache.org/documentation/programming-guide/#triggers] but it doesn't group  my pubsub messages.

 

How I can batch this process  ? 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)