You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Talat Uyarer <tu...@paloaltonetworks.com> on 2020/06/01 20:59:36 UTC

Re: Pipeline Processing Time

Sorry for the late response. Where does the beam set that timestamp field
on element ? Is it set whenever KafkaIO reads that element ? And also I
have a windowing function on my pipeline. Does the timestamp field change
for any kind of operation ? On pipeline I have the following steps: KafkaIO
-> Format Conversion Pardo -> SQL Filter -> Windowing Step -> Custom Sink.
If timestamp set in KafkaIO, Can I see process time by now() - timestamp in
Custom Sink ?

Thanks

On Thu, May 28, 2020 at 2:07 PM Luke Cwik <lc...@google.com> wrote:

> Dataflow provides msec counters for each transform that executes. You
> should be able to get them from stackdriver and see them from the Dataflow
> UI.
>
> You need to keep track of the timestamp of the element as it flows through
> the system as part of data that goes alongside the element. You can use the
> element's timestamp[1] if that makes sense (it might not if you intend to
> use a timestamp that is from the kafka record itself and the record's
> timestamp isn't the same as the ingestion timestamp). Unless you are
> writing your own sink, the sink won't track the processing time at all so
> you'll need to add a ParDo that goes right before it that writes the timing
> information to wherever you want (a counter, your own metrics database,
> logs, ...).
>
> 1:
> https://github.com/apache/beam/blob/018e889829e300ab9f321da7e0010ff0011a73b1/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L257
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_blob_018e889829e300ab9f321da7e0010ff0011a73b1_sdks_java_core_src_main_java_org_apache_beam_sdk_transforms_DoFn.java-23L257&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=>
>
>
> On Thu, May 28, 2020 at 1:12 PM Talat Uyarer <tu...@paloaltonetworks.com>
> wrote:
>
>> Yes I am trying to track how long it takes for a single element to be
>> ingested into the pipeline until it is output somewhere.
>>
>> My pipeline is unbounded. I am using KafkaIO. I did not think about CPU
>> time. if there is a way to track it too, it would be useful to improve my
>> metrics.
>>
>> On Thu, May 28, 2020 at 12:52 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> What do you mean by processing time?
>>>
>>> Are you trying to track how long it takes for a single element to be
>>> ingested into the pipeline until it is output somewhere?
>>> Do you have a bounded pipeline and want to know how long all the
>>> processing takes?
>>> Do you care about how much CPU time is being consumed in aggregate for
>>> all the processing that your pipeline is doing?
>>>
>>>
>>> On Thu, May 28, 2020 at 11:01 AM Talat Uyarer <
>>> tuyarer@paloaltonetworks.com> wrote:
>>>
>>>> I am using Dataflow Runner. The pipeline read from kafkaIO and send
>>>> Http. I could not find any metadata field on the element to set first read
>>>> time.
>>>>
>>>> On Thu, May 28, 2020 at 10:44 AM Kyle Weaver <kc...@google.com>
>>>> wrote:
>>>>
>>>>> Which runner are you using?
>>>>>
>>>>> On Thu, May 28, 2020 at 1:43 PM Talat Uyarer <
>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a pipeline which has 5 steps. What is the best way to measure
>>>>>> processing time for my pipeline?
>>>>>>
>>>>>> Thnaks
>>>>>>
>>>>>

Re: Pipeline Processing Time

Posted by Talat Uyarer <tu...@paloaltonetworks.com>.
Thank you Luke and Reuven for helping me. Now I can see my pipeline
processing time for each record.

On Wed, Jun 3, 2020 at 9:25 AM Reuven Lax <re...@google.com> wrote:

> Note: you need to tag the timestamp parameter to @ProcessElement with
> the @Timestamp annotation.
>
> On Mon, Jun 1, 2020 at 3:31 PM Luke Cwik <lc...@google.com> wrote:
>
>> You can configure KafkaIO to use some data from the record as the
>> elements timestamp. See the KafkaIO javadoc around the TimestampPolicy[1],
>> the default is current processing time.
>> You can access the timestamp of the element by adding
>> "org.joda.time.Instant timestamp" as a parameter to your @ProcessElement,
>> see this javadoc for additional details[2]. You could then compute now() -
>> timestamp to calculate processing time.
>>
>> 1:
>> https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/io/kafka/TimestampPolicy.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_releases_javadoc_2.21.0_org_apache_beam_sdk_io_kafka_TimestampPolicy.html&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=KuUWakZ-xaVGYfsw7YGz1WBOLIlpBHikvRxgZs9vWn0&s=1Sp349fe5C5l4ttxy9iNBlkzoO-9RX_qrvVllkk-PGg&e=>
>> 2:
>> https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/DoFn.ProcessElement.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_releases_javadoc_2.21.0_org_apache_beam_sdk_transforms_DoFn.ProcessElement.html&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=KuUWakZ-xaVGYfsw7YGz1WBOLIlpBHikvRxgZs9vWn0&s=nkJq_weo7lrd-JzTEw5PeCC-dkivOJ6AlRxLFXwnMMM&e=>
>>
>> On Mon, Jun 1, 2020 at 2:00 PM Talat Uyarer <tu...@paloaltonetworks.com>
>> wrote:
>>
>>> Sorry for the late response. Where does the beam set that timestamp
>>> field on element ? Is it set whenever KafkaIO reads that element ?
>>>
>> And also I have a windowing function on my pipeline. Does the timestamp
>>> field change for any kind of operation ? On pipeline I have the
>>> following steps: KafkaIO -> Format Conversion Pardo -> SQL Filter ->
>>> Windowing Step -> Custom Sink. If timestamp set in KafkaIO, Can I see
>>> process time by now() - timestamp in Custom Sink ?
>>>
>>>
>> Thanks
>>>
>>> On Thu, May 28, 2020 at 2:07 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Dataflow provides msec counters for each transform that executes. You
>>>> should be able to get them from stackdriver and see them from the Dataflow
>>>> UI.
>>>>
>>>> You need to keep track of the timestamp of the element as it flows
>>>> through the system as part of data that goes alongside the element. You can
>>>> use the element's timestamp[1] if that makes sense (it might not if you
>>>> intend to use a timestamp that is from the kafka record itself and the
>>>> record's timestamp isn't the same as the ingestion timestamp). Unless you
>>>> are writing your own sink, the sink won't track the processing time at all
>>>> so you'll need to add a ParDo that goes right before it that writes the
>>>> timing information to wherever you want (a counter, your own metrics
>>>> database, logs, ...).
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/018e889829e300ab9f321da7e0010ff0011a73b1/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L257
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_blob_018e889829e300ab9f321da7e0010ff0011a73b1_sdks_java_core_src_main_java_org_apache_beam_sdk_transforms_DoFn.java-23L257&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=>
>>>>
>>>>
>>>> On Thu, May 28, 2020 at 1:12 PM Talat Uyarer <
>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>
>>>>> Yes I am trying to track how long it takes for a single element to be
>>>>> ingested into the pipeline until it is output somewhere.
>>>>>
>>>>> My pipeline is unbounded. I am using KafkaIO. I did not think about
>>>>> CPU time. if there is a way to track it too, it would be useful to improve
>>>>> my metrics.
>>>>>
>>>>> On Thu, May 28, 2020 at 12:52 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> What do you mean by processing time?
>>>>>>
>>>>>> Are you trying to track how long it takes for a single element to be
>>>>>> ingested into the pipeline until it is output somewhere?
>>>>>> Do you have a bounded pipeline and want to know how long all the
>>>>>> processing takes?
>>>>>> Do you care about how much CPU time is being consumed in aggregate
>>>>>> for all the processing that your pipeline is doing?
>>>>>>
>>>>>>
>>>>>> On Thu, May 28, 2020 at 11:01 AM Talat Uyarer <
>>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>>
>>>>>>> I am using Dataflow Runner. The pipeline read from kafkaIO and send
>>>>>>> Http. I could not find any metadata field on the element to set first read
>>>>>>> time.
>>>>>>>
>>>>>>> On Thu, May 28, 2020 at 10:44 AM Kyle Weaver <kc...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Which runner are you using?
>>>>>>>>
>>>>>>>> On Thu, May 28, 2020 at 1:43 PM Talat Uyarer <
>>>>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a pipeline which has 5 steps. What is the best way to
>>>>>>>>> measure processing time for my pipeline?
>>>>>>>>>
>>>>>>>>> Thnaks
>>>>>>>>>
>>>>>>>>

Re: Pipeline Processing Time

Posted by Reuven Lax <re...@google.com>.
Note: you need to tag the timestamp parameter to @ProcessElement with
the @Timestamp annotation.

On Mon, Jun 1, 2020 at 3:31 PM Luke Cwik <lc...@google.com> wrote:

> You can configure KafkaIO to use some data from the record as the elements
> timestamp. See the KafkaIO javadoc around the TimestampPolicy[1], the
> default is current processing time.
> You can access the timestamp of the element by adding
> "org.joda.time.Instant timestamp" as a parameter to your @ProcessElement,
> see this javadoc for additional details[2]. You could then compute now() -
> timestamp to calculate processing time.
>
> 1:
> https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/io/kafka/TimestampPolicy.html
> 2:
> https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/DoFn.ProcessElement.html
>
> On Mon, Jun 1, 2020 at 2:00 PM Talat Uyarer <tu...@paloaltonetworks.com>
> wrote:
>
>> Sorry for the late response. Where does the beam set that timestamp field
>> on element ? Is it set whenever KafkaIO reads that element ?
>>
> And also I have a windowing function on my pipeline. Does the timestamp
>> field change for any kind of operation ? On pipeline I have the
>> following steps: KafkaIO -> Format Conversion Pardo -> SQL Filter ->
>> Windowing Step -> Custom Sink. If timestamp set in KafkaIO, Can I see
>> process time by now() - timestamp in Custom Sink ?
>>
>>
> Thanks
>>
>> On Thu, May 28, 2020 at 2:07 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Dataflow provides msec counters for each transform that executes. You
>>> should be able to get them from stackdriver and see them from the Dataflow
>>> UI.
>>>
>>> You need to keep track of the timestamp of the element as it flows
>>> through the system as part of data that goes alongside the element. You can
>>> use the element's timestamp[1] if that makes sense (it might not if you
>>> intend to use a timestamp that is from the kafka record itself and the
>>> record's timestamp isn't the same as the ingestion timestamp). Unless you
>>> are writing your own sink, the sink won't track the processing time at all
>>> so you'll need to add a ParDo that goes right before it that writes the
>>> timing information to wherever you want (a counter, your own metrics
>>> database, logs, ...).
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/018e889829e300ab9f321da7e0010ff0011a73b1/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L257
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_blob_018e889829e300ab9f321da7e0010ff0011a73b1_sdks_java_core_src_main_java_org_apache_beam_sdk_transforms_DoFn.java-23L257&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=>
>>>
>>>
>>> On Thu, May 28, 2020 at 1:12 PM Talat Uyarer <
>>> tuyarer@paloaltonetworks.com> wrote:
>>>
>>>> Yes I am trying to track how long it takes for a single element to be
>>>> ingested into the pipeline until it is output somewhere.
>>>>
>>>> My pipeline is unbounded. I am using KafkaIO. I did not think about CPU
>>>> time. if there is a way to track it too, it would be useful to improve my
>>>> metrics.
>>>>
>>>> On Thu, May 28, 2020 at 12:52 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> What do you mean by processing time?
>>>>>
>>>>> Are you trying to track how long it takes for a single element to be
>>>>> ingested into the pipeline until it is output somewhere?
>>>>> Do you have a bounded pipeline and want to know how long all the
>>>>> processing takes?
>>>>> Do you care about how much CPU time is being consumed in aggregate for
>>>>> all the processing that your pipeline is doing?
>>>>>
>>>>>
>>>>> On Thu, May 28, 2020 at 11:01 AM Talat Uyarer <
>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>
>>>>>> I am using Dataflow Runner. The pipeline read from kafkaIO and send
>>>>>> Http. I could not find any metadata field on the element to set first read
>>>>>> time.
>>>>>>
>>>>>> On Thu, May 28, 2020 at 10:44 AM Kyle Weaver <kc...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Which runner are you using?
>>>>>>>
>>>>>>> On Thu, May 28, 2020 at 1:43 PM Talat Uyarer <
>>>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a pipeline which has 5 steps. What is the best way to
>>>>>>>> measure processing time for my pipeline?
>>>>>>>>
>>>>>>>> Thnaks
>>>>>>>>
>>>>>>>

Re: Pipeline Processing Time

Posted by Luke Cwik <lc...@google.com>.
You can configure KafkaIO to use some data from the record as the elements
timestamp. See the KafkaIO javadoc around the TimestampPolicy[1], the
default is current processing time.
You can access the timestamp of the element by adding
"org.joda.time.Instant timestamp" as a parameter to your @ProcessElement,
see this javadoc for additional details[2]. You could then compute now() -
timestamp to calculate processing time.

1:
https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/io/kafka/TimestampPolicy.html
2:
https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/DoFn.ProcessElement.html

On Mon, Jun 1, 2020 at 2:00 PM Talat Uyarer <tu...@paloaltonetworks.com>
wrote:

> Sorry for the late response. Where does the beam set that timestamp field
> on element ? Is it set whenever KafkaIO reads that element ?
>
And also I have a windowing function on my pipeline. Does the timestamp
> field change for any kind of operation ? On pipeline I have the
> following steps: KafkaIO -> Format Conversion Pardo -> SQL Filter ->
> Windowing Step -> Custom Sink. If timestamp set in KafkaIO, Can I see
> process time by now() - timestamp in Custom Sink ?
>
>
Thanks
>
> On Thu, May 28, 2020 at 2:07 PM Luke Cwik <lc...@google.com> wrote:
>
>> Dataflow provides msec counters for each transform that executes. You
>> should be able to get them from stackdriver and see them from the Dataflow
>> UI.
>>
>> You need to keep track of the timestamp of the element as it flows
>> through the system as part of data that goes alongside the element. You can
>> use the element's timestamp[1] if that makes sense (it might not if you
>> intend to use a timestamp that is from the kafka record itself and the
>> record's timestamp isn't the same as the ingestion timestamp). Unless you
>> are writing your own sink, the sink won't track the processing time at all
>> so you'll need to add a ParDo that goes right before it that writes the
>> timing information to wherever you want (a counter, your own metrics
>> database, logs, ...).
>>
>> 1:
>> https://github.com/apache/beam/blob/018e889829e300ab9f321da7e0010ff0011a73b1/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L257
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_blob_018e889829e300ab9f321da7e0010ff0011a73b1_sdks_java_core_src_main_java_org_apache_beam_sdk_transforms_DoFn.java-23L257&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=BkW1L6EF7ergAVYDXCo-3Vwkpy6qjsWAz7_GD7pAR8g&m=1202mTv7BP1KzcBJECS98dr7u5riw0NHdl8rT8I6Ego&s=cPdnrK4r-tVd0iAO6j7eAAbDPISOdazEYBrPoC9cQOo&e=>
>>
>>
>> On Thu, May 28, 2020 at 1:12 PM Talat Uyarer <
>> tuyarer@paloaltonetworks.com> wrote:
>>
>>> Yes I am trying to track how long it takes for a single element to be
>>> ingested into the pipeline until it is output somewhere.
>>>
>>> My pipeline is unbounded. I am using KafkaIO. I did not think about CPU
>>> time. if there is a way to track it too, it would be useful to improve my
>>> metrics.
>>>
>>> On Thu, May 28, 2020 at 12:52 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> What do you mean by processing time?
>>>>
>>>> Are you trying to track how long it takes for a single element to be
>>>> ingested into the pipeline until it is output somewhere?
>>>> Do you have a bounded pipeline and want to know how long all the
>>>> processing takes?
>>>> Do you care about how much CPU time is being consumed in aggregate for
>>>> all the processing that your pipeline is doing?
>>>>
>>>>
>>>> On Thu, May 28, 2020 at 11:01 AM Talat Uyarer <
>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>
>>>>> I am using Dataflow Runner. The pipeline read from kafkaIO and send
>>>>> Http. I could not find any metadata field on the element to set first read
>>>>> time.
>>>>>
>>>>> On Thu, May 28, 2020 at 10:44 AM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Which runner are you using?
>>>>>>
>>>>>> On Thu, May 28, 2020 at 1:43 PM Talat Uyarer <
>>>>>> tuyarer@paloaltonetworks.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a pipeline which has 5 steps. What is the best way to measure
>>>>>>> processing time for my pipeline?
>>>>>>>
>>>>>>> Thnaks
>>>>>>>
>>>>>>