You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Josh <jo...@gmail.com> on 2017/08/03 15:17:56 UTC

PubSubIO withTimestampAttribute - what are the implications?

Hi all,

We've been running a few streaming Beam jobs on Dataflow, where each job is
consuming from PubSub via PubSubIO. Each job does something like this:

PubsubIO.readMessagesWithAttributes()
            .withIdAttribute("unique_id")
            .withTimestampAttribute("timestamp");

My understanding of `withTimestampAttribute` is that it means we use the
timestamp on the PubSub message as Beam's concept of time (the watermark) -
so that any windowing we do in the job uses "event time" rather than
"processing time".

My question is: is my understanding correct, and does using
`withTimestampAttribute` have any effect in a job that doesn't do any
windowing? I have a feeling it may also have an effect on Dataflow's
autoscaling, since I think Dataflow scales up when the watermark timestamp
lags behind, but I'm not sure about this.

The reason I'm concerned about this is because we've been using it in all
our Dataflow jobs, and have now realised that whenever
`withTimestampAttribute` is used, Dataflow creates an additional PubSub
subscription (suffixed with `__streaming_dataflow_internal`), which appears
to be doubling PubSub costs (since we pay per subscription)! So I want to
remove `withTimestampAttribute` from jobs where possible, but want to first
understand the implications.

Thanks for any advice,
Josh

Re: PubSubIO withTimestampAttribute - what are the implications?

Posted by Josh <jo...@gmail.com>.
Ok great, thanks Lukasz. I will try turning off the timestamp attribute on
some of these jobs then!

On Thu, Aug 3, 2017 at 10:14 PM, Lukasz Cwik <lc...@google.com> wrote:

> To my knowledge, autoscaling is dependent on how many messages are
> backlogged within Pubsub and independent of the second subscription (the
> second subscription is really to compute a better watermark).
>
> On Thu, Aug 3, 2017 at 1:34 PM, <jo...@gmail.com> wrote:
>
>> Thanks Lukasz that's good to know! It sounds like we can halve our PubSub
>> costs then!
>>
>> Just to clarify, if I were to remove withTimestampAttribute from a job,
>> this would cause the watermark to always be up to date (processing time)
>> even if the job starts to lag behind (build up of unacknowledged PubSub
>> messages). In this case would Dataflow's autoscaling still scale up? I
>> thought the reason the autoscaler scales up is due to the watermark lagging
>> behind, but is it also aware of the acknowledged PubSub messages?
>>
>> On 3 Aug 2017, at 18:58, Lukasz Cwik <lc...@google.com> wrote:
>>
>> You understanding is correct - the data watermark will only matter for
>> windowing. It will not affect auto-scaling. If the pipeline is not doing
>> any windowing, triggering, etc then there is no need to pay for the cost of
>> the second subscription.
>>
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh <jo...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> We've been running a few streaming Beam jobs on Dataflow, where each job
>>> is consuming from PubSub via PubSubIO. Each job does something like this:
>>>
>>> PubsubIO.readMessagesWithAttributes()
>>>             .withIdAttribute("unique_id")
>>>             .withTimestampAttribute("timestamp");
>>>
>>> My understanding of `withTimestampAttribute` is that it means we use the
>>> timestamp on the PubSub message as Beam's concept of time (the watermark) -
>>> so that any windowing we do in the job uses "event time" rather than
>>> "processing time".
>>>
>>> My question is: is my understanding correct, and does using
>>> `withTimestampAttribute` have any effect in a job that doesn't do any
>>> windowing? I have a feeling it may also have an effect on Dataflow's
>>> autoscaling, since I think Dataflow scales up when the watermark timestamp
>>> lags behind, but I'm not sure about this.
>>>
>>> The reason I'm concerned about this is because we've been using it in
>>> all our Dataflow jobs, and have now realised that whenever
>>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub
>>> subscription (suffixed with `__streaming_dataflow_internal`), which
>>> appears to be doubling PubSub costs (since we pay per subscription)! So I
>>> want to remove `withTimestampAttribute` from jobs where possible, but want
>>> to first understand the implications.
>>>
>>> Thanks for any advice,
>>> Josh
>>>
>>
>>
>

Re: PubSubIO withTimestampAttribute - what are the implications?

Posted by Lukasz Cwik <lc...@google.com>.
To my knowledge, autoscaling is dependent on how many messages are
backlogged within Pubsub and independent of the second subscription (the
second subscription is really to compute a better watermark).

On Thu, Aug 3, 2017 at 1:34 PM, <jo...@gmail.com> wrote:

> Thanks Lukasz that's good to know! It sounds like we can halve our PubSub
> costs then!
>
> Just to clarify, if I were to remove withTimestampAttribute from a job,
> this would cause the watermark to always be up to date (processing time)
> even if the job starts to lag behind (build up of unacknowledged PubSub
> messages). In this case would Dataflow's autoscaling still scale up? I
> thought the reason the autoscaler scales up is due to the watermark lagging
> behind, but is it also aware of the acknowledged PubSub messages?
>
> On 3 Aug 2017, at 18:58, Lukasz Cwik <lc...@google.com> wrote:
>
> You understanding is correct - the data watermark will only matter for
> windowing. It will not affect auto-scaling. If the pipeline is not doing
> any windowing, triggering, etc then there is no need to pay for the cost of
> the second subscription.
>
> On Thu, Aug 3, 2017 at 8:17 AM, Josh <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> We've been running a few streaming Beam jobs on Dataflow, where each job
>> is consuming from PubSub via PubSubIO. Each job does something like this:
>>
>> PubsubIO.readMessagesWithAttributes()
>>             .withIdAttribute("unique_id")
>>             .withTimestampAttribute("timestamp");
>>
>> My understanding of `withTimestampAttribute` is that it means we use the
>> timestamp on the PubSub message as Beam's concept of time (the watermark) -
>> so that any windowing we do in the job uses "event time" rather than
>> "processing time".
>>
>> My question is: is my understanding correct, and does using
>> `withTimestampAttribute` have any effect in a job that doesn't do any
>> windowing? I have a feeling it may also have an effect on Dataflow's
>> autoscaling, since I think Dataflow scales up when the watermark timestamp
>> lags behind, but I'm not sure about this.
>>
>> The reason I'm concerned about this is because we've been using it in all
>> our Dataflow jobs, and have now realised that whenever
>> `withTimestampAttribute` is used, Dataflow creates an additional PubSub
>> subscription (suffixed with `__streaming_dataflow_internal`), which
>> appears to be doubling PubSub costs (since we pay per subscription)! So I
>> want to remove `withTimestampAttribute` from jobs where possible, but want
>> to first understand the implications.
>>
>> Thanks for any advice,
>> Josh
>>
>
>

Re: PubSubIO withTimestampAttribute - what are the implications?

Posted by jo...@gmail.com.
Thanks Lukasz that's good to know! It sounds like we can halve our PubSub costs then!

Just to clarify, if I were to remove withTimestampAttribute from a job, this would cause the watermark to always be up to date (processing time) even if the job starts to lag behind (build up of unacknowledged PubSub messages). In this case would Dataflow's autoscaling still scale up? I thought the reason the autoscaler scales up is due to the watermark lagging behind, but is it also aware of the acknowledged PubSub messages?

> On 3 Aug 2017, at 18:58, Lukasz Cwik <lc...@google.com> wrote:
> 
> You understanding is correct - the data watermark will only matter for windowing. It will not affect auto-scaling. If the pipeline is not doing any windowing, triggering, etc then there is no need to pay for the cost of the second subscription. 
> 
>> On Thu, Aug 3, 2017 at 8:17 AM, Josh <jo...@gmail.com> wrote:
>> Hi all,
>> 
>> We've been running a few streaming Beam jobs on Dataflow, where each job is consuming from PubSub via PubSubIO. Each job does something like this:
>> 
>> PubsubIO.readMessagesWithAttributes()
>>             .withIdAttribute("unique_id")
>>             .withTimestampAttribute("timestamp");
>> 
>> My understanding of `withTimestampAttribute` is that it means we use the timestamp on the PubSub message as Beam's concept of time (the watermark) - so that any windowing we do in the job uses "event time" rather than "processing time".
>> 
>> My question is: is my understanding correct, and does using `withTimestampAttribute` have any effect in a job that doesn't do any windowing? I have a feeling it may also have an effect on Dataflow's autoscaling, since I think Dataflow scales up when the watermark timestamp lags behind, but I'm not sure about this.
>> 
>> The reason I'm concerned about this is because we've been using it in all our Dataflow jobs, and have now realised that whenever `withTimestampAttribute` is used, Dataflow creates an additional PubSub subscription (suffixed with `__streaming_dataflow_internal`), which appears to be doubling PubSub costs (since we pay per subscription)! So I want to remove `withTimestampAttribute` from jobs where possible, but want to first understand the implications.
>> 
>> Thanks for any advice,
>> Josh
> 

Re: PubSubIO withTimestampAttribute - what are the implications?

Posted by Lukasz Cwik <lc...@google.com>.
You understanding is correct - the data watermark will only matter for
windowing. It will not affect auto-scaling. If the pipeline is not doing
any windowing, triggering, etc then there is no need to pay for the cost of
the second subscription.

On Thu, Aug 3, 2017 at 8:17 AM, Josh <jo...@gmail.com> wrote:

> Hi all,
>
> We've been running a few streaming Beam jobs on Dataflow, where each job
> is consuming from PubSub via PubSubIO. Each job does something like this:
>
> PubsubIO.readMessagesWithAttributes()
>             .withIdAttribute("unique_id")
>             .withTimestampAttribute("timestamp");
>
> My understanding of `withTimestampAttribute` is that it means we use the
> timestamp on the PubSub message as Beam's concept of time (the watermark) -
> so that any windowing we do in the job uses "event time" rather than
> "processing time".
>
> My question is: is my understanding correct, and does using
> `withTimestampAttribute` have any effect in a job that doesn't do any
> windowing? I have a feeling it may also have an effect on Dataflow's
> autoscaling, since I think Dataflow scales up when the watermark timestamp
> lags behind, but I'm not sure about this.
>
> The reason I'm concerned about this is because we've been using it in all
> our Dataflow jobs, and have now realised that whenever
> `withTimestampAttribute` is used, Dataflow creates an additional PubSub
> subscription (suffixed with `__streaming_dataflow_internal`), which
> appears to be doubling PubSub costs (since we pay per subscription)! So I
> want to remove `withTimestampAttribute` from jobs where possible, but want
> to first understand the implications.
>
> Thanks for any advice,
> Josh
>