You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Andrew Jones <an...@andrew-jones.com> on 2018/01/03 11:09:29 UTC

Data guarantees PubSub to GCS

Hi,

I'd like to confirm Beams data guarantees when used with Google Cloud PubSub and Cloud Storage and running on Dataflow. I can't find any explicit documentation on it.

If the Beam job is running successfully, then I believe all data will be delivered to GCS at least once. If I stop the job with 'Drain', then any inflight data will be processed and saved.

What happens if the Beam job is not running successfully, and maybe throwing exceptions? Will the data still be available in PubSub when I cancel (not drain) the job? Does a drain work successfully if the data cannot be written to GCS because of the exceptions?

Thanks,
Andrew

Re: Data guarantees PubSub to GCS

Posted by Pablo Estrada <pa...@google.com>.

Oops - well, sorry about that! Glad Luke was able to clarify.
Best.
-P.

On Thu, Jan 4, 2018, 1:20 PM Lukasz Cwik <lc...@google.com> wrote:

> That is correct Derek, Google Cloud Dataflow will only ack the message to
> Pubsub when a bundle completes.
> For very simple pipelines fusion will make it so that all downstream
> actions will happen within the same bundle.
>
> On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <ph...@gmail.com>
> wrote:
>
>> Hi Pablo,
>>
>> *Regarding elements coming from PubSub into your pipeline:*
>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>> subscription, and you won't be able to retrieve it again from PubSub on the
>> same subscription.
>>
>> This part differs from my understanding of consuming Pub/Sub messages in
>> the Dataflow pipeline. I think the message will only be committed when a
>> PCollection in the pipeline gets materialized (
>> https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio),
>> which means if the pipeline is not complicated. Fusion optimization would
>> fuse multiple stages together and if any of these stages throw an
>> exception, the Pub/Sub message won't be acknowledged. I've also verified
>> this behavior.
>>
>> Let me know if my understanding is correct. :)
>>
>> Thanks,
>>
>> Derek
>>
>>
>> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <pa...@google.com>
>> wrote:
>>
>>> I am not a streaming expert, but I will answer according to how I
>>> understand the system, and others can correct me if I get something wrong.
>>>
>>> *Regarding elements coming from PubSub into your pipeline:*
>>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>>> subscription, and you won't be able to retrieve it again from PubSub on the
>>> same subscription.
>>>
>>> *Regarding elements stuck within your pipeline:*
>>> Bundles in a streaming pipeline are executed and committed individually.
>>> This means that one bundle may be stuck, while all other bundles may be
>>> moving forward in your pipeline. In a case like this, you won't be able to
>>> drain the pipeline because there is one bundle that can't be drained out
>>> (because exceptions are thrown every time processing for it is attempted).
>>> On the other hand, if you cancel your pipeline, then the information
>>> regarding the progress made by each bundle will be lost, so you will drop
>>> the data that was stuck within your pipeline, and was never written out.
>>> (That data was also acked in your PubSub subscription, so it won't come out
>>> from PubSub if you reattach to the same subscription later). - So cancel
>>> may not be what you're looking for either.
>>>
>>> For cases like these, what you'd need to do is to live-update your
>>> pipeline with code that can handle the problems in your current pipeline.
>>> This new code will replace the code in your pipeline stages, and then
>>> Dataflow will continue processing of your data in the state that it was
>>> before the update. This means that if there's one bundle that was stuck, it
>>> will be retried against the new code, and it will finally make progress
>>> across your pipeline.
>>>
>>> If you want to completely change, or stop your pipeline without dropping
>>> stuck bundles, you will still need to live-update it, and then drain it.
>>>
>>> I hope that was clear. Let me know if you need more clarification - and
>>> perhaps others will have more to add / correct.
>>> Best!
>>> -P.
>>>
>>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <
>>> andrew+beam@andrew-jones.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to confirm Beams data guarantees when used with Google Cloud
>>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>>>> documentation on it.
>>>>
>>>> If the Beam job is running successfully, then I believe all data will
>>>> be delivered to GCS at least once. If I stop the job with 'Drain', then any
>>>> inflight data will be processed and saved.
>>>>
>>>> What happens if the Beam job is not running successfully, and maybe
>>>> throwing exceptions? Will the data still be available in PubSub when I
>>>> cancel (not drain) the job? Does a drain work successfully if the data
>>>> cannot be written to GCS because of the exceptions?
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>
>>
>>
>> --
>> Derek Hao Hu
>>
>> Software Engineer | Snapchat
>> Snap Inc.
>>
>
>

Re: Data guarantees PubSub to GCS

Posted by Lukasz Cwik <lc...@google.com>.

That is correct Derek, Google Cloud Dataflow will only ack the message to
Pubsub when a bundle completes.
For very simple pipelines fusion will make it so that all downstream
actions will happen within the same bundle.

On Thu, Jan 4, 2018 at 1:00 PM, Derek Hao Hu <ph...@gmail.com> wrote:

> Hi Pablo,
>
> *Regarding elements coming from PubSub into your pipeline:*
> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
> subscription, and you won't be able to retrieve it again from PubSub on the
> same subscription.
>
> This part differs from my understanding of consuming Pub/Sub messages in
> the Dataflow pipeline. I think the message will only be committed when a
> PCollection in the pipeline gets materialized (https://stackoverflow.com/
> questions/41727106/when-does-dataflow-acknowledge-a-
> message-of-batched-items-from-pubsubio), which means if the pipeline is
> not complicated. Fusion optimization would fuse multiple stages together
> and if any of these stages throw an exception, the Pub/Sub message won't be
> acknowledged. I've also verified this behavior.
>
> Let me know if my understanding is correct. :)
>
> Thanks,
>
> Derek
>
>
> On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <pa...@google.com> wrote:
>
>> I am not a streaming expert, but I will answer according to how I
>> understand the system, and others can correct me if I get something wrong.
>>
>> *Regarding elements coming from PubSub into your pipeline:*
>> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
>> subscription, and you won't be able to retrieve it again from PubSub on the
>> same subscription.
>>
>> *Regarding elements stuck within your pipeline:*
>> Bundles in a streaming pipeline are executed and committed individually.
>> This means that one bundle may be stuck, while all other bundles may be
>> moving forward in your pipeline. In a case like this, you won't be able to
>> drain the pipeline because there is one bundle that can't be drained out
>> (because exceptions are thrown every time processing for it is attempted).
>> On the other hand, if you cancel your pipeline, then the information
>> regarding the progress made by each bundle will be lost, so you will drop
>> the data that was stuck within your pipeline, and was never written out.
>> (That data was also acked in your PubSub subscription, so it won't come out
>> from PubSub if you reattach to the same subscription later). - So cancel
>> may not be what you're looking for either.
>>
>> For cases like these, what you'd need to do is to live-update your
>> pipeline with code that can handle the problems in your current pipeline.
>> This new code will replace the code in your pipeline stages, and then
>> Dataflow will continue processing of your data in the state that it was
>> before the update. This means that if there's one bundle that was stuck, it
>> will be retried against the new code, and it will finally make progress
>> across your pipeline.
>>
>> If you want to completely change, or stop your pipeline without dropping
>> stuck bundles, you will still need to live-update it, and then drain it.
>>
>> I hope that was clear. Let me know if you need more clarification - and
>> perhaps others will have more to add / correct.
>> Best!
>> -P.
>>
>> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <an...@andrew-jones.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to confirm Beams data guarantees when used with Google Cloud
>>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>>> documentation on it.
>>>
>>> If the Beam job is running successfully, then I believe all data will be
>>> delivered to GCS at least once. If I stop the job with 'Drain', then any
>>> inflight data will be processed and saved.
>>>
>>> What happens if the Beam job is not running successfully, and maybe
>>> throwing exceptions? Will the data still be available in PubSub when I
>>> cancel (not drain) the job? Does a drain work successfully if the data
>>> cannot be written to GCS because of the exceptions?
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>
>
> --
> Derek Hao Hu
>
> Software Engineer | Snapchat
> Snap Inc.
>

Re: Data guarantees PubSub to GCS

Posted by Derek Hao Hu <ph...@gmail.com>.

Hi Pablo,

*Regarding elements coming from PubSub into your pipeline:*
Once the data enters your pipeline, it is 'acknowledged' on your PubSub
subscription, and you won't be able to retrieve it again from PubSub on the
same subscription.

This part differs from my understanding of consuming Pub/Sub messages in
the Dataflow pipeline. I think the message will only be committed when a
PCollection in the pipeline gets materialized (
https://stackoverflow.com/questions/41727106/when-does-dataflow-acknowledge-a-message-of-batched-items-from-pubsubio),
which means if the pipeline is not complicated. Fusion optimization would
fuse multiple stages together and if any of these stages throw an
exception, the Pub/Sub message won't be acknowledged. I've also verified
this behavior.

Let me know if my understanding is correct. :)

Thanks,

Derek


On Thu, Jan 4, 2018 at 11:42 AM, Pablo Estrada <pa...@google.com> wrote:

> I am not a streaming expert, but I will answer according to how I
> understand the system, and others can correct me if I get something wrong.
>
> *Regarding elements coming from PubSub into your pipeline:*
> Once the data enters your pipeline, it is 'acknowledged' on your PubSub
> subscription, and you won't be able to retrieve it again from PubSub on the
> same subscription.
>
> *Regarding elements stuck within your pipeline:*
> Bundles in a streaming pipeline are executed and committed individually.
> This means that one bundle may be stuck, while all other bundles may be
> moving forward in your pipeline. In a case like this, you won't be able to
> drain the pipeline because there is one bundle that can't be drained out
> (because exceptions are thrown every time processing for it is attempted).
> On the other hand, if you cancel your pipeline, then the information
> regarding the progress made by each bundle will be lost, so you will drop
> the data that was stuck within your pipeline, and was never written out.
> (That data was also acked in your PubSub subscription, so it won't come out
> from PubSub if you reattach to the same subscription later). - So cancel
> may not be what you're looking for either.
>
> For cases like these, what you'd need to do is to live-update your
> pipeline with code that can handle the problems in your current pipeline.
> This new code will replace the code in your pipeline stages, and then
> Dataflow will continue processing of your data in the state that it was
> before the update. This means that if there's one bundle that was stuck, it
> will be retried against the new code, and it will finally make progress
> across your pipeline.
>
> If you want to completely change, or stop your pipeline without dropping
> stuck bundles, you will still need to live-update it, and then drain it.
>
> I hope that was clear. Let me know if you need more clarification - and
> perhaps others will have more to add / correct.
> Best!
> -P.
>
> On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <an...@andrew-jones.com>
> wrote:
>
>> Hi,
>>
>> I'd like to confirm Beams data guarantees when used with Google Cloud
>> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
>> documentation on it.
>>
>> If the Beam job is running successfully, then I believe all data will be
>> delivered to GCS at least once. If I stop the job with 'Drain', then any
>> inflight data will be processed and saved.
>>
>> What happens if the Beam job is not running successfully, and maybe
>> throwing exceptions? Will the data still be available in PubSub when I
>> cancel (not drain) the job? Does a drain work successfully if the data
>> cannot be written to GCS because of the exceptions?
>>
>> Thanks,
>> Andrew
>>
>


-- 
Derek Hao Hu

Software Engineer | Snapchat
Snap Inc.

Re: Data guarantees PubSub to GCS

Posted by Pablo Estrada <pa...@google.com>.

I am not a streaming expert, but I will answer according to how I
understand the system, and others can correct me if I get something wrong.

*Regarding elements coming from PubSub into your pipeline:*
Once the data enters your pipeline, it is 'acknowledged' on your PubSub
subscription, and you won't be able to retrieve it again from PubSub on the
same subscription.

*Regarding elements stuck within your pipeline:*
Bundles in a streaming pipeline are executed and committed individually.
This means that one bundle may be stuck, while all other bundles may be
moving forward in your pipeline. In a case like this, you won't be able to
drain the pipeline because there is one bundle that can't be drained out
(because exceptions are thrown every time processing for it is attempted).
On the other hand, if you cancel your pipeline, then the information
regarding the progress made by each bundle will be lost, so you will drop
the data that was stuck within your pipeline, and was never written out.
(That data was also acked in your PubSub subscription, so it won't come out
from PubSub if you reattach to the same subscription later). - So cancel
may not be what you're looking for either.

For cases like these, what you'd need to do is to live-update your pipeline
with code that can handle the problems in your current pipeline. This new
code will replace the code in your pipeline stages, and then Dataflow will
continue processing of your data in the state that it was before the
update. This means that if there's one bundle that was stuck, it will be
retried against the new code, and it will finally make progress across your
pipeline.

If you want to completely change, or stop your pipeline without dropping
stuck bundles, you will still need to live-update it, and then drain it.

I hope that was clear. Let me know if you need more clarification - and
perhaps others will have more to add / correct.
Best!
-P.

On Wed, Jan 3, 2018 at 3:09 AM Andrew Jones <an...@andrew-jones.com>
wrote:

> Hi,
>
> I'd like to confirm Beams data guarantees when used with Google Cloud
> PubSub and Cloud Storage and running on Dataflow. I can't find any explicit
> documentation on it.
>
> If the Beam job is running successfully, then I believe all data will be
> delivered to GCS at least once. If I stop the job with 'Drain', then any
> inflight data will be processed and saved.
>
> What happens if the Beam job is not running successfully, and maybe
> throwing exceptions? Will the data still be available in PubSub when I
> cancel (not drain) the job? Does a drain work successfully if the data
> cannot be written to GCS because of the exceptions?
>
> Thanks,
> Andrew
>