You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by KV 59 <kv...@gmail.com> on 2021/08/24 14:43:44 UTC

Processing historical data

Hi,

I have a Beam streaming pipeline processing live data from PubSub using
sliding windows on event timestamps. I want to recompute the metrics for
historical data in BigQuery. What are my options?

I have looked at
https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
and I have a couple of questions

1. Can I use the same instance of the streaming pipeline? I don't think so
as the watermark would be way past the historical event timestamps.

2. Could I possibly split the pipeline and use one branch for historical
data and one for the live streaming data?

I am trying hard not to raise parallel infrastructure to process historical
data.

Any inputs would be very much appreciated

Thanks
Kishore

Re: Processing historical data

Posted by tr...@gmail.com.

It also makes sense to have separate jobs for reprocessing historical data,
because they will have different scale requirements. And batch can more
efficiently parallelize.

On Wed, 25 Aug 2021, 09:02 , <tr...@gmail.com> wrote:

> Have an option, and make the main IO suitable for pub sub or bq.
>
> You can't use the same running streaming job for the watermark reason you
> mentioned.
>
> You should be able to have to entire pipeline identical (basically in a
> single composite ptransform) except the input pcollection.
>
> On Wed, 25 Aug 2021, 08:24 Evan Galpin, <ev...@gmail.com> wrote:
>
>> Hi Kishore,
>>
>> You may be able to introduce a new pipeline which reads from BQ and
>> publishes to PubSub like you mentioned. By default, elements read via
>> PubSub will have a timestamp which is equal to the publish time of the
>> message (internally established by PubSub service).
>>
>> Are you using a custom withTimestampAttribute? If not, I believe there
>> should be no issue with watermark and late data. As per javadoc for PubSub
>> Read [1]:
>>
>> > Note that the system can guarantee that no late data will ever be seen
>> when it assigns timestamps by arrival time (i.e. timestampAttribute is
>> not provided).
>>
>> [1]
>>
>> https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-
>>
>> Hope that helps.
>>
>> Thanks,
>> Evan
>>
>> On Tue, Aug 24, 2021 at 11:16 KV 59 <kv...@gmail.com> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thanks for the response. I have a follow up to option 1 & 2
>>>
>>> If I were able to stream the historical data to PubSub and be able to
>>> restart the job, even then I believe it will not work because of the
>>> watermarks, am I right?
>>>
>>> As for option 2, if I were able to make an incompatible change (by doing
>>> a hard restart) even then it would be a challenge because of the
>>> watermarks.
>>>
>>> Thanks
>>> Kishore
>>>
>>>
>>>
>>> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <go...@google.com> wrote:
>>>
>>>> I suppose historic data Historical data processing will be one time
>>>> activity so it will be best to use a batch job to process historical data.
>>>> As for the options you mentioned,
>>>> Option 1 is not feasible as you will have to update the pipeline and I
>>>> believe the update will not be compatible because of the source change.
>>>> Option 2 also requires changes to 1st job which can be done using
>>>> update or drain and restart so has the same problem while being more
>>>> complicated.
>>>>
>>>> Thanks,
>>>> Ankur
>>>>
>>>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a Beam streaming pipeline processing live data from PubSub
>>>>> using sliding windows on event timestamps. I want to recompute the metrics
>>>>> for historical data in BigQuery. What are my options?
>>>>>
>>>>> I have looked at
>>>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>>>> and I have a couple of questions
>>>>>
>>>>> 1. Can I use the same instance of the streaming pipeline? I don't
>>>>> think so as the watermark would be way past the historical event
>>>>> timestamps.
>>>>>
>>>>> 2. Could I possibly split the pipeline and use one branch for
>>>>> historical data and one for the live streaming data?
>>>>>
>>>>> I am trying hard not to raise parallel infrastructure to process
>>>>> historical data.
>>>>>
>>>>> Any inputs would be very much appreciated
>>>>>
>>>>> Thanks
>>>>> Kishore
>>>>>
>>>>

Re: Processing historical data

Posted by tr...@gmail.com.

Have an option, and make the main IO suitable for pub sub or bq.

You can't use the same running streaming job for the watermark reason you
mentioned.

You should be able to have to entire pipeline identical (basically in a
single composite ptransform) except the input pcollection.

On Wed, 25 Aug 2021, 08:24 Evan Galpin, <ev...@gmail.com> wrote:

> Hi Kishore,
>
> You may be able to introduce a new pipeline which reads from BQ and
> publishes to PubSub like you mentioned. By default, elements read via
> PubSub will have a timestamp which is equal to the publish time of the
> message (internally established by PubSub service).
>
> Are you using a custom withTimestampAttribute? If not, I believe there
> should be no issue with watermark and late data. As per javadoc for PubSub
> Read [1]:
>
> > Note that the system can guarantee that no late data will ever be seen
> when it assigns timestamps by arrival time (i.e. timestampAttribute is
> not provided).
>
> [1]
>
> https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-
>
> Hope that helps.
>
> Thanks,
> Evan
>
> On Tue, Aug 24, 2021 at 11:16 KV 59 <kv...@gmail.com> wrote:
>
>> Hi Ankur,
>>
>> Thanks for the response. I have a follow up to option 1 & 2
>>
>> If I were able to stream the historical data to PubSub and be able to
>> restart the job, even then I believe it will not work because of the
>> watermarks, am I right?
>>
>> As for option 2, if I were able to make an incompatible change (by doing
>> a hard restart) even then it would be a challenge because of the
>> watermarks.
>>
>> Thanks
>> Kishore
>>
>>
>>
>> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <go...@google.com> wrote:
>>
>>> I suppose historic data Historical data processing will be one time
>>> activity so it will be best to use a batch job to process historical data.
>>> As for the options you mentioned,
>>> Option 1 is not feasible as you will have to update the pipeline and I
>>> believe the update will not be compatible because of the source change.
>>> Option 2 also requires changes to 1st job which can be done using update
>>> or drain and restart so has the same problem while being more complicated.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a Beam streaming pipeline processing live data from PubSub using
>>>> sliding windows on event timestamps. I want to recompute the metrics for
>>>> historical data in BigQuery. What are my options?
>>>>
>>>> I have looked at
>>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>>> and I have a couple of questions
>>>>
>>>> 1. Can I use the same instance of the streaming pipeline? I don't think
>>>> so as the watermark would be way past the historical event timestamps.
>>>>
>>>> 2. Could I possibly split the pipeline and use one branch for
>>>> historical data and one for the live streaming data?
>>>>
>>>> I am trying hard not to raise parallel infrastructure to process
>>>> historical data.
>>>>
>>>> Any inputs would be very much appreciated
>>>>
>>>> Thanks
>>>> Kishore
>>>>
>>>

Re: Processing historical data

Posted by Evan Galpin <ev...@gmail.com>.

Hi Kishore,

You may be able to introduce a new pipeline which reads from BQ and
publishes to PubSub like you mentioned. By default, elements read via
PubSub will have a timestamp which is equal to the publish time of the
message (internally established by PubSub service).

Are you using a custom withTimestampAttribute? If not, I believe there
should be no issue with watermark and late data. As per javadoc for PubSub
Read [1]:

> Note that the system can guarantee that no late data will ever be seen
when it assigns timestamps by arrival time (i.e. timestampAttribute is not
provided).

[1]
https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-

Hope that helps.

Thanks,
Evan

On Tue, Aug 24, 2021 at 11:16 KV 59 <kv...@gmail.com> wrote:

> Hi Ankur,
>
> Thanks for the response. I have a follow up to option 1 & 2
>
> If I were able to stream the historical data to PubSub and be able to
> restart the job, even then I believe it will not work because of the
> watermarks, am I right?
>
> As for option 2, if I were able to make an incompatible change (by doing a
> hard restart) even then it would be a challenge because of the watermarks.
>
> Thanks
> Kishore
>
>
>
> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <go...@google.com> wrote:
>
>> I suppose historic data Historical data processing will be one time
>> activity so it will be best to use a batch job to process historical data.
>> As for the options you mentioned,
>> Option 1 is not feasible as you will have to update the pipeline and I
>> believe the update will not be compatible because of the source change.
>> Option 2 also requires changes to 1st job which can be done using update
>> or drain and restart so has the same problem while being more complicated.
>>
>> Thanks,
>> Ankur
>>
>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a Beam streaming pipeline processing live data from PubSub using
>>> sliding windows on event timestamps. I want to recompute the metrics for
>>> historical data in BigQuery. What are my options?
>>>
>>> I have looked at
>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>> and I have a couple of questions
>>>
>>> 1. Can I use the same instance of the streaming pipeline? I don't think
>>> so as the watermark would be way past the historical event timestamps.
>>>
>>> 2. Could I possibly split the pipeline and use one branch for historical
>>> data and one for the live streaming data?
>>>
>>> I am trying hard not to raise parallel infrastructure to process
>>> historical data.
>>>
>>> Any inputs would be very much appreciated
>>>
>>> Thanks
>>> Kishore
>>>
>>

Re: Processing historical data

Posted by Ankur Goenka <go...@google.com>.

Sorry for the late response.
You can potentially use event time to make it work. This can be done by
adding a time attribute when republishing data to pubsub [1] which will
take care of the watermark.
For option 2. You can have a job with two different sources and I would
expect the two separate branches to handle the watermark correctly.
However, this will be as good as running 2 separate jobs hence I would
advise against it.

[1] https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage

On Tue, Aug 24, 2021 at 8:16 AM KV 59 <kv...@gmail.com> wrote:

> Hi Ankur,
>
> Thanks for the response. I have a follow up to option 1 & 2
>
> If I were able to stream the historical data to PubSub and be able to
> restart the job, even then I believe it will not work because of the
> watermarks, am I right?
>
> As for option 2, if I were able to make an incompatible change (by doing a
> hard restart) even then it would be a challenge because of the watermarks.
>
> Thanks
> Kishore
>
>
>
> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <go...@google.com> wrote:
>
>> I suppose historic data Historical data processing will be one time
>> activity so it will be best to use a batch job to process historical data.
>> As for the options you mentioned,
>> Option 1 is not feasible as you will have to update the pipeline and I
>> believe the update will not be compatible because of the source change.
>> Option 2 also requires changes to 1st job which can be done using update
>> or drain and restart so has the same problem while being more complicated.
>>
>> Thanks,
>> Ankur
>>
>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a Beam streaming pipeline processing live data from PubSub using
>>> sliding windows on event timestamps. I want to recompute the metrics for
>>> historical data in BigQuery. What are my options?
>>>
>>> I have looked at
>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>> and I have a couple of questions
>>>
>>> 1. Can I use the same instance of the streaming pipeline? I don't think
>>> so as the watermark would be way past the historical event timestamps.
>>>
>>> 2. Could I possibly split the pipeline and use one branch for historical
>>> data and one for the live streaming data?
>>>
>>> I am trying hard not to raise parallel infrastructure to process
>>> historical data.
>>>
>>> Any inputs would be very much appreciated
>>>
>>> Thanks
>>> Kishore
>>>
>>

Re: Processing historical data

Posted by KV 59 <kv...@gmail.com>.

Hi Ankur,

Thanks for the response. I have a follow up to option 1 & 2

If I were able to stream the historical data to PubSub and be able to
restart the job, even then I believe it will not work because of the
watermarks, am I right?

As for option 2, if I were able to make an incompatible change (by doing a
hard restart) even then it would be a challenge because of the watermarks.

Thanks
Kishore



On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <go...@google.com> wrote:

> I suppose historic data Historical data processing will be one time
> activity so it will be best to use a batch job to process historical data.
> As for the options you mentioned,
> Option 1 is not feasible as you will have to update the pipeline and I
> believe the update will not be compatible because of the source change.
> Option 2 also requires changes to 1st job which can be done using update
> or drain and restart so has the same problem while being more complicated.
>
> Thanks,
> Ankur
>
> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a Beam streaming pipeline processing live data from PubSub using
>> sliding windows on event timestamps. I want to recompute the metrics for
>> historical data in BigQuery. What are my options?
>>
>> I have looked at
>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>> and I have a couple of questions
>>
>> 1. Can I use the same instance of the streaming pipeline? I don't think
>> so as the watermark would be way past the historical event timestamps.
>>
>> 2. Could I possibly split the pipeline and use one branch for historical
>> data and one for the live streaming data?
>>
>> I am trying hard not to raise parallel infrastructure to process
>> historical data.
>>
>> Any inputs would be very much appreciated
>>
>> Thanks
>> Kishore
>>
>

Re: Processing historical data

Posted by Ankur Goenka <go...@google.com>.

I suppose historic data Historical data processing will be one time
activity so it will be best to use a batch job to process historical data.
As for the options you mentioned,
Option 1 is not feasible as you will have to update the pipeline and I
believe the update will not be compatible because of the source change.
Option 2 also requires changes to 1st job which can be done using update or
drain and restart so has the same problem while being more complicated.

Thanks,
Ankur

On Tue, Aug 24, 2021 at 7:44 AM KV 59 <kv...@gmail.com> wrote:

> Hi,
>
> I have a Beam streaming pipeline processing live data from PubSub using
> sliding windows on event timestamps. I want to recompute the metrics for
> historical data in BigQuery. What are my options?
>
> I have looked at
> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
> and I have a couple of questions
>
> 1. Can I use the same instance of the streaming pipeline? I don't think so
> as the watermark would be way past the historical event timestamps.
>
> 2. Could I possibly split the pipeline and use one branch for historical
> data and one for the live streaming data?
>
> I am trying hard not to raise parallel infrastructure to process
> historical data.
>
> Any inputs would be very much appreciated
>
> Thanks
> Kishore
>