You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Kenneth Knowles <kl...@google.com> on 2017/03/14 04:42:56 UTC

Re: Batch loading for streaming pipelines

This seems like a good topic for user@ so I've moved it there (dev@ to BCC).

You can get a bounded PCollection from KafkaIO via either of
.withMaxNumRecords(<n>) or .withMaxReadTime(<duration>).

Whether or not that will meet your use case would depend on more details of
what you are computing. Periodic batch jobs are harder to get right. In
particular, the time you stop reading and the end of a window (esp.
sessions) are not likely to coincide, so you'll need to deal with that.

Kenn

On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <ja...@gmail.com> wrote:

> Hi,
>
> We run multiple streaming pipelines using cloud dataflow that read from
> Kafka and write to BigQuery. We don't mind a few hours delay and are
> thinking of avoiding the costs associated with streaming data into
> BigQuery. Is there already a support (or a future plan) for such a
> scenario? If not then I guess I will implement one of the following option:
> * A BoundedSource implementation for Kafka so that we can run this in
> batch mode.
> * The streaming job writes to GCS and then a BQ load job writes to
> BigQuery.
>
> Thanks!
>

Re: Batch loading for streaming pipelines

Posted by Arpan Jain <ja...@gmail.com>.

Thanks for the reply! We are using these pipelines to read structured log
lines from Kafka and storing them in bigquery.

.withMaxNumRecords(<n>) or .withMaxReadTime(<duration>) aren't that useful
because they do not remember how much they have read in previous run.


On Mon, Mar 13, 2017 at 9:42 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> This seems like a good topic for user@ so I've moved it there (dev@ to
> BCC).
>
> You can get a bounded PCollection from KafkaIO via either of
> .withMaxNumRecords(<n>) or .withMaxReadTime(<duration>).
>
> Whether or not that will meet your use case would depend on more details of
> what you are computing. Periodic batch jobs are harder to get right. In
> particular, the time you stop reading and the end of a window (esp.
> sessions) are not likely to coincide, so you'll need to deal with that.
>
> Kenn
>
> On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <ja...@gmail.com> wrote:
>
> > Hi,
> >
> > We run multiple streaming pipelines using cloud dataflow that read from
> > Kafka and write to BigQuery. We don't mind a few hours delay and are
> > thinking of avoiding the costs associated with streaming data into
> > BigQuery. Is there already a support (or a future plan) for such a
> > scenario? If not then I guess I will implement one of the following
> option:
> > * A BoundedSource implementation for Kafka so that we can run this in
> > batch mode.
> > * The streaming job writes to GCS and then a BQ load job writes to
> > BigQuery.
> >
> > Thanks!
> >
>