You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Maximilian Michels <mx...@apache.org> on 2016/05/02 11:54:56 UTC

Re: [DISCUSS] Beam IO &runners native IO

Yes, I would expect sinks to provide similar additional interfaces
like sources, e.g. checkpointing. We could also use the
startBundle/processElement/finishBundle lifecycle methods to implement
checkpointing. I just wonder, if we want to make it more explicit.
Also, does it make sense that sinks can return a PCollection? You can
return PDone but you don't have to.

Since sinks are fundamental in streaming pipelines, it just seemed odd
to me that there is not dedicated interface. I understand a bit
clearer now that it is not viewed as crucial because we can use
existing primitives to create sinks. In a way, that might be elegant
but also less explicit.

On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry <fj...@google.com.invalid> wrote:
>>
>> @Frances Sources are not simple DoFns. They add additional
>> functionality, e.g. checkpointing, watermark generation, creating
>> splits. If we want sinks to be portable, we should think about a
>> dedicated interface. At least for the checkpointing.
>>
>
> We might be mixing sources and sinks in this conversation. ;-) Sources
> definitely provide additional functionality as you mentioned. But at least
> currently, sinks don't provide any new primitive functionality. Are you
> suggestion there needs to be a checkpointing interface for sinks beyond
> DoFn's bundle finalization? (Note that the existing Write for batch is just
> a PTransform based around ParDo.)

Re: [DISCUSS] Beam IO &runners native IO

Posted by Dan Halperin <dh...@google.com.INVALID>.
Certainly, there are a number of patterns for which careful co-design using
the pipeline structure in conjunction with the output system gives best
semantics. It may be worth looking at the (batch) Write transform to see
how the sink is split into

1) single initialize
2) parallel write
3) single finalize

and at FileBasedSink to see how it uses temp files and idempotent rename to
get correctness. The above are all done within the existing model's
semantics, using Create and side inputs and the
startBundle/processElement/finishBundle design patterns.

Admittedly, this is harder for some external systems, and the correctness
guarantees can depend on the APIs they provide. E.g., whether they support
atomic bulk operations such as a database bulk load or a filesystem rename.

I need to spend some time reviewing the Flink RollingSink, but it may be
that we can get close within the model already, or that the right
application of the forthcoming state APIs will help with the internal
checkpointing that Aljoscha is probably referencing.

Dan

On Tue, May 3, 2016 at 10:15 AM, Raghu Angadi <ra...@google.com.invalid>
wrote:

> agreed. finishBundle() helps but can not guarantee consistent state.
>
> On Tue, May 3, 2016 at 1:49 AM, Maximilian Michels <mx...@apache.org> wrote:
>
> > Correct, Kafka doesn't support rollbacks of the producer. In Flink
> > there is the RollingSink which supports transactional rolling files.
> > Admittedly, that is the only one. Still, checkpointing sinks in Beam
> > could be useful for users who are concerned about exactly once
> > semantics. I'm not sure whether we can implement something similar
> > with the bundle mechanism.
> >
> > On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi
> > <ra...@google.com.invalid> wrote:
> > > What are good examples of streaming sinks that support checkpointing
> (or
> > > transactions/rollbacks)? I don't Kafka supports a rollback.
> > >
> > > On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <mx...@apache.org>
> > wrote:
> > >
> > >> Yes, I would expect sinks to provide similar additional interfaces
> > >> like sources, e.g. checkpointing. We could also use the
> > >> startBundle/processElement/finishBundle lifecycle methods to implement
> > >> checkpointing. I just wonder, if we want to make it more explicit.
> > >> Also, does it make sense that sinks can return a PCollection? You can
> > >> return PDone but you don't have to.
> > >>
> > >> Since sinks are fundamental in streaming pipelines, it just seemed odd
> > >> to me that there is not dedicated interface. I understand a bit
> > >> clearer now that it is not viewed as crucial because we can use
> > >> existing primitives to create sinks. In a way, that might be elegant
> > >> but also less explicit.
> > >>
> > >> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry
> <fjp@google.com.invalid
> > >
> > >> wrote:
> > >> >>
> > >> >> @Frances Sources are not simple DoFns. They add additional
> > >> >> functionality, e.g. checkpointing, watermark generation, creating
> > >> >> splits. If we want sinks to be portable, we should think about a
> > >> >> dedicated interface. At least for the checkpointing.
> > >> >>
> > >> >
> > >> > We might be mixing sources and sinks in this conversation. ;-)
> Sources
> > >> > definitely provide additional functionality as you mentioned. But at
> > >> least
> > >> > currently, sinks don't provide any new primitive functionality. Are
> > you
> > >> > suggestion there needs to be a checkpointing interface for sinks
> > beyond
> > >> > DoFn's bundle finalization? (Note that the existing Write for batch
> is
> > >> just
> > >> > a PTransform based around ParDo.)
> > >>
> >
>

Re: [DISCUSS] Beam IO &runners native IO

Posted by Raghu Angadi <ra...@google.com.INVALID>.
agreed. finishBundle() helps but can not guarantee consistent state.

On Tue, May 3, 2016 at 1:49 AM, Maximilian Michels <mx...@apache.org> wrote:

> Correct, Kafka doesn't support rollbacks of the producer. In Flink
> there is the RollingSink which supports transactional rolling files.
> Admittedly, that is the only one. Still, checkpointing sinks in Beam
> could be useful for users who are concerned about exactly once
> semantics. I'm not sure whether we can implement something similar
> with the bundle mechanism.
>
> On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi
> <ra...@google.com.invalid> wrote:
> > What are good examples of streaming sinks that support checkpointing (or
> > transactions/rollbacks)? I don't Kafka supports a rollback.
> >
> > On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <mx...@apache.org>
> wrote:
> >
> >> Yes, I would expect sinks to provide similar additional interfaces
> >> like sources, e.g. checkpointing. We could also use the
> >> startBundle/processElement/finishBundle lifecycle methods to implement
> >> checkpointing. I just wonder, if we want to make it more explicit.
> >> Also, does it make sense that sinks can return a PCollection? You can
> >> return PDone but you don't have to.
> >>
> >> Since sinks are fundamental in streaming pipelines, it just seemed odd
> >> to me that there is not dedicated interface. I understand a bit
> >> clearer now that it is not viewed as crucial because we can use
> >> existing primitives to create sinks. In a way, that might be elegant
> >> but also less explicit.
> >>
> >> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry <fjp@google.com.invalid
> >
> >> wrote:
> >> >>
> >> >> @Frances Sources are not simple DoFns. They add additional
> >> >> functionality, e.g. checkpointing, watermark generation, creating
> >> >> splits. If we want sinks to be portable, we should think about a
> >> >> dedicated interface. At least for the checkpointing.
> >> >>
> >> >
> >> > We might be mixing sources and sinks in this conversation. ;-) Sources
> >> > definitely provide additional functionality as you mentioned. But at
> >> least
> >> > currently, sinks don't provide any new primitive functionality. Are
> you
> >> > suggestion there needs to be a checkpointing interface for sinks
> beyond
> >> > DoFn's bundle finalization? (Note that the existing Write for batch is
> >> just
> >> > a PTransform based around ParDo.)
> >>
>

Re: [DISCUSS] Beam IO &runners native IO

Posted by Maximilian Michels <mx...@apache.org>.
Correct, Kafka doesn't support rollbacks of the producer. In Flink
there is the RollingSink which supports transactional rolling files.
Admittedly, that is the only one. Still, checkpointing sinks in Beam
could be useful for users who are concerned about exactly once
semantics. I'm not sure whether we can implement something similar
with the bundle mechanism.

On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi
<ra...@google.com.invalid> wrote:
> What are good examples of streaming sinks that support checkpointing (or
> transactions/rollbacks)? I don't Kafka supports a rollback.
>
> On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <mx...@apache.org> wrote:
>
>> Yes, I would expect sinks to provide similar additional interfaces
>> like sources, e.g. checkpointing. We could also use the
>> startBundle/processElement/finishBundle lifecycle methods to implement
>> checkpointing. I just wonder, if we want to make it more explicit.
>> Also, does it make sense that sinks can return a PCollection? You can
>> return PDone but you don't have to.
>>
>> Since sinks are fundamental in streaming pipelines, it just seemed odd
>> to me that there is not dedicated interface. I understand a bit
>> clearer now that it is not viewed as crucial because we can use
>> existing primitives to create sinks. In a way, that might be elegant
>> but also less explicit.
>>
>> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry <fj...@google.com.invalid>
>> wrote:
>> >>
>> >> @Frances Sources are not simple DoFns. They add additional
>> >> functionality, e.g. checkpointing, watermark generation, creating
>> >> splits. If we want sinks to be portable, we should think about a
>> >> dedicated interface. At least for the checkpointing.
>> >>
>> >
>> > We might be mixing sources and sinks in this conversation. ;-) Sources
>> > definitely provide additional functionality as you mentioned. But at
>> least
>> > currently, sinks don't provide any new primitive functionality. Are you
>> > suggestion there needs to be a checkpointing interface for sinks beyond
>> > DoFn's bundle finalization? (Note that the existing Write for batch is
>> just
>> > a PTransform based around ParDo.)
>>

Re: [DISCUSS] Beam IO &runners native IO

Posted by Raghu Angadi <ra...@google.com.INVALID>.
What are good examples of streaming sinks that support checkpointing (or
transactions/rollbacks)? I don't Kafka supports a rollback.

On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <mx...@apache.org> wrote:

> Yes, I would expect sinks to provide similar additional interfaces
> like sources, e.g. checkpointing. We could also use the
> startBundle/processElement/finishBundle lifecycle methods to implement
> checkpointing. I just wonder, if we want to make it more explicit.
> Also, does it make sense that sinks can return a PCollection? You can
> return PDone but you don't have to.
>
> Since sinks are fundamental in streaming pipelines, it just seemed odd
> to me that there is not dedicated interface. I understand a bit
> clearer now that it is not viewed as crucial because we can use
> existing primitives to create sinks. In a way, that might be elegant
> but also less explicit.
>
> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry <fj...@google.com.invalid>
> wrote:
> >>
> >> @Frances Sources are not simple DoFns. They add additional
> >> functionality, e.g. checkpointing, watermark generation, creating
> >> splits. If we want sinks to be portable, we should think about a
> >> dedicated interface. At least for the checkpointing.
> >>
> >
> > We might be mixing sources and sinks in this conversation. ;-) Sources
> > definitely provide additional functionality as you mentioned. But at
> least
> > currently, sinks don't provide any new primitive functionality. Are you
> > suggestion there needs to be a checkpointing interface for sinks beyond
> > DoFn's bundle finalization? (Note that the existing Write for batch is
> just
> > a PTransform based around ParDo.)
>