You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Josh <jo...@gmail.com> on 2016/03/12 00:54:44 UTC

External DB as sink - with processing guarantees

Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?

Thanks for any advice,
Josh

Re: External DB as sink - with processing guarantees

Posted by Josh <jo...@gmail.com>.

Hi Fabian,

Thanks, that's very helpful. Actually most of my writes will be idempotent so I guess that means I'll get the exact once guarantee using the Hadoop output format!

Thanks,
Josh

> On 12 Mar 2016, at 09:14, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Hi Josh,
> 
> Flink can guarantee exactly-once processing within its data flow given that the data sources allow to replay data from a specific position in the stream. For example, Flink's Kafka Consumer supports exactly-once.
> 
> Flink achieves exactly-once processing by resetting operator state to a consistent state and replaying data. This means that data might actually be processed more than once, but the operator state will reflect exactly-once semantics because it was reset. Ensuring exactly-once end-to-end it difficult, because Flink does not control (and cannot reset) the state of the sinks. By default, data can be sent more than once to a sink resulting in at-least-once semantics at the sink.
> 
> This issue can be addressed, if the sink provides transactional writes (previous writes can be undone) or if the writes are idempotent (applying them several times does not change the result). Transactional support would need to be integrated with Flink's SinkFunction. This is not the case for Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you would need to implement a SinkFunction with transactional support or use idempotent writes if you want to achieve exactly-once results.
> 
> Best, Fabian
> 
> 2016-03-12 9:57 GMT+01:00 Josh <jo...@gmail.com>:
>> Thanks Nick, that sounds good. I would still like to have an understanding of what determines the processing guarantee though. Say I use a DynamoDB Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if it's at-least-once, is there a way to adapt it to achieve exactly-once?
>> 
>> Thanks,
>> Josh
>> 
>>> On 12 Mar 2016, at 02:46, Nick Dimiduk <nd...@gmail.com> wrote:
>>> 
>>> Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.
>>> 
>>> Re: output semantics, your mileage may vary. Flink should do you fine for at least once.
>>> 
>>>> On Friday, March 11, 2016, Josh <jo...@gmail.com> wrote:
>>>> Hi all,
>>>> 
>>>> I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:
>>>> 
>>>> 1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
>>>> 2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?
>>>> 
>>>> Thanks for any advice,
>>>> Josh
>

Re: External DB as sink - with processing guarantees

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Josh,

Flink can guarantee exactly-once processing within its data flow given that
the data sources allow to replay data from a specific position in the
stream. For example, Flink's Kafka Consumer supports exactly-once.

Flink achieves exactly-once processing by resetting operator state to a
consistent state and replaying data. This means that data might actually be
processed more than once, but the operator state will reflect exactly-once
semantics because it was reset. Ensuring exactly-once end-to-end it
difficult, because Flink does not control (and cannot reset) the state of
the sinks. By default, data can be sent more than once to a sink resulting
in at-least-once semantics at the sink.

This issue can be addressed, if the sink provides transactional writes
(previous writes can be undone) or if the writes are idempotent (applying
them several times does not change the result). Transactional support would
need to be integrated with Flink's SinkFunction. This is not the case for
Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but
you would need to implement a SinkFunction with transactional support or
use idempotent writes if you want to achieve exactly-once results.

Best, Fabian

2016-03-12 9:57 GMT+01:00 Josh <jo...@gmail.com>:

> Thanks Nick, that sounds good. I would still like to have an understanding
> of what determines the processing guarantee though. Say I use a DynamoDB
> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if
> it's at-least-once, is there a way to adapt it to achieve exactly-once?
>
> Thanks,
> Josh
>
> On 12 Mar 2016, at 02:46, Nick Dimiduk <nd...@gmail.com> wrote:
>
> Pretty much anything you can write to from a Hadoop MapReduce program can
> be a Flink destination. Just plug in the OutputFormat and go.
>
> Re: output semantics, your mileage may vary. Flink should do you fine for
> at least once.
>
> On Friday, March 11, 2016, Josh <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> I want to use an external data store (DynamoDB) as a sink with Flink. It
>> looks like there's no connector for Dynamo at the moment, so I have two
>> questions:
>>
>> 1. Is it easy to write my own sink for Flink and are there any docs
>> around how to do this?
>> 2. If I do this, will I still be able to have Flink's processing
>> guarantees? I.e. Can I be sure that every tuple has contributed to the
>> DynamoDB state either at-least-once or exactly-once?
>>
>> Thanks for any advice,
>> Josh
>
>

Re: External DB as sink - with processing guarantees

Posted by Josh <jo...@gmail.com>.

Thanks Nick, that sounds good. I would still like to have an understanding of what determines the processing guarantee though. Say I use a DynamoDB Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if it's at-least-once, is there a way to adapt it to achieve exactly-once?

Thanks,
Josh

> On 12 Mar 2016, at 02:46, Nick Dimiduk <nd...@gmail.com> wrote:
> 
> Pretty much anything you can write to from a Hadoop MapReduce program can be a Flink destination. Just plug in the OutputFormat and go.
> 
> Re: output semantics, your mileage may vary. Flink should do you fine for at least once.
> 
>> On Friday, March 11, 2016, Josh <jo...@gmail.com> wrote:
>> Hi all,
>> 
>> I want to use an external data store (DynamoDB) as a sink with Flink. It looks like there's no connector for Dynamo at the moment, so I have two questions:
>> 
>> 1. Is it easy to write my own sink for Flink and are there any docs around how to do this?
>> 2. If I do this, will I still be able to have Flink's processing guarantees? I.e. Can I be sure that every tuple has contributed to the DynamoDB state either at-least-once or exactly-once?
>> 
>> Thanks for any advice,
>> Josh

Re: External DB as sink - with processing guarantees

Posted by Nick Dimiduk <nd...@gmail.com>.

Pretty much anything you can write to from a Hadoop MapReduce program can
be a Flink destination. Just plug in the OutputFormat and go.

Re: output semantics, your mileage may vary. Flink should do you fine for
at least once.

On Friday, March 11, 2016, Josh <jo...@gmail.com> wrote:

> Hi all,
>
> I want to use an external data store (DynamoDB) as a sink with Flink. It
> looks like there's no connector for Dynamo at the moment, so I have two
> questions:
>
> 1. Is it easy to write my own sink for Flink and are there any docs around
> how to do this?
> 2. If I do this, will I still be able to have Flink's processing
> guarantees? I.e. Can I be sure that every tuple has contributed to the
> DynamoDB state either at-least-once or exactly-once?
>
> Thanks for any advice,
> Josh