You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Matthias J. Sax" <mj...@apache.org> on 2016/02/11 20:28:39 UTC

StateBackend

Hi,

In Flink it is possible to have different backends for operator state. I
am wondering what the best approach for different state backends would be.

Let's assume the backend is a database server. The following questions
arise:
  - Should the database server be started manually by the user or can
Flink start the server automatically it used?
    (this seems to be the approach for RocksDB as embedded servers)
  - Should each job use the same or individual backup server (or maybe a
mix of both?)

I personally think, that Flink should not start-up a backup server but
assume that it is available when the job is submitted. This allows the
user also the start up multiple instances of the backup server and
choose which one to use for each job individually.

What do you think about it? I ask because of the current PR for Redis as
StateBackend:
https://github.com/apache/flink/pull/1617

There is no embedded mode for Redis as for RocksDB.

-Matthias

Re: StateBackend

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks for the input.

Just to clearly my understanding: by "Flink-embedded [...] scale out as
the Flink job scales out", you mean that each TM hosts an embedded state
backend service, ie, all those instances form a logically single but
distributed backend service? How is is ensure, that the state is store
reliable (ie, not on the same machine the state belongs to)? Is this
handled by the service automatically, or is it Flink's responsibility?

What do you mean by "They work nicely with savepoints, because every
Flink job has a copy of the state"?

The classification itself makes sense. I guess, we should reflect this
in the documentation. Not sure, if the code can/should reflect this -- I
doubt it.

-Matthias


On 02/16/2016 10:32 AM, Stephan Ewen wrote:
> I think this is actually a pretty good question. Right now, there are two
> different types of state backends:
> 
>   (1) Flink-embedded. They are independent of external services, scale out
> as the Flink job scales out, and are really mainly a way of storing and
> backuping key/value state.
>         For example: MemoryStateBackend, FsStateBackend, RocksDBStateBackend
>         They work nicely with savepoints, because every Flink job has a
> copy of the state.
> 
>   (2) Flink-connected:The state is outside Flink. The systems need to run
> separately, don't scale with Flink.
>        Examples: DBStateBackend
>        One advantage they have currently is that state in Flink is small,
> so checkpoints and restore are very cheap.
> 
> 
> I think we should start classifying the state backends like this.
> 
> 
> Greetings,
> Stephan
> 
> 
> On Mon, Feb 15, 2016 at 3:11 PM, Aljoscha Krettek <al...@apache.org>
> wrote:
> 
>> Hi,
>> sorry about not answering but I wanted to wait since I already voiced my
>> opinion on the PR.
>>
>> I think it is better to assume an already running redis because it is
>> easier to work around clashes in running redis instances (ports, data
>> directory, and such). Then, however, care needs to be taken to make sure
>> that the state inside the one redis instance does not clash.
>>
>> Cheers,
>> Aljoscha
>>> On 15 Feb 2016, at 14:53, Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>> Anyone?
>>>
>>> Otherwise, I will suggest to move forward with the PR using the
>>> assumption that Redis must be started manually.
>>>
>>> -Matthias
>>>
>>> On 02/11/2016 08:28 PM, Matthias J. Sax wrote:
>>>> Hi,
>>>>
>>>> In Flink it is possible to have different backends for operator state. I
>>>> am wondering what the best approach for different state backends would
>> be.
>>>>
>>>> Let's assume the backend is a database server. The following questions
>>>> arise:
>>>>  - Should the database server be started manually by the user or can
>>>> Flink start the server automatically it used?
>>>>    (this seems to be the approach for RocksDB as embedded servers)
>>>>  - Should each job use the same or individual backup server (or maybe a
>>>> mix of both?)
>>>>
>>>> I personally think, that Flink should not start-up a backup server but
>>>> assume that it is available when the job is submitted. This allows the
>>>> user also the start up multiple instances of the backup server and
>>>> choose which one to use for each job individually.
>>>>
>>>> What do you think about it? I ask because of the current PR for Redis as
>>>> StateBackend:
>>>> https://github.com/apache/flink/pull/1617
>>>>
>>>> There is no embedded mode for Redis as for RocksDB.
>>>>
>>>> -Matthias
>>>>
>>>
>>
>>
>

Re: StateBackend

Posted by Stephan Ewen <se...@apache.org>.

I think this is actually a pretty good question. Right now, there are two
different types of state backends:

  (1) Flink-embedded. They are independent of external services, scale out
as the Flink job scales out, and are really mainly a way of storing and
backuping key/value state.
        For example: MemoryStateBackend, FsStateBackend, RocksDBStateBackend
        They work nicely with savepoints, because every Flink job has a
copy of the state.

  (2) Flink-connected:The state is outside Flink. The systems need to run
separately, don't scale with Flink.
       Examples: DBStateBackend
       One advantage they have currently is that state in Flink is small,
so checkpoints and restore are very cheap.


I think we should start classifying the state backends like this.


Greetings,
Stephan


On Mon, Feb 15, 2016 at 3:11 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> sorry about not answering but I wanted to wait since I already voiced my
> opinion on the PR.
>
> I think it is better to assume an already running redis because it is
> easier to work around clashes in running redis instances (ports, data
> directory, and such). Then, however, care needs to be taken to make sure
> that the state inside the one redis instance does not clash.
>
> Cheers,
> Aljoscha
> > On 15 Feb 2016, at 14:53, Matthias J. Sax <mj...@apache.org> wrote:
> >
> > Anyone?
> >
> > Otherwise, I will suggest to move forward with the PR using the
> > assumption that Redis must be started manually.
> >
> > -Matthias
> >
> > On 02/11/2016 08:28 PM, Matthias J. Sax wrote:
> >> Hi,
> >>
> >> In Flink it is possible to have different backends for operator state. I
> >> am wondering what the best approach for different state backends would
> be.
> >>
> >> Let's assume the backend is a database server. The following questions
> >> arise:
> >>  - Should the database server be started manually by the user or can
> >> Flink start the server automatically it used?
> >>    (this seems to be the approach for RocksDB as embedded servers)
> >>  - Should each job use the same or individual backup server (or maybe a
> >> mix of both?)
> >>
> >> I personally think, that Flink should not start-up a backup server but
> >> assume that it is available when the job is submitted. This allows the
> >> user also the start up multiple instances of the backup server and
> >> choose which one to use for each job individually.
> >>
> >> What do you think about it? I ask because of the current PR for Redis as
> >> StateBackend:
> >> https://github.com/apache/flink/pull/1617
> >>
> >> There is no embedded mode for Redis as for RocksDB.
> >>
> >> -Matthias
> >>
> >
>
>

Re: StateBackend

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
sorry about not answering but I wanted to wait since I already voiced my opinion on the PR.

I think it is better to assume an already running redis because it is easier to work around clashes in running redis instances (ports, data directory, and such). Then, however, care needs to be taken to make sure that the state inside the one redis instance does not clash.

Cheers,
Aljoscha
> On 15 Feb 2016, at 14:53, Matthias J. Sax <mj...@apache.org> wrote:
> 
> Anyone?
> 
> Otherwise, I will suggest to move forward with the PR using the
> assumption that Redis must be started manually.
> 
> -Matthias
> 
> On 02/11/2016 08:28 PM, Matthias J. Sax wrote:
>> Hi,
>> 
>> In Flink it is possible to have different backends for operator state. I
>> am wondering what the best approach for different state backends would be.
>> 
>> Let's assume the backend is a database server. The following questions
>> arise:
>>  - Should the database server be started manually by the user or can
>> Flink start the server automatically it used?
>>    (this seems to be the approach for RocksDB as embedded servers)
>>  - Should each job use the same or individual backup server (or maybe a
>> mix of both?)
>> 
>> I personally think, that Flink should not start-up a backup server but
>> assume that it is available when the job is submitted. This allows the
>> user also the start up multiple instances of the backup server and
>> choose which one to use for each job individually.
>> 
>> What do you think about it? I ask because of the current PR for Redis as
>> StateBackend:
>> https://github.com/apache/flink/pull/1617
>> 
>> There is no embedded mode for Redis as for RocksDB.
>> 
>> -Matthias
>> 
>

Re: StateBackend

Posted by "Matthias J. Sax" <mj...@apache.org>.

Anyone?

Otherwise, I will suggest to move forward with the PR using the
assumption that Redis must be started manually.

-Matthias

On 02/11/2016 08:28 PM, Matthias J. Sax wrote:
> Hi,
> 
> In Flink it is possible to have different backends for operator state. I
> am wondering what the best approach for different state backends would be.
> 
> Let's assume the backend is a database server. The following questions
> arise:
>   - Should the database server be started manually by the user or can
> Flink start the server automatically it used?
>     (this seems to be the approach for RocksDB as embedded servers)
>   - Should each job use the same or individual backup server (or maybe a
> mix of both?)
> 
> I personally think, that Flink should not start-up a backup server but
> assume that it is available when the job is submitted. This allows the
> user also the start up multiple instances of the backup server and
> choose which one to use for each job individually.
> 
> What do you think about it? I ask because of the current PR for Redis as
> StateBackend:
> https://github.com/apache/flink/pull/1617
> 
> There is no embedded mode for Redis as for RocksDB.
> 
> -Matthias
>