You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2017/10/25 08:31:03 UTC

State snapshotting when source is finite

Hi to all,
in my current use case I'd like to improve one step of our batch pipeline.
There's one specific job that ingest a tabular dataset (of Rows) and
explode it into a set of RDF statements (as Tuples).  The objects we output
are a containers of those Tuples (grouped by a field).
Flink stateful streaming could be a perfect fit here because we
incrementally increase the state of those containers but we don't have to
spend a lot of time performing some GET operation to an external Key-value
store.
The big problem here is that the sources are finite and the state of the
job gets lost once the job ends, while I was expecting that Flink was
snapshotting the state of its operators before exiting.

This idea was inspired by
https://data-artisans.com/blog/queryable-state-use-case-demo#no-external-store,
whit the difference that one can resume the state of the stateful
application only when required.
Do you think that it could be possible to support such a use case (that we
can summarize as "periodic batch jobs that pick up where they left")?

Best,
Flavio

Re: State snapshotting when source is finite

Posted by Flavio Pompermaier <po...@okkam.it>.

Done: https://issues.apache.org/jira/browse/FLINK-7930

Best,
Flavio

On Thu, Oct 26, 2017 at 10:52 AM, Till Rohrmann <tr...@apache.org>
wrote:

> Hi Flavio,
>
> this kind of feature is indeed useful and currently not supported by
> Flink. I think, however, that this feature is a bit trickier to implement,
> because Tasks cannot currently initiate checkpoints/savepoints on their
> own. This would entail some changes to the lifecycle of a Task and an extra
> communication step with the JobManager. However, nothing impossible to do.
>
> Please open a JIRA issue with the description of the problem where we can
> continue the discussion.
>
> Cheers,
> Till
>
> On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> Thanks for bringing up this topic.
>> I think running periodic jobs with state that gets restored and persisted
>> in a savepoint is a very valid use case and would fit the stream is a
>> superset of batch story quite well.
>> I'm not sure if this behavior is already supported, but think this would
>> be a desirable feature.
>>
>> I'm looping in Till and Aljoscha who might have some thoughts on this as
>> well.
>> Depending on the discussion we should open a JIRA for this feature.
>>
>> Cheers, Fabian
>>
>> 2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Hi to all,
>>> in my current use case I'd like to improve one step of our batch
>>> pipeline.
>>> There's one specific job that ingest a tabular dataset (of Rows) and
>>> explode it into a set of RDF statements (as Tuples).  The objects we output
>>> are a containers of those Tuples (grouped by a field).
>>> Flink stateful streaming could be a perfect fit here because we
>>> incrementally increase the state of those containers but we don't have to
>>> spend a lot of time performing some GET operation to an external Key-value
>>> store.
>>> The big problem here is that the sources are finite and the state of the
>>> job gets lost once the job ends, while I was expecting that Flink was
>>> snapshotting the state of its operators before exiting.
>>>
>>> This idea was inspired by https://data-artisans.com/b
>>> log/queryable-state-use-case-demo#no-external-store, whit the
>>> difference that one can resume the state of the stateful application only
>>> when required.
>>> Do you think that it could be possible to support such a use case (that
>>> we can summarize as "periodic batch jobs that pick up where they left")?
>>>
>>> Best,
>>> Flavio
>>>
>>
>>
>

Re: State snapshotting when source is finite

Posted by Till Rohrmann <tr...@apache.org>.

Hi Flavio,

this kind of feature is indeed useful and currently not supported by Flink.
I think, however, that this feature is a bit trickier to implement, because
Tasks cannot currently initiate checkpoints/savepoints on their own. This
would entail some changes to the lifecycle of a Task and an extra
communication step with the JobManager. However, nothing impossible to do.

Please open a JIRA issue with the description of the problem where we can
continue the discussion.

Cheers,
Till

On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi Flavio,
>
> Thanks for bringing up this topic.
> I think running periodic jobs with state that gets restored and persisted
> in a savepoint is a very valid use case and would fit the stream is a
> superset of batch story quite well.
> I'm not sure if this behavior is already supported, but think this would
> be a desirable feature.
>
> I'm looping in Till and Aljoscha who might have some thoughts on this as
> well.
> Depending on the discussion we should open a JIRA for this feature.
>
> Cheers, Fabian
>
> 2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:
>
>> Hi to all,
>> in my current use case I'd like to improve one step of our batch pipeline.
>> There's one specific job that ingest a tabular dataset (of Rows) and
>> explode it into a set of RDF statements (as Tuples).  The objects we output
>> are a containers of those Tuples (grouped by a field).
>> Flink stateful streaming could be a perfect fit here because we
>> incrementally increase the state of those containers but we don't have to
>> spend a lot of time performing some GET operation to an external Key-value
>> store.
>> The big problem here is that the sources are finite and the state of the
>> job gets lost once the job ends, while I was expecting that Flink was
>> snapshotting the state of its operators before exiting.
>>
>> This idea was inspired by https://data-artisans.com/b
>> log/queryable-state-use-case-demo#no-external-store, whit the difference
>> that one can resume the state of the stateful application only when
>> required.
>> Do you think that it could be possible to support such a use case (that
>> we can summarize as "periodic batch jobs that pick up where they left")?
>>
>> Best,
>> Flavio
>>
>
>

Re: State snapshotting when source is finite

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Flavio,

Thanks for bringing up this topic.
I think running periodic jobs with state that gets restored and persisted
in a savepoint is a very valid use case and would fit the stream is a
superset of batch story quite well.
I'm not sure if this behavior is already supported, but think this would be
a desirable feature.

I'm looping in Till and Aljoscha who might have some thoughts on this as
well.
Depending on the discussion we should open a JIRA for this feature.

Cheers, Fabian

2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:

> Hi to all,
> in my current use case I'd like to improve one step of our batch pipeline.
> There's one specific job that ingest a tabular dataset (of Rows) and
> explode it into a set of RDF statements (as Tuples).  The objects we output
> are a containers of those Tuples (grouped by a field).
> Flink stateful streaming could be a perfect fit here because we
> incrementally increase the state of those containers but we don't have to
> spend a lot of time performing some GET operation to an external Key-value
> store.
> The big problem here is that the sources are finite and the state of the
> job gets lost once the job ends, while I was expecting that Flink was
> snapshotting the state of its operators before exiting.
>
> This idea was inspired by https://data-artisans.com/
> blog/queryable-state-use-case-demo#no-external-store, whit the difference
> that one can resume the state of the stateful application only when
> required.
> Do you think that it could be possible to support such a use case (that we
> can summarize as "periodic batch jobs that pick up where they left")?
>
> Best,
> Flavio
>