You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Heath Albritton <ha...@harm.org> on 2019/02/08 19:49:55 UTC

Re: long lived standalone job session cluster in kubernetes

Has any progress been made on this?  There are a number of folks in
the community looking to help out.


-H

On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org> wrote:
>
> Hi Derek,
>
> there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
>
> Cheers,
> Till
>
> On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com> wrote:
>>
>> Sounds good.
>>
>> Is someone working on this automation today?
>>
>> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>
>>
>> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>
>> Hi Derek,
>>
>> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>
>> Cheers,
>> Till
>>
>> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <an...@data-artisans.com> wrote:
>>>
>>> Hi Derek,
>>>
>>> I think your automation steps look good.
>>> Recreating deployments should not take long
>>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>
>>> Best,
>>> Andrey
>>>
>>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>> > to help you more.
>>> >
>>> > As for the automation for similar process I would recommend having a
>>> > look at dA platform[1] which is built on top of kubernetes.
>>> >
>>> > Best,
>>> >
>>> > Dawid
>>> >
>>> > [1] https://data-artisans.com/platform-overview
>>> >
>>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>
>>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >> considering migrating our jobs off our "legacy" session cluster and
>>> >> into Kubernetes.
>>> >>
>>> >> I do need to ask some questions because I haven't found a lot of
>>> >> details in the documentation about how it works yet, and I gave up
>>> >> following the the DI around in the code after a while.
>>> >>
>>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>> >> another deployment for the taskmanagers.
>>> >>
>>> >> I want to upgrade the code or configuration and start from a
>>> >> savepoint, in an automated way.
>>> >>
>>> >> Best I can figure, I can not just update the deployment resources in
>>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>> >>
>>> >> Instead, I expect sequencing is important, something along the lines
>>> >> of this:
>>> >>
>>> >> 1. issue savepoint command on leader
>>> >> 2. wait for savepoint
>>> >> 3. destroy all leader and taskmanager containers
>>> >> 4. deploy new leader, with savepoint url
>>> >> 5. deploy new taskmanagers
>>> >>
>>> >>
>>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >> job) attaching to the new leader and causing a problem.
>>> >>
>>> >> Does that sound right, or am I overthinking it?
>>> >>
>>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>
>>> >
>>>

Re: long lived standalone job session cluster in kubernetes

Posted by Till Rohrmann <tr...@apache.org>.
Hi Heath,

I think some of the PRs are already open and ready for review [1, 2].

[1] https://issues.apache.org/jira/browse/FLINK-10932
[2] https://issues.apache.org/jira/browse/FLINK-10935

Cheers,
Till

On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton <ha...@harm.org> wrote:

> Great, my team is eager to get started.  I’m curious what progress had
> been made so far?
>
> -H
>
> On Feb 26, 2019, at 14:43, Chunhui Shi <cs...@apache.org> wrote:
>
> Hi Heath and Till, thanks for offering help on reviewing this feature. I
> just reassigned the JIRAs to myself after offline discussion with Jin. Let
> us work together to get kubernetes integrated natively with flink. Thanks.
>
> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for
>> your help :-)
>>
>> Cheers,
>> Till
>>
>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <ha...@harm.org>
>> wrote:
>>
>>> My team and I are keen to help out with testing and review as soon as
>>> there is a pill request.
>>>
>>> -H
>>>
>>> On Feb 11, 2019, at 00:26, Till Rohrmann <tr...@apache.org> wrote:
>>>
>>> Hi Heath,
>>>
>>> I just learned that people from Alibaba already made some good progress
>>> with FLINK-9953. I'm currently talking to them in order to see how we can
>>> merge this contribution into Flink as fast as possible. Since I'm quite
>>> busy due to the upcoming release I hope that other community members will
>>> help out with the reviewing once the PRs are opened.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org>
>>> wrote:
>>>
>>>> Has any progress been made on this?  There are a number of folks in
>>>> the community looking to help out.
>>>>
>>>>
>>>> -H
>>>>
>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>> >
>>>> > Hi Derek,
>>>> >
>>>> > there is this issue [1] which tracks the active Kubernetes
>>>> integration. Jin Sun already started implementing some parts of it. There
>>>> should also be some PRs open for it. Please check them out.
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Sounds good.
>>>> >>
>>>> >> Is someone working on this automation today?
>>>> >>
>>>> >> If not, although my time is tight, I may be able to work on a PR for
>>>> getting us started down the path Kubernetes native cluster mode.
>>>> >>
>>>> >>
>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>>> >>
>>>> >> Hi Derek,
>>>> >>
>>>> >> what I would recommend to use is to trigger the cancel with
>>>> savepoint command [1]. This will create a savepoint and terminate the job
>>>> execution. Next you simply need to respawn the job cluster which you
>>>> provide with the savepoint to resume from.
>>>> >>
>>>> >> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>>> >>
>>>> >> Cheers,
>>>> >> Till
>>>> >>
>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>>>> andrey@data-artisans.com> wrote:
>>>> >>>
>>>> >>> Hi Derek,
>>>> >>>
>>>> >>> I think your automation steps look good.
>>>> >>> Recreating deployments should not take long
>>>> >>> and as you mention, this way you can avoid unpredictable old/new
>>>> version collisions.
>>>> >>>
>>>> >>> Best,
>>>> >>> Andrey
>>>> >>>
>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org>
>>>> wrote:
>>>> >>> >
>>>> >>> > Hi Derek,
>>>> >>> >
>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should
>>>> be able
>>>> >>> > to help you more.
>>>> >>> >
>>>> >>> > As for the automation for similar process I would recommend
>>>> having a
>>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>>> >>> >
>>>> >>> > Best,
>>>> >>> >
>>>> >>> > Dawid
>>>> >>> >
>>>> >>> > [1] https://data-artisans.com/platform-overview
>>>> >>> >
>>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>>> >>> >>
>>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>>> >>> >> considering migrating our jobs off our "legacy" session cluster
>>>> and
>>>> >>> >> into Kubernetes.
>>>> >>> >>
>>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>>> >>> >> details in the documentation about how it works yet, and I gave
>>>> up
>>>> >>> >> following the the DI around in the code after a while.
>>>> >>> >>
>>>> >>> >> Let's say I have a deployment for the job "leader" in HA with
>>>> ZK, and
>>>> >>> >> another deployment for the taskmanagers.
>>>> >>> >>
>>>> >>> >> I want to upgrade the code or configuration and start from a
>>>> >>> >> savepoint, in an automated way.
>>>> >>> >>
>>>> >>> >> Best I can figure, I can not just update the deployment
>>>> resources in
>>>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>>>> order.
>>>> >>> >>
>>>> >>> >> Instead, I expect sequencing is important, something along the
>>>> lines
>>>> >>> >> of this:
>>>> >>> >>
>>>> >>> >> 1. issue savepoint command on leader
>>>> >>> >> 2. wait for savepoint
>>>> >>> >> 3. destroy all leader and taskmanager containers
>>>> >>> >> 4. deploy new leader, with savepoint url
>>>> >>> >> 5. deploy new taskmanagers
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> For example, I imagine old taskmanagers (with an old version of
>>>> my
>>>> >>> >> job) attaching to the new leader and causing a problem.
>>>> >>> >>
>>>> >>> >> Does that sound right, or am I overthinking it?
>>>> >>> >>
>>>> >>> >> If not, has anyone tried implementing any automation for this
>>>> yet?
>>>> >>> >>
>>>> >>> >
>>>> >>>
>>>>
>>>

Re: long lived standalone job session cluster in kubernetes

Posted by Heath Albritton <ha...@harm.org>.
Great, my team is eager to get started.  I’m curious what progress had been made so far?

-H

> On Feb 26, 2019, at 14:43, Chunhui Shi <cs...@apache.org> wrote:
> 
> Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks.
> 
>> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <tr...@apache.org> wrote:
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for your help :-)
>> 
>> Cheers,
>> Till
>> 
>>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <ha...@harm.org> wrote:
>>> My team and I are keen to help out with testing and review as soon as there is a pill request.
>>> 
>>> -H
>>> 
>>>> On Feb 11, 2019, at 00:26, Till Rohrmann <tr...@apache.org> wrote:
>>>> 
>>>> Hi Heath,
>>>> 
>>>> I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.
>>>> 
>>>> Cheers,
>>>> Till
>>>> 
>>>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org> wrote:
>>>>> Has any progress been made on this?  There are a number of folks in
>>>>> the community looking to help out.
>>>>> 
>>>>> 
>>>>> -H
>>>>> 
>>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org> wrote:
>>>>> >
>>>>> > Hi Derek,
>>>>> >
>>>>> > there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>>>>> >
>>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>>>> >
>>>>> > Cheers,
>>>>> > Till
>>>>> >
>>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com> wrote:
>>>>> >>
>>>>> >> Sounds good.
>>>>> >>
>>>>> >> Is someone working on this automation today?
>>>>> >>
>>>>> >> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>>>>> >>
>>>>> >>
>>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>>>> >>
>>>>> >> Hi Derek,
>>>>> >>
>>>>> >> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>>>>> >>
>>>>> >> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Till
>>>>> >>
>>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <an...@data-artisans.com> wrote:
>>>>> >>>
>>>>> >>> Hi Derek,
>>>>> >>>
>>>>> >>> I think your automation steps look good.
>>>>> >>> Recreating deployments should not take long
>>>>> >>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>>>>> >>>
>>>>> >>> Best,
>>>>> >>> Andrey
>>>>> >>>
>>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org> wrote:
>>>>> >>> >
>>>>> >>> > Hi Derek,
>>>>> >>> >
>>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>>>>> >>> > to help you more.
>>>>> >>> >
>>>>> >>> > As for the automation for similar process I would recommend having a
>>>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>>>> >>> >
>>>>> >>> > Best,
>>>>> >>> >
>>>>> >>> > Dawid
>>>>> >>> >
>>>>> >>> > [1] https://data-artisans.com/platform-overview
>>>>> >>> >
>>>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>>>> >>> >>
>>>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>>>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>>>>> >>> >> into Kubernetes.
>>>>> >>> >>
>>>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>>>> >>> >> details in the documentation about how it works yet, and I gave up
>>>>> >>> >> following the the DI around in the code after a while.
>>>>> >>> >>
>>>>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>>>>> >>> >> another deployment for the taskmanagers.
>>>>> >>> >>
>>>>> >>> >> I want to upgrade the code or configuration and start from a
>>>>> >>> >> savepoint, in an automated way.
>>>>> >>> >>
>>>>> >>> >> Best I can figure, I can not just update the deployment resources in
>>>>> >>> >> kubernetes and allow the containers to restart in an arbitrary order.
>>>>> >>> >>
>>>>> >>> >> Instead, I expect sequencing is important, something along the lines
>>>>> >>> >> of this:
>>>>> >>> >>
>>>>> >>> >> 1. issue savepoint command on leader
>>>>> >>> >> 2. wait for savepoint
>>>>> >>> >> 3. destroy all leader and taskmanager containers
>>>>> >>> >> 4. deploy new leader, with savepoint url
>>>>> >>> >> 5. deploy new taskmanagers
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>>>>> >>> >> job) attaching to the new leader and causing a problem.
>>>>> >>> >>
>>>>> >>> >> Does that sound right, or am I overthinking it?
>>>>> >>> >>
>>>>> >>> >> If not, has anyone tried implementing any automation for this yet?
>>>>> >>> >>
>>>>> >>> >
>>>>> >>>

Re: long lived standalone job session cluster in kubernetes

Posted by Chunhui Shi <cs...@apache.org>.
Hi Heath and Till, thanks for offering help on reviewing this feature. I
just reassigned the JIRAs to myself after offline discussion with Jin. Let
us work together to get kubernetes integrated natively with flink. Thanks.

On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <tr...@apache.org> wrote:

> Alright, I'll get back to you once the PRs are open. Thanks a lot for your
> help :-)
>
> Cheers,
> Till
>
> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <ha...@harm.org> wrote:
>
>> My team and I are keen to help out with testing and review as soon as
>> there is a pill request.
>>
>> -H
>>
>> On Feb 11, 2019, at 00:26, Till Rohrmann <tr...@apache.org> wrote:
>>
>> Hi Heath,
>>
>> I just learned that people from Alibaba already made some good progress
>> with FLINK-9953. I'm currently talking to them in order to see how we can
>> merge this contribution into Flink as fast as possible. Since I'm quite
>> busy due to the upcoming release I hope that other community members will
>> help out with the reviewing once the PRs are opened.
>>
>> Cheers,
>> Till
>>
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org> wrote:
>>
>>> Has any progress been made on this?  There are a number of folks in
>>> the community looking to help out.
>>>
>>>
>>> -H
>>>
>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > there is this issue [1] which tracks the active Kubernetes
>>> integration. Jin Sun already started implementing some parts of it. There
>>> should also be some PRs open for it. Please check them out.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com>
>>> wrote:
>>> >>
>>> >> Sounds good.
>>> >>
>>> >> Is someone working on this automation today?
>>> >>
>>> >> If not, although my time is tight, I may be able to work on a PR for
>>> getting us started down the path Kubernetes native cluster mode.
>>> >>
>>> >>
>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>> >>
>>> >> Hi Derek,
>>> >>
>>> >> what I would recommend to use is to trigger the cancel with savepoint
>>> command [1]. This will create a savepoint and terminate the job execution.
>>> Next you simply need to respawn the job cluster which you provide with the
>>> savepoint to resume from.
>>> >>
>>> >> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>> >>
>>> >> Cheers,
>>> >> Till
>>> >>
>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>>> andrey@data-artisans.com> wrote:
>>> >>>
>>> >>> Hi Derek,
>>> >>>
>>> >>> I think your automation steps look good.
>>> >>> Recreating deployments should not take long
>>> >>> and as you mention, this way you can avoid unpredictable old/new
>>> version collisions.
>>> >>>
>>> >>> Best,
>>> >>> Andrey
>>> >>>
>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org>
>>> wrote:
>>> >>> >
>>> >>> > Hi Derek,
>>> >>> >
>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
>>> able
>>> >>> > to help you more.
>>> >>> >
>>> >>> > As for the automation for similar process I would recommend having
>>> a
>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>> >>> >
>>> >>> > Best,
>>> >>> >
>>> >>> > Dawid
>>> >>> >
>>> >>> > [1] https://data-artisans.com/platform-overview
>>> >>> >
>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>> >>
>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >>> >> considering migrating our jobs off our "legacy" session cluster
>>> and
>>> >>> >> into Kubernetes.
>>> >>> >>
>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>> >>> >> details in the documentation about how it works yet, and I gave up
>>> >>> >> following the the DI around in the code after a while.
>>> >>> >>
>>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
>>> and
>>> >>> >> another deployment for the taskmanagers.
>>> >>> >>
>>> >>> >> I want to upgrade the code or configuration and start from a
>>> >>> >> savepoint, in an automated way.
>>> >>> >>
>>> >>> >> Best I can figure, I can not just update the deployment resources
>>> in
>>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>>> order.
>>> >>> >>
>>> >>> >> Instead, I expect sequencing is important, something along the
>>> lines
>>> >>> >> of this:
>>> >>> >>
>>> >>> >> 1. issue savepoint command on leader
>>> >>> >> 2. wait for savepoint
>>> >>> >> 3. destroy all leader and taskmanager containers
>>> >>> >> 4. deploy new leader, with savepoint url
>>> >>> >> 5. deploy new taskmanagers
>>> >>> >>
>>> >>> >>
>>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >>> >> job) attaching to the new leader and causing a problem.
>>> >>> >>
>>> >>> >> Does that sound right, or am I overthinking it?
>>> >>> >>
>>> >>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>> >>
>>> >>> >
>>> >>>
>>>
>>

Re: long lived standalone job session cluster in kubernetes

Posted by Till Rohrmann <tr...@apache.org>.
Alright, I'll get back to you once the PRs are open. Thanks a lot for your
help :-)

Cheers,
Till

On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <ha...@harm.org> wrote:

> My team and I are keen to help out with testing and review as soon as
> there is a pill request.
>
> -H
>
> On Feb 11, 2019, at 00:26, Till Rohrmann <tr...@apache.org> wrote:
>
> Hi Heath,
>
> I just learned that people from Alibaba already made some good progress
> with FLINK-9953. I'm currently talking to them in order to see how we can
> merge this contribution into Flink as fast as possible. Since I'm quite
> busy due to the upcoming release I hope that other community members will
> help out with the reviewing once the PRs are opened.
>
> Cheers,
> Till
>
> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org> wrote:
>
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>>
>>
>> -H
>>
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration.
>> Jin Sun already started implementing some parts of it. There should also be
>> some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com>
>> wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for
>> getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint
>> command [1]. This will create a savepoint and terminate the job execution.
>> Next you simply need to respawn the job cluster which you provide with the
>> savepoint to resume from.
>> >>
>> >> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>> andrey@data-artisans.com> wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new
>> version collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org>
>> wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
>> able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
>> and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources
>> in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>> order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the
>> lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>
>>
>

Re: long lived standalone job session cluster in kubernetes

Posted by Heath Albritton <ha...@harm.org>.
My team and I are keen to help out with testing and review as soon as there is a pill request.

-H

> On Feb 11, 2019, at 00:26, Till Rohrmann <tr...@apache.org> wrote:
> 
> Hi Heath,
> 
> I just learned that people from Alibaba already made some good progress with FLINK-9953. I'm currently talking to them in order to see how we can merge this contribution into Flink as fast as possible. Since I'm quite busy due to the upcoming release I hope that other community members will help out with the reviewing once the PRs are opened.
> 
> Cheers,
> Till
> 
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org> wrote:
>> Has any progress been made on this?  There are a number of folks in
>> the community looking to help out.
>> 
>> 
>> -H
>> 
>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org> wrote:
>> >
>> > Hi Derek,
>> >
>> > there is this issue [1] which tracks the active Kubernetes integration. Jin Sun already started implementing some parts of it. There should also be some PRs open for it. Please check them out.
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com> wrote:
>> >>
>> >> Sounds good.
>> >>
>> >> Is someone working on this automation today?
>> >>
>> >> If not, although my time is tight, I may be able to work on a PR for getting us started down the path Kubernetes native cluster mode.
>> >>
>> >>
>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>> >>
>> >> Hi Derek,
>> >>
>> >> what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
>> >>
>> >> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <an...@data-artisans.com> wrote:
>> >>>
>> >>> Hi Derek,
>> >>>
>> >>> I think your automation steps look good.
>> >>> Recreating deployments should not take long
>> >>> and as you mention, this way you can avoid unpredictable old/new version collisions.
>> >>>
>> >>> Best,
>> >>> Andrey
>> >>>
>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org> wrote:
>> >>> >
>> >>> > Hi Derek,
>> >>> >
>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be able
>> >>> > to help you more.
>> >>> >
>> >>> > As for the automation for similar process I would recommend having a
>> >>> > look at dA platform[1] which is built on top of kubernetes.
>> >>> >
>> >>> > Best,
>> >>> >
>> >>> > Dawid
>> >>> >
>> >>> > [1] https://data-artisans.com/platform-overview
>> >>> >
>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>> >>> >>
>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>> >>> >> into Kubernetes.
>> >>> >>
>> >>> >> I do need to ask some questions because I haven't found a lot of
>> >>> >> details in the documentation about how it works yet, and I gave up
>> >>> >> following the the DI around in the code after a while.
>> >>> >>
>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
>> >>> >> another deployment for the taskmanagers.
>> >>> >>
>> >>> >> I want to upgrade the code or configuration and start from a
>> >>> >> savepoint, in an automated way.
>> >>> >>
>> >>> >> Best I can figure, I can not just update the deployment resources in
>> >>> >> kubernetes and allow the containers to restart in an arbitrary order.
>> >>> >>
>> >>> >> Instead, I expect sequencing is important, something along the lines
>> >>> >> of this:
>> >>> >>
>> >>> >> 1. issue savepoint command on leader
>> >>> >> 2. wait for savepoint
>> >>> >> 3. destroy all leader and taskmanager containers
>> >>> >> 4. deploy new leader, with savepoint url
>> >>> >> 5. deploy new taskmanagers
>> >>> >>
>> >>> >>
>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>> >>> >> job) attaching to the new leader and causing a problem.
>> >>> >>
>> >>> >> Does that sound right, or am I overthinking it?
>> >>> >>
>> >>> >> If not, has anyone tried implementing any automation for this yet?
>> >>> >>
>> >>> >
>> >>>

Re: long lived standalone job session cluster in kubernetes

Posted by Till Rohrmann <tr...@apache.org>.
Hi Heath,

I just learned that people from Alibaba already made some good progress
with FLINK-9953. I'm currently talking to them in order to see how we can
merge this contribution into Flink as fast as possible. Since I'm quite
busy due to the upcoming release I hope that other community members will
help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <ha...@harm.org> wrote:

> Has any progress been made on this?  There are a number of folks in
> the community looking to help out.
>
>
> -H
>
> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <tr...@apache.org>
> wrote:
> >
> > Hi Derek,
> >
> > there is this issue [1] which tracks the active Kubernetes integration.
> Jin Sun already started implementing some parts of it. There should also be
> some PRs open for it. Please check them out.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-9953
> >
> > Cheers,
> > Till
> >
> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <de...@gmail.com>
> wrote:
> >>
> >> Sounds good.
> >>
> >> Is someone working on this automation today?
> >>
> >> If not, although my time is tight, I may be able to work on a PR for
> getting us started down the path Kubernetes native cluster mode.
> >>
> >>
> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
> >>
> >> Hi Derek,
> >>
> >> what I would recommend to use is to trigger the cancel with savepoint
> command [1]. This will create a savepoint and terminate the job execution.
> Next you simply need to respawn the job cluster which you provide with the
> savepoint to resume from.
> >>
> >> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
> andrey@data-artisans.com> wrote:
> >>>
> >>> Hi Derek,
> >>>
> >>> I think your automation steps look good.
> >>> Recreating deployments should not take long
> >>> and as you mention, this way you can avoid unpredictable old/new
> version collisions.
> >>>
> >>> Best,
> >>> Andrey
> >>>
> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dw...@apache.org>
> wrote:
> >>> >
> >>> > Hi Derek,
> >>> >
> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
> able
> >>> > to help you more.
> >>> >
> >>> > As for the automation for similar process I would recommend having a
> >>> > look at dA platform[1] which is built on top of kubernetes.
> >>> >
> >>> > Best,
> >>> >
> >>> > Dawid
> >>> >
> >>> > [1] https://data-artisans.com/platform-overview
> >>> >
> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>> >>
> >>> >> I'm looking at the job cluster mode, it looks great and I and
> >>> >> considering migrating our jobs off our "legacy" session cluster and
> >>> >> into Kubernetes.
> >>> >>
> >>> >> I do need to ask some questions because I haven't found a lot of
> >>> >> details in the documentation about how it works yet, and I gave up
> >>> >> following the the DI around in the code after a while.
> >>> >>
> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
> and
> >>> >> another deployment for the taskmanagers.
> >>> >>
> >>> >> I want to upgrade the code or configuration and start from a
> >>> >> savepoint, in an automated way.
> >>> >>
> >>> >> Best I can figure, I can not just update the deployment resources in
> >>> >> kubernetes and allow the containers to restart in an arbitrary
> order.
> >>> >>
> >>> >> Instead, I expect sequencing is important, something along the lines
> >>> >> of this:
> >>> >>
> >>> >> 1. issue savepoint command on leader
> >>> >> 2. wait for savepoint
> >>> >> 3. destroy all leader and taskmanager containers
> >>> >> 4. deploy new leader, with savepoint url
> >>> >> 5. deploy new taskmanagers
> >>> >>
> >>> >>
> >>> >> For example, I imagine old taskmanagers (with an old version of my
> >>> >> job) attaching to the new leader and causing a problem.
> >>> >>
> >>> >> Does that sound right, or am I overthinking it?
> >>> >>
> >>> >> If not, has anyone tried implementing any automation for this yet?
> >>> >>
> >>> >
> >>>
>