You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Márton Balassi <ba...@gmail.com> on 2022/02/01 14:26:18 UTC

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Hi team,

Thank you for the great feedback, Thomas has updated the FLIP page
accordingly. If you are comfortable with the currently existing design and
depth in the FLIP [1] I suggest moving forward to the voting stage - once
that reaches a positive conclusion it lets us create the separate code
repository under the flink project for the operator.

I encourage everyone to keep improving the details in the meantime, however
I believe given the existing design and the general sentiment on this
thread that the most efficient path from here is starting the
implementation so that we can collectively iterate over it.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator

On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org> wrote:

> HI Xintong,
>
> Thanks for the feedback and please see responses below -->
>
> On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <to...@gmail.com>
> wrote:
>
> > Thanks Thomas for drafting this FLIP, and everyone for the discussion.
> >
> > I also have a few questions and comments.
> >
> > ## Job Submission
> > Deploying a Flink session cluster via kubectl & CR and then submitting
> jobs
> > to the cluster via Flink cli / REST is probably the approach that
> requires
> > the least effort. However, I'd like to point out 2 weaknesses.
> > 1. A lot of users use Flink in perjob/application modes. For these users,
> > having to run the job in two steps (deploy the cluster, and submit the
> job)
> > is not that convenient.
> > 2. One of our motivations is being able to manage Flink applications'
> > lifecycles with kubectl. Submitting jobs from cli sounds not aligned with
> > this motivation.
> > I think it's probably worth it to support submitting jobs via kubectl &
> CR
> > in the first version, both together with deploying the cluster like in
> > perjob/application mode and after deploying the cluster like in session
> > mode.
> >
>
> The intention is to support application management through operator and CR,
> which means there won't be any 2 step submission process, which as you
> allude to would defeat the purpose of this project. The CR example shows
> the application part. Please note that the bare cluster support is an
> *additional* feature for scenarios that require external job management. Is
> there anything on the FLIP page that creates a different impression?
>
>
> >
> > ## Versioning
> > Which Flink versions does the operator plan to support?
> > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > 2. Native K8s HA was introduced in Flink 1.12
> > 3. The Pod template support was introduced in Flink 1.13
> > 4. There was some changes to the Flink docker image entrypoint script in,
> > IIRC, Flink 1.13
> >
>
> Great, thanks for providing this. It is important for the compatibility
> going forward also. We are targeting Flink 1.14.x upwards. Before the
> operator is ready there will be another Flink release. Let's see if anyone
> is interested in earlier versions?
>
>
> >
> > ## Compatibility
> > What kind of API compatibility we can commit to? It's probably fine to
> have
> > alpha / beta version APIs that allow incompatible future changes for the
> > first version. But eventually we would need to guarantee backwards
> > compatibility, so that an early version CR can work with a new version
> > operator.
> >
>
> Another great point and please let me include that on the FLIP page. ;-)
>
> I think we should allow incompatible changes for the first one or two
> versions, similar to how other major features have evolved recently, such
> as FLIP-27.
>
> Would be great to get broader feedback on this one.
>
> Cheers,
> Thomas
>
>
>
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org> wrote:
> >
> > > Thanks for the feedback!
> > >
> > > >
> > > > # 1 Flink Native vs Standalone integration
> > > > Maybe we should make this more clear in the FLIP but we agreed to do
> > the
> > > > first version of the operator based on the native integration.
> > > > While this clearly does not cover all use-cases and requirements, it
> > > seems
> > > > this would lead to a much smaller initial effort and a nicer first
> > > version.
> > > >
> > >
> > > I'm also leaning towards the native integration, as long as it reduces
> > the
> > > MVP effort. Ultimately the operator will need to also support the
> > > standalone mode. I would like to gain more confidence that native
> > > integration reduces the effort. While it cuts the effort to handle the
> TM
> > > pod creation, some mapping code from the CR to the native integration
> > > client and config needs to be created. As mentioned in the FLIP, native
> > > integration requires the Flink job manager to have access to the k8s
> API
> > to
> > > create pods, which in some scenarios may be seen as unfavorable.
> > >
> > >  > > > # Pod Template
> > > > > > Is the pod template in CR same with what Flink has already
> > > > supported[4]?
> > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> resources)
> > > > could
> > > > > > take effect.
> > >
> > > Yes, pod template would look almost identical. There are a few settings
> > > that the operator will control (and that may need to be blacklisted),
> but
> > > in general we would not want to place restrictions. I think a mechanism
> > > where a pod template is merged from multiple layers would also be
> > > interesting to make this more flexible.
> > >
> > > Cheers,
> > > Thomas
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Yangze Guo <ka...@gmail.com>.
Thanks @Thomas and @Gyula.
+1 to only introduce necessary and reasonable shorthand proxy parameters.

Best,
Yangze Guo

On Tue, Feb 8, 2022 at 12:47 PM Thomas Weise <th...@apache.org> wrote:
>
> @Yangze thanks for bringing up the configuration priority. This is
> quite important indeed and should be mentioned in the FLIP.
>
> I agree with the sentiment that whenever possible we should use the
> native configuration directly (either Flink native settings or k8s pod
> template), rather than introducing proxy parameters in the CRD. That
> certainly applies to taskManager.taskSlots which can be specified
> directly under flinkConfiguration.
>
> Thanks @Alexis for the pointers!
>
> Regarding memory: I'm leaning towards starting from total memory at
> the k8s resource level and let Flink derive components by default. For
> many users that would be a more intuitive approach than specifying the
> components. So container memory -> taskmanager.memory.process.size ->
> <Flink calculates components> [1]
>
> With that approach we could also extract the resource spec from the
> pod template. Although setting memory is something necessary pretty
> much always and defining the pod template not necessarily. Having the
> shorthand proxy parameter may be a good compromise.
>
> Cheers,
> Thomas
>
> [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/
>
> On Mon, Feb 7, 2022 at 4:33 AM Alexis Sarda-Espinosa
> <al...@microfocus.com> wrote:
> >
> > Danny Cranmer mentioned they are interested in standalone mode, and I am too, so I just wanted to say that if that development starts in parallel, I might be able to contribute a little.
> >
> > Regarding the CRD, I agree it would be nice to avoid as many "duplications" as possible if pod templates are to be used. In my PoC I even tried to make use of existing configuration options like kubernetes.container.image & pipeline.jars [1]. For CPU/Memory resources, the discussion in [2] might be relevant.
> >
> > [1] https://github.com/MicroFocus/opsb-flink-k8s-operator/blob/main/kubernetes/sample_batch_job.yaml
> > [2] https://issues.apache.org/jira/browse/FLINK-24150
> >
> > Regards,
> > Alexis.
> >
> > -----Original Message-----
> > From: K Fred <yu...@gmail.com>
> > Sent: Montag, 7. Februar 2022 09:36
> > To: dev@flink.apache.org
> > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
> >
> > Hi Gyula!
> >
> > You are right. I think some common flink config options can be put in the CR, other expert settings continue to be overwritten by flink, and then the user can choose to customize the configuration.
> >
> > Best Wishes,
> > Peng Yuan
> >
> > On Mon, Feb 7, 2022 at 4:16 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Hi Yangze!
> > >
> > > This is not set in stone at the moment but the way I think it should
> > > work is that first class config options in the CR should always take
> > > precedence over the Flink config.
> > >
> > > In general we should not introduce too many arbitrary config options
> > > that duplicate the flink configs without good reasons but the ones we
> > > introduce should overwrite flink configs.
> > >
> > > We should discuss and decide together what config options to keep in
> > > the flink conf and what to bring on the CR level. Resource related
> > > ones are difficult because on one hand they are integral to every
> > > application, on the other hand there are many expert settings that we
> > > should probably leave in the conf.
> > >
> > > Cheers,
> > > Gyula
> > >
> > >
> > >
> > > On Mon, Feb 7, 2022 at 8:28 AM Yangze Guo <ka...@gmail.com> wrote:
> > >
> > > > Thanks everyone for the great effort. The FLIP looks really good.
> > > >
> > > > I just want to make sure the configuration priority in the CR example.
> > > > It seems the requests resources or "taskManager. taskSlots" will be
> > > > transferred to Flink internal config, e.g.
> > > > "taskmanager.memory.process.size" and
> > > > "taskmanager.numberOfTaskSlots", and override the one in
> > > > "flinkConfiguration". Am I understanding this correctly?
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Sorry for the late reply. We were out due to the public holidays
> > > > > in
> > > > China.
> > > > >
> > > > > @Thomas,
> > > > >
> > > > > The intention is to support application management through
> > > > > operator and
> > > > CR,
> > > > > > which means there won't be any 2 step submission process, which
> > > > > > as
> > > you
> > > > > > allude to would defeat the purpose of this project. The CR
> > > > > > example
> > > > shows
> > > > > > the application part. Please note that the bare cluster support
> > > > > > is an
> > > > > > *additional* feature for scenarios that require external job
> > > > management. Is
> > > > > > there anything on the FLIP page that creates a different impression?
> > > > > >
> > > > >
> > > > > Sounds good to me. I don't remember what created the impression of
> > > > > 2
> > > step
> > > > > submission back then. I revisited the latest version of this FLIP
> > > > > and
> > > it
> > > > > looks good to me.
> > > > >
> > > > > @Gyula,
> > > > >
> > > > > Versioning:
> > > > > > Versioning will be independent from Flink and the operator will
> > > depend
> > > > on a
> > > > > > fixed flink version (in every given operator version).
> > > > > > This should be the exact same setup as with Stateful Functions (
> > > > > > https://github.com/apache/flink-statefun). So independent
> > > > > > release
> > > > cycle
> > > > > > but
> > > > > > still within the Flink umbrella.
> > > > > >
> > > > >
> > > > > Does this mean if someone wants to upgrade Flink to a version that
> > > > > is released after the operator version that is being used, he/she
> > > > > would
> > > need
> > > > > to upgrade the operator version first?
> > > > > I'm not questioning this, just trying to make sure I'm
> > > > > understanding
> > > this
> > > > > correctly.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Thank you Alexis,
> > > > > >
> > > > > > Will definitely check this out. You are right, Kotlin makes it
> > > > difficult to
> > > > > > adopt pieces of this code directly but I think it will be good
> > > > > > to get inspiration for the architecture and look at how
> > > > > > particular problems
> > > > have
> > > > > > been solved. It will be a great help for us I am sure.
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> > > > > > alexis.sarda-espinosa@microfocus.com> wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > just wanted to mention that my employer agreed to open source
> > > > > > > the
> > > > PoC I
> > > > > > > developed:
> > > > > > > https://github.com/MicroFocus/opsb-flink-k8s-operator
> > > > > > >
> > > > > > > I understand the concern for maintainability, so Gradle &
> > > > > > > Kotlin
> > > > might
> > > > > > not
> > > > > > > be appealing to you, but at least it gives you another reference.
> > > The
> > > > > > Helm
> > > > > > > resources in particular might be useful.
> > > > > > >
> > > > > > > There are bits and pieces there referring to Flink sessions,
> > > > > > > but
> > > > those
> > > > > > are
> > > > > > > just placeholders, the functioning parts use application mode
> > > > > > > with
> > > > native
> > > > > > > integration.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Alexis.
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Thomas Weise <th...@apache.org>
> > > > > > > Sent: Saturday, February 5, 2022 2:41 AM
> > > > > > > To: dev <de...@flink.apache.org>
> > > > > > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes
> > > Operator
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Thanks for the continued feedback and discussion. Looks like
> > > > > > > we are ready to start a VOTE, I will initiate it shortly.
> > > > > > >
> > > > > > > In parallel it would be good to find the repository name.
> > > > > > >
> > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > >
> > > > > > > I thought "flink-operator" could be a bit misleading since the
> > > > > > > term operator already has a meaning in Flink.
> > > > > > >
> > > > > > > I also considered "flink-k8s-operator" but that would be
> > > > > > > almost identical to existing operator implementations and
> > > > > > > could lead to confusion in the future.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra
> > > > > > > <gy...@gmail.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi Danny,
> > > > > > > >
> > > > > > > > So far we have been focusing our dev efforts on the initial
> > > native
> > > > > > > > implementation with the team.
> > > > > > > > If the discussion and vote goes well for this FLIP we are
> > > > > > > > looking
> > > > > > forward
> > > > > > > > to contributing the initial version sometime next week
> > > > > > > > (fingers
> > > > > > crossed).
> > > > > > > >
> > > > > > > > At that point I think we can already start the dev work to
> > > support
> > > > the
> > > > > > > > standalone mode as well, especially if you can dedicate some
> > > > effort to
> > > > > > > > pushing that side.
> > > > > > > > Working together on this sounds like a great idea and we
> > > > > > > > should
> > > > start
> > > > > > as
> > > > > > > > soon as possible! :)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > dannycranmer@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I have been discussing this one with my team. We are
> > > > > > > > > interested
> > > > in
> > > > > > the
> > > > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > > > implementation.
> > > > > > > > > Potentially we can work together to support both modes in
> > > > parallel?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Danny!
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > >
> > > > > > > > > > Versioning:
> > > > > > > > > > Versioning will be independent from Flink and the
> > > > > > > > > > operator
> > > will
> > > > > > > depend
> > > > > > > > > on a
> > > > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > > > This should be the exact same setup as with Stateful
> > > Functions
> > > > (
> > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > > > > > > > independent
> > > > release
> > > > > > > cycle
> > > > > > > > > > but
> > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > >
> > > > > > > > > > Deployment error handling:
> > > > > > > > > > I think that's a very good point, as general exception
> > > > handling for
> > > > > > > the
> > > > > > > > > > different failure scenarios is a tricky problem. I think
> > > > > > > > > > the
> > > > > > > exception
> > > > > > > > > > classifiers and retry strategies could avoid a lot of
> > > > > > > > > > manual
> > > > > > > intervention
> > > > > > > > > > from the user. We will definitely need to add something
> > > > > > > > > > like
> > > > this.
> > > > > > > Once
> > > > > > > > > we
> > > > > > > > > > have the repo created with the initial operator code we
> > > should
> > > > open
> > > > > > > some
> > > > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > dannycranmer@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey team,
> > > > > > > > > > >
> > > > > > > > > > > Great work on the FLIP, I am looking forward to this
> > > > > > > > > > > one. I
> > > > agree
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > >
> > > > > > > > > > > I have general feedback around how we will handle job
> > > > submission
> > > > > > > > > failure
> > > > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > > > section, we
> > > > > > > can
> > > > > > > > > use
> > > > > > > > > > > Java to handle job submission failures from the Flink
> > > > client. It
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > useful to have the ability to configure exception
> > > > classifiers and
> > > > > > > retry
> > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > >
> > > > > > > > > > > Given this will be in a separate Github repository I
> > > > > > > > > > > am
> > > > curious
> > > > > > how
> > > > > > > > > ther
> > > > > > > > > > > versioning strategy will work in relation to the Flink
> > > > version?
> > > > > > Do
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > any other components with a similar setup I can look at?
> > > > Will the
> > > > > > > > > > operator
> > > > > > > > > > > version track Flink or will it use its own versioning
> > > > strategy
> > > > > > > with a
> > > > > > > > > > Flink
> > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi team,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the great feedback, Thomas has updated
> > > > > > > > > > > > the
> > > > FLIP
> > > > > > > page
> > > > > > > > > > > > accordingly. If you are comfortable with the
> > > > > > > > > > > > currently
> > > > existing
> > > > > > > > > design
> > > > > > > > > > > and
> > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to
> > > > > > > > > > > > the
> > > > voting
> > > > > > > stage -
> > > > > > > > > > once
> > > > > > > > > > > > that reaches a positive conclusion it lets us create
> > > > > > > > > > > > the
> > > > > > separate
> > > > > > > > > code
> > > > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > > > >
> > > > > > > > > > > > I encourage everyone to keep improving the details
> > > > > > > > > > > > in the
> > > > > > > meantime,
> > > > > > > > > > > however
> > > > > > > > > > > > I believe given the existing design and the general
> > > > sentiment
> > > > > > on
> > > > > > > this
> > > > > > > > > > > > thread that the most efficient path from here is
> > > > > > > > > > > > starting
> > > > the
> > > > > > > > > > > > implementation so that we can collectively iterate
> > > > > > > > > > > > over
> > > it.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduc
> > > e+Flink+Kubernetes+Operator
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > thw@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback and please see responses
> > > > > > > > > > > > > below
> > > > -->
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > > > > > > > > > > > everyone
> > > for
> > > > the
> > > > > > > > > > > discussion.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Job Submission Deploying a Flink session
> > > > > > > > > > > > > > cluster via kubectl & CR
> > > and
> > > > then
> > > > > > > > > > > submitting
> > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > to the cluster via Flink cli / REST is probably
> > > > > > > > > > > > > > the
> > > > > > approach
> > > > > > > that
> > > > > > > > > > > > > requires
> > > > > > > > > > > > > > the least effort. However, I'd like to point out
> > > > > > > > > > > > > > 2
> > > > > > > weaknesses.
> > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > > > > > > > > > perjob/application
> > > > modes.
> > > > > > For
> > > > > > > > > these
> > > > > > > > > > > > users,
> > > > > > > > > > > > > > having to run the job in two steps (deploy the
> > > > cluster, and
> > > > > > > > > submit
> > > > > > > > > > > the
> > > > > > > > > > > > > job)
> > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > 2. One of our motivations is being able to
> > > > > > > > > > > > > > manage
> > > Flink
> > > > > > > > > > applications'
> > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from
> > > > > > > > > > > > > > cli
> > > > sounds
> > > > > > not
> > > > > > > > > > aligned
> > > > > > > > > > > > with
> > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > I think it's probably worth it to support
> > > > > > > > > > > > > > submitting
> > > > jobs
> > > > > > via
> > > > > > > > > > > kubectl &
> > > > > > > > > > > > > CR
> > > > > > > > > > > > > > in the first version, both together with
> > > > > > > > > > > > > > deploying
> > > the
> > > > > > > cluster
> > > > > > > > > like
> > > > > > > > > > > in
> > > > > > > > > > > > > > perjob/application mode and after deploying the
> > > cluster
> > > > > > like
> > > > > > > in
> > > > > > > > > > > session
> > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > The intention is to support application management
> > > > through
> > > > > > > operator
> > > > > > > > > > and
> > > > > > > > > > > > CR,
> > > > > > > > > > > > > which means there won't be any 2 step submission
> > > process,
> > > > > > > which as
> > > > > > > > > > you
> > > > > > > > > > > > > allude to would defeat the purpose of this
> > > > > > > > > > > > > project. The
> > > > CR
> > > > > > > example
> > > > > > > > > > > shows
> > > > > > > > > > > > > the application part. Please note that the bare
> > > > > > > > > > > > > cluster
> > > > > > > support is
> > > > > > > > > an
> > > > > > > > > > > > > *additional* feature for scenarios that require
> > > external
> > > > job
> > > > > > > > > > > management.
> > > > > > > > > > > > Is
> > > > > > > > > > > > > there anything on the FLIP page that creates a
> > > different
> > > > > > > > > impression?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > Which Flink versions does the operator plan to
> > > support?
> > > > > > > > > > > > > > 1. Native K8s deployment was firstly introduced
> > > > > > > > > > > > > > in
> > > > Flink
> > > > > > 1.10
> > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12 3.
> > > > > > > > > > > > > > The Pod template support was introduced in Flink
> > > > 1.13
> > > > > > > > > > > > > > 4. There was some changes to the Flink docker
> > > > > > > > > > > > > > image
> > > > > > > entrypoint
> > > > > > > > > > script
> > > > > > > > > > > > in,
> > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Great, thanks for providing this. It is important
> > > > > > > > > > > > > for
> > > the
> > > > > > > > > > compatibility
> > > > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > > > upwards.
> > > > > > > Before
> > > > > > > > > the
> > > > > > > > > > > > > operator is ready there will be another Flink release.
> > > > Let's
> > > > > > > see if
> > > > > > > > > > > > anyone
> > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Compatibility What kind of API compatibility
> > > > > > > > > > > > > > we can commit to? It's
> > > > > > > probably
> > > > > > > > > fine
> > > > > > > > > > > to
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > > > > > > > > > > > incompatible
> > > > future
> > > > > > > changes
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > first version. But eventually we would need to
> > > > guarantee
> > > > > > > > > backwards
> > > > > > > > > > > > > > compatibility, so that an early version CR can
> > > > > > > > > > > > > > work
> > > > with a
> > > > > > > new
> > > > > > > > > > > version
> > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another great point and please let me include that
> > > > > > > > > > > > > on
> > > the
> > > > > > FLIP
> > > > > > > > > page.
> > > > > > > > > > > ;-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think we should allow incompatible changes for
> > > > > > > > > > > > > the
> > > > first
> > > > > > one
> > > > > > > or
> > > > > > > > > two
> > > > > > > > > > > > > versions, similar to how other major features have
> > > > evolved
> > > > > > > > > recently,
> > > > > > > > > > > such
> > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > > > thw@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > > > Maybe we should make this more clear in the
> > > > > > > > > > > > > > > > FLIP
> > > > but we
> > > > > > > > > agreed
> > > > > > > > > > to
> > > > > > > > > > > > do
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > first version of the operator based on the
> > > > > > > > > > > > > > > > native
> > > > > > > > > integration.
> > > > > > > > > > > > > > > > While this clearly does not cover all
> > > > > > > > > > > > > > > > use-cases
> > > and
> > > > > > > > > > requirements,
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > this would lead to a much smaller initial
> > > > > > > > > > > > > > > > effort
> > > > and a
> > > > > > > nicer
> > > > > > > > > > > first
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > > > > > > > > > > > integration, as
> > > > long
> > > > > > > as it
> > > > > > > > > > > > reduces
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > MVP effort. Ultimately the operator will need
> > > > > > > > > > > > > > > to
> > > also
> > > > > > > support
> > > > > > > > > the
> > > > > > > > > > > > > > > standalone mode. I would like to gain more
> > > confidence
> > > > > > that
> > > > > > > > > native
> > > > > > > > > > > > > > > integration reduces the effort. While it cuts
> > > > > > > > > > > > > > > the
> > > > effort
> > > > > > to
> > > > > > > > > > handle
> > > > > > > > > > > > the
> > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > pod creation, some mapping code from the CR to
> > > > > > > > > > > > > > > the
> > > > native
> > > > > > > > > > > integration
> > > > > > > > > > > > > > > client and config needs to be created. As
> > > > > > > > > > > > > > > mentioned
> > > > in
> > > > > > the
> > > > > > > > > FLIP,
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > integration requires the Flink job manager to
> > > > > > > > > > > > > > > have
> > > > access
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > k8s
> > > > > > > > > > > > > API
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > create pods, which in some scenarios may be
> > > > > > > > > > > > > > > seen as
> > > > > > > > > unfavorable.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > Is the pod template in CR same with what
> > > Flink
> > > > has
> > > > > > > > > already
> > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > > > > cpu/memory
> > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, pod template would look almost identical.
> > > There
> > > > are
> > > > > > a
> > > > > > > few
> > > > > > > > > > > > settings
> > > > > > > > > > > > > > > that the operator will control (and that may
> > > > > > > > > > > > > > > need
> > > to
> > > > be
> > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > but
> > > > > > > > > > > > > > > in general we would not want to place
> > > restrictions. I
> > > > > > > think a
> > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > where a pod template is merged from multiple
> > > > > > > > > > > > > > > layers
> > > > would
> > > > > > > also
> > > > > > > > > be
> > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Thomas Weise <th...@apache.org>.
@Yangze thanks for bringing up the configuration priority. This is
quite important indeed and should be mentioned in the FLIP.

I agree with the sentiment that whenever possible we should use the
native configuration directly (either Flink native settings or k8s pod
template), rather than introducing proxy parameters in the CRD. That
certainly applies to taskManager.taskSlots which can be specified
directly under flinkConfiguration.

Thanks @Alexis for the pointers!

Regarding memory: I'm leaning towards starting from total memory at
the k8s resource level and let Flink derive components by default. For
many users that would be a more intuitive approach than specifying the
components. So container memory -> taskmanager.memory.process.size ->
<Flink calculates components> [1]

With that approach we could also extract the resource spec from the
pod template. Although setting memory is something necessary pretty
much always and defining the pod template not necessarily. Having the
shorthand proxy parameter may be a good compromise.

Cheers,
Thomas

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup/

On Mon, Feb 7, 2022 at 4:33 AM Alexis Sarda-Espinosa
<al...@microfocus.com> wrote:
>
> Danny Cranmer mentioned they are interested in standalone mode, and I am too, so I just wanted to say that if that development starts in parallel, I might be able to contribute a little.
>
> Regarding the CRD, I agree it would be nice to avoid as many "duplications" as possible if pod templates are to be used. In my PoC I even tried to make use of existing configuration options like kubernetes.container.image & pipeline.jars [1]. For CPU/Memory resources, the discussion in [2] might be relevant.
>
> [1] https://github.com/MicroFocus/opsb-flink-k8s-operator/blob/main/kubernetes/sample_batch_job.yaml
> [2] https://issues.apache.org/jira/browse/FLINK-24150
>
> Regards,
> Alexis.
>
> -----Original Message-----
> From: K Fred <yu...@gmail.com>
> Sent: Montag, 7. Februar 2022 09:36
> To: dev@flink.apache.org
> Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
>
> Hi Gyula!
>
> You are right. I think some common flink config options can be put in the CR, other expert settings continue to be overwritten by flink, and then the user can choose to customize the configuration.
>
> Best Wishes,
> Peng Yuan
>
> On Mon, Feb 7, 2022 at 4:16 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Hi Yangze!
> >
> > This is not set in stone at the moment but the way I think it should
> > work is that first class config options in the CR should always take
> > precedence over the Flink config.
> >
> > In general we should not introduce too many arbitrary config options
> > that duplicate the flink configs without good reasons but the ones we
> > introduce should overwrite flink configs.
> >
> > We should discuss and decide together what config options to keep in
> > the flink conf and what to bring on the CR level. Resource related
> > ones are difficult because on one hand they are integral to every
> > application, on the other hand there are many expert settings that we
> > should probably leave in the conf.
> >
> > Cheers,
> > Gyula
> >
> >
> >
> > On Mon, Feb 7, 2022 at 8:28 AM Yangze Guo <ka...@gmail.com> wrote:
> >
> > > Thanks everyone for the great effort. The FLIP looks really good.
> > >
> > > I just want to make sure the configuration priority in the CR example.
> > > It seems the requests resources or "taskManager. taskSlots" will be
> > > transferred to Flink internal config, e.g.
> > > "taskmanager.memory.process.size" and
> > > "taskmanager.numberOfTaskSlots", and override the one in
> > > "flinkConfiguration". Am I understanding this correctly?
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com>
> > > wrote:
> > > >
> > > > Sorry for the late reply. We were out due to the public holidays
> > > > in
> > > China.
> > > >
> > > > @Thomas,
> > > >
> > > > The intention is to support application management through
> > > > operator and
> > > CR,
> > > > > which means there won't be any 2 step submission process, which
> > > > > as
> > you
> > > > > allude to would defeat the purpose of this project. The CR
> > > > > example
> > > shows
> > > > > the application part. Please note that the bare cluster support
> > > > > is an
> > > > > *additional* feature for scenarios that require external job
> > > management. Is
> > > > > there anything on the FLIP page that creates a different impression?
> > > > >
> > > >
> > > > Sounds good to me. I don't remember what created the impression of
> > > > 2
> > step
> > > > submission back then. I revisited the latest version of this FLIP
> > > > and
> > it
> > > > looks good to me.
> > > >
> > > > @Gyula,
> > > >
> > > > Versioning:
> > > > > Versioning will be independent from Flink and the operator will
> > depend
> > > on a
> > > > > fixed flink version (in every given operator version).
> > > > > This should be the exact same setup as with Stateful Functions (
> > > > > https://github.com/apache/flink-statefun). So independent
> > > > > release
> > > cycle
> > > > > but
> > > > > still within the Flink umbrella.
> > > > >
> > > >
> > > > Does this mean if someone wants to upgrade Flink to a version that
> > > > is released after the operator version that is being used, he/she
> > > > would
> > need
> > > > to upgrade the operator version first?
> > > > I'm not questioning this, just trying to make sure I'm
> > > > understanding
> > this
> > > > correctly.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > >
> > > > > Thank you Alexis,
> > > > >
> > > > > Will definitely check this out. You are right, Kotlin makes it
> > > difficult to
> > > > > adopt pieces of this code directly but I think it will be good
> > > > > to get inspiration for the architecture and look at how
> > > > > particular problems
> > > have
> > > > > been solved. It will be a great help for us I am sure.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> > > > > alexis.sarda-espinosa@microfocus.com> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > just wanted to mention that my employer agreed to open source
> > > > > > the
> > > PoC I
> > > > > > developed:
> > > > > > https://github.com/MicroFocus/opsb-flink-k8s-operator
> > > > > >
> > > > > > I understand the concern for maintainability, so Gradle &
> > > > > > Kotlin
> > > might
> > > > > not
> > > > > > be appealing to you, but at least it gives you another reference.
> > The
> > > > > Helm
> > > > > > resources in particular might be useful.
> > > > > >
> > > > > > There are bits and pieces there referring to Flink sessions,
> > > > > > but
> > > those
> > > > > are
> > > > > > just placeholders, the functioning parts use application mode
> > > > > > with
> > > native
> > > > > > integration.
> > > > > >
> > > > > > Regards,
> > > > > > Alexis.
> > > > > >
> > > > > > ________________________________
> > > > > > From: Thomas Weise <th...@apache.org>
> > > > > > Sent: Saturday, February 5, 2022 2:41 AM
> > > > > > To: dev <de...@flink.apache.org>
> > > > > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes
> > Operator
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks for the continued feedback and discussion. Looks like
> > > > > > we are ready to start a VOTE, I will initiate it shortly.
> > > > > >
> > > > > > In parallel it would be good to find the repository name.
> > > > > >
> > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > >
> > > > > > I thought "flink-operator" could be a bit misleading since the
> > > > > > term operator already has a meaning in Flink.
> > > > > >
> > > > > > I also considered "flink-k8s-operator" but that would be
> > > > > > almost identical to existing operator implementations and
> > > > > > could lead to confusion in the future.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra
> > > > > > <gy...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > Hi Danny,
> > > > > > >
> > > > > > > So far we have been focusing our dev efforts on the initial
> > native
> > > > > > > implementation with the team.
> > > > > > > If the discussion and vote goes well for this FLIP we are
> > > > > > > looking
> > > > > forward
> > > > > > > to contributing the initial version sometime next week
> > > > > > > (fingers
> > > > > crossed).
> > > > > > >
> > > > > > > At that point I think we can already start the dev work to
> > support
> > > the
> > > > > > > standalone mode as well, especially if you can dedicate some
> > > effort to
> > > > > > > pushing that side.
> > > > > > > Working together on this sounds like a great idea and we
> > > > > > > should
> > > start
> > > > > as
> > > > > > > soon as possible! :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > dannycranmer@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I have been discussing this one with my team. We are
> > > > > > > > interested
> > > in
> > > > > the
> > > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > > implementation.
> > > > > > > > Potentially we can work together to support both modes in
> > > parallel?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Danny!
> > > > > > > > >
> > > > > > > > > Thanks for the feedback :)
> > > > > > > > >
> > > > > > > > > Versioning:
> > > > > > > > > Versioning will be independent from Flink and the
> > > > > > > > > operator
> > will
> > > > > > depend
> > > > > > > > on a
> > > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > > This should be the exact same setup as with Stateful
> > Functions
> > > (
> > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > > > > > > independent
> > > release
> > > > > > cycle
> > > > > > > > > but
> > > > > > > > > still within the Flink umbrella.
> > > > > > > > >
> > > > > > > > > Deployment error handling:
> > > > > > > > > I think that's a very good point, as general exception
> > > handling for
> > > > > > the
> > > > > > > > > different failure scenarios is a tricky problem. I think
> > > > > > > > > the
> > > > > > exception
> > > > > > > > > classifiers and retry strategies could avoid a lot of
> > > > > > > > > manual
> > > > > > intervention
> > > > > > > > > from the user. We will definitely need to add something
> > > > > > > > > like
> > > this.
> > > > > > Once
> > > > > > > > we
> > > > > > > > > have the repo created with the initial operator code we
> > should
> > > open
> > > > > > some
> > > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > dannycranmer@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey team,
> > > > > > > > > >
> > > > > > > > > > Great work on the FLIP, I am looking forward to this
> > > > > > > > > > one. I
> > > agree
> > > > > > that
> > > > > > > > we
> > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > >
> > > > > > > > > > I have general feedback around how we will handle job
> > > submission
> > > > > > > > failure
> > > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > > section, we
> > > > > > can
> > > > > > > > use
> > > > > > > > > > Java to handle job submission failures from the Flink
> > > client. It
> > > > > > would
> > > > > > > > be
> > > > > > > > > > useful to have the ability to configure exception
> > > classifiers and
> > > > > > retry
> > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > >
> > > > > > > > > > Given this will be in a separate Github repository I
> > > > > > > > > > am
> > > curious
> > > > > how
> > > > > > > > ther
> > > > > > > > > > versioning strategy will work in relation to the Flink
> > > version?
> > > > > Do
> > > > > > we
> > > > > > > > > have
> > > > > > > > > > any other components with a similar setup I can look at?
> > > Will the
> > > > > > > > > operator
> > > > > > > > > > version track Flink or will it use its own versioning
> > > strategy
> > > > > > with a
> > > > > > > > > Flink
> > > > > > > > > > version support matrix, or similar?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi team,
> > > > > > > > > > >
> > > > > > > > > > > Thank you for the great feedback, Thomas has updated
> > > > > > > > > > > the
> > > FLIP
> > > > > > page
> > > > > > > > > > > accordingly. If you are comfortable with the
> > > > > > > > > > > currently
> > > existing
> > > > > > > > design
> > > > > > > > > > and
> > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to
> > > > > > > > > > > the
> > > voting
> > > > > > stage -
> > > > > > > > > once
> > > > > > > > > > > that reaches a positive conclusion it lets us create
> > > > > > > > > > > the
> > > > > separate
> > > > > > > > code
> > > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > > >
> > > > > > > > > > > I encourage everyone to keep improving the details
> > > > > > > > > > > in the
> > > > > > meantime,
> > > > > > > > > > however
> > > > > > > > > > > I believe given the existing design and the general
> > > sentiment
> > > > > on
> > > > > > this
> > > > > > > > > > > thread that the most efficient path from here is
> > > > > > > > > > > starting
> > > the
> > > > > > > > > > > implementation so that we can collectively iterate
> > > > > > > > > > > over
> > it.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduc
> > e+Flink+Kubernetes+Operator
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > thw@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback and please see responses
> > > > > > > > > > > > below
> > > -->
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > tonysong820@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > > > > > > > > > > everyone
> > for
> > > the
> > > > > > > > > > discussion.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Job Submission Deploying a Flink session
> > > > > > > > > > > > > cluster via kubectl & CR
> > and
> > > then
> > > > > > > > > > submitting
> > > > > > > > > > > > jobs
> > > > > > > > > > > > > to the cluster via Flink cli / REST is probably
> > > > > > > > > > > > > the
> > > > > approach
> > > > > > that
> > > > > > > > > > > > requires
> > > > > > > > > > > > > the least effort. However, I'd like to point out
> > > > > > > > > > > > > 2
> > > > > > weaknesses.
> > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > > > > > > > > perjob/application
> > > modes.
> > > > > For
> > > > > > > > these
> > > > > > > > > > > users,
> > > > > > > > > > > > > having to run the job in two steps (deploy the
> > > cluster, and
> > > > > > > > submit
> > > > > > > > > > the
> > > > > > > > > > > > job)
> > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > 2. One of our motivations is being able to
> > > > > > > > > > > > > manage
> > Flink
> > > > > > > > > applications'
> > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from
> > > > > > > > > > > > > cli
> > > sounds
> > > > > not
> > > > > > > > > aligned
> > > > > > > > > > > with
> > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > I think it's probably worth it to support
> > > > > > > > > > > > > submitting
> > > jobs
> > > > > via
> > > > > > > > > > kubectl &
> > > > > > > > > > > > CR
> > > > > > > > > > > > > in the first version, both together with
> > > > > > > > > > > > > deploying
> > the
> > > > > > cluster
> > > > > > > > like
> > > > > > > > > > in
> > > > > > > > > > > > > perjob/application mode and after deploying the
> > cluster
> > > > > like
> > > > > > in
> > > > > > > > > > session
> > > > > > > > > > > > > mode.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The intention is to support application management
> > > through
> > > > > > operator
> > > > > > > > > and
> > > > > > > > > > > CR,
> > > > > > > > > > > > which means there won't be any 2 step submission
> > process,
> > > > > > which as
> > > > > > > > > you
> > > > > > > > > > > > allude to would defeat the purpose of this
> > > > > > > > > > > > project. The
> > > CR
> > > > > > example
> > > > > > > > > > shows
> > > > > > > > > > > > the application part. Please note that the bare
> > > > > > > > > > > > cluster
> > > > > > support is
> > > > > > > > an
> > > > > > > > > > > > *additional* feature for scenarios that require
> > external
> > > job
> > > > > > > > > > management.
> > > > > > > > > > > Is
> > > > > > > > > > > > there anything on the FLIP page that creates a
> > different
> > > > > > > > impression?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > Which Flink versions does the operator plan to
> > support?
> > > > > > > > > > > > > 1. Native K8s deployment was firstly introduced
> > > > > > > > > > > > > in
> > > Flink
> > > > > 1.10
> > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12 3.
> > > > > > > > > > > > > The Pod template support was introduced in Flink
> > > 1.13
> > > > > > > > > > > > > 4. There was some changes to the Flink docker
> > > > > > > > > > > > > image
> > > > > > entrypoint
> > > > > > > > > script
> > > > > > > > > > > in,
> > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Great, thanks for providing this. It is important
> > > > > > > > > > > > for
> > the
> > > > > > > > > compatibility
> > > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > > upwards.
> > > > > > Before
> > > > > > > > the
> > > > > > > > > > > > operator is ready there will be another Flink release.
> > > Let's
> > > > > > see if
> > > > > > > > > > > anyone
> > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Compatibility What kind of API compatibility
> > > > > > > > > > > > > we can commit to? It's
> > > > > > probably
> > > > > > > > fine
> > > > > > > > > > to
> > > > > > > > > > > > have
> > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > > > > > > > > > > incompatible
> > > future
> > > > > > changes
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > first version. But eventually we would need to
> > > guarantee
> > > > > > > > backwards
> > > > > > > > > > > > > compatibility, so that an early version CR can
> > > > > > > > > > > > > work
> > > with a
> > > > > > new
> > > > > > > > > > version
> > > > > > > > > > > > > operator.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Another great point and please let me include that
> > > > > > > > > > > > on
> > the
> > > > > FLIP
> > > > > > > > page.
> > > > > > > > > > ;-)
> > > > > > > > > > > >
> > > > > > > > > > > > I think we should allow incompatible changes for
> > > > > > > > > > > > the
> > > first
> > > > > one
> > > > > > or
> > > > > > > > two
> > > > > > > > > > > > versions, similar to how other major features have
> > > evolved
> > > > > > > > recently,
> > > > > > > > > > such
> > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > >
> > > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > > thw@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > > Maybe we should make this more clear in the
> > > > > > > > > > > > > > > FLIP
> > > but we
> > > > > > > > agreed
> > > > > > > > > to
> > > > > > > > > > > do
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > first version of the operator based on the
> > > > > > > > > > > > > > > native
> > > > > > > > integration.
> > > > > > > > > > > > > > > While this clearly does not cover all
> > > > > > > > > > > > > > > use-cases
> > and
> > > > > > > > > requirements,
> > > > > > > > > > > it
> > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > this would lead to a much smaller initial
> > > > > > > > > > > > > > > effort
> > > and a
> > > > > > nicer
> > > > > > > > > > first
> > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > > > > > > > > > > integration, as
> > > long
> > > > > > as it
> > > > > > > > > > > reduces
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > MVP effort. Ultimately the operator will need
> > > > > > > > > > > > > > to
> > also
> > > > > > support
> > > > > > > > the
> > > > > > > > > > > > > > standalone mode. I would like to gain more
> > confidence
> > > > > that
> > > > > > > > native
> > > > > > > > > > > > > > integration reduces the effort. While it cuts
> > > > > > > > > > > > > > the
> > > effort
> > > > > to
> > > > > > > > > handle
> > > > > > > > > > > the
> > > > > > > > > > > > TM
> > > > > > > > > > > > > > pod creation, some mapping code from the CR to
> > > > > > > > > > > > > > the
> > > native
> > > > > > > > > > integration
> > > > > > > > > > > > > > client and config needs to be created. As
> > > > > > > > > > > > > > mentioned
> > > in
> > > > > the
> > > > > > > > FLIP,
> > > > > > > > > > > native
> > > > > > > > > > > > > > integration requires the Flink job manager to
> > > > > > > > > > > > > > have
> > > access
> > > > > > to
> > > > > > > > the
> > > > > > > > > > k8s
> > > > > > > > > > > > API
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > create pods, which in some scenarios may be
> > > > > > > > > > > > > > seen as
> > > > > > > > unfavorable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > Is the pod template in CR same with what
> > Flink
> > > has
> > > > > > > > already
> > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > > > cpu/memory
> > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, pod template would look almost identical.
> > There
> > > are
> > > > > a
> > > > > > few
> > > > > > > > > > > settings
> > > > > > > > > > > > > > that the operator will control (and that may
> > > > > > > > > > > > > > need
> > to
> > > be
> > > > > > > > > > blacklisted),
> > > > > > > > > > > > but
> > > > > > > > > > > > > > in general we would not want to place
> > restrictions. I
> > > > > > think a
> > > > > > > > > > > mechanism
> > > > > > > > > > > > > > where a pod template is merged from multiple
> > > > > > > > > > > > > > layers
> > > would
> > > > > > also
> > > > > > > > be
> > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >

RE: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Alexis Sarda-Espinosa <al...@microfocus.com>.
Danny Cranmer mentioned they are interested in standalone mode, and I am too, so I just wanted to say that if that development starts in parallel, I might be able to contribute a little.

Regarding the CRD, I agree it would be nice to avoid as many "duplications" as possible if pod templates are to be used. In my PoC I even tried to make use of existing configuration options like kubernetes.container.image & pipeline.jars [1]. For CPU/Memory resources, the discussion in [2] might be relevant.

[1] https://github.com/MicroFocus/opsb-flink-k8s-operator/blob/main/kubernetes/sample_batch_job.yaml
[2] https://issues.apache.org/jira/browse/FLINK-24150

Regards,
Alexis.

-----Original Message-----
From: K Fred <yu...@gmail.com> 
Sent: Montag, 7. Februar 2022 09:36
To: dev@flink.apache.org
Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Hi Gyula!

You are right. I think some common flink config options can be put in the CR, other expert settings continue to be overwritten by flink, and then the user can choose to customize the configuration.

Best Wishes,
Peng Yuan

On Mon, Feb 7, 2022 at 4:16 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Yangze!
>
> This is not set in stone at the moment but the way I think it should 
> work is that first class config options in the CR should always take 
> precedence over the Flink config.
>
> In general we should not introduce too many arbitrary config options 
> that duplicate the flink configs without good reasons but the ones we 
> introduce should overwrite flink configs.
>
> We should discuss and decide together what config options to keep in 
> the flink conf and what to bring on the CR level. Resource related 
> ones are difficult because on one hand they are integral to every 
> application, on the other hand there are many expert settings that we 
> should probably leave in the conf.
>
> Cheers,
> Gyula
>
>
>
> On Mon, Feb 7, 2022 at 8:28 AM Yangze Guo <ka...@gmail.com> wrote:
>
> > Thanks everyone for the great effort. The FLIP looks really good.
> >
> > I just want to make sure the configuration priority in the CR example.
> > It seems the requests resources or "taskManager. taskSlots" will be 
> > transferred to Flink internal config, e.g.
> > "taskmanager.memory.process.size" and 
> > "taskmanager.numberOfTaskSlots", and override the one in 
> > "flinkConfiguration". Am I understanding this correctly?
> >
> > Best,
> > Yangze Guo
> >
> > On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com>
> > wrote:
> > >
> > > Sorry for the late reply. We were out due to the public holidays 
> > > in
> > China.
> > >
> > > @Thomas,
> > >
> > > The intention is to support application management through 
> > > operator and
> > CR,
> > > > which means there won't be any 2 step submission process, which 
> > > > as
> you
> > > > allude to would defeat the purpose of this project. The CR 
> > > > example
> > shows
> > > > the application part. Please note that the bare cluster support 
> > > > is an
> > > > *additional* feature for scenarios that require external job
> > management. Is
> > > > there anything on the FLIP page that creates a different impression?
> > > >
> > >
> > > Sounds good to me. I don't remember what created the impression of 
> > > 2
> step
> > > submission back then. I revisited the latest version of this FLIP 
> > > and
> it
> > > looks good to me.
> > >
> > > @Gyula,
> > >
> > > Versioning:
> > > > Versioning will be independent from Flink and the operator will
> depend
> > on a
> > > > fixed flink version (in every given operator version).
> > > > This should be the exact same setup as with Stateful Functions ( 
> > > > https://github.com/apache/flink-statefun). So independent 
> > > > release
> > cycle
> > > > but
> > > > still within the Flink umbrella.
> > > >
> > >
> > > Does this mean if someone wants to upgrade Flink to a version that 
> > > is released after the operator version that is being used, he/she 
> > > would
> need
> > > to upgrade the operator version first?
> > > I'm not questioning this, just trying to make sure I'm 
> > > understanding
> this
> > > correctly.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Thank you Alexis,
> > > >
> > > > Will definitely check this out. You are right, Kotlin makes it
> > difficult to
> > > > adopt pieces of this code directly but I think it will be good 
> > > > to get inspiration for the architecture and look at how 
> > > > particular problems
> > have
> > > > been solved. It will be a great help for us I am sure.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa < 
> > > > alexis.sarda-espinosa@microfocus.com> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > just wanted to mention that my employer agreed to open source 
> > > > > the
> > PoC I
> > > > > developed: 
> > > > > https://github.com/MicroFocus/opsb-flink-k8s-operator
> > > > >
> > > > > I understand the concern for maintainability, so Gradle & 
> > > > > Kotlin
> > might
> > > > not
> > > > > be appealing to you, but at least it gives you another reference.
> The
> > > > Helm
> > > > > resources in particular might be useful.
> > > > >
> > > > > There are bits and pieces there referring to Flink sessions, 
> > > > > but
> > those
> > > > are
> > > > > just placeholders, the functioning parts use application mode 
> > > > > with
> > native
> > > > > integration.
> > > > >
> > > > > Regards,
> > > > > Alexis.
> > > > >
> > > > > ________________________________
> > > > > From: Thomas Weise <th...@apache.org>
> > > > > Sent: Saturday, February 5, 2022 2:41 AM
> > > > > To: dev <de...@flink.apache.org>
> > > > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes
> Operator
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the continued feedback and discussion. Looks like 
> > > > > we are ready to start a VOTE, I will initiate it shortly.
> > > > >
> > > > > In parallel it would be good to find the repository name.
> > > > >
> > > > > My suggestion would be: flink-kubernetes-operator
> > > > >
> > > > > I thought "flink-operator" could be a bit misleading since the 
> > > > > term operator already has a meaning in Flink.
> > > > >
> > > > > I also considered "flink-k8s-operator" but that would be 
> > > > > almost identical to existing operator implementations and 
> > > > > could lead to confusion in the future.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra 
> > > > > <gy...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hi Danny,
> > > > > >
> > > > > > So far we have been focusing our dev efforts on the initial
> native
> > > > > > implementation with the team.
> > > > > > If the discussion and vote goes well for this FLIP we are 
> > > > > > looking
> > > > forward
> > > > > > to contributing the initial version sometime next week 
> > > > > > (fingers
> > > > crossed).
> > > > > >
> > > > > > At that point I think we can already start the dev work to
> support
> > the
> > > > > > standalone mode as well, especially if you can dedicate some
> > effort to
> > > > > > pushing that side.
> > > > > > Working together on this sounds like a great idea and we 
> > > > > > should
> > start
> > > > as
> > > > > > soon as possible! :)
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > dannycranmer@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > I have been discussing this one with my team. We are 
> > > > > > > interested
> > in
> > > > the
> > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > implementation.
> > > > > > > Potentially we can work together to support both modes in
> > parallel?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Danny!
> > > > > > > >
> > > > > > > > Thanks for the feedback :)
> > > > > > > >
> > > > > > > > Versioning:
> > > > > > > > Versioning will be independent from Flink and the 
> > > > > > > > operator
> will
> > > > > depend
> > > > > > > on a
> > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > This should be the exact same setup as with Stateful
> Functions
> > (
> > > > > > > > https://github.com/apache/flink-statefun). So 
> > > > > > > > independent
> > release
> > > > > cycle
> > > > > > > > but
> > > > > > > > still within the Flink umbrella.
> > > > > > > >
> > > > > > > > Deployment error handling:
> > > > > > > > I think that's a very good point, as general exception
> > handling for
> > > > > the
> > > > > > > > different failure scenarios is a tricky problem. I think 
> > > > > > > > the
> > > > > exception
> > > > > > > > classifiers and retry strategies could avoid a lot of 
> > > > > > > > manual
> > > > > intervention
> > > > > > > > from the user. We will definitely need to add something 
> > > > > > > > like
> > this.
> > > > > Once
> > > > > > > we
> > > > > > > > have the repo created with the initial operator code we
> should
> > open
> > > > > some
> > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > dannycranmer@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey team,
> > > > > > > > >
> > > > > > > > > Great work on the FLIP, I am looking forward to this 
> > > > > > > > > one. I
> > agree
> > > > > that
> > > > > > > we
> > > > > > > > > can move forward to the voting stage.
> > > > > > > > >
> > > > > > > > > I have general feedback around how we will handle job
> > submission
> > > > > > > failure
> > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > section, we
> > > > > can
> > > > > > > use
> > > > > > > > > Java to handle job submission failures from the Flink
> > client. It
> > > > > would
> > > > > > > be
> > > > > > > > > useful to have the ability to configure exception
> > classifiers and
> > > > > retry
> > > > > > > > > strategy as part of operator configuration.
> > > > > > > > >
> > > > > > > > > Given this will be in a separate Github repository I 
> > > > > > > > > am
> > curious
> > > > how
> > > > > > > ther
> > > > > > > > > versioning strategy will work in relation to the Flink
> > version?
> > > > Do
> > > > > we
> > > > > > > > have
> > > > > > > > > any other components with a similar setup I can look at?
> > Will the
> > > > > > > > operator
> > > > > > > > > version track Flink or will it use its own versioning
> > strategy
> > > > > with a
> > > > > > > > Flink
> > > > > > > > > version support matrix, or similar?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > balassi.marton@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi team,
> > > > > > > > > >
> > > > > > > > > > Thank you for the great feedback, Thomas has updated 
> > > > > > > > > > the
> > FLIP
> > > > > page
> > > > > > > > > > accordingly. If you are comfortable with the 
> > > > > > > > > > currently
> > existing
> > > > > > > design
> > > > > > > > > and
> > > > > > > > > > depth in the FLIP [1] I suggest moving forward to 
> > > > > > > > > > the
> > voting
> > > > > stage -
> > > > > > > > once
> > > > > > > > > > that reaches a positive conclusion it lets us create 
> > > > > > > > > > the
> > > > separate
> > > > > > > code
> > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > >
> > > > > > > > > > I encourage everyone to keep improving the details 
> > > > > > > > > > in the
> > > > > meantime,
> > > > > > > > > however
> > > > > > > > > > I believe given the existing design and the general
> > sentiment
> > > > on
> > > > > this
> > > > > > > > > > thread that the most efficient path from here is 
> > > > > > > > > > starting
> > the
> > > > > > > > > > implementation so that we can collectively iterate 
> > > > > > > > > > over
> it.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduc
> e+Flink+Kubernetes+Operator
> > > > > > > > > >
> > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > HI Xintong,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback and please see responses 
> > > > > > > > > > > below
> > -->
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > tonysong820@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and 
> > > > > > > > > > > > everyone
> for
> > the
> > > > > > > > > discussion.
> > > > > > > > > > > >
> > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > >
> > > > > > > > > > > > ## Job Submission Deploying a Flink session 
> > > > > > > > > > > > cluster via kubectl & CR
> and
> > then
> > > > > > > > > submitting
> > > > > > > > > > > jobs
> > > > > > > > > > > > to the cluster via Flink cli / REST is probably 
> > > > > > > > > > > > the
> > > > approach
> > > > > that
> > > > > > > > > > > requires
> > > > > > > > > > > > the least effort. However, I'd like to point out 
> > > > > > > > > > > > 2
> > > > > weaknesses.
> > > > > > > > > > > > 1. A lot of users use Flink in 
> > > > > > > > > > > > perjob/application
> > modes.
> > > > For
> > > > > > > these
> > > > > > > > > > users,
> > > > > > > > > > > > having to run the job in two steps (deploy the
> > cluster, and
> > > > > > > submit
> > > > > > > > > the
> > > > > > > > > > > job)
> > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > 2. One of our motivations is being able to 
> > > > > > > > > > > > manage
> Flink
> > > > > > > > applications'
> > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from 
> > > > > > > > > > > > cli
> > sounds
> > > > not
> > > > > > > > aligned
> > > > > > > > > > with
> > > > > > > > > > > > this motivation.
> > > > > > > > > > > > I think it's probably worth it to support 
> > > > > > > > > > > > submitting
> > jobs
> > > > via
> > > > > > > > > kubectl &
> > > > > > > > > > > CR
> > > > > > > > > > > > in the first version, both together with 
> > > > > > > > > > > > deploying
> the
> > > > > cluster
> > > > > > > like
> > > > > > > > > in
> > > > > > > > > > > > perjob/application mode and after deploying the
> cluster
> > > > like
> > > > > in
> > > > > > > > > session
> > > > > > > > > > > > mode.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intention is to support application management
> > through
> > > > > operator
> > > > > > > > and
> > > > > > > > > > CR,
> > > > > > > > > > > which means there won't be any 2 step submission
> process,
> > > > > which as
> > > > > > > > you
> > > > > > > > > > > allude to would defeat the purpose of this 
> > > > > > > > > > > project. The
> > CR
> > > > > example
> > > > > > > > > shows
> > > > > > > > > > > the application part. Please note that the bare 
> > > > > > > > > > > cluster
> > > > > support is
> > > > > > > an
> > > > > > > > > > > *additional* feature for scenarios that require
> external
> > job
> > > > > > > > > management.
> > > > > > > > > > Is
> > > > > > > > > > > there anything on the FLIP page that creates a
> different
> > > > > > > impression?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > Which Flink versions does the operator plan to
> support?
> > > > > > > > > > > > 1. Native K8s deployment was firstly introduced 
> > > > > > > > > > > > in
> > Flink
> > > > 1.10
> > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12 3. 
> > > > > > > > > > > > The Pod template support was introduced in Flink
> > 1.13
> > > > > > > > > > > > 4. There was some changes to the Flink docker 
> > > > > > > > > > > > image
> > > > > entrypoint
> > > > > > > > script
> > > > > > > > > > in,
> > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Great, thanks for providing this. It is important 
> > > > > > > > > > > for
> the
> > > > > > > > compatibility
> > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > upwards.
> > > > > Before
> > > > > > > the
> > > > > > > > > > > operator is ready there will be another Flink release.
> > Let's
> > > > > see if
> > > > > > > > > > anyone
> > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Compatibility What kind of API compatibility 
> > > > > > > > > > > > we can commit to? It's
> > > > > probably
> > > > > > > fine
> > > > > > > > > to
> > > > > > > > > > > have
> > > > > > > > > > > > alpha / beta version APIs that allow 
> > > > > > > > > > > > incompatible
> > future
> > > > > changes
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > first version. But eventually we would need to
> > guarantee
> > > > > > > backwards
> > > > > > > > > > > > compatibility, so that an early version CR can 
> > > > > > > > > > > > work
> > with a
> > > > > new
> > > > > > > > > version
> > > > > > > > > > > > operator.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Another great point and please let me include that 
> > > > > > > > > > > on
> the
> > > > FLIP
> > > > > > > page.
> > > > > > > > > ;-)
> > > > > > > > > > >
> > > > > > > > > > > I think we should allow incompatible changes for 
> > > > > > > > > > > the
> > first
> > > > one
> > > > > or
> > > > > > > two
> > > > > > > > > > > versions, similar to how other major features have
> > evolved
> > > > > > > recently,
> > > > > > > > > such
> > > > > > > > > > > as FLIP-27.
> > > > > > > > > > >
> > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration 
> > > > > > > > > > > > > > Maybe we should make this more clear in the 
> > > > > > > > > > > > > > FLIP
> > but we
> > > > > > > agreed
> > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > > > the
> > > > > > > > > > > > > > first version of the operator based on the 
> > > > > > > > > > > > > > native
> > > > > > > integration.
> > > > > > > > > > > > > > While this clearly does not cover all 
> > > > > > > > > > > > > > use-cases
> and
> > > > > > > > requirements,
> > > > > > > > > > it
> > > > > > > > > > > > > seems
> > > > > > > > > > > > > > this would lead to a much smaller initial 
> > > > > > > > > > > > > > effort
> > and a
> > > > > nicer
> > > > > > > > > first
> > > > > > > > > > > > > version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm also leaning towards the native 
> > > > > > > > > > > > > integration, as
> > long
> > > > > as it
> > > > > > > > > > reduces
> > > > > > > > > > > > the
> > > > > > > > > > > > > MVP effort. Ultimately the operator will need 
> > > > > > > > > > > > > to
> also
> > > > > support
> > > > > > > the
> > > > > > > > > > > > > standalone mode. I would like to gain more
> confidence
> > > > that
> > > > > > > native
> > > > > > > > > > > > > integration reduces the effort. While it cuts 
> > > > > > > > > > > > > the
> > effort
> > > > to
> > > > > > > > handle
> > > > > > > > > > the
> > > > > > > > > > > TM
> > > > > > > > > > > > > pod creation, some mapping code from the CR to 
> > > > > > > > > > > > > the
> > native
> > > > > > > > > integration
> > > > > > > > > > > > > client and config needs to be created. As 
> > > > > > > > > > > > > mentioned
> > in
> > > > the
> > > > > > > FLIP,
> > > > > > > > > > native
> > > > > > > > > > > > > integration requires the Flink job manager to 
> > > > > > > > > > > > > have
> > access
> > > > > to
> > > > > > > the
> > > > > > > > > k8s
> > > > > > > > > > > API
> > > > > > > > > > > > to
> > > > > > > > > > > > > create pods, which in some scenarios may be 
> > > > > > > > > > > > > seen as
> > > > > > > unfavorable.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > Is the pod template in CR same with what
> Flink
> > has
> > > > > > > already
> > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > > cpu/memory
> > > > > > > > > > > resources)
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, pod template would look almost identical.
> There
> > are
> > > > a
> > > > > few
> > > > > > > > > > settings
> > > > > > > > > > > > > that the operator will control (and that may 
> > > > > > > > > > > > > need
> to
> > be
> > > > > > > > > blacklisted),
> > > > > > > > > > > but
> > > > > > > > > > > > > in general we would not want to place
> restrictions. I
> > > > > think a
> > > > > > > > > > mechanism
> > > > > > > > > > > > > where a pod template is merged from multiple 
> > > > > > > > > > > > > layers
> > would
> > > > > also
> > > > > > > be
> > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi Gyula!

You are right. I think some common flink config options can be put in the
CR, other expert settings continue to be overwritten by flink, and then the
user can choose to customize the configuration.

Best Wishes,
Peng Yuan

On Mon, Feb 7, 2022 at 4:16 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Yangze!
>
> This is not set in stone at the moment but the way I think it should work
> is that first class config options in the CR should always take precedence
> over the Flink config.
>
> In general we should not introduce too many arbitrary config options that
> duplicate the flink configs without good reasons but the ones we introduce
> should overwrite flink configs.
>
> We should discuss and decide together what config options to keep in the
> flink conf and what to bring on the CR level. Resource related ones are
> difficult because on one hand they are integral to every application, on
> the other hand there are many expert settings that we should probably leave
> in the conf.
>
> Cheers,
> Gyula
>
>
>
> On Mon, Feb 7, 2022 at 8:28 AM Yangze Guo <ka...@gmail.com> wrote:
>
> > Thanks everyone for the great effort. The FLIP looks really good.
> >
> > I just want to make sure the configuration priority in the CR example.
> > It seems the requests resources or "taskManager. taskSlots" will be
> > transferred to Flink internal config, e.g.
> > "taskmanager.memory.process.size" and "taskmanager.numberOfTaskSlots",
> > and override the one in "flinkConfiguration". Am I understanding this
> > correctly?
> >
> > Best,
> > Yangze Guo
> >
> > On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com>
> > wrote:
> > >
> > > Sorry for the late reply. We were out due to the public holidays in
> > China.
> > >
> > > @Thomas,
> > >
> > > The intention is to support application management through operator and
> > CR,
> > > > which means there won't be any 2 step submission process, which as
> you
> > > > allude to would defeat the purpose of this project. The CR example
> > shows
> > > > the application part. Please note that the bare cluster support is an
> > > > *additional* feature for scenarios that require external job
> > management. Is
> > > > there anything on the FLIP page that creates a different impression?
> > > >
> > >
> > > Sounds good to me. I don't remember what created the impression of 2
> step
> > > submission back then. I revisited the latest version of this FLIP and
> it
> > > looks good to me.
> > >
> > > @Gyula,
> > >
> > > Versioning:
> > > > Versioning will be independent from Flink and the operator will
> depend
> > on a
> > > > fixed flink version (in every given operator version).
> > > > This should be the exact same setup as with Stateful Functions (
> > > > https://github.com/apache/flink-statefun). So independent release
> > cycle
> > > > but
> > > > still within the Flink umbrella.
> > > >
> > >
> > > Does this mean if someone wants to upgrade Flink to a version that is
> > > released after the operator version that is being used, he/she would
> need
> > > to upgrade the operator version first?
> > > I'm not questioning this, just trying to make sure I'm understanding
> this
> > > correctly.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Thank you Alexis,
> > > >
> > > > Will definitely check this out. You are right, Kotlin makes it
> > difficult to
> > > > adopt pieces of this code directly but I think it will be good to get
> > > > inspiration for the architecture and look at how particular problems
> > have
> > > > been solved. It will be a great help for us I am sure.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> > > > alexis.sarda-espinosa@microfocus.com> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > just wanted to mention that my employer agreed to open source the
> > PoC I
> > > > > developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
> > > > >
> > > > > I understand the concern for maintainability, so Gradle & Kotlin
> > might
> > > > not
> > > > > be appealing to you, but at least it gives you another reference.
> The
> > > > Helm
> > > > > resources in particular might be useful.
> > > > >
> > > > > There are bits and pieces there referring to Flink sessions, but
> > those
> > > > are
> > > > > just placeholders, the functioning parts use application mode with
> > native
> > > > > integration.
> > > > >
> > > > > Regards,
> > > > > Alexis.
> > > > >
> > > > > ________________________________
> > > > > From: Thomas Weise <th...@apache.org>
> > > > > Sent: Saturday, February 5, 2022 2:41 AM
> > > > > To: dev <de...@flink.apache.org>
> > > > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes
> Operator
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the continued feedback and discussion. Looks like we are
> > > > > ready to start a VOTE, I will initiate it shortly.
> > > > >
> > > > > In parallel it would be good to find the repository name.
> > > > >
> > > > > My suggestion would be: flink-kubernetes-operator
> > > > >
> > > > > I thought "flink-operator" could be a bit misleading since the term
> > > > > operator already has a meaning in Flink.
> > > > >
> > > > > I also considered "flink-k8s-operator" but that would be almost
> > > > > identical to existing operator implementations and could lead to
> > > > > confusion in the future.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > > > >
> > > > > > Hi Danny,
> > > > > >
> > > > > > So far we have been focusing our dev efforts on the initial
> native
> > > > > > implementation with the team.
> > > > > > If the discussion and vote goes well for this FLIP we are looking
> > > > forward
> > > > > > to contributing the initial version sometime next week (fingers
> > > > crossed).
> > > > > >
> > > > > > At that point I think we can already start the dev work to
> support
> > the
> > > > > > standalone mode as well, especially if you can dedicate some
> > effort to
> > > > > > pushing that side.
> > > > > > Working together on this sounds like a great idea and we should
> > start
> > > > as
> > > > > > soon as possible! :)
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > dannycranmer@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > I have been discussing this one with my team. We are interested
> > in
> > > > the
> > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > implementation.
> > > > > > > Potentially we can work together to support both modes in
> > parallel?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Danny!
> > > > > > > >
> > > > > > > > Thanks for the feedback :)
> > > > > > > >
> > > > > > > > Versioning:
> > > > > > > > Versioning will be independent from Flink and the operator
> will
> > > > > depend
> > > > > > > on a
> > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > This should be the exact same setup as with Stateful
> Functions
> > (
> > > > > > > > https://github.com/apache/flink-statefun). So independent
> > release
> > > > > cycle
> > > > > > > > but
> > > > > > > > still within the Flink umbrella.
> > > > > > > >
> > > > > > > > Deployment error handling:
> > > > > > > > I think that's a very good point, as general exception
> > handling for
> > > > > the
> > > > > > > > different failure scenarios is a tricky problem. I think the
> > > > > exception
> > > > > > > > classifiers and retry strategies could avoid a lot of manual
> > > > > intervention
> > > > > > > > from the user. We will definitely need to add something like
> > this.
> > > > > Once
> > > > > > > we
> > > > > > > > have the repo created with the initial operator code we
> should
> > open
> > > > > some
> > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > dannycranmer@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey team,
> > > > > > > > >
> > > > > > > > > Great work on the FLIP, I am looking forward to this one. I
> > agree
> > > > > that
> > > > > > > we
> > > > > > > > > can move forward to the voting stage.
> > > > > > > > >
> > > > > > > > > I have general feedback around how we will handle job
> > submission
> > > > > > > failure
> > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > section, we
> > > > > can
> > > > > > > use
> > > > > > > > > Java to handle job submission failures from the Flink
> > client. It
> > > > > would
> > > > > > > be
> > > > > > > > > useful to have the ability to configure exception
> > classifiers and
> > > > > retry
> > > > > > > > > strategy as part of operator configuration.
> > > > > > > > >
> > > > > > > > > Given this will be in a separate Github repository I am
> > curious
> > > > how
> > > > > > > ther
> > > > > > > > > versioning strategy will work in relation to the Flink
> > version?
> > > > Do
> > > > > we
> > > > > > > > have
> > > > > > > > > any other components with a similar setup I can look at?
> > Will the
> > > > > > > > operator
> > > > > > > > > version track Flink or will it use its own versioning
> > strategy
> > > > > with a
> > > > > > > > Flink
> > > > > > > > > version support matrix, or similar?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > balassi.marton@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi team,
> > > > > > > > > >
> > > > > > > > > > Thank you for the great feedback, Thomas has updated the
> > FLIP
> > > > > page
> > > > > > > > > > accordingly. If you are comfortable with the currently
> > existing
> > > > > > > design
> > > > > > > > > and
> > > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> > voting
> > > > > stage -
> > > > > > > > once
> > > > > > > > > > that reaches a positive conclusion it lets us create the
> > > > separate
> > > > > > > code
> > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > >
> > > > > > > > > > I encourage everyone to keep improving the details in the
> > > > > meantime,
> > > > > > > > > however
> > > > > > > > > > I believe given the existing design and the general
> > sentiment
> > > > on
> > > > > this
> > > > > > > > > > thread that the most efficient path from here is starting
> > the
> > > > > > > > > > implementation so that we can collectively iterate over
> it.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > >
> > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > HI Xintong,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback and please see responses below
> > -->
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > tonysong820@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone
> for
> > the
> > > > > > > > > discussion.
> > > > > > > > > > > >
> > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > >
> > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > Deploying a Flink session cluster via kubectl & CR
> and
> > then
> > > > > > > > > submitting
> > > > > > > > > > > jobs
> > > > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > > > approach
> > > > > that
> > > > > > > > > > > requires
> > > > > > > > > > > > the least effort. However, I'd like to point out 2
> > > > > weaknesses.
> > > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> > modes.
> > > > For
> > > > > > > these
> > > > > > > > > > users,
> > > > > > > > > > > > having to run the job in two steps (deploy the
> > cluster, and
> > > > > > > submit
> > > > > > > > > the
> > > > > > > > > > > job)
> > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > 2. One of our motivations is being able to manage
> Flink
> > > > > > > > applications'
> > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> > sounds
> > > > not
> > > > > > > > aligned
> > > > > > > > > > with
> > > > > > > > > > > > this motivation.
> > > > > > > > > > > > I think it's probably worth it to support submitting
> > jobs
> > > > via
> > > > > > > > > kubectl &
> > > > > > > > > > > CR
> > > > > > > > > > > > in the first version, both together with deploying
> the
> > > > > cluster
> > > > > > > like
> > > > > > > > > in
> > > > > > > > > > > > perjob/application mode and after deploying the
> cluster
> > > > like
> > > > > in
> > > > > > > > > session
> > > > > > > > > > > > mode.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intention is to support application management
> > through
> > > > > operator
> > > > > > > > and
> > > > > > > > > > CR,
> > > > > > > > > > > which means there won't be any 2 step submission
> process,
> > > > > which as
> > > > > > > > you
> > > > > > > > > > > allude to would defeat the purpose of this project. The
> > CR
> > > > > example
> > > > > > > > > shows
> > > > > > > > > > > the application part. Please note that the bare cluster
> > > > > support is
> > > > > > > an
> > > > > > > > > > > *additional* feature for scenarios that require
> external
> > job
> > > > > > > > > management.
> > > > > > > > > > Is
> > > > > > > > > > > there anything on the FLIP page that creates a
> different
> > > > > > > impression?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > Which Flink versions does the operator plan to
> support?
> > > > > > > > > > > > 1. Native K8s deployment was firstly introduced in
> > Flink
> > > > 1.10
> > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > 3. The Pod template support was introduced in Flink
> > 1.13
> > > > > > > > > > > > 4. There was some changes to the Flink docker image
> > > > > entrypoint
> > > > > > > > script
> > > > > > > > > > in,
> > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Great, thanks for providing this. It is important for
> the
> > > > > > > > compatibility
> > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > upwards.
> > > > > Before
> > > > > > > the
> > > > > > > > > > > operator is ready there will be another Flink release.
> > Let's
> > > > > see if
> > > > > > > > > > anyone
> > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > > > probably
> > > > > > > fine
> > > > > > > > > to
> > > > > > > > > > > have
> > > > > > > > > > > > alpha / beta version APIs that allow incompatible
> > future
> > > > > changes
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > first version. But eventually we would need to
> > guarantee
> > > > > > > backwards
> > > > > > > > > > > > compatibility, so that an early version CR can work
> > with a
> > > > > new
> > > > > > > > > version
> > > > > > > > > > > > operator.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Another great point and please let me include that on
> the
> > > > FLIP
> > > > > > > page.
> > > > > > > > > ;-)
> > > > > > > > > > >
> > > > > > > > > > > I think we should allow incompatible changes for the
> > first
> > > > one
> > > > > or
> > > > > > > two
> > > > > > > > > > > versions, similar to how other major features have
> > evolved
> > > > > > > recently,
> > > > > > > > > such
> > > > > > > > > > > as FLIP-27.
> > > > > > > > > > >
> > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > Maybe we should make this more clear in the FLIP
> > but we
> > > > > > > agreed
> > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > > > the
> > > > > > > > > > > > > > first version of the operator based on the native
> > > > > > > integration.
> > > > > > > > > > > > > > While this clearly does not cover all use-cases
> and
> > > > > > > > requirements,
> > > > > > > > > > it
> > > > > > > > > > > > > seems
> > > > > > > > > > > > > > this would lead to a much smaller initial effort
> > and a
> > > > > nicer
> > > > > > > > > first
> > > > > > > > > > > > > version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm also leaning towards the native integration, as
> > long
> > > > > as it
> > > > > > > > > > reduces
> > > > > > > > > > > > the
> > > > > > > > > > > > > MVP effort. Ultimately the operator will need to
> also
> > > > > support
> > > > > > > the
> > > > > > > > > > > > > standalone mode. I would like to gain more
> confidence
> > > > that
> > > > > > > native
> > > > > > > > > > > > > integration reduces the effort. While it cuts the
> > effort
> > > > to
> > > > > > > > handle
> > > > > > > > > > the
> > > > > > > > > > > TM
> > > > > > > > > > > > > pod creation, some mapping code from the CR to the
> > native
> > > > > > > > > integration
> > > > > > > > > > > > > client and config needs to be created. As mentioned
> > in
> > > > the
> > > > > > > FLIP,
> > > > > > > > > > native
> > > > > > > > > > > > > integration requires the Flink job manager to have
> > access
> > > > > to
> > > > > > > the
> > > > > > > > > k8s
> > > > > > > > > > > API
> > > > > > > > > > > > to
> > > > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > > > unfavorable.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > Is the pod template in CR same with what
> Flink
> > has
> > > > > > > already
> > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > > cpu/memory
> > > > > > > > > > > resources)
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, pod template would look almost identical.
> There
> > are
> > > > a
> > > > > few
> > > > > > > > > > settings
> > > > > > > > > > > > > that the operator will control (and that may need
> to
> > be
> > > > > > > > > blacklisted),
> > > > > > > > > > > but
> > > > > > > > > > > > > in general we would not want to place
> restrictions. I
> > > > > think a
> > > > > > > > > > mechanism
> > > > > > > > > > > > > where a pod template is merged from multiple layers
> > would
> > > > > also
> > > > > > > be
> > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Yangze!

This is not set in stone at the moment but the way I think it should work
is that first class config options in the CR should always take precedence
over the Flink config.

In general we should not introduce too many arbitrary config options that
duplicate the flink configs without good reasons but the ones we introduce
should overwrite flink configs.

We should discuss and decide together what config options to keep in the
flink conf and what to bring on the CR level. Resource related ones are
difficult because on one hand they are integral to every application, on
the other hand there are many expert settings that we should probably leave
in the conf.

Cheers,
Gyula



On Mon, Feb 7, 2022 at 8:28 AM Yangze Guo <ka...@gmail.com> wrote:

> Thanks everyone for the great effort. The FLIP looks really good.
>
> I just want to make sure the configuration priority in the CR example.
> It seems the requests resources or "taskManager. taskSlots" will be
> transferred to Flink internal config, e.g.
> "taskmanager.memory.process.size" and "taskmanager.numberOfTaskSlots",
> and override the one in "flinkConfiguration". Am I understanding this
> correctly?
>
> Best,
> Yangze Guo
>
> On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com>
> wrote:
> >
> > Sorry for the late reply. We were out due to the public holidays in
> China.
> >
> > @Thomas,
> >
> > The intention is to support application management through operator and
> CR,
> > > which means there won't be any 2 step submission process, which as you
> > > allude to would defeat the purpose of this project. The CR example
> shows
> > > the application part. Please note that the bare cluster support is an
> > > *additional* feature for scenarios that require external job
> management. Is
> > > there anything on the FLIP page that creates a different impression?
> > >
> >
> > Sounds good to me. I don't remember what created the impression of 2 step
> > submission back then. I revisited the latest version of this FLIP and it
> > looks good to me.
> >
> > @Gyula,
> >
> > Versioning:
> > > Versioning will be independent from Flink and the operator will depend
> on a
> > > fixed flink version (in every given operator version).
> > > This should be the exact same setup as with Stateful Functions (
> > > https://github.com/apache/flink-statefun). So independent release
> cycle
> > > but
> > > still within the Flink umbrella.
> > >
> >
> > Does this mean if someone wants to upgrade Flink to a version that is
> > released after the operator version that is being used, he/she would need
> > to upgrade the operator version first?
> > I'm not questioning this, just trying to make sure I'm understanding this
> > correctly.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Thank you Alexis,
> > >
> > > Will definitely check this out. You are right, Kotlin makes it
> difficult to
> > > adopt pieces of this code directly but I think it will be good to get
> > > inspiration for the architecture and look at how particular problems
> have
> > > been solved. It will be a great help for us I am sure.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> > > alexis.sarda-espinosa@microfocus.com> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > just wanted to mention that my employer agreed to open source the
> PoC I
> > > > developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
> > > >
> > > > I understand the concern for maintainability, so Gradle & Kotlin
> might
> > > not
> > > > be appealing to you, but at least it gives you another reference. The
> > > Helm
> > > > resources in particular might be useful.
> > > >
> > > > There are bits and pieces there referring to Flink sessions, but
> those
> > > are
> > > > just placeholders, the functioning parts use application mode with
> native
> > > > integration.
> > > >
> > > > Regards,
> > > > Alexis.
> > > >
> > > > ________________________________
> > > > From: Thomas Weise <th...@apache.org>
> > > > Sent: Saturday, February 5, 2022 2:41 AM
> > > > To: dev <de...@flink.apache.org>
> > > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
> > > >
> > > > Hi,
> > > >
> > > > Thanks for the continued feedback and discussion. Looks like we are
> > > > ready to start a VOTE, I will initiate it shortly.
> > > >
> > > > In parallel it would be good to find the repository name.
> > > >
> > > > My suggestion would be: flink-kubernetes-operator
> > > >
> > > > I thought "flink-operator" could be a bit misleading since the term
> > > > operator already has a meaning in Flink.
> > > >
> > > > I also considered "flink-k8s-operator" but that would be almost
> > > > identical to existing operator implementations and could lead to
> > > > confusion in the future.
> > > >
> > > > Thoughts?
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > >
> > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > > > >
> > > > > Hi Danny,
> > > > >
> > > > > So far we have been focusing our dev efforts on the initial native
> > > > > implementation with the team.
> > > > > If the discussion and vote goes well for this FLIP we are looking
> > > forward
> > > > > to contributing the initial version sometime next week (fingers
> > > crossed).
> > > > >
> > > > > At that point I think we can already start the dev work to support
> the
> > > > > standalone mode as well, especially if you can dedicate some
> effort to
> > > > > pushing that side.
> > > > > Working together on this sounds like a great idea and we should
> start
> > > as
> > > > > soon as possible! :)
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> dannycranmer@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I have been discussing this one with my team. We are interested
> in
> > > the
> > > > > > Standalone mode, and are willing to contribute towards the
> > > > implementation.
> > > > > > Potentially we can work together to support both modes in
> parallel?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Danny!
> > > > > > >
> > > > > > > Thanks for the feedback :)
> > > > > > >
> > > > > > > Versioning:
> > > > > > > Versioning will be independent from Flink and the operator will
> > > > depend
> > > > > > on a
> > > > > > > fixed flink version (in every given operator version).
> > > > > > > This should be the exact same setup as with Stateful Functions
> (
> > > > > > > https://github.com/apache/flink-statefun). So independent
> release
> > > > cycle
> > > > > > > but
> > > > > > > still within the Flink umbrella.
> > > > > > >
> > > > > > > Deployment error handling:
> > > > > > > I think that's a very good point, as general exception
> handling for
> > > > the
> > > > > > > different failure scenarios is a tricky problem. I think the
> > > > exception
> > > > > > > classifiers and retry strategies could avoid a lot of manual
> > > > intervention
> > > > > > > from the user. We will definitely need to add something like
> this.
> > > > Once
> > > > > > we
> > > > > > > have the repo created with the initial operator code we should
> open
> > > > some
> > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > dannycranmer@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey team,
> > > > > > > >
> > > > > > > > Great work on the FLIP, I am looking forward to this one. I
> agree
> > > > that
> > > > > > we
> > > > > > > > can move forward to the voting stage.
> > > > > > > >
> > > > > > > > I have general feedback around how we will handle job
> submission
> > > > > > failure
> > > > > > > > and retry. As discussed in the Rejected Alternatives
> section, we
> > > > can
> > > > > > use
> > > > > > > > Java to handle job submission failures from the Flink
> client. It
> > > > would
> > > > > > be
> > > > > > > > useful to have the ability to configure exception
> classifiers and
> > > > retry
> > > > > > > > strategy as part of operator configuration.
> > > > > > > >
> > > > > > > > Given this will be in a separate Github repository I am
> curious
> > > how
> > > > > > ther
> > > > > > > > versioning strategy will work in relation to the Flink
> version?
> > > Do
> > > > we
> > > > > > > have
> > > > > > > > any other components with a similar setup I can look at?
> Will the
> > > > > > > operator
> > > > > > > > version track Flink or will it use its own versioning
> strategy
> > > > with a
> > > > > > > Flink
> > > > > > > > version support matrix, or similar?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > balassi.marton@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi team,
> > > > > > > > >
> > > > > > > > > Thank you for the great feedback, Thomas has updated the
> FLIP
> > > > page
> > > > > > > > > accordingly. If you are comfortable with the currently
> existing
> > > > > > design
> > > > > > > > and
> > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> voting
> > > > stage -
> > > > > > > once
> > > > > > > > > that reaches a positive conclusion it lets us create the
> > > separate
> > > > > > code
> > > > > > > > > repository under the flink project for the operator.
> > > > > > > > >
> > > > > > > > > I encourage everyone to keep improving the details in the
> > > > meantime,
> > > > > > > > however
> > > > > > > > > I believe given the existing design and the general
> sentiment
> > > on
> > > > this
> > > > > > > > > thread that the most efficient path from here is starting
> the
> > > > > > > > > implementation so that we can collectively iterate over it.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > >
> > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> thw@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > HI Xintong,
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback and please see responses below
> -->
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > tonysong820@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for
> the
> > > > > > > > discussion.
> > > > > > > > > > >
> > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > >
> > > > > > > > > > > ## Job Submission
> > > > > > > > > > > Deploying a Flink session cluster via kubectl & CR and
> then
> > > > > > > > submitting
> > > > > > > > > > jobs
> > > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > > approach
> > > > that
> > > > > > > > > > requires
> > > > > > > > > > > the least effort. However, I'd like to point out 2
> > > > weaknesses.
> > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> modes.
> > > For
> > > > > > these
> > > > > > > > > users,
> > > > > > > > > > > having to run the job in two steps (deploy the
> cluster, and
> > > > > > submit
> > > > > > > > the
> > > > > > > > > > job)
> > > > > > > > > > > is not that convenient.
> > > > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > > > applications'
> > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> sounds
> > > not
> > > > > > > aligned
> > > > > > > > > with
> > > > > > > > > > > this motivation.
> > > > > > > > > > > I think it's probably worth it to support submitting
> jobs
> > > via
> > > > > > > > kubectl &
> > > > > > > > > > CR
> > > > > > > > > > > in the first version, both together with deploying the
> > > > cluster
> > > > > > like
> > > > > > > > in
> > > > > > > > > > > perjob/application mode and after deploying the cluster
> > > like
> > > > in
> > > > > > > > session
> > > > > > > > > > > mode.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The intention is to support application management
> through
> > > > operator
> > > > > > > and
> > > > > > > > > CR,
> > > > > > > > > > which means there won't be any 2 step submission process,
> > > > which as
> > > > > > > you
> > > > > > > > > > allude to would defeat the purpose of this project. The
> CR
> > > > example
> > > > > > > > shows
> > > > > > > > > > the application part. Please note that the bare cluster
> > > > support is
> > > > > > an
> > > > > > > > > > *additional* feature for scenarios that require external
> job
> > > > > > > > management.
> > > > > > > > > Is
> > > > > > > > > > there anything on the FLIP page that creates a different
> > > > > > impression?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ## Versioning
> > > > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > > > 1. Native K8s deployment was firstly introduced in
> Flink
> > > 1.10
> > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > 3. The Pod template support was introduced in Flink
> 1.13
> > > > > > > > > > > 4. There was some changes to the Flink docker image
> > > > entrypoint
> > > > > > > script
> > > > > > > > > in,
> > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Great, thanks for providing this. It is important for the
> > > > > > > compatibility
> > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> upwards.
> > > > Before
> > > > > > the
> > > > > > > > > > operator is ready there will be another Flink release.
> Let's
> > > > see if
> > > > > > > > > anyone
> > > > > > > > > > is interested in earlier versions?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ## Compatibility
> > > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > > probably
> > > > > > fine
> > > > > > > > to
> > > > > > > > > > have
> > > > > > > > > > > alpha / beta version APIs that allow incompatible
> future
> > > > changes
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > first version. But eventually we would need to
> guarantee
> > > > > > backwards
> > > > > > > > > > > compatibility, so that an early version CR can work
> with a
> > > > new
> > > > > > > > version
> > > > > > > > > > > operator.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Another great point and please let me include that on the
> > > FLIP
> > > > > > page.
> > > > > > > > ;-)
> > > > > > > > > >
> > > > > > > > > > I think we should allow incompatible changes for the
> first
> > > one
> > > > or
> > > > > > two
> > > > > > > > > > versions, similar to how other major features have
> evolved
> > > > > > recently,
> > > > > > > > such
> > > > > > > > > > as FLIP-27.
> > > > > > > > > >
> > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > thw@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > Maybe we should make this more clear in the FLIP
> but we
> > > > > > agreed
> > > > > > > to
> > > > > > > > > do
> > > > > > > > > > > the
> > > > > > > > > > > > > first version of the operator based on the native
> > > > > > integration.
> > > > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > > > requirements,
> > > > > > > > > it
> > > > > > > > > > > > seems
> > > > > > > > > > > > > this would lead to a much smaller initial effort
> and a
> > > > nicer
> > > > > > > > first
> > > > > > > > > > > > version.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm also leaning towards the native integration, as
> long
> > > > as it
> > > > > > > > > reduces
> > > > > > > > > > > the
> > > > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > > > support
> > > > > > the
> > > > > > > > > > > > standalone mode. I would like to gain more confidence
> > > that
> > > > > > native
> > > > > > > > > > > > integration reduces the effort. While it cuts the
> effort
> > > to
> > > > > > > handle
> > > > > > > > > the
> > > > > > > > > > TM
> > > > > > > > > > > > pod creation, some mapping code from the CR to the
> native
> > > > > > > > integration
> > > > > > > > > > > > client and config needs to be created. As mentioned
> in
> > > the
> > > > > > FLIP,
> > > > > > > > > native
> > > > > > > > > > > > integration requires the Flink job manager to have
> access
> > > > to
> > > > > > the
> > > > > > > > k8s
> > > > > > > > > > API
> > > > > > > > > > > to
> > > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > > unfavorable.
> > > > > > > > > > > >
> > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > Is the pod template in CR same with what Flink
> has
> > > > > > already
> > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > cpu/memory
> > > > > > > > > > resources)
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, pod template would look almost identical. There
> are
> > > a
> > > > few
> > > > > > > > > settings
> > > > > > > > > > > > that the operator will control (and that may need to
> be
> > > > > > > > blacklisted),
> > > > > > > > > > but
> > > > > > > > > > > > in general we would not want to place restrictions. I
> > > > think a
> > > > > > > > > mechanism
> > > > > > > > > > > > where a pod template is merged from multiple layers
> would
> > > > also
> > > > > > be
> > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Yangze Guo <ka...@gmail.com>.
Thanks everyone for the great effort. The FLIP looks really good.

I just want to make sure the configuration priority in the CR example.
It seems the requests resources or "taskManager. taskSlots" will be
transferred to Flink internal config, e.g.
"taskmanager.memory.process.size" and "taskmanager.numberOfTaskSlots",
and override the one in "flinkConfiguration". Am I understanding this
correctly?

Best,
Yangze Guo

On Mon, Feb 7, 2022 at 10:22 AM Xintong Song <to...@gmail.com> wrote:
>
> Sorry for the late reply. We were out due to the public holidays in China.
>
> @Thomas,
>
> The intention is to support application management through operator and CR,
> > which means there won't be any 2 step submission process, which as you
> > allude to would defeat the purpose of this project. The CR example shows
> > the application part. Please note that the bare cluster support is an
> > *additional* feature for scenarios that require external job management. Is
> > there anything on the FLIP page that creates a different impression?
> >
>
> Sounds good to me. I don't remember what created the impression of 2 step
> submission back then. I revisited the latest version of this FLIP and it
> looks good to me.
>
> @Gyula,
>
> Versioning:
> > Versioning will be independent from Flink and the operator will depend on a
> > fixed flink version (in every given operator version).
> > This should be the exact same setup as with Stateful Functions (
> > https://github.com/apache/flink-statefun). So independent release cycle
> > but
> > still within the Flink umbrella.
> >
>
> Does this mean if someone wants to upgrade Flink to a version that is
> released after the operator version that is being used, he/she would need
> to upgrade the operator version first?
> I'm not questioning this, just trying to make sure I'm understanding this
> correctly.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Thank you Alexis,
> >
> > Will definitely check this out. You are right, Kotlin makes it difficult to
> > adopt pieces of this code directly but I think it will be good to get
> > inspiration for the architecture and look at how particular problems have
> > been solved. It will be a great help for us I am sure.
> >
> > Cheers,
> > Gyula
> >
> > On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> > alexis.sarda-espinosa@microfocus.com> wrote:
> >
> > > Hi everyone,
> > >
> > > just wanted to mention that my employer agreed to open source the PoC I
> > > developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
> > >
> > > I understand the concern for maintainability, so Gradle & Kotlin might
> > not
> > > be appealing to you, but at least it gives you another reference. The
> > Helm
> > > resources in particular might be useful.
> > >
> > > There are bits and pieces there referring to Flink sessions, but those
> > are
> > > just placeholders, the functioning parts use application mode with native
> > > integration.
> > >
> > > Regards,
> > > Alexis.
> > >
> > > ________________________________
> > > From: Thomas Weise <th...@apache.org>
> > > Sent: Saturday, February 5, 2022 2:41 AM
> > > To: dev <de...@flink.apache.org>
> > > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
> > >
> > > Hi,
> > >
> > > Thanks for the continued feedback and discussion. Looks like we are
> > > ready to start a VOTE, I will initiate it shortly.
> > >
> > > In parallel it would be good to find the repository name.
> > >
> > > My suggestion would be: flink-kubernetes-operator
> > >
> > > I thought "flink-operator" could be a bit misleading since the term
> > > operator already has a meaning in Flink.
> > >
> > > I also considered "flink-k8s-operator" but that would be almost
> > > identical to existing operator implementations and could lead to
> > > confusion in the future.
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
> > > >
> > > > Hi Danny,
> > > >
> > > > So far we have been focusing our dev efforts on the initial native
> > > > implementation with the team.
> > > > If the discussion and vote goes well for this FLIP we are looking
> > forward
> > > > to contributing the initial version sometime next week (fingers
> > crossed).
> > > >
> > > > At that point I think we can already start the dev work to support the
> > > > standalone mode as well, especially if you can dedicate some effort to
> > > > pushing that side.
> > > > Working together on this sounds like a great idea and we should start
> > as
> > > > soon as possible! :)
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> > > > wrote:
> > > >
> > > > > I have been discussing this one with my team. We are interested in
> > the
> > > > > Standalone mode, and are willing to contribute towards the
> > > implementation.
> > > > > Potentially we can work together to support both modes in parallel?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Danny!
> > > > > >
> > > > > > Thanks for the feedback :)
> > > > > >
> > > > > > Versioning:
> > > > > > Versioning will be independent from Flink and the operator will
> > > depend
> > > > > on a
> > > > > > fixed flink version (in every given operator version).
> > > > > > This should be the exact same setup as with Stateful Functions (
> > > > > > https://github.com/apache/flink-statefun). So independent release
> > > cycle
> > > > > > but
> > > > > > still within the Flink umbrella.
> > > > > >
> > > > > > Deployment error handling:
> > > > > > I think that's a very good point, as general exception handling for
> > > the
> > > > > > different failure scenarios is a tricky problem. I think the
> > > exception
> > > > > > classifiers and retry strategies could avoid a lot of manual
> > > intervention
> > > > > > from the user. We will definitely need to add something like this.
> > > Once
> > > > > we
> > > > > > have the repo created with the initial operator code we should open
> > > some
> > > > > > tickets for this and put it on the short term roadmap!
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > dannycranmer@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey team,
> > > > > > >
> > > > > > > Great work on the FLIP, I am looking forward to this one. I agree
> > > that
> > > > > we
> > > > > > > can move forward to the voting stage.
> > > > > > >
> > > > > > > I have general feedback around how we will handle job submission
> > > > > failure
> > > > > > > and retry. As discussed in the Rejected Alternatives section, we
> > > can
> > > > > use
> > > > > > > Java to handle job submission failures from the Flink client. It
> > > would
> > > > > be
> > > > > > > useful to have the ability to configure exception classifiers and
> > > retry
> > > > > > > strategy as part of operator configuration.
> > > > > > >
> > > > > > > Given this will be in a separate Github repository I am curious
> > how
> > > > > ther
> > > > > > > versioning strategy will work in relation to the Flink version?
> > Do
> > > we
> > > > > > have
> > > > > > > any other components with a similar setup I can look at? Will the
> > > > > > operator
> > > > > > > version track Flink or will it use its own versioning strategy
> > > with a
> > > > > > Flink
> > > > > > > version support matrix, or similar?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > balassi.marton@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi team,
> > > > > > > >
> > > > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> > > page
> > > > > > > > accordingly. If you are comfortable with the currently existing
> > > > > design
> > > > > > > and
> > > > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> > > stage -
> > > > > > once
> > > > > > > > that reaches a positive conclusion it lets us create the
> > separate
> > > > > code
> > > > > > > > repository under the flink project for the operator.
> > > > > > > >
> > > > > > > > I encourage everyone to keep improving the details in the
> > > meantime,
> > > > > > > however
> > > > > > > > I believe given the existing design and the general sentiment
> > on
> > > this
> > > > > > > > thread that the most efficient path from here is starting the
> > > > > > > > implementation so that we can collectively iterate over it.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > >
> > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > HI Xintong,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > > > >
> > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > tonysong820@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > > > discussion.
> > > > > > > > > >
> > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > >
> > > > > > > > > > ## Job Submission
> > > > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > > > submitting
> > > > > > > > > jobs
> > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > approach
> > > that
> > > > > > > > > requires
> > > > > > > > > > the least effort. However, I'd like to point out 2
> > > weaknesses.
> > > > > > > > > > 1. A lot of users use Flink in perjob/application modes.
> > For
> > > > > these
> > > > > > > > users,
> > > > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > > > submit
> > > > > > > the
> > > > > > > > > job)
> > > > > > > > > > is not that convenient.
> > > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > > applications'
> > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds
> > not
> > > > > > aligned
> > > > > > > > with
> > > > > > > > > > this motivation.
> > > > > > > > > > I think it's probably worth it to support submitting jobs
> > via
> > > > > > > kubectl &
> > > > > > > > > CR
> > > > > > > > > > in the first version, both together with deploying the
> > > cluster
> > > > > like
> > > > > > > in
> > > > > > > > > > perjob/application mode and after deploying the cluster
> > like
> > > in
> > > > > > > session
> > > > > > > > > > mode.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > The intention is to support application management through
> > > operator
> > > > > > and
> > > > > > > > CR,
> > > > > > > > > which means there won't be any 2 step submission process,
> > > which as
> > > > > > you
> > > > > > > > > allude to would defeat the purpose of this project. The CR
> > > example
> > > > > > > shows
> > > > > > > > > the application part. Please note that the bare cluster
> > > support is
> > > > > an
> > > > > > > > > *additional* feature for scenarios that require external job
> > > > > > > management.
> > > > > > > > Is
> > > > > > > > > there anything on the FLIP page that creates a different
> > > > > impression?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ## Versioning
> > > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > > 1. Native K8s deployment was firstly introduced in Flink
> > 1.10
> > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > > > 4. There was some changes to the Flink docker image
> > > entrypoint
> > > > > > script
> > > > > > > > in,
> > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Great, thanks for providing this. It is important for the
> > > > > > compatibility
> > > > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> > > Before
> > > > > the
> > > > > > > > > operator is ready there will be another Flink release. Let's
> > > see if
> > > > > > > > anyone
> > > > > > > > > is interested in earlier versions?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ## Compatibility
> > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > probably
> > > > > fine
> > > > > > > to
> > > > > > > > > have
> > > > > > > > > > alpha / beta version APIs that allow incompatible future
> > > changes
> > > > > > for
> > > > > > > > the
> > > > > > > > > > first version. But eventually we would need to guarantee
> > > > > backwards
> > > > > > > > > > compatibility, so that an early version CR can work with a
> > > new
> > > > > > > version
> > > > > > > > > > operator.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Another great point and please let me include that on the
> > FLIP
> > > > > page.
> > > > > > > ;-)
> > > > > > > > >
> > > > > > > > > I think we should allow incompatible changes for the first
> > one
> > > or
> > > > > two
> > > > > > > > > versions, similar to how other major features have evolved
> > > > > recently,
> > > > > > > such
> > > > > > > > > as FLIP-27.
> > > > > > > > >
> > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > thw@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > > > agreed
> > > > > > to
> > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > > > first version of the operator based on the native
> > > > > integration.
> > > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > > requirements,
> > > > > > > > it
> > > > > > > > > > > seems
> > > > > > > > > > > > this would lead to a much smaller initial effort and a
> > > nicer
> > > > > > > first
> > > > > > > > > > > version.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm also leaning towards the native integration, as long
> > > as it
> > > > > > > > reduces
> > > > > > > > > > the
> > > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > > support
> > > > > the
> > > > > > > > > > > standalone mode. I would like to gain more confidence
> > that
> > > > > native
> > > > > > > > > > > integration reduces the effort. While it cuts the effort
> > to
> > > > > > handle
> > > > > > > > the
> > > > > > > > > TM
> > > > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > > > integration
> > > > > > > > > > > client and config needs to be created. As mentioned in
> > the
> > > > > FLIP,
> > > > > > > > native
> > > > > > > > > > > integration requires the Flink job manager to have access
> > > to
> > > > > the
> > > > > > > k8s
> > > > > > > > > API
> > > > > > > > > > to
> > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > unfavorable.
> > > > > > > > > > >
> > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > > > already
> > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > cpu/memory
> > > > > > > > > resources)
> > > > > > > > > > > > could
> > > > > > > > > > > > > > take effect.
> > > > > > > > > > >
> > > > > > > > > > > Yes, pod template would look almost identical. There are
> > a
> > > few
> > > > > > > > settings
> > > > > > > > > > > that the operator will control (and that may need to be
> > > > > > > blacklisted),
> > > > > > > > > but
> > > > > > > > > > > in general we would not want to place restrictions. I
> > > think a
> > > > > > > > mechanism
> > > > > > > > > > > where a pod template is merged from multiple layers would
> > > also
> > > > > be
> > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Xintong Song <to...@gmail.com>.
Sorry for the late reply. We were out due to the public holidays in China.

@Thomas,

The intention is to support application management through operator and CR,
> which means there won't be any 2 step submission process, which as you
> allude to would defeat the purpose of this project. The CR example shows
> the application part. Please note that the bare cluster support is an
> *additional* feature for scenarios that require external job management. Is
> there anything on the FLIP page that creates a different impression?
>

Sounds good to me. I don't remember what created the impression of 2 step
submission back then. I revisited the latest version of this FLIP and it
looks good to me.

@Gyula,

Versioning:
> Versioning will be independent from Flink and the operator will depend on a
> fixed flink version (in every given operator version).
> This should be the exact same setup as with Stateful Functions (
> https://github.com/apache/flink-statefun). So independent release cycle
> but
> still within the Flink umbrella.
>

Does this mean if someone wants to upgrade Flink to a version that is
released after the operator version that is being used, he/she would need
to upgrade the operator version first?
I'm not questioning this, just trying to make sure I'm understanding this
correctly.

Thank you~

Xintong Song



On Mon, Feb 7, 2022 at 3:14 AM Gyula Fóra <gy...@gmail.com> wrote:

> Thank you Alexis,
>
> Will definitely check this out. You are right, Kotlin makes it difficult to
> adopt pieces of this code directly but I think it will be good to get
> inspiration for the architecture and look at how particular problems have
> been solved. It will be a great help for us I am sure.
>
> Cheers,
> Gyula
>
> On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
> alexis.sarda-espinosa@microfocus.com> wrote:
>
> > Hi everyone,
> >
> > just wanted to mention that my employer agreed to open source the PoC I
> > developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
> >
> > I understand the concern for maintainability, so Gradle & Kotlin might
> not
> > be appealing to you, but at least it gives you another reference. The
> Helm
> > resources in particular might be useful.
> >
> > There are bits and pieces there referring to Flink sessions, but those
> are
> > just placeholders, the functioning parts use application mode with native
> > integration.
> >
> > Regards,
> > Alexis.
> >
> > ________________________________
> > From: Thomas Weise <th...@apache.org>
> > Sent: Saturday, February 5, 2022 2:41 AM
> > To: dev <de...@flink.apache.org>
> > Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
> >
> > Hi,
> >
> > Thanks for the continued feedback and discussion. Looks like we are
> > ready to start a VOTE, I will initiate it shortly.
> >
> > In parallel it would be good to find the repository name.
> >
> > My suggestion would be: flink-kubernetes-operator
> >
> > I thought "flink-operator" could be a bit misleading since the term
> > operator already has a meaning in Flink.
> >
> > I also considered "flink-k8s-operator" but that would be almost
> > identical to existing operator implementations and could lead to
> > confusion in the future.
> >
> > Thoughts?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
> > >
> > > Hi Danny,
> > >
> > > So far we have been focusing our dev efforts on the initial native
> > > implementation with the team.
> > > If the discussion and vote goes well for this FLIP we are looking
> forward
> > > to contributing the initial version sometime next week (fingers
> crossed).
> > >
> > > At that point I think we can already start the dev work to support the
> > > standalone mode as well, especially if you can dedicate some effort to
> > > pushing that side.
> > > Working together on this sounds like a great idea and we should start
> as
> > > soon as possible! :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> > > wrote:
> > >
> > > > I have been discussing this one with my team. We are interested in
> the
> > > > Standalone mode, and are willing to contribute towards the
> > implementation.
> > > > Potentially we can work together to support both modes in parallel?
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Danny!
> > > > >
> > > > > Thanks for the feedback :)
> > > > >
> > > > > Versioning:
> > > > > Versioning will be independent from Flink and the operator will
> > depend
> > > > on a
> > > > > fixed flink version (in every given operator version).
> > > > > This should be the exact same setup as with Stateful Functions (
> > > > > https://github.com/apache/flink-statefun). So independent release
> > cycle
> > > > > but
> > > > > still within the Flink umbrella.
> > > > >
> > > > > Deployment error handling:
> > > > > I think that's a very good point, as general exception handling for
> > the
> > > > > different failure scenarios is a tricky problem. I think the
> > exception
> > > > > classifiers and retry strategies could avoid a lot of manual
> > intervention
> > > > > from the user. We will definitely need to add something like this.
> > Once
> > > > we
> > > > > have the repo created with the initial operator code we should open
> > some
> > > > > tickets for this and put it on the short term roadmap!
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > dannycranmer@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hey team,
> > > > > >
> > > > > > Great work on the FLIP, I am looking forward to this one. I agree
> > that
> > > > we
> > > > > > can move forward to the voting stage.
> > > > > >
> > > > > > I have general feedback around how we will handle job submission
> > > > failure
> > > > > > and retry. As discussed in the Rejected Alternatives section, we
> > can
> > > > use
> > > > > > Java to handle job submission failures from the Flink client. It
> > would
> > > > be
> > > > > > useful to have the ability to configure exception classifiers and
> > retry
> > > > > > strategy as part of operator configuration.
> > > > > >
> > > > > > Given this will be in a separate Github repository I am curious
> how
> > > > ther
> > > > > > versioning strategy will work in relation to the Flink version?
> Do
> > we
> > > > > have
> > > > > > any other components with a similar setup I can look at? Will the
> > > > > operator
> > > > > > version track Flink or will it use its own versioning strategy
> > with a
> > > > > Flink
> > > > > > version support matrix, or similar?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > balassi.marton@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > >
> > > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> > page
> > > > > > > accordingly. If you are comfortable with the currently existing
> > > > design
> > > > > > and
> > > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> > stage -
> > > > > once
> > > > > > > that reaches a positive conclusion it lets us create the
> separate
> > > > code
> > > > > > > repository under the flink project for the operator.
> > > > > > >
> > > > > > > I encourage everyone to keep improving the details in the
> > meantime,
> > > > > > however
> > > > > > > I believe given the existing design and the general sentiment
> on
> > this
> > > > > > > thread that the most efficient path from here is starting the
> > > > > > > implementation so that we can collectively iterate over it.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > >
> > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > HI Xintong,
> > > > > > > >
> > > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > tonysong820@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > > discussion.
> > > > > > > > >
> > > > > > > > > I also have a few questions and comments.
> > > > > > > > >
> > > > > > > > > ## Job Submission
> > > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > > submitting
> > > > > > > > jobs
> > > > > > > > > to the cluster via Flink cli / REST is probably the
> approach
> > that
> > > > > > > > requires
> > > > > > > > > the least effort. However, I'd like to point out 2
> > weaknesses.
> > > > > > > > > 1. A lot of users use Flink in perjob/application modes.
> For
> > > > these
> > > > > > > users,
> > > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > > submit
> > > > > > the
> > > > > > > > job)
> > > > > > > > > is not that convenient.
> > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > applications'
> > > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds
> not
> > > > > aligned
> > > > > > > with
> > > > > > > > > this motivation.
> > > > > > > > > I think it's probably worth it to support submitting jobs
> via
> > > > > > kubectl &
> > > > > > > > CR
> > > > > > > > > in the first version, both together with deploying the
> > cluster
> > > > like
> > > > > > in
> > > > > > > > > perjob/application mode and after deploying the cluster
> like
> > in
> > > > > > session
> > > > > > > > > mode.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The intention is to support application management through
> > operator
> > > > > and
> > > > > > > CR,
> > > > > > > > which means there won't be any 2 step submission process,
> > which as
> > > > > you
> > > > > > > > allude to would defeat the purpose of this project. The CR
> > example
> > > > > > shows
> > > > > > > > the application part. Please note that the bare cluster
> > support is
> > > > an
> > > > > > > > *additional* feature for scenarios that require external job
> > > > > > management.
> > > > > > > Is
> > > > > > > > there anything on the FLIP page that creates a different
> > > > impression?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Versioning
> > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > 1. Native K8s deployment was firstly introduced in Flink
> 1.10
> > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > > 4. There was some changes to the Flink docker image
> > entrypoint
> > > > > script
> > > > > > > in,
> > > > > > > > > IIRC, Flink 1.13
> > > > > > > > >
> > > > > > > >
> > > > > > > > Great, thanks for providing this. It is important for the
> > > > > compatibility
> > > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> > Before
> > > > the
> > > > > > > > operator is ready there will be another Flink release. Let's
> > see if
> > > > > > > anyone
> > > > > > > > is interested in earlier versions?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Compatibility
> > > > > > > > > What kind of API compatibility we can commit to? It's
> > probably
> > > > fine
> > > > > > to
> > > > > > > > have
> > > > > > > > > alpha / beta version APIs that allow incompatible future
> > changes
> > > > > for
> > > > > > > the
> > > > > > > > > first version. But eventually we would need to guarantee
> > > > backwards
> > > > > > > > > compatibility, so that an early version CR can work with a
> > new
> > > > > > version
> > > > > > > > > operator.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Another great point and please let me include that on the
> FLIP
> > > > page.
> > > > > > ;-)
> > > > > > > >
> > > > > > > > I think we should allow incompatible changes for the first
> one
> > or
> > > > two
> > > > > > > > versions, similar to how other major features have evolved
> > > > recently,
> > > > > > such
> > > > > > > > as FLIP-27.
> > > > > > > >
> > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback!
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > > agreed
> > > > > to
> > > > > > > do
> > > > > > > > > the
> > > > > > > > > > > first version of the operator based on the native
> > > > integration.
> > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > requirements,
> > > > > > > it
> > > > > > > > > > seems
> > > > > > > > > > > this would lead to a much smaller initial effort and a
> > nicer
> > > > > > first
> > > > > > > > > > version.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm also leaning towards the native integration, as long
> > as it
> > > > > > > reduces
> > > > > > > > > the
> > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > support
> > > > the
> > > > > > > > > > standalone mode. I would like to gain more confidence
> that
> > > > native
> > > > > > > > > > integration reduces the effort. While it cuts the effort
> to
> > > > > handle
> > > > > > > the
> > > > > > > > TM
> > > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > > integration
> > > > > > > > > > client and config needs to be created. As mentioned in
> the
> > > > FLIP,
> > > > > > > native
> > > > > > > > > > integration requires the Flink job manager to have access
> > to
> > > > the
> > > > > > k8s
> > > > > > > > API
> > > > > > > > > to
> > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > unfavorable.
> > > > > > > > > >
> > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > > already
> > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > cpu/memory
> > > > > > > > resources)
> > > > > > > > > > > could
> > > > > > > > > > > > > take effect.
> > > > > > > > > >
> > > > > > > > > > Yes, pod template would look almost identical. There are
> a
> > few
> > > > > > > settings
> > > > > > > > > > that the operator will control (and that may need to be
> > > > > > blacklisted),
> > > > > > > > but
> > > > > > > > > > in general we would not want to place restrictions. I
> > think a
> > > > > > > mechanism
> > > > > > > > > > where a pod template is merged from multiple layers would
> > also
> > > > be
> > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Thank you Alexis,

Will definitely check this out. You are right, Kotlin makes it difficult to
adopt pieces of this code directly but I think it will be good to get
inspiration for the architecture and look at how particular problems have
been solved. It will be a great help for us I am sure.

Cheers,
Gyula

On Sat, Feb 5, 2022 at 12:28 PM Alexis Sarda-Espinosa <
alexis.sarda-espinosa@microfocus.com> wrote:

> Hi everyone,
>
> just wanted to mention that my employer agreed to open source the PoC I
> developed: https://github.com/MicroFocus/opsb-flink-k8s-operator
>
> I understand the concern for maintainability, so Gradle & Kotlin might not
> be appealing to you, but at least it gives you another reference. The Helm
> resources in particular might be useful.
>
> There are bits and pieces there referring to Flink sessions, but those are
> just placeholders, the functioning parts use application mode with native
> integration.
>
> Regards,
> Alexis.
>
> ________________________________
> From: Thomas Weise <th...@apache.org>
> Sent: Saturday, February 5, 2022 2:41 AM
> To: dev <de...@flink.apache.org>
> Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator
>
> Hi,
>
> Thanks for the continued feedback and discussion. Looks like we are
> ready to start a VOTE, I will initiate it shortly.
>
> In parallel it would be good to find the repository name.
>
> My suggestion would be: flink-kubernetes-operator
>
> I thought "flink-operator" could be a bit misleading since the term
> operator already has a meaning in Flink.
>
> I also considered "flink-k8s-operator" but that would be almost
> identical to existing operator implementations and could lead to
> confusion in the future.
>
> Thoughts?
>
> Thanks,
> Thomas
>
>
>
> On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > Hi Danny,
> >
> > So far we have been focusing our dev efforts on the initial native
> > implementation with the team.
> > If the discussion and vote goes well for this FLIP we are looking forward
> > to contributing the initial version sometime next week (fingers crossed).
> >
> > At that point I think we can already start the dev work to support the
> > standalone mode as well, especially if you can dedicate some effort to
> > pushing that side.
> > Working together on this sounds like a great idea and we should start as
> > soon as possible! :)
> >
> > Cheers,
> > Gyula
> >
> > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> > wrote:
> >
> > > I have been discussing this one with my team. We are interested in the
> > > Standalone mode, and are willing to contribute towards the
> implementation.
> > > Potentially we can work together to support both modes in parallel?
> > >
> > > Thanks,
> > >
> > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Hi Danny!
> > > >
> > > > Thanks for the feedback :)
> > > >
> > > > Versioning:
> > > > Versioning will be independent from Flink and the operator will
> depend
> > > on a
> > > > fixed flink version (in every given operator version).
> > > > This should be the exact same setup as with Stateful Functions (
> > > > https://github.com/apache/flink-statefun). So independent release
> cycle
> > > > but
> > > > still within the Flink umbrella.
> > > >
> > > > Deployment error handling:
> > > > I think that's a very good point, as general exception handling for
> the
> > > > different failure scenarios is a tricky problem. I think the
> exception
> > > > classifiers and retry strategies could avoid a lot of manual
> intervention
> > > > from the user. We will definitely need to add something like this.
> Once
> > > we
> > > > have the repo created with the initial operator code we should open
> some
> > > > tickets for this and put it on the short term roadmap!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> dannycranmer@apache.org>
> > > > wrote:
> > > >
> > > > > Hey team,
> > > > >
> > > > > Great work on the FLIP, I am looking forward to this one. I agree
> that
> > > we
> > > > > can move forward to the voting stage.
> > > > >
> > > > > I have general feedback around how we will handle job submission
> > > failure
> > > > > and retry. As discussed in the Rejected Alternatives section, we
> can
> > > use
> > > > > Java to handle job submission failures from the Flink client. It
> would
> > > be
> > > > > useful to have the ability to configure exception classifiers and
> retry
> > > > > strategy as part of operator configuration.
> > > > >
> > > > > Given this will be in a separate Github repository I am curious how
> > > ther
> > > > > versioning strategy will work in relation to the Flink version? Do
> we
> > > > have
> > > > > any other components with a similar setup I can look at? Will the
> > > > operator
> > > > > version track Flink or will it use its own versioning strategy
> with a
> > > > Flink
> > > > > version support matrix, or similar?
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > balassi.marton@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi team,
> > > > > >
> > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> page
> > > > > > accordingly. If you are comfortable with the currently existing
> > > design
> > > > > and
> > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> stage -
> > > > once
> > > > > > that reaches a positive conclusion it lets us create the separate
> > > code
> > > > > > repository under the flink project for the operator.
> > > > > >
> > > > > > I encourage everyone to keep improving the details in the
> meantime,
> > > > > however
> > > > > > I believe given the existing design and the general sentiment on
> this
> > > > > > thread that the most efficient path from here is starting the
> > > > > > implementation so that we can collectively iterate over it.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > >
> > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > HI Xintong,
> > > > > > >
> > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > tonysong820@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > discussion.
> > > > > > > >
> > > > > > > > I also have a few questions and comments.
> > > > > > > >
> > > > > > > > ## Job Submission
> > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > submitting
> > > > > > > jobs
> > > > > > > > to the cluster via Flink cli / REST is probably the approach
> that
> > > > > > > requires
> > > > > > > > the least effort. However, I'd like to point out 2
> weaknesses.
> > > > > > > > 1. A lot of users use Flink in perjob/application modes. For
> > > these
> > > > > > users,
> > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > submit
> > > > > the
> > > > > > > job)
> > > > > > > > is not that convenient.
> > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > applications'
> > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > > > aligned
> > > > > > with
> > > > > > > > this motivation.
> > > > > > > > I think it's probably worth it to support submitting jobs via
> > > > > kubectl &
> > > > > > > CR
> > > > > > > > in the first version, both together with deploying the
> cluster
> > > like
> > > > > in
> > > > > > > > perjob/application mode and after deploying the cluster like
> in
> > > > > session
> > > > > > > > mode.
> > > > > > > >
> > > > > > >
> > > > > > > The intention is to support application management through
> operator
> > > > and
> > > > > > CR,
> > > > > > > which means there won't be any 2 step submission process,
> which as
> > > > you
> > > > > > > allude to would defeat the purpose of this project. The CR
> example
> > > > > shows
> > > > > > > the application part. Please note that the bare cluster
> support is
> > > an
> > > > > > > *additional* feature for scenarios that require external job
> > > > > management.
> > > > > > Is
> > > > > > > there anything on the FLIP page that creates a different
> > > impression?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > ## Versioning
> > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > 4. There was some changes to the Flink docker image
> entrypoint
> > > > script
> > > > > > in,
> > > > > > > > IIRC, Flink 1.13
> > > > > > > >
> > > > > > >
> > > > > > > Great, thanks for providing this. It is important for the
> > > > compatibility
> > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> Before
> > > the
> > > > > > > operator is ready there will be another Flink release. Let's
> see if
> > > > > > anyone
> > > > > > > is interested in earlier versions?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > ## Compatibility
> > > > > > > > What kind of API compatibility we can commit to? It's
> probably
> > > fine
> > > > > to
> > > > > > > have
> > > > > > > > alpha / beta version APIs that allow incompatible future
> changes
> > > > for
> > > > > > the
> > > > > > > > first version. But eventually we would need to guarantee
> > > backwards
> > > > > > > > compatibility, so that an early version CR can work with a
> new
> > > > > version
> > > > > > > > operator.
> > > > > > > >
> > > > > > >
> > > > > > > Another great point and please let me include that on the FLIP
> > > page.
> > > > > ;-)
> > > > > > >
> > > > > > > I think we should allow incompatible changes for the first one
> or
> > > two
> > > > > > > versions, similar to how other major features have evolved
> > > recently,
> > > > > such
> > > > > > > as FLIP-27.
> > > > > > >
> > > > > > > Would be great to get broader feedback on this one.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > agreed
> > > > to
> > > > > > do
> > > > > > > > the
> > > > > > > > > > first version of the operator based on the native
> > > integration.
> > > > > > > > > > While this clearly does not cover all use-cases and
> > > > requirements,
> > > > > > it
> > > > > > > > > seems
> > > > > > > > > > this would lead to a much smaller initial effort and a
> nicer
> > > > > first
> > > > > > > > > version.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm also leaning towards the native integration, as long
> as it
> > > > > > reduces
> > > > > > > > the
> > > > > > > > > MVP effort. Ultimately the operator will need to also
> support
> > > the
> > > > > > > > > standalone mode. I would like to gain more confidence that
> > > native
> > > > > > > > > integration reduces the effort. While it cuts the effort to
> > > > handle
> > > > > > the
> > > > > > > TM
> > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > integration
> > > > > > > > > client and config needs to be created. As mentioned in the
> > > FLIP,
> > > > > > native
> > > > > > > > > integration requires the Flink job manager to have access
> to
> > > the
> > > > > k8s
> > > > > > > API
> > > > > > > > to
> > > > > > > > > create pods, which in some scenarios may be seen as
> > > unfavorable.
> > > > > > > > >
> > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > already
> > > > > > > > > > supported[4]?
> > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> cpu/memory
> > > > > > > resources)
> > > > > > > > > > could
> > > > > > > > > > > > take effect.
> > > > > > > > >
> > > > > > > > > Yes, pod template would look almost identical. There are a
> few
> > > > > > settings
> > > > > > > > > that the operator will control (and that may need to be
> > > > > blacklisted),
> > > > > > > but
> > > > > > > > > in general we would not want to place restrictions. I
> think a
> > > > > > mechanism
> > > > > > > > > where a pod template is merged from multiple layers would
> also
> > > be
> > > > > > > > > interesting to make this more flexible.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Alexis Sarda-Espinosa <al...@microfocus.com>.
Hi everyone,

just wanted to mention that my employer agreed to open source the PoC I developed: https://github.com/MicroFocus/opsb-flink-k8s-operator

I understand the concern for maintainability, so Gradle & Kotlin might not be appealing to you, but at least it gives you another reference. The Helm resources in particular might be useful.

There are bits and pieces there referring to Flink sessions, but those are just placeholders, the functioning parts use application mode with native integration.

Regards,
Alexis.

________________________________
From: Thomas Weise <th...@apache.org>
Sent: Saturday, February 5, 2022 2:41 AM
To: dev <de...@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Hi,

Thanks for the continued feedback and discussion. Looks like we are
ready to start a VOTE, I will initiate it shortly.

In parallel it would be good to find the repository name.

My suggestion would be: flink-kubernetes-operator

I thought "flink-operator" could be a bit misleading since the term
operator already has a meaning in Flink.

I also considered "flink-k8s-operator" but that would be almost
identical to existing operator implementations and could lead to
confusion in the future.

Thoughts?

Thanks,
Thomas



On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> Hi Danny,
>
> So far we have been focusing our dev efforts on the initial native
> implementation with the team.
> If the discussion and vote goes well for this FLIP we are looking forward
> to contributing the initial version sometime next week (fingers crossed).
>
> At that point I think we can already start the dev work to support the
> standalone mode as well, especially if you can dedicate some effort to
> pushing that side.
> Working together on this sounds like a great idea and we should start as
> soon as possible! :)
>
> Cheers,
> Gyula
>
> On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> wrote:
>
> > I have been discussing this one with my team. We are interested in the
> > Standalone mode, and are willing to contribute towards the implementation.
> > Potentially we can work together to support both modes in parallel?
> >
> > Thanks,
> >
> > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Hi Danny!
> > >
> > > Thanks for the feedback :)
> > >
> > > Versioning:
> > > Versioning will be independent from Flink and the operator will depend
> > on a
> > > fixed flink version (in every given operator version).
> > > This should be the exact same setup as with Stateful Functions (
> > > https://github.com/apache/flink-statefun). So independent release cycle
> > > but
> > > still within the Flink umbrella.
> > >
> > > Deployment error handling:
> > > I think that's a very good point, as general exception handling for the
> > > different failure scenarios is a tricky problem. I think the exception
> > > classifiers and retry strategies could avoid a lot of manual intervention
> > > from the user. We will definitely need to add something like this. Once
> > we
> > > have the repo created with the initial operator code we should open some
> > > tickets for this and put it on the short term roadmap!
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <da...@apache.org>
> > > wrote:
> > >
> > > > Hey team,
> > > >
> > > > Great work on the FLIP, I am looking forward to this one. I agree that
> > we
> > > > can move forward to the voting stage.
> > > >
> > > > I have general feedback around how we will handle job submission
> > failure
> > > > and retry. As discussed in the Rejected Alternatives section, we can
> > use
> > > > Java to handle job submission failures from the Flink client. It would
> > be
> > > > useful to have the ability to configure exception classifiers and retry
> > > > strategy as part of operator configuration.
> > > >
> > > > Given this will be in a separate Github repository I am curious how
> > ther
> > > > versioning strategy will work in relation to the Flink version? Do we
> > > have
> > > > any other components with a similar setup I can look at? Will the
> > > operator
> > > > version track Flink or will it use its own versioning strategy with a
> > > Flink
> > > > version support matrix, or similar?
> > > >
> > > > Thanks,
> > > >
> > > >
> > > >
> > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > balassi.marton@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi team,
> > > > >
> > > > > Thank you for the great feedback, Thomas has updated the FLIP page
> > > > > accordingly. If you are comfortable with the currently existing
> > design
> > > > and
> > > > > depth in the FLIP [1] I suggest moving forward to the voting stage -
> > > once
> > > > > that reaches a positive conclusion it lets us create the separate
> > code
> > > > > repository under the flink project for the operator.
> > > > >
> > > > > I encourage everyone to keep improving the details in the meantime,
> > > > however
> > > > > I believe given the existing design and the general sentiment on this
> > > > > thread that the most efficient path from here is starting the
> > > > > implementation so that we can collectively iterate over it.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > >
> > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > HI Xintong,
> > > > > >
> > > > > > Thanks for the feedback and please see responses below -->
> > > > > >
> > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > tonysong820@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > discussion.
> > > > > > >
> > > > > > > I also have a few questions and comments.
> > > > > > >
> > > > > > > ## Job Submission
> > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > submitting
> > > > > > jobs
> > > > > > > to the cluster via Flink cli / REST is probably the approach that
> > > > > > requires
> > > > > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > > > > 1. A lot of users use Flink in perjob/application modes. For
> > these
> > > > > users,
> > > > > > > having to run the job in two steps (deploy the cluster, and
> > submit
> > > > the
> > > > > > job)
> > > > > > > is not that convenient.
> > > > > > > 2. One of our motivations is being able to manage Flink
> > > applications'
> > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > > aligned
> > > > > with
> > > > > > > this motivation.
> > > > > > > I think it's probably worth it to support submitting jobs via
> > > > kubectl &
> > > > > > CR
> > > > > > > in the first version, both together with deploying the cluster
> > like
> > > > in
> > > > > > > perjob/application mode and after deploying the cluster like in
> > > > session
> > > > > > > mode.
> > > > > > >
> > > > > >
> > > > > > The intention is to support application management through operator
> > > and
> > > > > CR,
> > > > > > which means there won't be any 2 step submission process, which as
> > > you
> > > > > > allude to would defeat the purpose of this project. The CR example
> > > > shows
> > > > > > the application part. Please note that the bare cluster support is
> > an
> > > > > > *additional* feature for scenarios that require external job
> > > > management.
> > > > > Is
> > > > > > there anything on the FLIP page that creates a different
> > impression?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Versioning
> > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > 4. There was some changes to the Flink docker image entrypoint
> > > script
> > > > > in,
> > > > > > > IIRC, Flink 1.13
> > > > > > >
> > > > > >
> > > > > > Great, thanks for providing this. It is important for the
> > > compatibility
> > > > > > going forward also. We are targeting Flink 1.14.x upwards. Before
> > the
> > > > > > operator is ready there will be another Flink release. Let's see if
> > > > > anyone
> > > > > > is interested in earlier versions?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Compatibility
> > > > > > > What kind of API compatibility we can commit to? It's probably
> > fine
> > > > to
> > > > > > have
> > > > > > > alpha / beta version APIs that allow incompatible future changes
> > > for
> > > > > the
> > > > > > > first version. But eventually we would need to guarantee
> > backwards
> > > > > > > compatibility, so that an early version CR can work with a new
> > > > version
> > > > > > > operator.
> > > > > > >
> > > > > >
> > > > > > Another great point and please let me include that on the FLIP
> > page.
> > > > ;-)
> > > > > >
> > > > > > I think we should allow incompatible changes for the first one or
> > two
> > > > > > versions, similar to how other major features have evolved
> > recently,
> > > > such
> > > > > > as FLIP-27.
> > > > > >
> > > > > > Would be great to get broader feedback on this one.
> > > > > >
> > > > > > Cheers,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > agreed
> > > to
> > > > > do
> > > > > > > the
> > > > > > > > > first version of the operator based on the native
> > integration.
> > > > > > > > > While this clearly does not cover all use-cases and
> > > requirements,
> > > > > it
> > > > > > > > seems
> > > > > > > > > this would lead to a much smaller initial effort and a nicer
> > > > first
> > > > > > > > version.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm also leaning towards the native integration, as long as it
> > > > > reduces
> > > > > > > the
> > > > > > > > MVP effort. Ultimately the operator will need to also support
> > the
> > > > > > > > standalone mode. I would like to gain more confidence that
> > native
> > > > > > > > integration reduces the effort. While it cuts the effort to
> > > handle
> > > > > the
> > > > > > TM
> > > > > > > > pod creation, some mapping code from the CR to the native
> > > > integration
> > > > > > > > client and config needs to be created. As mentioned in the
> > FLIP,
> > > > > native
> > > > > > > > integration requires the Flink job manager to have access to
> > the
> > > > k8s
> > > > > > API
> > > > > > > to
> > > > > > > > create pods, which in some scenarios may be seen as
> > unfavorable.
> > > > > > > >
> > > > > > > >  > > > # Pod Template
> > > > > > > > > > > Is the pod template in CR same with what Flink has
> > already
> > > > > > > > > supported[4]?
> > > > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > > > > resources)
> > > > > > > > > could
> > > > > > > > > > > take effect.
> > > > > > > >
> > > > > > > > Yes, pod template would look almost identical. There are a few
> > > > > settings
> > > > > > > > that the operator will control (and that may need to be
> > > > blacklisted),
> > > > > > but
> > > > > > > > in general we would not want to place restrictions. I think a
> > > > > mechanism
> > > > > > > > where a pod template is merged from multiple layers would also
> > be
> > > > > > > > interesting to make this more flexible.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Márton Balassi <ba...@gmail.com>.
Good catch, Yang Wang and Gyula on the Java version. I personally prefer
that we simply can not support Java 8 for the operator, since it is a net
new project we are better off starting support at Java 11 right away.

As Gyula outlined above, it is important to note that it only affects the
operator (and the operator container image), not existing or new Flink jobs.

On Tue, Feb 15, 2022 at 1:50 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Devs,
>
> Yang Wang discovered that the current prototype is not compatible with Java
> 8 but only 11 and upwards.
>
> The reason for this is that the java operator SDK itself is not java 8
> compatible unfortunately.
>
> Given that Java 8 is on the road to deprecation and that the operator runs
> as a containerized deployment, are there any concerns regarding making the
> target java version 11?
> This should not affect deployed flink clusters and jobs, those should still
> work with Java 8, but only the kubernetes operator itself.
>
> Cheers,
> Gyula
>
>
> On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com> wrote:
>
> > I also lean to not introduce the savepoint/checkpoint related fields to
> the
> > job spec, especially in the very beginning of flink-kubernetes-operator.
> >
> >
> > Best,
> > Yang
> >
> > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> >
> > > Hi Peng Yuan!
> > >
> > > While I do agree that savepoint path is a very important production
> > > configuration there are a lot of other things that come to my mind:
> > >  - savepoint dir
> > >  - checkpoint dir
> > >  - checkpoint interval/timeout
> > >  - high availability settings (provider/storagedir etc)
> > >
> > > just to name a few...
> > >
> > > While these are all production critical, they have nice clean Flink
> > config
> > > settings to go with them. If we stand introducing these to jobspec we
> > only
> > > get confusion about priority order etc and it is going to be hard to
> > change
> > > or remove them in the future. In any case we should validate that these
> > > configs exist in cases where users use a stateful upgrade mode for
> > example.
> > > This is something we need to add for sure.
> > >
> > > As for the other options you mentioned like automatic savepoint
> > generation
> > > for instance, those deserve an independent discussion of their own I
> > > believe :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com>
> wrote:
> > >
> > > > Hi Matyas!
> > > >
> > > > Thanks for your reply!
> > > > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> > > solution
> > > > , i missed this part.
> > > > For savepoint related configuration, I think it's very important to
> be
> > > > specified in JobSpec, Because savepoint is a very common
> configuration
> > > for
> > > > upgrading a job, if it has been placed in JobSpec can be obviously
> > > > configured by the user. In addition, other advanced properties can be
> > put
> > > > into flinkConfiguration customized by expert users.
> > > > A bunch of savepoint configuration as follows:
> > > >
> > > > > fromSavepoint——Job restart from
> > > >
> > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > `savepointsDir`
> > > > > every n seconds.
> > > >
> > > > savepointsDir—— Savepoints dir where to store automatically taken
> > > > > savepoints
> > > >
> > > > savepointGeneration—— Update savepoint generation of job status for a
> > > > > running job (should be defined in JobStatus)
> > > >
> > > >
> > > > Best wishes,
> > > > Peng Yuan.
> > > >
> > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <
> matyas.orhidi@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Peng,
> > > > >
> > > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > > podTemplate
> > > > > functionality in the operator could cover both. We also need to be
> > > > careful
> > > > > about introducing proxy parameters in the CRD spec. The savepoint
> > path
> > > is
> > > > > usually accompanied with a bunch of other configurations for
> example,
> > > so
> > > > > users need to use configuration params anyway. What do you think?
> > > > >
> > > > > Best,
> > > > > Matyas
> > > > >
> > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Gyula!
> > > > > >
> > > > > > I have reviewed the prototype design of flink-kubernetes-operator
> > you
> > > > > > submitted, and I have the following questions:
> > > > > >
> > > > > > 1.Can a Flink Jar package that supports pulling from the sidecar
> be
> > > > added
> > > > > > to the JobSpec? just like this:
> > > > > >
> > > > > > > initContainers:
> > > > > > >       - name: downloader
> > > > > > >         image: curlimages/curl
> > > > > > >         env:
> > > > > > >           - name: JAR_URL
> > > > > > >             value:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > >           - name: DEST_PATH
> > > > > > >             value: /cache/flink-app.jar
> > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH}
> ${JAR_URL}']
> > > > > >
> > > > > > 2.Can we add savepoint path property to job specification?
> > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > TaskManagerSpec
> > > to
> > > > > > expose some service ,such as prometheus?The property can be this:
> > > > > >
> > > > > > > extraPorts:
> > > > > > >       - name: prom
> > > > > > >         containerPort: 9249
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best wishes,
> > > > > > Peng Yuan
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Hi Flink Devs!
> > > > > > >
> > > > > > > We would like to present to you the first prototype of the
> > > > > > > flink-kubernetes-operator that was built based on the FLIP and
> > the
> > > > > > > discussion on this mail thread. We would also like to call out
> > some
> > > > > > design
> > > > > > > decisions that we have made regarding architecture components
> > that
> > > > were
> > > > > > not
> > > > > > > explicitly mentioned in the FLIP document/thread and give you
> the
> > > > > > > opportunity to raise any concerns here.
> > > > > > >
> > > > > > > You can find the initial prototype here:
> > > > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > >
> > > > > > > We will leave the PR open for 1-2 days before merging to let
> > people
> > > > > > comment
> > > > > > > on it, but please be mindful that this is an initial prototype
> > with
> > > > > many
> > > > > > > rough edges. It is not intended to be a complete implementation
> > of
> > > > the
> > > > > > FLIP
> > > > > > > specs as that will take some more work from all of us :)
> > > > > > >
> > > > > > >
> > > > > > > *Prototype feature set:*The prototype contains a basic working
> > > > version
> > > > > of
> > > > > > > the flink-kubernetes-operator that supports deployment and
> > > lifecycle
> > > > > > > management of a stateful native flink application. We have
> basic
> > > > > support
> > > > > > > for stateful and stateless upgrades, UI ingress, pod templates
> > etc.
> > > > > Error
> > > > > > > handling at this point is largely missing.
> > > > > > >
> > > > > > >
> > > > > > > *Features / design decisions that were not explicitly discussed
> > in
> > > > this
> > > > > > > thread*
> > > > > > >
> > > > > > > *Basic Admission control using a Webhook*Standard resource
> > > admission
> > > > > > > control in Kubernetes to validate and potentially reject
> > resources
> > > is
> > > > > > done
> > > > > > > through Webhooks.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > This is a necessary mechanism to give the user an upfront error
> > > when
> > > > an
> > > > > > > incorrect resource was submitted. In the Flink operator's case
> we
> > > > need
> > > > > to
> > > > > > > validate that the FlinkDeployment yaml actually makes sense and
> > > does
> > > > > not
> > > > > > > contain erroneous config options that would inevitably lead to
> > > > > > > deployment/job failures.
> > > > > > >
> > > > > > > We have implemented a simple webhook that we can use for this
> > type
> > > of
> > > > > > > validation, as a separate maven module
> > (flink-kubernetes-webhook).
> > > > The
> > > > > > > webhook is an optional component and can be enabled or disabled
> > > > during
> > > > > > > deployment. To avoid pulling in new external dependencies we
> have
> > > > used
> > > > > > the
> > > > > > > Flink Shaded Netty module to build the simple rest endpoint
> > > required.
> > > > > If
> > > > > > > the community feels that Netty adds unnecessary complexity to
> the
> > > > > webhook
> > > > > > > implementation we are open to alternative backends such as
> > > Springboot
> > > > > for
> > > > > > > instance which would practically eliminate all the boilerplate.
> > > > > > >
> > > > > > >
> > > > > > > *Helm Chart for deployment*Helm charts provide an industry
> > standard
> > > > way
> > > > > > of
> > > > > > > managing kubernetes deployments. We have created a helm chart
> > > > prototype
> > > > > > > that can be used to deploy the operator together with all
> > required
> > > > > > > resources. The helm chart allows easy configuration for things
> > like
> > > > > > images,
> > > > > > > namespaces etc and flags to control specific parts of the
> > > deployment
> > > > > such
> > > > > > > as RBAC or the webhook.
> > > > > > >
> > > > > > > The helm chart provided is intended to be a first version that
> > > worked
> > > > > for
> > > > > > > us during development but we expect to have a lot of iterations
> > on
> > > it
> > > > > > based
> > > > > > > on the feedback from the community.
> > > > > > >
> > > > > > > *Acknowledgment*
> > > > > > > We would like to thank everyone who has provided support and
> > > valuable
> > > > > > > feedback on this FLIP.
> > > > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > > > specifically
> > > > > > > for making their operators open source and available to us
> which
> > > had
> > > > a
> > > > > > big
> > > > > > > impact on the FLIP and the prototype.
> > > > > > >
> > > > > > > We are looking forward to continuing development on the
> operator
> > > > > together
> > > > > > > with the broader community.
> > > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yuanpengfred@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > It's great to see the project getting started and I can't
> wait
> > to
> > > > see
> > > > > > the
> > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > >
> > > > > > > > Best Wishes!
> > > > > > > > Peng Yuan
> > > > > > > >
> > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > gyula.fora@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Peng Yuan!
> > > > > > > > >
> > > > > > > > > The repo is already created:
> > > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > > >
> > > > > > > > > We will open the PR with the initial prototype later today,
> > > stay
> > > > > > tuned
> > > > > > > in
> > > > > > > > > this thread! :)
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > yuanpengfred@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > Has the project of flink-kubernetes-operator been created
> > in
> > > > > > github?
> > > > > > > > > >
> > > > > > > > > > Peng Yuan
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I agree with flink-kubernetes-operator as the repo name
> > :)
> > > > > > > > > > > Don't have any better idea
> > > > > > > > > > >
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the continued feedback and discussion.
> Looks
> > > > like
> > > > > we
> > > > > > > are
> > > > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > > > >
> > > > > > > > > > > > In parallel it would be good to find the repository
> > name.
> > > > > > > > > > > >
> > > > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > > > >
> > > > > > > > > > > > I thought "flink-operator" could be a bit misleading
> > > since
> > > > > the
> > > > > > > term
> > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > >
> > > > > > > > > > > > I also considered "flink-k8s-operator" but that would
> > be
> > > > > almost
> > > > > > > > > > > > identical to existing operator implementations and
> > could
> > > > lead
> > > > > > to
> > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > gyula.fora@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > >
> > > > > > > > > > > > > So far we have been focusing our dev efforts on the
> > > > initial
> > > > > > > > native
> > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > If the discussion and vote goes well for this FLIP
> we
> > > are
> > > > > > > looking
> > > > > > > > > > > forward
> > > > > > > > > > > > > to contributing the initial version sometime next
> > week
> > > > > > (fingers
> > > > > > > > > > > crossed).
> > > > > > > > > > > > >
> > > > > > > > > > > > > At that point I think we can already start the dev
> > work
> > > > to
> > > > > > > > support
> > > > > > > > > > the
> > > > > > > > > > > > > standalone mode as well, especially if you can
> > dedicate
> > > > > some
> > > > > > > > effort
> > > > > > > > > > to
> > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > Working together on this sounds like a great idea
> and
> > > we
> > > > > > should
> > > > > > > > > start
> > > > > > > > > > > as
> > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I have been discussing this one with my team. We
> > are
> > > > > > > interested
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > Standalone mode, and are willing to contribute
> > > towards
> > > > > the
> > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > Potentially we can work together to support both
> > > modes
> > > > in
> > > > > > > > > parallel?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > Versioning will be independent from Flink and
> the
> > > > > > operator
> > > > > > > > will
> > > > > > > > > > > > depend
> > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > fixed flink version (in every given operator
> > > > version).
> > > > > > > > > > > > > > > This should be the exact same setup as with
> > > Stateful
> > > > > > > > Functions
> > > > > > > > > (
> > > > > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > > > independent
> > > > > > > > > > release
> > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > I think that's a very good point, as general
> > > > exception
> > > > > > > > handling
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > different failure scenarios is a tricky
> problem.
> > I
> > > > > think
> > > > > > > the
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > > classifiers and retry strategies could avoid a
> > lot
> > > of
> > > > > > > manual
> > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > from the user. We will definitely need to add
> > > > something
> > > > > > > like
> > > > > > > > > > this.
> > > > > > > > > > > > Once
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > have the repo created with the initial operator
> > > code
> > > > we
> > > > > > > > should
> > > > > > > > > > open
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > tickets for this and put it on the short term
> > > > roadmap!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Great work on the FLIP, I am looking forward
> to
> > > > this
> > > > > > > one. I
> > > > > > > > > > agree
> > > > > > > > > > > > that
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I have general feedback around how we will
> > handle
> > > > job
> > > > > > > > > > submission
> > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > > Alternatives
> > > > > > > > section,
> > > > > > > > > > we
> > > > > > > > > > > > can
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > Java to handle job submission failures from
> the
> > > > Flink
> > > > > > > > client.
> > > > > > > > > > It
> > > > > > > > > > > > would
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > useful to have the ability to configure
> > exception
> > > > > > > > classifiers
> > > > > > > > > > and
> > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Given this will be in a separate Github
> > > repository
> > > > I
> > > > > am
> > > > > > > > > curious
> > > > > > > > > > > how
> > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > versioning strategy will work in relation to
> > the
> > > > > Flink
> > > > > > > > > version?
> > > > > > > > > > > Do
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > any other components with a similar setup I
> can
> > > > look
> > > > > > at?
> > > > > > > > Will
> > > > > > > > > > the
> > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > version track Flink or will it use its own
> > > > versioning
> > > > > > > > > strategy
> > > > > > > > > > > > with a
> > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton
> Balassi <
> > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you for the great feedback, Thomas
> has
> > > > > updated
> > > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > accordingly. If you are comfortable with
> the
> > > > > > currently
> > > > > > > > > > existing
> > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving
> > forward
> > > to
> > > > > the
> > > > > > > > > voting
> > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > that reaches a positive conclusion it lets
> us
> > > > > create
> > > > > > > the
> > > > > > > > > > > separate
> > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > repository under the flink project for the
> > > > > operator.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I encourage everyone to keep improving the
> > > > details
> > > > > in
> > > > > > > the
> > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > I believe given the existing design and the
> > > > general
> > > > > > > > > sentiment
> > > > > > > > > > > on
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > thread that the most efficient path from
> here
> > > is
> > > > > > > starting
> > > > > > > > > the
> > > > > > > > > > > > > > > > > implementation so that we can collectively
> > > > iterate
> > > > > > over
> > > > > > > > it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas
> > Weise <
> > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > > > responses
> > > > > > > below
> > > > > > > > > -->
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong
> > > Song <
> > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP,
> and
> > > > > > everyone
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I also have a few questions and
> comments.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > > > kubectl &
> > > > > > CR
> > > > > > > > and
> > > > > > > > > > then
> > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> > > > probably
> > > > > > the
> > > > > > > > > > > approach
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > the least effort. However, I'd like to
> > > point
> > > > > out
> > > > > > 2
> > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > perjob/application
> > > > > > > > > modes.
> > > > > > > > > > > For
> > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > having to run the job in two steps
> > (deploy
> > > > the
> > > > > > > > cluster,
> > > > > > > > > > and
> > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > 2. One of our motivations is being able
> > to
> > > > > manage
> > > > > > > > Flink
> > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting
> jobs
> > > from
> > > > > cli
> > > > > > > > > sounds
> > > > > > > > > > > not
> > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > I think it's probably worth it to
> support
> > > > > > > submitting
> > > > > > > > > jobs
> > > > > > > > > > > via
> > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > in the first version, both together
> with
> > > > > > deploying
> > > > > > > > the
> > > > > > > > > > > > cluster
> > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > perjob/application mode and after
> > deploying
> > > > the
> > > > > > > > cluster
> > > > > > > > > > > like
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The intention is to support application
> > > > > management
> > > > > > > > > through
> > > > > > > > > > > > operator
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > which means there won't be any 2 step
> > > > submission
> > > > > > > > process,
> > > > > > > > > > > > which as
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > allude to would defeat the purpose of
> this
> > > > > project.
> > > > > > > The
> > > > > > > > > CR
> > > > > > > > > > > > example
> > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > the application part. Please note that
> the
> > > bare
> > > > > > > cluster
> > > > > > > > > > > > support is
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > *additional* feature for scenarios that
> > > require
> > > > > > > > external
> > > > > > > > > > job
> > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > there anything on the FLIP page that
> > creates
> > > a
> > > > > > > > different
> > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > Which Flink versions does the operator
> > plan
> > > > to
> > > > > > > > support?
> > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > > > introduced
> > > > > > in
> > > > > > > > > Flink
> > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in
> Flink
> > > 1.12
> > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > introduced
> > > in
> > > > > > Flink
> > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > 4. There was some changes to the Flink
> > > docker
> > > > > > image
> > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > > > important
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > going forward also. We are targeting
> Flink
> > > > 1.14.x
> > > > > > > > > upwards.
> > > > > > > > > > > > Before
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > operator is ready there will be another
> > Flink
> > > > > > > release.
> > > > > > > > > > Let's
> > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > What kind of API compatibility we can
> > > commit
> > > > > to?
> > > > > > > It's
> > > > > > > > > > > > probably
> > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > > incompatible
> > > > > > > > > future
> > > > > > > > > > > > changes
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > first version. But eventually we would
> > need
> > > > to
> > > > > > > > > guarantee
> > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > compatibility, so that an early version
> > CR
> > > > can
> > > > > > work
> > > > > > > > > with
> > > > > > > > > > a
> > > > > > > > > > > > new
> > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Another great point and please let me
> > include
> > > > > that
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > FLIP
> > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think we should allow incompatible
> > changes
> > > > for
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > one
> > > > > > > > > > > > or
> > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > versions, similar to how other major
> > features
> > > > > have
> > > > > > > > > evolved
> > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Would be great to get broader feedback on
> > > this
> > > > > one.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas
> > > Weise
> > > > <
> > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > > > integration
> > > > > > > > > > > > > > > > > > > > > Maybe we should make this more
> clear
> > in
> > > > the
> > > > > > > FLIP
> > > > > > > > > but
> > > > > > > > > > we
> > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > first version of the operator based
> > on
> > > > the
> > > > > > > native
> > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > While this clearly does not cover
> all
> > > > > > use-cases
> > > > > > > > and
> > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> > > initial
> > > > > > > effort
> > > > > > > > > and
> > > > > > > > > > a
> > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > > integration,
> > > > > > > as
> > > > > > > > > > long
> > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator
> > will
> > > > need
> > > > > > to
> > > > > > > > also
> > > > > > > > > > > > support
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > standalone mode. I would like to gain
> > > more
> > > > > > > > confidence
> > > > > > > > > > > that
> > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > integration reduces the effort. While
> > it
> > > > cuts
> > > > > > the
> > > > > > > > > > effort
> > > > > > > > > > > to
> > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > pod creation, some mapping code from
> > the
> > > CR
> > > > > to
> > > > > > > the
> > > > > > > > > > native
> > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > client and config needs to be
> created.
> > As
> > > > > > > mentioned
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > integration requires the Flink job
> > > manager
> > > > to
> > > > > > > have
> > > > > > > > > > access
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > create pods, which in some scenarios
> > may
> > > be
> > > > > > seen
> > > > > > > as
> > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR same
> > with
> > > > > what
> > > > > > > > Flink
> > > > > > > > > > has
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> > arbitrary
> > > > > > > field(e.g.
> > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > > > identical.
> > > > > > > > There
> > > > > > > > > > are
> > > > > > > > > > > a
> > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > that the operator will control (and
> > that
> > > > may
> > > > > > need
> > > > > > > > to
> > > > > > > > > be
> > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > > > > restrictions. I
> > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > where a pod template is merged from
> > > > multiple
> > > > > > > layers
> > > > > > > > > > would
> > > > > > > > > > > > also
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > interesting to make this more
> flexible.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Konstantin,

I completely agree with the general philosophy that if a resource exists it
should be "running" or I would rather say "do it's thing" whatever that
means for a particular resource.

We followed this design principle when we decided to have only 2 "desired
states" running and suspended and not have states like canceled etc.

Based on our own use cases and the feedback that we received from others
temporarily suspending a streaming job deployment is part of it's regular
lifecycle. The job still exist and will continue after suspension but it
signifies a state where data processing should be paused for whatever
reason.

The current suspend mechanism is also difficult to do manually if we remove
it from the operator:
 1. We need to implement a cancel-with-savepoint operation and expose this
to the user
 2. The user needs to manually look up the savepoint
 3. Create a new resource later

Adding 1.) is basically equivalent to the current implementation but would
actually expose an operation that feels much more unnatural compared to a
supend.

Cheers,
Gyula

On Wed, Feb 16, 2022 at 11:16 AM Konstantin Knauf <kn...@apache.org> wrote:

> Hi Gyula,
>
> sorry for joining late. One comment on the API design for consideration: we
> are using the job.state as kind of a "desired state", right? This is quite
> uncommon in Kubernetes to my knowledge. In Kubernetes almost always the
> fact that a resource exists means that it should be "running". The only API
> that I am aware of that has something like "suspended" is a Kubernetes Job
> (
>
> https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
> ),
> which looks retrofitted to me.
>
> Cheers,
>
> Konstantin
>
> On Wed, Feb 16, 2022 at 10:52 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Hi All!
> >
> > Thank you all for reviewing the PR and already helping to make it
> better. I
> > have opened a bunch of jira tickets under
> > https://issues.apache.org/jira/browse/FLINK-25963 based on some comments
> > and incomplete features in general.
> >
> > Given that there were no major objections about the prototype, I will
> merge
> > it now so we can start collaborating together.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Feb 16, 2022 at 3:52 AM Yang Wang <da...@gmail.com> wrote:
> >
> > > Thanks for the explanation.
> > > Given that it is unrelated with java version in Flink.
> > > Starting with java11 for the flink-kubernetes-operator makes sense to
> me.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Thomas Weise <th...@apache.org> 于2022年2月15日周二 23:57写道:
> > >
> > > > Hi,
> > > >
> > > > At this point I see no reason to support Java 8 for a new project.
> > > > Java 8 is being phased out, we should start with 11.
> > > >
> > > > Also, since the operator isn't a library but effectively just a
> docker
> > > > image, the ability to change the Java version isn't as critical as it
> > > > is for Flink core, which needs to run in many different environments.
> > > >
> > > > Cheers,
> > > > Thomas
> > > >
> > > > On Tue, Feb 15, 2022 at 4:50 AM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Devs,
> > > > >
> > > > > Yang Wang discovered that the current prototype is not compatible
> > with
> > > > Java
> > > > > 8 but only 11 and upwards.
> > > > >
> > > > > The reason for this is that the java operator SDK itself is not
> java
> > 8
> > > > > compatible unfortunately.
> > > > >
> > > > > Given that Java 8 is on the road to deprecation and that the
> operator
> > > > runs
> > > > > as a containerized deployment, are there any concerns regarding
> > making
> > > > the
> > > > > target java version 11?
> > > > > This should not affect deployed flink clusters and jobs, those
> should
> > > > still
> > > > > work with Java 8, but only the kubernetes operator itself.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > >
> > > > > On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I also lean to not introduce the savepoint/checkpoint related
> > fields
> > > > to the
> > > > > > job spec, especially in the very beginning of
> > > > flink-kubernetes-operator.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> > > > > >
> > > > > > > Hi Peng Yuan!
> > > > > > >
> > > > > > > While I do agree that savepoint path is a very important
> > production
> > > > > > > configuration there are a lot of other things that come to my
> > mind:
> > > > > > >  - savepoint dir
> > > > > > >  - checkpoint dir
> > > > > > >  - checkpoint interval/timeout
> > > > > > >  - high availability settings (provider/storagedir etc)
> > > > > > >
> > > > > > > just to name a few...
> > > > > > >
> > > > > > > While these are all production critical, they have nice clean
> > Flink
> > > > > > config
> > > > > > > settings to go with them. If we stand introducing these to
> > jobspec
> > > we
> > > > > > only
> > > > > > > get confusion about priority order etc and it is going to be
> hard
> > > to
> > > > > > change
> > > > > > > or remove them in the future. In any case we should validate
> that
> > > > these
> > > > > > > configs exist in cases where users use a stateful upgrade mode
> > for
> > > > > > example.
> > > > > > > This is something we need to add for sure.
> > > > > > >
> > > > > > > As for the other options you mentioned like automatic savepoint
> > > > > > generation
> > > > > > > for instance, those deserve an independent discussion of their
> > own
> > > I
> > > > > > > believe :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <
> yuanpengfred@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Matyas!
> > > > > > > >
> > > > > > > > Thanks for your reply!
> > > > > > > > For 1. and 3. scenarios,I couldn't agree more with the
> > > podTemplate
> > > > > > > solution
> > > > > > > > , i missed this part.
> > > > > > > > For savepoint related configuration, I think it's very
> > important
> > > > to be
> > > > > > > > specified in JobSpec, Because savepoint is a very common
> > > > configuration
> > > > > > > for
> > > > > > > > upgrading a job, if it has been placed in JobSpec can be
> > > obviously
> > > > > > > > configured by the user. In addition, other advanced
> properties
> > > can
> > > > be
> > > > > > put
> > > > > > > > into flinkConfiguration customized by expert users.
> > > > > > > > A bunch of savepoint configuration as follows:
> > > > > > > >
> > > > > > > > > fromSavepoint——Job restart from
> > > > > > > >
> > > > > > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > > > > > `savepointsDir`
> > > > > > > > > every n seconds.
> > > > > > > >
> > > > > > > > savepointsDir—— Savepoints dir where to store automatically
> > taken
> > > > > > > > > savepoints
> > > > > > > >
> > > > > > > > savepointGeneration—— Update savepoint generation of job
> status
> > > > for a
> > > > > > > > > running job (should be defined in JobStatus)
> > > > > > > >
> > > > > > > >
> > > > > > > > Best wishes,
> > > > > > > > Peng Yuan.
> > > > > > > >
> > > > > > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <
> > > > matyas.orhidi@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Peng,
> > > > > > > > >
> > > > > > > > > Thanks for your feedback. Regarding 1. and 3. scenarios,
> the
> > > > > > > podTemplate
> > > > > > > > > functionality in the operator could cover both. We also
> need
> > to
> > > > be
> > > > > > > > careful
> > > > > > > > > about introducing proxy parameters in the CRD spec. The
> > > savepoint
> > > > > > path
> > > > > > > is
> > > > > > > > > usually accompanied with a bunch of other configurations
> for
> > > > example,
> > > > > > > so
> > > > > > > > > users need to use configuration params anyway. What do you
> > > think?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Matyas
> > > > > > > > >
> > > > > > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <
> > yuanpengfred@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Gyula!
> > > > > > > > > >
> > > > > > > > > > I have reviewed the prototype design of
> > > > flink-kubernetes-operator
> > > > > > you
> > > > > > > > > > submitted, and I have the following questions:
> > > > > > > > > >
> > > > > > > > > > 1.Can a Flink Jar package that supports pulling from the
> > > > sidecar be
> > > > > > > > added
> > > > > > > > > > to the JobSpec? just like this:
> > > > > > > > > >
> > > > > > > > > > > initContainers:
> > > > > > > > > > >       - name: downloader
> > > > > > > > > > >         image: curlimages/curl
> > > > > > > > > > >         env:
> > > > > > > > > > >           - name: JAR_URL
> > > > > > > > > > >             value:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > > > > > >           - name: DEST_PATH
> > > > > > > > > > >             value: /cache/flink-app.jar
> > > > > > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH}
> > > > ${JAR_URL}']
> > > > > > > > > >
> > > > > > > > > > 2.Can we add savepoint path property to job
> specification?
> > > > > > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > > > > > TaskManagerSpec
> > > > > > > to
> > > > > > > > > > expose some service ,such as prometheus?The property can
> be
> > > > this:
> > > > > > > > > >
> > > > > > > > > > > extraPorts:
> > > > > > > > > > >       - name: prom
> > > > > > > > > > >         containerPort: 9249
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best wishes,
> > > > > > > > > > Peng Yuan
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <
> > > gyfora@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Flink Devs!
> > > > > > > > > > >
> > > > > > > > > > > We would like to present to you the first prototype of
> > the
> > > > > > > > > > > flink-kubernetes-operator that was built based on the
> > FLIP
> > > > and
> > > > > > the
> > > > > > > > > > > discussion on this mail thread. We would also like to
> > call
> > > > out
> > > > > > some
> > > > > > > > > > design
> > > > > > > > > > > decisions that we have made regarding architecture
> > > components
> > > > > > that
> > > > > > > > were
> > > > > > > > > > not
> > > > > > > > > > > explicitly mentioned in the FLIP document/thread and
> give
> > > > you the
> > > > > > > > > > > opportunity to raise any concerns here.
> > > > > > > > > > >
> > > > > > > > > > > You can find the initial prototype here:
> > > > > > > > > > >
> > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > > > > > >
> > > > > > > > > > > We will leave the PR open for 1-2 days before merging
> to
> > > let
> > > > > > people
> > > > > > > > > > comment
> > > > > > > > > > > on it, but please be mindful that this is an initial
> > > > prototype
> > > > > > with
> > > > > > > > > many
> > > > > > > > > > > rough edges. It is not intended to be a complete
> > > > implementation
> > > > > > of
> > > > > > > > the
> > > > > > > > > > FLIP
> > > > > > > > > > > specs as that will take some more work from all of us
> :)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Prototype feature set:*The prototype contains a basic
> > > > working
> > > > > > > > version
> > > > > > > > > of
> > > > > > > > > > > the flink-kubernetes-operator that supports deployment
> > and
> > > > > > > lifecycle
> > > > > > > > > > > management of a stateful native flink application. We
> > have
> > > > basic
> > > > > > > > > support
> > > > > > > > > > > for stateful and stateless upgrades, UI ingress, pod
> > > > templates
> > > > > > etc.
> > > > > > > > > Error
> > > > > > > > > > > handling at this point is largely missing.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Features / design decisions that were not explicitly
> > > > discussed
> > > > > > in
> > > > > > > > this
> > > > > > > > > > > thread*
> > > > > > > > > > >
> > > > > > > > > > > *Basic Admission control using a Webhook*Standard
> > resource
> > > > > > > admission
> > > > > > > > > > > control in Kubernetes to validate and potentially
> reject
> > > > > > resources
> > > > > > > is
> > > > > > > > > > done
> > > > > > > > > > > through Webhooks.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > > > > > This is a necessary mechanism to give the user an
> upfront
> > > > error
> > > > > > > when
> > > > > > > > an
> > > > > > > > > > > incorrect resource was submitted. In the Flink
> operator's
> > > > case we
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > validate that the FlinkDeployment yaml actually makes
> > sense
> > > > and
> > > > > > > does
> > > > > > > > > not
> > > > > > > > > > > contain erroneous config options that would inevitably
> > lead
> > > > to
> > > > > > > > > > > deployment/job failures.
> > > > > > > > > > >
> > > > > > > > > > > We have implemented a simple webhook that we can use
> for
> > > this
> > > > > > type
> > > > > > > of
> > > > > > > > > > > validation, as a separate maven module
> > > > > > (flink-kubernetes-webhook).
> > > > > > > > The
> > > > > > > > > > > webhook is an optional component and can be enabled or
> > > > disabled
> > > > > > > > during
> > > > > > > > > > > deployment. To avoid pulling in new external
> dependencies
> > > we
> > > > have
> > > > > > > > used
> > > > > > > > > > the
> > > > > > > > > > > Flink Shaded Netty module to build the simple rest
> > endpoint
> > > > > > > required.
> > > > > > > > > If
> > > > > > > > > > > the community feels that Netty adds unnecessary
> > complexity
> > > > to the
> > > > > > > > > webhook
> > > > > > > > > > > implementation we are open to alternative backends such
> > as
> > > > > > > Springboot
> > > > > > > > > for
> > > > > > > > > > > instance which would practically eliminate all the
> > > > boilerplate.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Helm Chart for deployment*Helm charts provide an
> > industry
> > > > > > standard
> > > > > > > > way
> > > > > > > > > > of
> > > > > > > > > > > managing kubernetes deployments. We have created a helm
> > > chart
> > > > > > > > prototype
> > > > > > > > > > > that can be used to deploy the operator together with
> all
> > > > > > required
> > > > > > > > > > > resources. The helm chart allows easy configuration for
> > > > things
> > > > > > like
> > > > > > > > > > images,
> > > > > > > > > > > namespaces etc and flags to control specific parts of
> the
> > > > > > > deployment
> > > > > > > > > such
> > > > > > > > > > > as RBAC or the webhook.
> > > > > > > > > > >
> > > > > > > > > > > The helm chart provided is intended to be a first
> version
> > > > that
> > > > > > > worked
> > > > > > > > > for
> > > > > > > > > > > us during development but we expect to have a lot of
> > > > iterations
> > > > > > on
> > > > > > > it
> > > > > > > > > > based
> > > > > > > > > > > on the feedback from the community.
> > > > > > > > > > >
> > > > > > > > > > > *Acknowledgment*
> > > > > > > > > > > We would like to thank everyone who has provided
> support
> > > and
> > > > > > > valuable
> > > > > > > > > > > feedback on this FLIP.
> > > > > > > > > > > We would also like to thank Yang Wang & Alexis
> > > Sarda-Espinosa
> > > > > > > > > > specifically
> > > > > > > > > > > for making their operators open source and available to
> > us
> > > > which
> > > > > > > had
> > > > > > > > a
> > > > > > > > > > big
> > > > > > > > > > > impact on the FLIP and the prototype.
> > > > > > > > > > >
> > > > > > > > > > > We are looking forward to continuing development on the
> > > > operator
> > > > > > > > > together
> > > > > > > > > > > with the broader community.
> > > > > > > > > > > All work will be tracked using the ASF Jira from now
> on.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <
> > > > yuanpengfred@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Gyula,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > It's great to see the project getting started and I
> > can't
> > > > wait
> > > > > > to
> > > > > > > > see
> > > > > > > > > > the
> > > > > > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > > > > > >
> > > > > > > > > > > > Best Wishes!
> > > > > > > > > > > > Peng Yuan
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > > > > > gyula.fora@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Peng Yuan!
> > > > > > > > > > > > >
> > > > > > > > > > > > > The repo is already created:
> > > > > > > > > > > > >
> https://github.com/apache/flink-kubernetes-operator
> > > > > > > > > > > > >
> > > > > > > > > > > > > We will open the PR with the initial prototype
> later
> > > > today,
> > > > > > > stay
> > > > > > > > > > tuned
> > > > > > > > > > > in
> > > > > > > > > > > > > this thread! :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > > > > > yuanpengfred@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Has the project of flink-kubernetes-operator been
> > > > created
> > > > > > in
> > > > > > > > > > github?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Peng Yuan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I agree with flink-kubernetes-operator as the
> > repo
> > > > name
> > > > > > :)
> > > > > > > > > > > > > > > Don't have any better idea
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > > > > > thw@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the continued feedback and
> > discussion.
> > > > Looks
> > > > > > > > like
> > > > > > > > > we
> > > > > > > > > > > are
> > > > > > > > > > > > > > > > ready to start a VOTE, I will initiate it
> > > shortly.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In parallel it would be good to find the
> > > repository
> > > > > > name.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > My suggestion would be:
> > flink-kubernetes-operator
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I thought "flink-operator" could be a bit
> > > > misleading
> > > > > > > since
> > > > > > > > > the
> > > > > > > > > > > term
> > > > > > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I also considered "flink-k8s-operator" but
> that
> > > > would
> > > > > > be
> > > > > > > > > almost
> > > > > > > > > > > > > > > > identical to existing operator
> implementations
> > > and
> > > > > > could
> > > > > > > > lead
> > > > > > > > > > to
> > > > > > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So far we have been focusing our dev
> efforts
> > on
> > > > the
> > > > > > > > initial
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > > > > > If the discussion and vote goes well for
> this
> > > > FLIP we
> > > > > > > are
> > > > > > > > > > > looking
> > > > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > > > to contributing the initial version
> sometime
> > > next
> > > > > > week
> > > > > > > > > > (fingers
> > > > > > > > > > > > > > > crossed).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > At that point I think we can already start
> > the
> > > > dev
> > > > > > work
> > > > > > > > to
> > > > > > > > > > > > support
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > standalone mode as well, especially if you
> > can
> > > > > > dedicate
> > > > > > > > > some
> > > > > > > > > > > > effort
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > > > > > Working together on this sounds like a
> great
> > > > idea and
> > > > > > > we
> > > > > > > > > > should
> > > > > > > > > > > > > start
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny
> Cranmer
> > <
> > > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I have been discussing this one with my
> > team.
> > > > We
> > > > > > are
> > > > > > > > > > > interested
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > Standalone mode, and are willing to
> > > contribute
> > > > > > > towards
> > > > > > > > > the
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > > > Potentially we can work together to
> support
> > > > both
> > > > > > > modes
> > > > > > > > in
> > > > > > > > > > > > > parallel?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula
> Fóra <
> > > > > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > > > > > Versioning will be independent from
> Flink
> > > > and the
> > > > > > > > > > operator
> > > > > > > > > > > > will
> > > > > > > > > > > > > > > > depend
> > > > > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > > > fixed flink version (in every given
> > > operator
> > > > > > > > version).
> > > > > > > > > > > > > > > > > > > This should be the exact same setup as
> > with
> > > > > > > Stateful
> > > > > > > > > > > > Functions
> > > > > > > > > > > > > (
> > > > > > > > > > > > > > > > > > >
> https://github.com/apache/flink-statefun
> > ).
> > > > So
> > > > > > > > > > independent
> > > > > > > > > > > > > > release
> > > > > > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > > > > > I think that's a very good point, as
> > > general
> > > > > > > > exception
> > > > > > > > > > > > handling
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > different failure scenarios is a tricky
> > > > problem.
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > > > classifiers and retry strategies could
> > > avoid
> > > > a
> > > > > > lot
> > > > > > > of
> > > > > > > > > > > manual
> > > > > > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > > > > > from the user. We will definitely need
> to
> > > add
> > > > > > > > something
> > > > > > > > > > > like
> > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > Once
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > have the repo created with the initial
> > > > operator
> > > > > > > code
> > > > > > > > we
> > > > > > > > > > > > should
> > > > > > > > > > > > > > open
> > > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > tickets for this and put it on the
> short
> > > term
> > > > > > > > roadmap!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny
> > > Cranmer
> > > > <
> > > > > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Great work on the FLIP, I am looking
> > > > forward to
> > > > > > > > this
> > > > > > > > > > > one. I
> > > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I have general feedback around how we
> > > will
> > > > > > handle
> > > > > > > > job
> > > > > > > > > > > > > > submission
> > > > > > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > > > > > and retry. As discussed in the
> Rejected
> > > > > > > > Alternatives
> > > > > > > > > > > > section,
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > > > Java to handle job submission
> failures
> > > > from the
> > > > > > > > Flink
> > > > > > > > > > > > client.
> > > > > > > > > > > > > > It
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > useful to have the ability to
> configure
> > > > > > exception
> > > > > > > > > > > > classifiers
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > > strategy as part of operator
> > > configuration.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Given this will be in a separate
> Github
> > > > > > > repository
> > > > > > > > I
> > > > > > > > > am
> > > > > > > > > > > > > curious
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > > > > > versioning strategy will work in
> > relation
> > > > to
> > > > > > the
> > > > > > > > > Flink
> > > > > > > > > > > > > version?
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > any other components with a similar
> > setup
> > > > I can
> > > > > > > > look
> > > > > > > > > > at?
> > > > > > > > > > > > Will
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > > > version track Flink or will it use
> its
> > > own
> > > > > > > > versioning
> > > > > > > > > > > > > strategy
> > > > > > > > > > > > > > > > with a
> > > > > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton
> > > > Balassi <
> > > > > > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you for the great feedback,
> > > Thomas
> > > > has
> > > > > > > > > updated
> > > > > > > > > > > the
> > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > > > > > accordingly. If you are comfortable
> > > with
> > > > the
> > > > > > > > > > currently
> > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest
> > moving
> > > > > > forward
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > > voting
> > > > > > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > > > > > that reaches a positive conclusion
> it
> > > > lets us
> > > > > > > > > create
> > > > > > > > > > > the
> > > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > > > > > repository under the flink project
> > for
> > > > the
> > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I encourage everyone to keep
> > improving
> > > > the
> > > > > > > > details
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > > > > > I believe given the existing design
> > and
> > > > the
> > > > > > > > general
> > > > > > > > > > > > > sentiment
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > > > thread that the most efficient path
> > > from
> > > > here
> > > > > > > is
> > > > > > > > > > > starting
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > implementation so that we can
> > > > collectively
> > > > > > > > iterate
> > > > > > > > > > over
> > > > > > > > > > > > it.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM
> > Thomas
> > > > > > Weise <
> > > > > > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback and
> please
> > > see
> > > > > > > > responses
> > > > > > > > > > > below
> > > > > > > > > > > > > -->
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM
> > > > Xintong
> > > > > > > Song <
> > > > > > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this
> > > > FLIP, and
> > > > > > > > > > everyone
> > > > > > > > > > > > for
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I also have a few questions and
> > > > comments.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > > > > > Deploying a Flink session
> cluster
> > > via
> > > > > > > > kubectl &
> > > > > > > > > > CR
> > > > > > > > > > > > and
> > > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > > > > > to the cluster via Flink cli /
> > REST
> > > > is
> > > > > > > > probably
> > > > > > > > > > the
> > > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > > > > > the least effort. However, I'd
> > like
> > > > to
> > > > > > > point
> > > > > > > > > out
> > > > > > > > > > 2
> > > > > > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > > > > perjob/application
> > > > > > > > > > > > > modes.
> > > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > > > > > having to run the job in two
> > steps
> > > > > > (deploy
> > > > > > > > the
> > > > > > > > > > > > cluster,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > > > > > 2. One of our motivations is
> > being
> > > > able
> > > > > > to
> > > > > > > > > manage
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > > > > > lifecycles with kubectl.
> > Submitting
> > > > jobs
> > > > > > > from
> > > > > > > > > cli
> > > > > > > > > > > > > sounds
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > > > > > I think it's probably worth it
> to
> > > > support
> > > > > > > > > > > submitting
> > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > > > > > in the first version, both
> > together
> > > > with
> > > > > > > > > > deploying
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > > perjob/application mode and
> after
> > > > > > deploying
> > > > > > > > the
> > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > The intention is to support
> > > application
> > > > > > > > > management
> > > > > > > > > > > > > through
> > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > > > > > which means there won't be any 2
> > step
> > > > > > > > submission
> > > > > > > > > > > > process,
> > > > > > > > > > > > > > > > which as
> > > > > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > > > > allude to would defeat the
> purpose
> > of
> > > > this
> > > > > > > > > project.
> > > > > > > > > > > The
> > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > > > > > the application part. Please note
> > > that
> > > > the
> > > > > > > bare
> > > > > > > > > > > cluster
> > > > > > > > > > > > > > > > support is
> > > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > > > *additional* feature for
> scenarios
> > > that
> > > > > > > require
> > > > > > > > > > > > external
> > > > > > > > > > > > > > job
> > > > > > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > > > > there anything on the FLIP page
> > that
> > > > > > creates
> > > > > > > a
> > > > > > > > > > > > different
> > > > > > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > > > > > Which Flink versions does the
> > > > operator
> > > > > > plan
> > > > > > > > to
> > > > > > > > > > > > support?
> > > > > > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was
> > > firstly
> > > > > > > > introduced
> > > > > > > > > > in
> > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced
> > in
> > > > Flink
> > > > > > > 1.12
> > > > > > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > > > > > introduced
> > > > > > > in
> > > > > > > > > > Flink
> > > > > > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > > > > > 4. There was some changes to
> the
> > > > Flink
> > > > > > > docker
> > > > > > > > > > image
> > > > > > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Great, thanks for providing this.
> > It
> > > is
> > > > > > > > important
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > > > > > going forward also. We are
> > targeting
> > > > Flink
> > > > > > > > 1.14.x
> > > > > > > > > > > > > upwards.
> > > > > > > > > > > > > > > > Before
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > operator is ready there will be
> > > another
> > > > > > Flink
> > > > > > > > > > > release.
> > > > > > > > > > > > > > Let's
> > > > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > > is interested in earlier
> versions?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > > > > > What kind of API compatibility
> we
> > > can
> > > > > > > commit
> > > > > > > > > to?
> > > > > > > > > > > It's
> > > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > > > > alpha / beta version APIs that
> > > allow
> > > > > > > > > incompatible
> > > > > > > > > > > > > future
> > > > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > first version. But eventually
> we
> > > > would
> > > > > > need
> > > > > > > > to
> > > > > > > > > > > > > guarantee
> > > > > > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > > > > > compatibility, so that an early
> > > > version
> > > > > > CR
> > > > > > > > can
> > > > > > > > > > work
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Another great point and please
> let
> > me
> > > > > > include
> > > > > > > > > that
> > > > > > > > > > on
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I think we should allow
> > incompatible
> > > > > > changes
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > first
> > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > > > > > versions, similar to how other
> > major
> > > > > > features
> > > > > > > > > have
> > > > > > > > > > > > > evolved
> > > > > > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Would be great to get broader
> > > feedback
> > > > on
> > > > > > > this
> > > > > > > > > one.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM
> > > > Thomas
> > > > > > > Weise
> > > > > > > > <
> > > > > > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs
> > Standalone
> > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > > > > Maybe we should make this
> > more
> > > > clear
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > FLIP
> > > > > > > > > > > > > but
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > first version of the
> operator
> > > > based
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > > > > > While this clearly does not
> > > > cover all
> > > > > > > > > > use-cases
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > > > > > this would lead to a much
> > > smaller
> > > > > > > initial
> > > > > > > > > > > effort
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > I'm also leaning towards the
> > > native
> > > > > > > > > > integration,
> > > > > > > > > > > as
> > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the
> > > operator
> > > > > > will
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > standalone mode. I would like
> > to
> > > > gain
> > > > > > > more
> > > > > > > > > > > > confidence
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > > > integration reduces the
> effort.
> > > > While
> > > > > > it
> > > > > > > > cuts
> > > > > > > > > > the
> > > > > > > > > > > > > > effort
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > > > > > pod creation, some mapping
> code
> > > > from
> > > > > > the
> > > > > > > CR
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > > > client and config needs to be
> > > > created.
> > > > > > As
> > > > > > > > > > > mentioned
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > > > integration requires the
> Flink
> > > job
> > > > > > > manager
> > > > > > > > to
> > > > > > > > > > > have
> > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > create pods, which in some
> > > > scenarios
> > > > > > may
> > > > > > > be
> > > > > > > > > > seen
> > > > > > > > > > > as
> > > > > > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > > > > > Is the pod template in
> CR
> > > > same
> > > > > > with
> > > > > > > > > what
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > > has
> > > > > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not
> the
> > > > > > arbitrary
> > > > > > > > > > > field(e.g.
> > > > > > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Yes, pod template would look
> > > almost
> > > > > > > > > identical.
> > > > > > > > > > > > There
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > > > > > that the operator will
> control
> > > (and
> > > > > > that
> > > > > > > > may
> > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > > > > > in general we would not want
> to
> > > > place
> > > > > > > > > > > > restrictions. I
> > > > > > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > > > > > where a pod template is
> merged
> > > from
> > > > > > > > multiple
> > > > > > > > > > > layers
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > interesting to make this more
> > > > flexible.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Konstantin Knauf <kn...@apache.org>.
Hi Gyula,

sorry for joining late. One comment on the API design for consideration: we
are using the job.state as kind of a "desired state", right? This is quite
uncommon in Kubernetes to my knowledge. In Kubernetes almost always the
fact that a resource exists means that it should be "running". The only API
that I am aware of that has something like "suspended" is a Kubernetes Job (
https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job),
which looks retrofitted to me.

Cheers,

Konstantin

On Wed, Feb 16, 2022 at 10:52 AM Gyula Fóra <gy...@gmail.com> wrote:

> Hi All!
>
> Thank you all for reviewing the PR and already helping to make it better. I
> have opened a bunch of jira tickets under
> https://issues.apache.org/jira/browse/FLINK-25963 based on some comments
> and incomplete features in general.
>
> Given that there were no major objections about the prototype, I will merge
> it now so we can start collaborating together.
>
> Cheers,
> Gyula
>
> On Wed, Feb 16, 2022 at 3:52 AM Yang Wang <da...@gmail.com> wrote:
>
> > Thanks for the explanation.
> > Given that it is unrelated with java version in Flink.
> > Starting with java11 for the flink-kubernetes-operator makes sense to me.
> >
> >
> > Best,
> > Yang
> >
> > Thomas Weise <th...@apache.org> 于2022年2月15日周二 23:57写道:
> >
> > > Hi,
> > >
> > > At this point I see no reason to support Java 8 for a new project.
> > > Java 8 is being phased out, we should start with 11.
> > >
> > > Also, since the operator isn't a library but effectively just a docker
> > > image, the ability to change the Java version isn't as critical as it
> > > is for Flink core, which needs to run in many different environments.
> > >
> > > Cheers,
> > > Thomas
> > >
> > > On Tue, Feb 15, 2022 at 4:50 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > > >
> > > > Hi Devs,
> > > >
> > > > Yang Wang discovered that the current prototype is not compatible
> with
> > > Java
> > > > 8 but only 11 and upwards.
> > > >
> > > > The reason for this is that the java operator SDK itself is not java
> 8
> > > > compatible unfortunately.
> > > >
> > > > Given that Java 8 is on the road to deprecation and that the operator
> > > runs
> > > > as a containerized deployment, are there any concerns regarding
> making
> > > the
> > > > target java version 11?
> > > > This should not affect deployed flink clusters and jobs, those should
> > > still
> > > > work with Java 8, but only the kubernetes operator itself.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > >
> > > > On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com>
> > wrote:
> > > >
> > > > > I also lean to not introduce the savepoint/checkpoint related
> fields
> > > to the
> > > > > job spec, especially in the very beginning of
> > > flink-kubernetes-operator.
> > > > >
> > > > >
> > > > > Best,
> > > > > Yang
> > > > >
> > > > > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> > > > >
> > > > > > Hi Peng Yuan!
> > > > > >
> > > > > > While I do agree that savepoint path is a very important
> production
> > > > > > configuration there are a lot of other things that come to my
> mind:
> > > > > >  - savepoint dir
> > > > > >  - checkpoint dir
> > > > > >  - checkpoint interval/timeout
> > > > > >  - high availability settings (provider/storagedir etc)
> > > > > >
> > > > > > just to name a few...
> > > > > >
> > > > > > While these are all production critical, they have nice clean
> Flink
> > > > > config
> > > > > > settings to go with them. If we stand introducing these to
> jobspec
> > we
> > > > > only
> > > > > > get confusion about priority order etc and it is going to be hard
> > to
> > > > > change
> > > > > > or remove them in the future. In any case we should validate that
> > > these
> > > > > > configs exist in cases where users use a stateful upgrade mode
> for
> > > > > example.
> > > > > > This is something we need to add for sure.
> > > > > >
> > > > > > As for the other options you mentioned like automatic savepoint
> > > > > generation
> > > > > > for instance, those deserve an independent discussion of their
> own
> > I
> > > > > > believe :)
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Matyas!
> > > > > > >
> > > > > > > Thanks for your reply!
> > > > > > > For 1. and 3. scenarios,I couldn't agree more with the
> > podTemplate
> > > > > > solution
> > > > > > > , i missed this part.
> > > > > > > For savepoint related configuration, I think it's very
> important
> > > to be
> > > > > > > specified in JobSpec, Because savepoint is a very common
> > > configuration
> > > > > > for
> > > > > > > upgrading a job, if it has been placed in JobSpec can be
> > obviously
> > > > > > > configured by the user. In addition, other advanced properties
> > can
> > > be
> > > > > put
> > > > > > > into flinkConfiguration customized by expert users.
> > > > > > > A bunch of savepoint configuration as follows:
> > > > > > >
> > > > > > > > fromSavepoint——Job restart from
> > > > > > >
> > > > > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > > > > `savepointsDir`
> > > > > > > > every n seconds.
> > > > > > >
> > > > > > > savepointsDir—— Savepoints dir where to store automatically
> taken
> > > > > > > > savepoints
> > > > > > >
> > > > > > > savepointGeneration—— Update savepoint generation of job status
> > > for a
> > > > > > > > running job (should be defined in JobStatus)
> > > > > > >
> > > > > > >
> > > > > > > Best wishes,
> > > > > > > Peng Yuan.
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <
> > > matyas.orhidi@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Peng,
> > > > > > > >
> > > > > > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > > > > > podTemplate
> > > > > > > > functionality in the operator could cover both. We also need
> to
> > > be
> > > > > > > careful
> > > > > > > > about introducing proxy parameters in the CRD spec. The
> > savepoint
> > > > > path
> > > > > > is
> > > > > > > > usually accompanied with a bunch of other configurations for
> > > example,
> > > > > > so
> > > > > > > > users need to use configuration params anyway. What do you
> > think?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matyas
> > > > > > > >
> > > > > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <
> yuanpengfred@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Gyula!
> > > > > > > > >
> > > > > > > > > I have reviewed the prototype design of
> > > flink-kubernetes-operator
> > > > > you
> > > > > > > > > submitted, and I have the following questions:
> > > > > > > > >
> > > > > > > > > 1.Can a Flink Jar package that supports pulling from the
> > > sidecar be
> > > > > > > added
> > > > > > > > > to the JobSpec? just like this:
> > > > > > > > >
> > > > > > > > > > initContainers:
> > > > > > > > > >       - name: downloader
> > > > > > > > > >         image: curlimages/curl
> > > > > > > > > >         env:
> > > > > > > > > >           - name: JAR_URL
> > > > > > > > > >             value:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > > > > >           - name: DEST_PATH
> > > > > > > > > >             value: /cache/flink-app.jar
> > > > > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH}
> > > ${JAR_URL}']
> > > > > > > > >
> > > > > > > > > 2.Can we add savepoint path property to job specification?
> > > > > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > > > > TaskManagerSpec
> > > > > > to
> > > > > > > > > expose some service ,such as prometheus?The property can be
> > > this:
> > > > > > > > >
> > > > > > > > > > extraPorts:
> > > > > > > > > >       - name: prom
> > > > > > > > > >         containerPort: 9249
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best wishes,
> > > > > > > > > Peng Yuan
> > > > > > > > >
> > > > > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <
> > gyfora@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Flink Devs!
> > > > > > > > > >
> > > > > > > > > > We would like to present to you the first prototype of
> the
> > > > > > > > > > flink-kubernetes-operator that was built based on the
> FLIP
> > > and
> > > > > the
> > > > > > > > > > discussion on this mail thread. We would also like to
> call
> > > out
> > > > > some
> > > > > > > > > design
> > > > > > > > > > decisions that we have made regarding architecture
> > components
> > > > > that
> > > > > > > were
> > > > > > > > > not
> > > > > > > > > > explicitly mentioned in the FLIP document/thread and give
> > > you the
> > > > > > > > > > opportunity to raise any concerns here.
> > > > > > > > > >
> > > > > > > > > > You can find the initial prototype here:
> > > > > > > > > >
> https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > > > > >
> > > > > > > > > > We will leave the PR open for 1-2 days before merging to
> > let
> > > > > people
> > > > > > > > > comment
> > > > > > > > > > on it, but please be mindful that this is an initial
> > > prototype
> > > > > with
> > > > > > > > many
> > > > > > > > > > rough edges. It is not intended to be a complete
> > > implementation
> > > > > of
> > > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > specs as that will take some more work from all of us :)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *Prototype feature set:*The prototype contains a basic
> > > working
> > > > > > > version
> > > > > > > > of
> > > > > > > > > > the flink-kubernetes-operator that supports deployment
> and
> > > > > > lifecycle
> > > > > > > > > > management of a stateful native flink application. We
> have
> > > basic
> > > > > > > > support
> > > > > > > > > > for stateful and stateless upgrades, UI ingress, pod
> > > templates
> > > > > etc.
> > > > > > > > Error
> > > > > > > > > > handling at this point is largely missing.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *Features / design decisions that were not explicitly
> > > discussed
> > > > > in
> > > > > > > this
> > > > > > > > > > thread*
> > > > > > > > > >
> > > > > > > > > > *Basic Admission control using a Webhook*Standard
> resource
> > > > > > admission
> > > > > > > > > > control in Kubernetes to validate and potentially reject
> > > > > resources
> > > > > > is
> > > > > > > > > done
> > > > > > > > > > through Webhooks.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > > > > This is a necessary mechanism to give the user an upfront
> > > error
> > > > > > when
> > > > > > > an
> > > > > > > > > > incorrect resource was submitted. In the Flink operator's
> > > case we
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > validate that the FlinkDeployment yaml actually makes
> sense
> > > and
> > > > > > does
> > > > > > > > not
> > > > > > > > > > contain erroneous config options that would inevitably
> lead
> > > to
> > > > > > > > > > deployment/job failures.
> > > > > > > > > >
> > > > > > > > > > We have implemented a simple webhook that we can use for
> > this
> > > > > type
> > > > > > of
> > > > > > > > > > validation, as a separate maven module
> > > > > (flink-kubernetes-webhook).
> > > > > > > The
> > > > > > > > > > webhook is an optional component and can be enabled or
> > > disabled
> > > > > > > during
> > > > > > > > > > deployment. To avoid pulling in new external dependencies
> > we
> > > have
> > > > > > > used
> > > > > > > > > the
> > > > > > > > > > Flink Shaded Netty module to build the simple rest
> endpoint
> > > > > > required.
> > > > > > > > If
> > > > > > > > > > the community feels that Netty adds unnecessary
> complexity
> > > to the
> > > > > > > > webhook
> > > > > > > > > > implementation we are open to alternative backends such
> as
> > > > > > Springboot
> > > > > > > > for
> > > > > > > > > > instance which would practically eliminate all the
> > > boilerplate.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *Helm Chart for deployment*Helm charts provide an
> industry
> > > > > standard
> > > > > > > way
> > > > > > > > > of
> > > > > > > > > > managing kubernetes deployments. We have created a helm
> > chart
> > > > > > > prototype
> > > > > > > > > > that can be used to deploy the operator together with all
> > > > > required
> > > > > > > > > > resources. The helm chart allows easy configuration for
> > > things
> > > > > like
> > > > > > > > > images,
> > > > > > > > > > namespaces etc and flags to control specific parts of the
> > > > > > deployment
> > > > > > > > such
> > > > > > > > > > as RBAC or the webhook.
> > > > > > > > > >
> > > > > > > > > > The helm chart provided is intended to be a first version
> > > that
> > > > > > worked
> > > > > > > > for
> > > > > > > > > > us during development but we expect to have a lot of
> > > iterations
> > > > > on
> > > > > > it
> > > > > > > > > based
> > > > > > > > > > on the feedback from the community.
> > > > > > > > > >
> > > > > > > > > > *Acknowledgment*
> > > > > > > > > > We would like to thank everyone who has provided support
> > and
> > > > > > valuable
> > > > > > > > > > feedback on this FLIP.
> > > > > > > > > > We would also like to thank Yang Wang & Alexis
> > Sarda-Espinosa
> > > > > > > > > specifically
> > > > > > > > > > for making their operators open source and available to
> us
> > > which
> > > > > > had
> > > > > > > a
> > > > > > > > > big
> > > > > > > > > > impact on the FLIP and the prototype.
> > > > > > > > > >
> > > > > > > > > > We are looking forward to continuing development on the
> > > operator
> > > > > > > > together
> > > > > > > > > > with the broader community.
> > > > > > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <
> > > yuanpengfred@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Gyula,
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > It's great to see the project getting started and I
> can't
> > > wait
> > > > > to
> > > > > > > see
> > > > > > > > > the
> > > > > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > > > > >
> > > > > > > > > > > Best Wishes!
> > > > > > > > > > > Peng Yuan
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > > > > gyula.fora@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Peng Yuan!
> > > > > > > > > > > >
> > > > > > > > > > > > The repo is already created:
> > > > > > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > > > > > >
> > > > > > > > > > > > We will open the PR with the initial prototype later
> > > today,
> > > > > > stay
> > > > > > > > > tuned
> > > > > > > > > > in
> > > > > > > > > > > > this thread! :)
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > > > > yuanpengfred@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Has the project of flink-kubernetes-operator been
> > > created
> > > > > in
> > > > > > > > > github?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Peng Yuan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree with flink-kubernetes-operator as the
> repo
> > > name
> > > > > :)
> > > > > > > > > > > > > > Don't have any better idea
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > > > > thw@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the continued feedback and
> discussion.
> > > Looks
> > > > > > > like
> > > > > > > > we
> > > > > > > > > > are
> > > > > > > > > > > > > > > ready to start a VOTE, I will initiate it
> > shortly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In parallel it would be good to find the
> > repository
> > > > > name.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My suggestion would be:
> flink-kubernetes-operator
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I thought "flink-operator" could be a bit
> > > misleading
> > > > > > since
> > > > > > > > the
> > > > > > > > > > term
> > > > > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I also considered "flink-k8s-operator" but that
> > > would
> > > > > be
> > > > > > > > almost
> > > > > > > > > > > > > > > identical to existing operator implementations
> > and
> > > > > could
> > > > > > > lead
> > > > > > > > > to
> > > > > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So far we have been focusing our dev efforts
> on
> > > the
> > > > > > > initial
> > > > > > > > > > > native
> > > > > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > > > > If the discussion and vote goes well for this
> > > FLIP we
> > > > > > are
> > > > > > > > > > looking
> > > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > > to contributing the initial version sometime
> > next
> > > > > week
> > > > > > > > > (fingers
> > > > > > > > > > > > > > crossed).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > At that point I think we can already start
> the
> > > dev
> > > > > work
> > > > > > > to
> > > > > > > > > > > support
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > standalone mode as well, especially if you
> can
> > > > > dedicate
> > > > > > > > some
> > > > > > > > > > > effort
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > > > > Working together on this sounds like a great
> > > idea and
> > > > > > we
> > > > > > > > > should
> > > > > > > > > > > > start
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer
> <
> > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I have been discussing this one with my
> team.
> > > We
> > > > > are
> > > > > > > > > > interested
> > > > > > > > > > > > in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > Standalone mode, and are willing to
> > contribute
> > > > > > towards
> > > > > > > > the
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > > Potentially we can work together to support
> > > both
> > > > > > modes
> > > > > > > in
> > > > > > > > > > > > parallel?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > > > > Versioning will be independent from Flink
> > > and the
> > > > > > > > > operator
> > > > > > > > > > > will
> > > > > > > > > > > > > > > depend
> > > > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > > fixed flink version (in every given
> > operator
> > > > > > > version).
> > > > > > > > > > > > > > > > > > This should be the exact same setup as
> with
> > > > > > Stateful
> > > > > > > > > > > Functions
> > > > > > > > > > > > (
> > > > > > > > > > > > > > > > > > https://github.com/apache/flink-statefun
> ).
> > > So
> > > > > > > > > independent
> > > > > > > > > > > > > release
> > > > > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > > > > I think that's a very good point, as
> > general
> > > > > > > exception
> > > > > > > > > > > handling
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > different failure scenarios is a tricky
> > > problem.
> > > > > I
> > > > > > > > think
> > > > > > > > > > the
> > > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > > classifiers and retry strategies could
> > avoid
> > > a
> > > > > lot
> > > > > > of
> > > > > > > > > > manual
> > > > > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > > > > from the user. We will definitely need to
> > add
> > > > > > > something
> > > > > > > > > > like
> > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > Once
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > have the repo created with the initial
> > > operator
> > > > > > code
> > > > > > > we
> > > > > > > > > > > should
> > > > > > > > > > > > > open
> > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > tickets for this and put it on the short
> > term
> > > > > > > roadmap!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny
> > Cranmer
> > > <
> > > > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Great work on the FLIP, I am looking
> > > forward to
> > > > > > > this
> > > > > > > > > > one. I
> > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I have general feedback around how we
> > will
> > > > > handle
> > > > > > > job
> > > > > > > > > > > > > submission
> > > > > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > > > > > Alternatives
> > > > > > > > > > > section,
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > > Java to handle job submission failures
> > > from the
> > > > > > > Flink
> > > > > > > > > > > client.
> > > > > > > > > > > > > It
> > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > useful to have the ability to configure
> > > > > exception
> > > > > > > > > > > classifiers
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > strategy as part of operator
> > configuration.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Given this will be in a separate Github
> > > > > > repository
> > > > > > > I
> > > > > > > > am
> > > > > > > > > > > > curious
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > > > > versioning strategy will work in
> relation
> > > to
> > > > > the
> > > > > > > > Flink
> > > > > > > > > > > > version?
> > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > any other components with a similar
> setup
> > > I can
> > > > > > > look
> > > > > > > > > at?
> > > > > > > > > > > Will
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > > version track Flink or will it use its
> > own
> > > > > > > versioning
> > > > > > > > > > > > strategy
> > > > > > > > > > > > > > > with a
> > > > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton
> > > Balassi <
> > > > > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thank you for the great feedback,
> > Thomas
> > > has
> > > > > > > > updated
> > > > > > > > > > the
> > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > > > > accordingly. If you are comfortable
> > with
> > > the
> > > > > > > > > currently
> > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest
> moving
> > > > > forward
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > > voting
> > > > > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > > > > that reaches a positive conclusion it
> > > lets us
> > > > > > > > create
> > > > > > > > > > the
> > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > > > > repository under the flink project
> for
> > > the
> > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I encourage everyone to keep
> improving
> > > the
> > > > > > > details
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > > > > I believe given the existing design
> and
> > > the
> > > > > > > general
> > > > > > > > > > > > sentiment
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > > thread that the most efficient path
> > from
> > > here
> > > > > > is
> > > > > > > > > > starting
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > implementation so that we can
> > > collectively
> > > > > > > iterate
> > > > > > > > > over
> > > > > > > > > > > it.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM
> Thomas
> > > > > Weise <
> > > > > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for the feedback and please
> > see
> > > > > > > responses
> > > > > > > > > > below
> > > > > > > > > > > > -->
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM
> > > Xintong
> > > > > > Song <
> > > > > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this
> > > FLIP, and
> > > > > > > > > everyone
> > > > > > > > > > > for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I also have a few questions and
> > > comments.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > > > > Deploying a Flink session cluster
> > via
> > > > > > > kubectl &
> > > > > > > > > CR
> > > > > > > > > > > and
> > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > > > > to the cluster via Flink cli /
> REST
> > > is
> > > > > > > probably
> > > > > > > > > the
> > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > > > > the least effort. However, I'd
> like
> > > to
> > > > > > point
> > > > > > > > out
> > > > > > > > > 2
> > > > > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > > > perjob/application
> > > > > > > > > > > > modes.
> > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > > > > having to run the job in two
> steps
> > > > > (deploy
> > > > > > > the
> > > > > > > > > > > cluster,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > > > > 2. One of our motivations is
> being
> > > able
> > > > > to
> > > > > > > > manage
> > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > > > > lifecycles with kubectl.
> Submitting
> > > jobs
> > > > > > from
> > > > > > > > cli
> > > > > > > > > > > > sounds
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > > > > I think it's probably worth it to
> > > support
> > > > > > > > > > submitting
> > > > > > > > > > > > jobs
> > > > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > > > > in the first version, both
> together
> > > with
> > > > > > > > > deploying
> > > > > > > > > > > the
> > > > > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > perjob/application mode and after
> > > > > deploying
> > > > > > > the
> > > > > > > > > > > cluster
> > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > The intention is to support
> > application
> > > > > > > > management
> > > > > > > > > > > > through
> > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > > > > which means there won't be any 2
> step
> > > > > > > submission
> > > > > > > > > > > process,
> > > > > > > > > > > > > > > which as
> > > > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > > > allude to would defeat the purpose
> of
> > > this
> > > > > > > > project.
> > > > > > > > > > The
> > > > > > > > > > > > CR
> > > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > > > > the application part. Please note
> > that
> > > the
> > > > > > bare
> > > > > > > > > > cluster
> > > > > > > > > > > > > > > support is
> > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > > *additional* feature for scenarios
> > that
> > > > > > require
> > > > > > > > > > > external
> > > > > > > > > > > > > job
> > > > > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > > > there anything on the FLIP page
> that
> > > > > creates
> > > > > > a
> > > > > > > > > > > different
> > > > > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > > > > Which Flink versions does the
> > > operator
> > > > > plan
> > > > > > > to
> > > > > > > > > > > support?
> > > > > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was
> > firstly
> > > > > > > introduced
> > > > > > > > > in
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced
> in
> > > Flink
> > > > > > 1.12
> > > > > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > > > > introduced
> > > > > > in
> > > > > > > > > Flink
> > > > > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > > > > 4. There was some changes to the
> > > Flink
> > > > > > docker
> > > > > > > > > image
> > > > > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Great, thanks for providing this.
> It
> > is
> > > > > > > important
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > > > > going forward also. We are
> targeting
> > > Flink
> > > > > > > 1.14.x
> > > > > > > > > > > > upwards.
> > > > > > > > > > > > > > > Before
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > operator is ready there will be
> > another
> > > > > Flink
> > > > > > > > > > release.
> > > > > > > > > > > > > Let's
> > > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > > > > What kind of API compatibility we
> > can
> > > > > > commit
> > > > > > > > to?
> > > > > > > > > > It's
> > > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > > > alpha / beta version APIs that
> > allow
> > > > > > > > incompatible
> > > > > > > > > > > > future
> > > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > first version. But eventually we
> > > would
> > > > > need
> > > > > > > to
> > > > > > > > > > > > guarantee
> > > > > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > > > > compatibility, so that an early
> > > version
> > > > > CR
> > > > > > > can
> > > > > > > > > work
> > > > > > > > > > > > with
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Another great point and please let
> me
> > > > > include
> > > > > > > > that
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I think we should allow
> incompatible
> > > > > changes
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > first
> > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > > > > versions, similar to how other
> major
> > > > > features
> > > > > > > > have
> > > > > > > > > > > > evolved
> > > > > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Would be great to get broader
> > feedback
> > > on
> > > > > > this
> > > > > > > > one.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM
> > > Thomas
> > > > > > Weise
> > > > > > > <
> > > > > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs
> Standalone
> > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > > > Maybe we should make this
> more
> > > clear
> > > > > in
> > > > > > > the
> > > > > > > > > > FLIP
> > > > > > > > > > > > but
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > first version of the operator
> > > based
> > > > > on
> > > > > > > the
> > > > > > > > > > native
> > > > > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > > > > While this clearly does not
> > > cover all
> > > > > > > > > use-cases
> > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > > > > this would lead to a much
> > smaller
> > > > > > initial
> > > > > > > > > > effort
> > > > > > > > > > > > and
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I'm also leaning towards the
> > native
> > > > > > > > > integration,
> > > > > > > > > > as
> > > > > > > > > > > > > long
> > > > > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the
> > operator
> > > > > will
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > also
> > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > standalone mode. I would like
> to
> > > gain
> > > > > > more
> > > > > > > > > > > confidence
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > > integration reduces the effort.
> > > While
> > > > > it
> > > > > > > cuts
> > > > > > > > > the
> > > > > > > > > > > > > effort
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > > > > pod creation, some mapping code
> > > from
> > > > > the
> > > > > > CR
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > > client and config needs to be
> > > created.
> > > > > As
> > > > > > > > > > mentioned
> > > > > > > > > > > > in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > > integration requires the Flink
> > job
> > > > > > manager
> > > > > > > to
> > > > > > > > > > have
> > > > > > > > > > > > > access
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > create pods, which in some
> > > scenarios
> > > > > may
> > > > > > be
> > > > > > > > > seen
> > > > > > > > > > as
> > > > > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR
> > > same
> > > > > with
> > > > > > > > what
> > > > > > > > > > > Flink
> > > > > > > > > > > > > has
> > > > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> > > > > arbitrary
> > > > > > > > > > field(e.g.
> > > > > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Yes, pod template would look
> > almost
> > > > > > > > identical.
> > > > > > > > > > > There
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > > > > that the operator will control
> > (and
> > > > > that
> > > > > > > may
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > > > > in general we would not want to
> > > place
> > > > > > > > > > > restrictions. I
> > > > > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > > > > where a pod template is merged
> > from
> > > > > > > multiple
> > > > > > > > > > layers
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > interesting to make this more
> > > flexible.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi All!

Thank you all for reviewing the PR and already helping to make it better. I
have opened a bunch of jira tickets under
https://issues.apache.org/jira/browse/FLINK-25963 based on some comments
and incomplete features in general.

Given that there were no major objections about the prototype, I will merge
it now so we can start collaborating together.

Cheers,
Gyula

On Wed, Feb 16, 2022 at 3:52 AM Yang Wang <da...@gmail.com> wrote:

> Thanks for the explanation.
> Given that it is unrelated with java version in Flink.
> Starting with java11 for the flink-kubernetes-operator makes sense to me.
>
>
> Best,
> Yang
>
> Thomas Weise <th...@apache.org> 于2022年2月15日周二 23:57写道:
>
> > Hi,
> >
> > At this point I see no reason to support Java 8 for a new project.
> > Java 8 is being phased out, we should start with 11.
> >
> > Also, since the operator isn't a library but effectively just a docker
> > image, the ability to change the Java version isn't as critical as it
> > is for Flink core, which needs to run in many different environments.
> >
> > Cheers,
> > Thomas
> >
> > On Tue, Feb 15, 2022 at 4:50 AM Gyula Fóra <gy...@gmail.com> wrote:
> > >
> > > Hi Devs,
> > >
> > > Yang Wang discovered that the current prototype is not compatible with
> > Java
> > > 8 but only 11 and upwards.
> > >
> > > The reason for this is that the java operator SDK itself is not java 8
> > > compatible unfortunately.
> > >
> > > Given that Java 8 is on the road to deprecation and that the operator
> > runs
> > > as a containerized deployment, are there any concerns regarding making
> > the
> > > target java version 11?
> > > This should not affect deployed flink clusters and jobs, those should
> > still
> > > work with Java 8, but only the kubernetes operator itself.
> > >
> > > Cheers,
> > > Gyula
> > >
> > >
> > > On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com>
> wrote:
> > >
> > > > I also lean to not introduce the savepoint/checkpoint related fields
> > to the
> > > > job spec, especially in the very beginning of
> > flink-kubernetes-operator.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> > > >
> > > > > Hi Peng Yuan!
> > > > >
> > > > > While I do agree that savepoint path is a very important production
> > > > > configuration there are a lot of other things that come to my mind:
> > > > >  - savepoint dir
> > > > >  - checkpoint dir
> > > > >  - checkpoint interval/timeout
> > > > >  - high availability settings (provider/storagedir etc)
> > > > >
> > > > > just to name a few...
> > > > >
> > > > > While these are all production critical, they have nice clean Flink
> > > > config
> > > > > settings to go with them. If we stand introducing these to jobspec
> we
> > > > only
> > > > > get confusion about priority order etc and it is going to be hard
> to
> > > > change
> > > > > or remove them in the future. In any case we should validate that
> > these
> > > > > configs exist in cases where users use a stateful upgrade mode for
> > > > example.
> > > > > This is something we need to add for sure.
> > > > >
> > > > > As for the other options you mentioned like automatic savepoint
> > > > generation
> > > > > for instance, those deserve an independent discussion of their own
> I
> > > > > believe :)
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Matyas!
> > > > > >
> > > > > > Thanks for your reply!
> > > > > > For 1. and 3. scenarios,I couldn't agree more with the
> podTemplate
> > > > > solution
> > > > > > , i missed this part.
> > > > > > For savepoint related configuration, I think it's very important
> > to be
> > > > > > specified in JobSpec, Because savepoint is a very common
> > configuration
> > > > > for
> > > > > > upgrading a job, if it has been placed in JobSpec can be
> obviously
> > > > > > configured by the user. In addition, other advanced properties
> can
> > be
> > > > put
> > > > > > into flinkConfiguration customized by expert users.
> > > > > > A bunch of savepoint configuration as follows:
> > > > > >
> > > > > > > fromSavepoint——Job restart from
> > > > > >
> > > > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > > > `savepointsDir`
> > > > > > > every n seconds.
> > > > > >
> > > > > > savepointsDir—— Savepoints dir where to store automatically taken
> > > > > > > savepoints
> > > > > >
> > > > > > savepointGeneration—— Update savepoint generation of job status
> > for a
> > > > > > > running job (should be defined in JobStatus)
> > > > > >
> > > > > >
> > > > > > Best wishes,
> > > > > > Peng Yuan.
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <
> > matyas.orhidi@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Peng,
> > > > > > >
> > > > > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > > > > podTemplate
> > > > > > > functionality in the operator could cover both. We also need to
> > be
> > > > > > careful
> > > > > > > about introducing proxy parameters in the CRD spec. The
> savepoint
> > > > path
> > > > > is
> > > > > > > usually accompanied with a bunch of other configurations for
> > example,
> > > > > so
> > > > > > > users need to use configuration params anyway. What do you
> think?
> > > > > > >
> > > > > > > Best,
> > > > > > > Matyas
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yuanpengfred@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Gyula!
> > > > > > > >
> > > > > > > > I have reviewed the prototype design of
> > flink-kubernetes-operator
> > > > you
> > > > > > > > submitted, and I have the following questions:
> > > > > > > >
> > > > > > > > 1.Can a Flink Jar package that supports pulling from the
> > sidecar be
> > > > > > added
> > > > > > > > to the JobSpec? just like this:
> > > > > > > >
> > > > > > > > > initContainers:
> > > > > > > > >       - name: downloader
> > > > > > > > >         image: curlimages/curl
> > > > > > > > >         env:
> > > > > > > > >           - name: JAR_URL
> > > > > > > > >             value:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > > > >           - name: DEST_PATH
> > > > > > > > >             value: /cache/flink-app.jar
> > > > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH}
> > ${JAR_URL}']
> > > > > > > >
> > > > > > > > 2.Can we add savepoint path property to job specification?
> > > > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > > > TaskManagerSpec
> > > > > to
> > > > > > > > expose some service ,such as prometheus?The property can be
> > this:
> > > > > > > >
> > > > > > > > > extraPorts:
> > > > > > > > >       - name: prom
> > > > > > > > >         containerPort: 9249
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Best wishes,
> > > > > > > > Peng Yuan
> > > > > > > >
> > > > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <
> gyfora@apache.org
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Flink Devs!
> > > > > > > > >
> > > > > > > > > We would like to present to you the first prototype of the
> > > > > > > > > flink-kubernetes-operator that was built based on the FLIP
> > and
> > > > the
> > > > > > > > > discussion on this mail thread. We would also like to call
> > out
> > > > some
> > > > > > > > design
> > > > > > > > > decisions that we have made regarding architecture
> components
> > > > that
> > > > > > were
> > > > > > > > not
> > > > > > > > > explicitly mentioned in the FLIP document/thread and give
> > you the
> > > > > > > > > opportunity to raise any concerns here.
> > > > > > > > >
> > > > > > > > > You can find the initial prototype here:
> > > > > > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > > > >
> > > > > > > > > We will leave the PR open for 1-2 days before merging to
> let
> > > > people
> > > > > > > > comment
> > > > > > > > > on it, but please be mindful that this is an initial
> > prototype
> > > > with
> > > > > > > many
> > > > > > > > > rough edges. It is not intended to be a complete
> > implementation
> > > > of
> > > > > > the
> > > > > > > > FLIP
> > > > > > > > > specs as that will take some more work from all of us :)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *Prototype feature set:*The prototype contains a basic
> > working
> > > > > > version
> > > > > > > of
> > > > > > > > > the flink-kubernetes-operator that supports deployment and
> > > > > lifecycle
> > > > > > > > > management of a stateful native flink application. We have
> > basic
> > > > > > > support
> > > > > > > > > for stateful and stateless upgrades, UI ingress, pod
> > templates
> > > > etc.
> > > > > > > Error
> > > > > > > > > handling at this point is largely missing.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *Features / design decisions that were not explicitly
> > discussed
> > > > in
> > > > > > this
> > > > > > > > > thread*
> > > > > > > > >
> > > > > > > > > *Basic Admission control using a Webhook*Standard resource
> > > > > admission
> > > > > > > > > control in Kubernetes to validate and potentially reject
> > > > resources
> > > > > is
> > > > > > > > done
> > > > > > > > > through Webhooks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > > > This is a necessary mechanism to give the user an upfront
> > error
> > > > > when
> > > > > > an
> > > > > > > > > incorrect resource was submitted. In the Flink operator's
> > case we
> > > > > > need
> > > > > > > to
> > > > > > > > > validate that the FlinkDeployment yaml actually makes sense
> > and
> > > > > does
> > > > > > > not
> > > > > > > > > contain erroneous config options that would inevitably lead
> > to
> > > > > > > > > deployment/job failures.
> > > > > > > > >
> > > > > > > > > We have implemented a simple webhook that we can use for
> this
> > > > type
> > > > > of
> > > > > > > > > validation, as a separate maven module
> > > > (flink-kubernetes-webhook).
> > > > > > The
> > > > > > > > > webhook is an optional component and can be enabled or
> > disabled
> > > > > > during
> > > > > > > > > deployment. To avoid pulling in new external dependencies
> we
> > have
> > > > > > used
> > > > > > > > the
> > > > > > > > > Flink Shaded Netty module to build the simple rest endpoint
> > > > > required.
> > > > > > > If
> > > > > > > > > the community feels that Netty adds unnecessary complexity
> > to the
> > > > > > > webhook
> > > > > > > > > implementation we are open to alternative backends such as
> > > > > Springboot
> > > > > > > for
> > > > > > > > > instance which would practically eliminate all the
> > boilerplate.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *Helm Chart for deployment*Helm charts provide an industry
> > > > standard
> > > > > > way
> > > > > > > > of
> > > > > > > > > managing kubernetes deployments. We have created a helm
> chart
> > > > > > prototype
> > > > > > > > > that can be used to deploy the operator together with all
> > > > required
> > > > > > > > > resources. The helm chart allows easy configuration for
> > things
> > > > like
> > > > > > > > images,
> > > > > > > > > namespaces etc and flags to control specific parts of the
> > > > > deployment
> > > > > > > such
> > > > > > > > > as RBAC or the webhook.
> > > > > > > > >
> > > > > > > > > The helm chart provided is intended to be a first version
> > that
> > > > > worked
> > > > > > > for
> > > > > > > > > us during development but we expect to have a lot of
> > iterations
> > > > on
> > > > > it
> > > > > > > > based
> > > > > > > > > on the feedback from the community.
> > > > > > > > >
> > > > > > > > > *Acknowledgment*
> > > > > > > > > We would like to thank everyone who has provided support
> and
> > > > > valuable
> > > > > > > > > feedback on this FLIP.
> > > > > > > > > We would also like to thank Yang Wang & Alexis
> Sarda-Espinosa
> > > > > > > > specifically
> > > > > > > > > for making their operators open source and available to us
> > which
> > > > > had
> > > > > > a
> > > > > > > > big
> > > > > > > > > impact on the FLIP and the prototype.
> > > > > > > > >
> > > > > > > > > We are looking forward to continuing development on the
> > operator
> > > > > > > together
> > > > > > > > > with the broader community.
> > > > > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <
> > yuanpengfred@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Gyula,
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > It's great to see the project getting started and I can't
> > wait
> > > > to
> > > > > > see
> > > > > > > > the
> > > > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > > > >
> > > > > > > > > > Best Wishes!
> > > > > > > > > > Peng Yuan
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > > > gyula.fora@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Peng Yuan!
> > > > > > > > > > >
> > > > > > > > > > > The repo is already created:
> > > > > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > > > > >
> > > > > > > > > > > We will open the PR with the initial prototype later
> > today,
> > > > > stay
> > > > > > > > tuned
> > > > > > > > > in
> > > > > > > > > > > this thread! :)
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > > > yuanpengfred@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > Has the project of flink-kubernetes-operator been
> > created
> > > > in
> > > > > > > > github?
> > > > > > > > > > > >
> > > > > > > > > > > > Peng Yuan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > > > gyula.fora@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I agree with flink-kubernetes-operator as the repo
> > name
> > > > :)
> > > > > > > > > > > > > Don't have any better idea
> > > > > > > > > > > > >
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > > > thw@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the continued feedback and discussion.
> > Looks
> > > > > > like
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > > > > > ready to start a VOTE, I will initiate it
> shortly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In parallel it would be good to find the
> repository
> > > > name.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I thought "flink-operator" could be a bit
> > misleading
> > > > > since
> > > > > > > the
> > > > > > > > > term
> > > > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I also considered "flink-k8s-operator" but that
> > would
> > > > be
> > > > > > > almost
> > > > > > > > > > > > > > identical to existing operator implementations
> and
> > > > could
> > > > > > lead
> > > > > > > > to
> > > > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So far we have been focusing our dev efforts on
> > the
> > > > > > initial
> > > > > > > > > > native
> > > > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > > > If the discussion and vote goes well for this
> > FLIP we
> > > > > are
> > > > > > > > > looking
> > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > to contributing the initial version sometime
> next
> > > > week
> > > > > > > > (fingers
> > > > > > > > > > > > > crossed).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > At that point I think we can already start the
> > dev
> > > > work
> > > > > > to
> > > > > > > > > > support
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > standalone mode as well, especially if you can
> > > > dedicate
> > > > > > > some
> > > > > > > > > > effort
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > > > Working together on this sounds like a great
> > idea and
> > > > > we
> > > > > > > > should
> > > > > > > > > > > start
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I have been discussing this one with my team.
> > We
> > > > are
> > > > > > > > > interested
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > Standalone mode, and are willing to
> contribute
> > > > > towards
> > > > > > > the
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > Potentially we can work together to support
> > both
> > > > > modes
> > > > > > in
> > > > > > > > > > > parallel?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > > > Versioning will be independent from Flink
> > and the
> > > > > > > > operator
> > > > > > > > > > will
> > > > > > > > > > > > > > depend
> > > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > fixed flink version (in every given
> operator
> > > > > > version).
> > > > > > > > > > > > > > > > > This should be the exact same setup as with
> > > > > Stateful
> > > > > > > > > > Functions
> > > > > > > > > > > (
> > > > > > > > > > > > > > > > > https://github.com/apache/flink-statefun).
> > So
> > > > > > > > independent
> > > > > > > > > > > > release
> > > > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > > > I think that's a very good point, as
> general
> > > > > > exception
> > > > > > > > > > handling
> > > > > > > > > > > > for
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > different failure scenarios is a tricky
> > problem.
> > > > I
> > > > > > > think
> > > > > > > > > the
> > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > classifiers and retry strategies could
> avoid
> > a
> > > > lot
> > > > > of
> > > > > > > > > manual
> > > > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > > > from the user. We will definitely need to
> add
> > > > > > something
> > > > > > > > > like
> > > > > > > > > > > > this.
> > > > > > > > > > > > > > Once
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > have the repo created with the initial
> > operator
> > > > > code
> > > > > > we
> > > > > > > > > > should
> > > > > > > > > > > > open
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > tickets for this and put it on the short
> term
> > > > > > roadmap!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny
> Cranmer
> > <
> > > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Great work on the FLIP, I am looking
> > forward to
> > > > > > this
> > > > > > > > > one. I
> > > > > > > > > > > > agree
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I have general feedback around how we
> will
> > > > handle
> > > > > > job
> > > > > > > > > > > > submission
> > > > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > > > > Alternatives
> > > > > > > > > > section,
> > > > > > > > > > > > we
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > Java to handle job submission failures
> > from the
> > > > > > Flink
> > > > > > > > > > client.
> > > > > > > > > > > > It
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > useful to have the ability to configure
> > > > exception
> > > > > > > > > > classifiers
> > > > > > > > > > > > and
> > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > strategy as part of operator
> configuration.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Given this will be in a separate Github
> > > > > repository
> > > > > > I
> > > > > > > am
> > > > > > > > > > > curious
> > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > > > versioning strategy will work in relation
> > to
> > > > the
> > > > > > > Flink
> > > > > > > > > > > version?
> > > > > > > > > > > > > Do
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > any other components with a similar setup
> > I can
> > > > > > look
> > > > > > > > at?
> > > > > > > > > > Will
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > version track Flink or will it use its
> own
> > > > > > versioning
> > > > > > > > > > > strategy
> > > > > > > > > > > > > > with a
> > > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton
> > Balassi <
> > > > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you for the great feedback,
> Thomas
> > has
> > > > > > > updated
> > > > > > > > > the
> > > > > > > > > > > FLIP
> > > > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > > > accordingly. If you are comfortable
> with
> > the
> > > > > > > > currently
> > > > > > > > > > > > existing
> > > > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving
> > > > forward
> > > > > to
> > > > > > > the
> > > > > > > > > > > voting
> > > > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > > > that reaches a positive conclusion it
> > lets us
> > > > > > > create
> > > > > > > > > the
> > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > > > repository under the flink project for
> > the
> > > > > > > operator.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I encourage everyone to keep improving
> > the
> > > > > > details
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > > > I believe given the existing design and
> > the
> > > > > > general
> > > > > > > > > > > sentiment
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > thread that the most efficient path
> from
> > here
> > > > > is
> > > > > > > > > starting
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > implementation so that we can
> > collectively
> > > > > > iterate
> > > > > > > > over
> > > > > > > > > > it.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas
> > > > Weise <
> > > > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the feedback and please
> see
> > > > > > responses
> > > > > > > > > below
> > > > > > > > > > > -->
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM
> > Xintong
> > > > > Song <
> > > > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this
> > FLIP, and
> > > > > > > > everyone
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I also have a few questions and
> > comments.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > > > Deploying a Flink session cluster
> via
> > > > > > kubectl &
> > > > > > > > CR
> > > > > > > > > > and
> > > > > > > > > > > > then
> > > > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST
> > is
> > > > > > probably
> > > > > > > > the
> > > > > > > > > > > > > approach
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > > > the least effort. However, I'd like
> > to
> > > > > point
> > > > > > > out
> > > > > > > > 2
> > > > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > > perjob/application
> > > > > > > > > > > modes.
> > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > > > having to run the job in two steps
> > > > (deploy
> > > > > > the
> > > > > > > > > > cluster,
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > > > 2. One of our motivations is being
> > able
> > > > to
> > > > > > > manage
> > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting
> > jobs
> > > > > from
> > > > > > > cli
> > > > > > > > > > > sounds
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > > > I think it's probably worth it to
> > support
> > > > > > > > > submitting
> > > > > > > > > > > jobs
> > > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > > > in the first version, both together
> > with
> > > > > > > > deploying
> > > > > > > > > > the
> > > > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > perjob/application mode and after
> > > > deploying
> > > > > > the
> > > > > > > > > > cluster
> > > > > > > > > > > > > like
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > The intention is to support
> application
> > > > > > > management
> > > > > > > > > > > through
> > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > > > which means there won't be any 2 step
> > > > > > submission
> > > > > > > > > > process,
> > > > > > > > > > > > > > which as
> > > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > > allude to would defeat the purpose of
> > this
> > > > > > > project.
> > > > > > > > > The
> > > > > > > > > > > CR
> > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > > > the application part. Please note
> that
> > the
> > > > > bare
> > > > > > > > > cluster
> > > > > > > > > > > > > > support is
> > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > *additional* feature for scenarios
> that
> > > > > require
> > > > > > > > > > external
> > > > > > > > > > > > job
> > > > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > > there anything on the FLIP page that
> > > > creates
> > > > > a
> > > > > > > > > > different
> > > > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > > > Which Flink versions does the
> > operator
> > > > plan
> > > > > > to
> > > > > > > > > > support?
> > > > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was
> firstly
> > > > > > introduced
> > > > > > > > in
> > > > > > > > > > > Flink
> > > > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in
> > Flink
> > > > > 1.12
> > > > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > > > introduced
> > > > > in
> > > > > > > > Flink
> > > > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > > > 4. There was some changes to the
> > Flink
> > > > > docker
> > > > > > > > image
> > > > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Great, thanks for providing this. It
> is
> > > > > > important
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > > > going forward also. We are targeting
> > Flink
> > > > > > 1.14.x
> > > > > > > > > > > upwards.
> > > > > > > > > > > > > > Before
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > operator is ready there will be
> another
> > > > Flink
> > > > > > > > > release.
> > > > > > > > > > > > Let's
> > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > > > What kind of API compatibility we
> can
> > > > > commit
> > > > > > > to?
> > > > > > > > > It's
> > > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > > alpha / beta version APIs that
> allow
> > > > > > > incompatible
> > > > > > > > > > > future
> > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > first version. But eventually we
> > would
> > > > need
> > > > > > to
> > > > > > > > > > > guarantee
> > > > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > > > compatibility, so that an early
> > version
> > > > CR
> > > > > > can
> > > > > > > > work
> > > > > > > > > > > with
> > > > > > > > > > > > a
> > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Another great point and please let me
> > > > include
> > > > > > > that
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I think we should allow incompatible
> > > > changes
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > first
> > > > > > > > > > > > > one
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > > > versions, similar to how other major
> > > > features
> > > > > > > have
> > > > > > > > > > > evolved
> > > > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Would be great to get broader
> feedback
> > on
> > > > > this
> > > > > > > one.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM
> > Thomas
> > > > > Weise
> > > > > > <
> > > > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > > Maybe we should make this more
> > clear
> > > > in
> > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > but
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > first version of the operator
> > based
> > > > on
> > > > > > the
> > > > > > > > > native
> > > > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > > > While this clearly does not
> > cover all
> > > > > > > > use-cases
> > > > > > > > > > and
> > > > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > > > this would lead to a much
> smaller
> > > > > initial
> > > > > > > > > effort
> > > > > > > > > > > and
> > > > > > > > > > > > a
> > > > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I'm also leaning towards the
> native
> > > > > > > > integration,
> > > > > > > > > as
> > > > > > > > > > > > long
> > > > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the
> operator
> > > > will
> > > > > > need
> > > > > > > > to
> > > > > > > > > > also
> > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > standalone mode. I would like to
> > gain
> > > > > more
> > > > > > > > > > confidence
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > integration reduces the effort.
> > While
> > > > it
> > > > > > cuts
> > > > > > > > the
> > > > > > > > > > > > effort
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > > > pod creation, some mapping code
> > from
> > > > the
> > > > > CR
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > > client and config needs to be
> > created.
> > > > As
> > > > > > > > > mentioned
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > > integration requires the Flink
> job
> > > > > manager
> > > > > > to
> > > > > > > > > have
> > > > > > > > > > > > access
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > create pods, which in some
> > scenarios
> > > > may
> > > > > be
> > > > > > > > seen
> > > > > > > > > as
> > > > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR
> > same
> > > > with
> > > > > > > what
> > > > > > > > > > Flink
> > > > > > > > > > > > has
> > > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> > > > arbitrary
> > > > > > > > > field(e.g.
> > > > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Yes, pod template would look
> almost
> > > > > > > identical.
> > > > > > > > > > There
> > > > > > > > > > > > are
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > > > that the operator will control
> (and
> > > > that
> > > > > > may
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > > > in general we would not want to
> > place
> > > > > > > > > > restrictions. I
> > > > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > > > where a pod template is merged
> from
> > > > > > multiple
> > > > > > > > > layers
> > > > > > > > > > > > would
> > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > interesting to make this more
> > flexible.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Yang Wang <da...@gmail.com>.
Thanks for the explanation.
Given that it is unrelated with java version in Flink.
Starting with java11 for the flink-kubernetes-operator makes sense to me.


Best,
Yang

Thomas Weise <th...@apache.org> 于2022年2月15日周二 23:57写道:

> Hi,
>
> At this point I see no reason to support Java 8 for a new project.
> Java 8 is being phased out, we should start with 11.
>
> Also, since the operator isn't a library but effectively just a docker
> image, the ability to change the Java version isn't as critical as it
> is for Flink core, which needs to run in many different environments.
>
> Cheers,
> Thomas
>
> On Tue, Feb 15, 2022 at 4:50 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > Hi Devs,
> >
> > Yang Wang discovered that the current prototype is not compatible with
> Java
> > 8 but only 11 and upwards.
> >
> > The reason for this is that the java operator SDK itself is not java 8
> > compatible unfortunately.
> >
> > Given that Java 8 is on the road to deprecation and that the operator
> runs
> > as a containerized deployment, are there any concerns regarding making
> the
> > target java version 11?
> > This should not affect deployed flink clusters and jobs, those should
> still
> > work with Java 8, but only the kubernetes operator itself.
> >
> > Cheers,
> > Gyula
> >
> >
> > On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com> wrote:
> >
> > > I also lean to not introduce the savepoint/checkpoint related fields
> to the
> > > job spec, especially in the very beginning of
> flink-kubernetes-operator.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> > >
> > > > Hi Peng Yuan!
> > > >
> > > > While I do agree that savepoint path is a very important production
> > > > configuration there are a lot of other things that come to my mind:
> > > >  - savepoint dir
> > > >  - checkpoint dir
> > > >  - checkpoint interval/timeout
> > > >  - high availability settings (provider/storagedir etc)
> > > >
> > > > just to name a few...
> > > >
> > > > While these are all production critical, they have nice clean Flink
> > > config
> > > > settings to go with them. If we stand introducing these to jobspec we
> > > only
> > > > get confusion about priority order etc and it is going to be hard to
> > > change
> > > > or remove them in the future. In any case we should validate that
> these
> > > > configs exist in cases where users use a stateful upgrade mode for
> > > example.
> > > > This is something we need to add for sure.
> > > >
> > > > As for the other options you mentioned like automatic savepoint
> > > generation
> > > > for instance, those deserve an independent discussion of their own I
> > > > believe :)
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com>
> wrote:
> > > >
> > > > > Hi Matyas!
> > > > >
> > > > > Thanks for your reply!
> > > > > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> > > > solution
> > > > > , i missed this part.
> > > > > For savepoint related configuration, I think it's very important
> to be
> > > > > specified in JobSpec, Because savepoint is a very common
> configuration
> > > > for
> > > > > upgrading a job, if it has been placed in JobSpec can be obviously
> > > > > configured by the user. In addition, other advanced properties can
> be
> > > put
> > > > > into flinkConfiguration customized by expert users.
> > > > > A bunch of savepoint configuration as follows:
> > > > >
> > > > > > fromSavepoint——Job restart from
> > > > >
> > > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > > `savepointsDir`
> > > > > > every n seconds.
> > > > >
> > > > > savepointsDir—— Savepoints dir where to store automatically taken
> > > > > > savepoints
> > > > >
> > > > > savepointGeneration—— Update savepoint generation of job status
> for a
> > > > > > running job (should be defined in JobStatus)
> > > > >
> > > > >
> > > > > Best wishes,
> > > > > Peng Yuan.
> > > > >
> > > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <
> matyas.orhidi@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Peng,
> > > > > >
> > > > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > > > podTemplate
> > > > > > functionality in the operator could cover both. We also need to
> be
> > > > > careful
> > > > > > about introducing proxy parameters in the CRD spec. The savepoint
> > > path
> > > > is
> > > > > > usually accompanied with a bunch of other configurations for
> example,
> > > > so
> > > > > > users need to use configuration params anyway. What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Matyas
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Gyula!
> > > > > > >
> > > > > > > I have reviewed the prototype design of
> flink-kubernetes-operator
> > > you
> > > > > > > submitted, and I have the following questions:
> > > > > > >
> > > > > > > 1.Can a Flink Jar package that supports pulling from the
> sidecar be
> > > > > added
> > > > > > > to the JobSpec? just like this:
> > > > > > >
> > > > > > > > initContainers:
> > > > > > > >       - name: downloader
> > > > > > > >         image: curlimages/curl
> > > > > > > >         env:
> > > > > > > >           - name: JAR_URL
> > > > > > > >             value:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > > >           - name: DEST_PATH
> > > > > > > >             value: /cache/flink-app.jar
> > > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH}
> ${JAR_URL}']
> > > > > > >
> > > > > > > 2.Can we add savepoint path property to job specification?
> > > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > > TaskManagerSpec
> > > > to
> > > > > > > expose some service ,such as prometheus?The property can be
> this:
> > > > > > >
> > > > > > > > extraPorts:
> > > > > > > >       - name: prom
> > > > > > > >         containerPort: 9249
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best wishes,
> > > > > > > Peng Yuan
> > > > > > >
> > > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gyfora@apache.org
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Flink Devs!
> > > > > > > >
> > > > > > > > We would like to present to you the first prototype of the
> > > > > > > > flink-kubernetes-operator that was built based on the FLIP
> and
> > > the
> > > > > > > > discussion on this mail thread. We would also like to call
> out
> > > some
> > > > > > > design
> > > > > > > > decisions that we have made regarding architecture components
> > > that
> > > > > were
> > > > > > > not
> > > > > > > > explicitly mentioned in the FLIP document/thread and give
> you the
> > > > > > > > opportunity to raise any concerns here.
> > > > > > > >
> > > > > > > > You can find the initial prototype here:
> > > > > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > > >
> > > > > > > > We will leave the PR open for 1-2 days before merging to let
> > > people
> > > > > > > comment
> > > > > > > > on it, but please be mindful that this is an initial
> prototype
> > > with
> > > > > > many
> > > > > > > > rough edges. It is not intended to be a complete
> implementation
> > > of
> > > > > the
> > > > > > > FLIP
> > > > > > > > specs as that will take some more work from all of us :)
> > > > > > > >
> > > > > > > >
> > > > > > > > *Prototype feature set:*The prototype contains a basic
> working
> > > > > version
> > > > > > of
> > > > > > > > the flink-kubernetes-operator that supports deployment and
> > > > lifecycle
> > > > > > > > management of a stateful native flink application. We have
> basic
> > > > > > support
> > > > > > > > for stateful and stateless upgrades, UI ingress, pod
> templates
> > > etc.
> > > > > > Error
> > > > > > > > handling at this point is largely missing.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Features / design decisions that were not explicitly
> discussed
> > > in
> > > > > this
> > > > > > > > thread*
> > > > > > > >
> > > > > > > > *Basic Admission control using a Webhook*Standard resource
> > > > admission
> > > > > > > > control in Kubernetes to validate and potentially reject
> > > resources
> > > > is
> > > > > > > done
> > > > > > > > through Webhooks.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > > This is a necessary mechanism to give the user an upfront
> error
> > > > when
> > > > > an
> > > > > > > > incorrect resource was submitted. In the Flink operator's
> case we
> > > > > need
> > > > > > to
> > > > > > > > validate that the FlinkDeployment yaml actually makes sense
> and
> > > > does
> > > > > > not
> > > > > > > > contain erroneous config options that would inevitably lead
> to
> > > > > > > > deployment/job failures.
> > > > > > > >
> > > > > > > > We have implemented a simple webhook that we can use for this
> > > type
> > > > of
> > > > > > > > validation, as a separate maven module
> > > (flink-kubernetes-webhook).
> > > > > The
> > > > > > > > webhook is an optional component and can be enabled or
> disabled
> > > > > during
> > > > > > > > deployment. To avoid pulling in new external dependencies we
> have
> > > > > used
> > > > > > > the
> > > > > > > > Flink Shaded Netty module to build the simple rest endpoint
> > > > required.
> > > > > > If
> > > > > > > > the community feels that Netty adds unnecessary complexity
> to the
> > > > > > webhook
> > > > > > > > implementation we are open to alternative backends such as
> > > > Springboot
> > > > > > for
> > > > > > > > instance which would practically eliminate all the
> boilerplate.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Helm Chart for deployment*Helm charts provide an industry
> > > standard
> > > > > way
> > > > > > > of
> > > > > > > > managing kubernetes deployments. We have created a helm chart
> > > > > prototype
> > > > > > > > that can be used to deploy the operator together with all
> > > required
> > > > > > > > resources. The helm chart allows easy configuration for
> things
> > > like
> > > > > > > images,
> > > > > > > > namespaces etc and flags to control specific parts of the
> > > > deployment
> > > > > > such
> > > > > > > > as RBAC or the webhook.
> > > > > > > >
> > > > > > > > The helm chart provided is intended to be a first version
> that
> > > > worked
> > > > > > for
> > > > > > > > us during development but we expect to have a lot of
> iterations
> > > on
> > > > it
> > > > > > > based
> > > > > > > > on the feedback from the community.
> > > > > > > >
> > > > > > > > *Acknowledgment*
> > > > > > > > We would like to thank everyone who has provided support and
> > > > valuable
> > > > > > > > feedback on this FLIP.
> > > > > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > > > > specifically
> > > > > > > > for making their operators open source and available to us
> which
> > > > had
> > > > > a
> > > > > > > big
> > > > > > > > impact on the FLIP and the prototype.
> > > > > > > >
> > > > > > > > We are looking forward to continuing development on the
> operator
> > > > > > together
> > > > > > > > with the broader community.
> > > > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <
> yuanpengfred@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Gyula,
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > It's great to see the project getting started and I can't
> wait
> > > to
> > > > > see
> > > > > > > the
> > > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > > >
> > > > > > > > > Best Wishes!
> > > > > > > > > Peng Yuan
> > > > > > > > >
> > > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > > gyula.fora@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Peng Yuan!
> > > > > > > > > >
> > > > > > > > > > The repo is already created:
> > > > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > > > >
> > > > > > > > > > We will open the PR with the initial prototype later
> today,
> > > > stay
> > > > > > > tuned
> > > > > > > > in
> > > > > > > > > > this thread! :)
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > > yuanpengfred@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > Has the project of flink-kubernetes-operator been
> created
> > > in
> > > > > > > github?
> > > > > > > > > > >
> > > > > > > > > > > Peng Yuan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > > gyula.fora@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I agree with flink-kubernetes-operator as the repo
> name
> > > :)
> > > > > > > > > > > > Don't have any better idea
> > > > > > > > > > > >
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > > thw@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the continued feedback and discussion.
> Looks
> > > > > like
> > > > > > we
> > > > > > > > are
> > > > > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In parallel it would be good to find the repository
> > > name.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > > > > >
> > > > > > > > > > > > > I thought "flink-operator" could be a bit
> misleading
> > > > since
> > > > > > the
> > > > > > > > term
> > > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I also considered "flink-k8s-operator" but that
> would
> > > be
> > > > > > almost
> > > > > > > > > > > > > identical to existing operator implementations and
> > > could
> > > > > lead
> > > > > > > to
> > > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So far we have been focusing our dev efforts on
> the
> > > > > initial
> > > > > > > > > native
> > > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > > If the discussion and vote goes well for this
> FLIP we
> > > > are
> > > > > > > > looking
> > > > > > > > > > > > forward
> > > > > > > > > > > > > > to contributing the initial version sometime next
> > > week
> > > > > > > (fingers
> > > > > > > > > > > > crossed).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > At that point I think we can already start the
> dev
> > > work
> > > > > to
> > > > > > > > > support
> > > > > > > > > > > the
> > > > > > > > > > > > > > standalone mode as well, especially if you can
> > > dedicate
> > > > > > some
> > > > > > > > > effort
> > > > > > > > > > > to
> > > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > > Working together on this sounds like a great
> idea and
> > > > we
> > > > > > > should
> > > > > > > > > > start
> > > > > > > > > > > > as
> > > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have been discussing this one with my team.
> We
> > > are
> > > > > > > > interested
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > Standalone mode, and are willing to contribute
> > > > towards
> > > > > > the
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > Potentially we can work together to support
> both
> > > > modes
> > > > > in
> > > > > > > > > > parallel?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > > Versioning will be independent from Flink
> and the
> > > > > > > operator
> > > > > > > > > will
> > > > > > > > > > > > > depend
> > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > fixed flink version (in every given operator
> > > > > version).
> > > > > > > > > > > > > > > > This should be the exact same setup as with
> > > > Stateful
> > > > > > > > > Functions
> > > > > > > > > > (
> > > > > > > > > > > > > > > > https://github.com/apache/flink-statefun).
> So
> > > > > > > independent
> > > > > > > > > > > release
> > > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > > I think that's a very good point, as general
> > > > > exception
> > > > > > > > > handling
> > > > > > > > > > > for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > different failure scenarios is a tricky
> problem.
> > > I
> > > > > > think
> > > > > > > > the
> > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > classifiers and retry strategies could avoid
> a
> > > lot
> > > > of
> > > > > > > > manual
> > > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > > from the user. We will definitely need to add
> > > > > something
> > > > > > > > like
> > > > > > > > > > > this.
> > > > > > > > > > > > > Once
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > have the repo created with the initial
> operator
> > > > code
> > > > > we
> > > > > > > > > should
> > > > > > > > > > > open
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > tickets for this and put it on the short term
> > > > > roadmap!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer
> <
> > > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Great work on the FLIP, I am looking
> forward to
> > > > > this
> > > > > > > > one. I
> > > > > > > > > > > agree
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I have general feedback around how we will
> > > handle
> > > > > job
> > > > > > > > > > > submission
> > > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > > > Alternatives
> > > > > > > > > section,
> > > > > > > > > > > we
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > Java to handle job submission failures
> from the
> > > > > Flink
> > > > > > > > > client.
> > > > > > > > > > > It
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > useful to have the ability to configure
> > > exception
> > > > > > > > > classifiers
> > > > > > > > > > > and
> > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Given this will be in a separate Github
> > > > repository
> > > > > I
> > > > > > am
> > > > > > > > > > curious
> > > > > > > > > > > > how
> > > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > > versioning strategy will work in relation
> to
> > > the
> > > > > > Flink
> > > > > > > > > > version?
> > > > > > > > > > > > Do
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > any other components with a similar setup
> I can
> > > > > look
> > > > > > > at?
> > > > > > > > > Will
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > version track Flink or will it use its own
> > > > > versioning
> > > > > > > > > > strategy
> > > > > > > > > > > > > with a
> > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton
> Balassi <
> > > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you for the great feedback, Thomas
> has
> > > > > > updated
> > > > > > > > the
> > > > > > > > > > FLIP
> > > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > > accordingly. If you are comfortable with
> the
> > > > > > > currently
> > > > > > > > > > > existing
> > > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving
> > > forward
> > > > to
> > > > > > the
> > > > > > > > > > voting
> > > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > > that reaches a positive conclusion it
> lets us
> > > > > > create
> > > > > > > > the
> > > > > > > > > > > > separate
> > > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > > repository under the flink project for
> the
> > > > > > operator.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I encourage everyone to keep improving
> the
> > > > > details
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > > I believe given the existing design and
> the
> > > > > general
> > > > > > > > > > sentiment
> > > > > > > > > > > > on
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > thread that the most efficient path from
> here
> > > > is
> > > > > > > > starting
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > > implementation so that we can
> collectively
> > > > > iterate
> > > > > > > over
> > > > > > > > > it.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas
> > > Weise <
> > > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > > > > responses
> > > > > > > > below
> > > > > > > > > > -->
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM
> Xintong
> > > > Song <
> > > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this
> FLIP, and
> > > > > > > everyone
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I also have a few questions and
> comments.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > > > > kubectl &
> > > > > > > CR
> > > > > > > > > and
> > > > > > > > > > > then
> > > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST
> is
> > > > > probably
> > > > > > > the
> > > > > > > > > > > > approach
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > > the least effort. However, I'd like
> to
> > > > point
> > > > > > out
> > > > > > > 2
> > > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > > perjob/application
> > > > > > > > > > modes.
> > > > > > > > > > > > For
> > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > > having to run the job in two steps
> > > (deploy
> > > > > the
> > > > > > > > > cluster,
> > > > > > > > > > > and
> > > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > > 2. One of our motivations is being
> able
> > > to
> > > > > > manage
> > > > > > > > > Flink
> > > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting
> jobs
> > > > from
> > > > > > cli
> > > > > > > > > > sounds
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > > I think it's probably worth it to
> support
> > > > > > > > submitting
> > > > > > > > > > jobs
> > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > > in the first version, both together
> with
> > > > > > > deploying
> > > > > > > > > the
> > > > > > > > > > > > > cluster
> > > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > perjob/application mode and after
> > > deploying
> > > > > the
> > > > > > > > > cluster
> > > > > > > > > > > > like
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The intention is to support application
> > > > > > management
> > > > > > > > > > through
> > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > > which means there won't be any 2 step
> > > > > submission
> > > > > > > > > process,
> > > > > > > > > > > > > which as
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > allude to would defeat the purpose of
> this
> > > > > > project.
> > > > > > > > The
> > > > > > > > > > CR
> > > > > > > > > > > > > example
> > > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > > the application part. Please note that
> the
> > > > bare
> > > > > > > > cluster
> > > > > > > > > > > > > support is
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > *additional* feature for scenarios that
> > > > require
> > > > > > > > > external
> > > > > > > > > > > job
> > > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > there anything on the FLIP page that
> > > creates
> > > > a
> > > > > > > > > different
> > > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > > Which Flink versions does the
> operator
> > > plan
> > > > > to
> > > > > > > > > support?
> > > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > > > > introduced
> > > > > > > in
> > > > > > > > > > Flink
> > > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in
> Flink
> > > > 1.12
> > > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > > introduced
> > > > in
> > > > > > > Flink
> > > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > > 4. There was some changes to the
> Flink
> > > > docker
> > > > > > > image
> > > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > > > > important
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > > going forward also. We are targeting
> Flink
> > > > > 1.14.x
> > > > > > > > > > upwards.
> > > > > > > > > > > > > Before
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > operator is ready there will be another
> > > Flink
> > > > > > > > release.
> > > > > > > > > > > Let's
> > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > > What kind of API compatibility we can
> > > > commit
> > > > > > to?
> > > > > > > > It's
> > > > > > > > > > > > > probably
> > > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > > > incompatible
> > > > > > > > > > future
> > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > first version. But eventually we
> would
> > > need
> > > > > to
> > > > > > > > > > guarantee
> > > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > > compatibility, so that an early
> version
> > > CR
> > > > > can
> > > > > > > work
> > > > > > > > > > with
> > > > > > > > > > > a
> > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Another great point and please let me
> > > include
> > > > > > that
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I think we should allow incompatible
> > > changes
> > > > > for
> > > > > > > the
> > > > > > > > > > first
> > > > > > > > > > > > one
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > > versions, similar to how other major
> > > features
> > > > > > have
> > > > > > > > > > evolved
> > > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Would be great to get broader feedback
> on
> > > > this
> > > > > > one.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM
> Thomas
> > > > Weise
> > > > > <
> > > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > > > > integration
> > > > > > > > > > > > > > > > > > > > > > Maybe we should make this more
> clear
> > > in
> > > > > the
> > > > > > > > FLIP
> > > > > > > > > > but
> > > > > > > > > > > we
> > > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > first version of the operator
> based
> > > on
> > > > > the
> > > > > > > > native
> > > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > > While this clearly does not
> cover all
> > > > > > > use-cases
> > > > > > > > > and
> > > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> > > > initial
> > > > > > > > effort
> > > > > > > > > > and
> > > > > > > > > > > a
> > > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > > > integration,
> > > > > > > > as
> > > > > > > > > > > long
> > > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator
> > > will
> > > > > need
> > > > > > > to
> > > > > > > > > also
> > > > > > > > > > > > > support
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > standalone mode. I would like to
> gain
> > > > more
> > > > > > > > > confidence
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > integration reduces the effort.
> While
> > > it
> > > > > cuts
> > > > > > > the
> > > > > > > > > > > effort
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > > pod creation, some mapping code
> from
> > > the
> > > > CR
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > native
> > > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > > client and config needs to be
> created.
> > > As
> > > > > > > > mentioned
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > > integration requires the Flink job
> > > > manager
> > > > > to
> > > > > > > > have
> > > > > > > > > > > access
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > create pods, which in some
> scenarios
> > > may
> > > > be
> > > > > > > seen
> > > > > > > > as
> > > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR
> same
> > > with
> > > > > > what
> > > > > > > > > Flink
> > > > > > > > > > > has
> > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> > > arbitrary
> > > > > > > > field(e.g.
> > > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > > > > identical.
> > > > > > > > > There
> > > > > > > > > > > are
> > > > > > > > > > > > a
> > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > > that the operator will control (and
> > > that
> > > > > may
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > > in general we would not want to
> place
> > > > > > > > > restrictions. I
> > > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > > where a pod template is merged from
> > > > > multiple
> > > > > > > > layers
> > > > > > > > > > > would
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > interesting to make this more
> flexible.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Thomas Weise <th...@apache.org>.
Hi,

At this point I see no reason to support Java 8 for a new project.
Java 8 is being phased out, we should start with 11.

Also, since the operator isn't a library but effectively just a docker
image, the ability to change the Java version isn't as critical as it
is for Flink core, which needs to run in many different environments.

Cheers,
Thomas

On Tue, Feb 15, 2022 at 4:50 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> Hi Devs,
>
> Yang Wang discovered that the current prototype is not compatible with Java
> 8 but only 11 and upwards.
>
> The reason for this is that the java operator SDK itself is not java 8
> compatible unfortunately.
>
> Given that Java 8 is on the road to deprecation and that the operator runs
> as a containerized deployment, are there any concerns regarding making the
> target java version 11?
> This should not affect deployed flink clusters and jobs, those should still
> work with Java 8, but only the kubernetes operator itself.
>
> Cheers,
> Gyula
>
>
> On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com> wrote:
>
> > I also lean to not introduce the savepoint/checkpoint related fields to the
> > job spec, especially in the very beginning of flink-kubernetes-operator.
> >
> >
> > Best,
> > Yang
> >
> > Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
> >
> > > Hi Peng Yuan!
> > >
> > > While I do agree that savepoint path is a very important production
> > > configuration there are a lot of other things that come to my mind:
> > >  - savepoint dir
> > >  - checkpoint dir
> > >  - checkpoint interval/timeout
> > >  - high availability settings (provider/storagedir etc)
> > >
> > > just to name a few...
> > >
> > > While these are all production critical, they have nice clean Flink
> > config
> > > settings to go with them. If we stand introducing these to jobspec we
> > only
> > > get confusion about priority order etc and it is going to be hard to
> > change
> > > or remove them in the future. In any case we should validate that these
> > > configs exist in cases where users use a stateful upgrade mode for
> > example.
> > > This is something we need to add for sure.
> > >
> > > As for the other options you mentioned like automatic savepoint
> > generation
> > > for instance, those deserve an independent discussion of their own I
> > > believe :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com> wrote:
> > >
> > > > Hi Matyas!
> > > >
> > > > Thanks for your reply!
> > > > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> > > solution
> > > > , i missed this part.
> > > > For savepoint related configuration, I think it's very important to be
> > > > specified in JobSpec, Because savepoint is a very common configuration
> > > for
> > > > upgrading a job, if it has been placed in JobSpec can be obviously
> > > > configured by the user. In addition, other advanced properties can be
> > put
> > > > into flinkConfiguration customized by expert users.
> > > > A bunch of savepoint configuration as follows:
> > > >
> > > > > fromSavepoint——Job restart from
> > > >
> > > > autoSavepointSecond—— Automatically take a savepoint to the
> > > `savepointsDir`
> > > > > every n seconds.
> > > >
> > > > savepointsDir—— Savepoints dir where to store automatically taken
> > > > > savepoints
> > > >
> > > > savepointGeneration—— Update savepoint generation of job status for a
> > > > > running job (should be defined in JobStatus)
> > > >
> > > >
> > > > Best wishes,
> > > > Peng Yuan.
> > > >
> > > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <matyas.orhidi@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Peng,
> > > > >
> > > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > > podTemplate
> > > > > functionality in the operator could cover both. We also need to be
> > > > careful
> > > > > about introducing proxy parameters in the CRD spec. The savepoint
> > path
> > > is
> > > > > usually accompanied with a bunch of other configurations for example,
> > > so
> > > > > users need to use configuration params anyway. What do you think?
> > > > >
> > > > > Best,
> > > > > Matyas
> > > > >
> > > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Gyula!
> > > > > >
> > > > > > I have reviewed the prototype design of flink-kubernetes-operator
> > you
> > > > > > submitted, and I have the following questions:
> > > > > >
> > > > > > 1.Can a Flink Jar package that supports pulling from the sidecar be
> > > > added
> > > > > > to the JobSpec? just like this:
> > > > > >
> > > > > > > initContainers:
> > > > > > >       - name: downloader
> > > > > > >         image: curlimages/curl
> > > > > > >         env:
> > > > > > >           - name: JAR_URL
> > > > > > >             value:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > > >           - name: DEST_PATH
> > > > > > >             value: /cache/flink-app.jar
> > > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> > > > > >
> > > > > > 2.Can we add savepoint path property to job specification?
> > > > > > 3.Can we add an extra port to the JobManagerSpec and
> > TaskManagerSpec
> > > to
> > > > > > expose some service ,such as prometheus?The property can be this:
> > > > > >
> > > > > > > extraPorts:
> > > > > > >       - name: prom
> > > > > > >         containerPort: 9249
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best wishes,
> > > > > > Peng Yuan
> > > > > >
> > > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Hi Flink Devs!
> > > > > > >
> > > > > > > We would like to present to you the first prototype of the
> > > > > > > flink-kubernetes-operator that was built based on the FLIP and
> > the
> > > > > > > discussion on this mail thread. We would also like to call out
> > some
> > > > > > design
> > > > > > > decisions that we have made regarding architecture components
> > that
> > > > were
> > > > > > not
> > > > > > > explicitly mentioned in the FLIP document/thread and give you the
> > > > > > > opportunity to raise any concerns here.
> > > > > > >
> > > > > > > You can find the initial prototype here:
> > > > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > > >
> > > > > > > We will leave the PR open for 1-2 days before merging to let
> > people
> > > > > > comment
> > > > > > > on it, but please be mindful that this is an initial prototype
> > with
> > > > > many
> > > > > > > rough edges. It is not intended to be a complete implementation
> > of
> > > > the
> > > > > > FLIP
> > > > > > > specs as that will take some more work from all of us :)
> > > > > > >
> > > > > > >
> > > > > > > *Prototype feature set:*The prototype contains a basic working
> > > > version
> > > > > of
> > > > > > > the flink-kubernetes-operator that supports deployment and
> > > lifecycle
> > > > > > > management of a stateful native flink application. We have basic
> > > > > support
> > > > > > > for stateful and stateless upgrades, UI ingress, pod templates
> > etc.
> > > > > Error
> > > > > > > handling at this point is largely missing.
> > > > > > >
> > > > > > >
> > > > > > > *Features / design decisions that were not explicitly discussed
> > in
> > > > this
> > > > > > > thread*
> > > > > > >
> > > > > > > *Basic Admission control using a Webhook*Standard resource
> > > admission
> > > > > > > control in Kubernetes to validate and potentially reject
> > resources
> > > is
> > > > > > done
> > > > > > > through Webhooks.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > > This is a necessary mechanism to give the user an upfront error
> > > when
> > > > an
> > > > > > > incorrect resource was submitted. In the Flink operator's case we
> > > > need
> > > > > to
> > > > > > > validate that the FlinkDeployment yaml actually makes sense and
> > > does
> > > > > not
> > > > > > > contain erroneous config options that would inevitably lead to
> > > > > > > deployment/job failures.
> > > > > > >
> > > > > > > We have implemented a simple webhook that we can use for this
> > type
> > > of
> > > > > > > validation, as a separate maven module
> > (flink-kubernetes-webhook).
> > > > The
> > > > > > > webhook is an optional component and can be enabled or disabled
> > > > during
> > > > > > > deployment. To avoid pulling in new external dependencies we have
> > > > used
> > > > > > the
> > > > > > > Flink Shaded Netty module to build the simple rest endpoint
> > > required.
> > > > > If
> > > > > > > the community feels that Netty adds unnecessary complexity to the
> > > > > webhook
> > > > > > > implementation we are open to alternative backends such as
> > > Springboot
> > > > > for
> > > > > > > instance which would practically eliminate all the boilerplate.
> > > > > > >
> > > > > > >
> > > > > > > *Helm Chart for deployment*Helm charts provide an industry
> > standard
> > > > way
> > > > > > of
> > > > > > > managing kubernetes deployments. We have created a helm chart
> > > > prototype
> > > > > > > that can be used to deploy the operator together with all
> > required
> > > > > > > resources. The helm chart allows easy configuration for things
> > like
> > > > > > images,
> > > > > > > namespaces etc and flags to control specific parts of the
> > > deployment
> > > > > such
> > > > > > > as RBAC or the webhook.
> > > > > > >
> > > > > > > The helm chart provided is intended to be a first version that
> > > worked
> > > > > for
> > > > > > > us during development but we expect to have a lot of iterations
> > on
> > > it
> > > > > > based
> > > > > > > on the feedback from the community.
> > > > > > >
> > > > > > > *Acknowledgment*
> > > > > > > We would like to thank everyone who has provided support and
> > > valuable
> > > > > > > feedback on this FLIP.
> > > > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > > > specifically
> > > > > > > for making their operators open source and available to us which
> > > had
> > > > a
> > > > > > big
> > > > > > > impact on the FLIP and the prototype.
> > > > > > >
> > > > > > > We are looking forward to continuing development on the operator
> > > > > together
> > > > > > > with the broader community.
> > > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > It's great to see the project getting started and I can't wait
> > to
> > > > see
> > > > > > the
> > > > > > > > PR and start contributing code.😄😄😄
> > > > > > > >
> > > > > > > > Best Wishes!
> > > > > > > > Peng Yuan
> > > > > > > >
> > > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> > gyula.fora@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Peng Yuan!
> > > > > > > > >
> > > > > > > > > The repo is already created:
> > > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > > >
> > > > > > > > > We will open the PR with the initial prototype later today,
> > > stay
> > > > > > tuned
> > > > > > > in
> > > > > > > > > this thread! :)
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> > yuanpengfred@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > Has the project of flink-kubernetes-operator been created
> > in
> > > > > > github?
> > > > > > > > > >
> > > > > > > > > > Peng Yuan
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I agree with flink-kubernetes-operator as the repo name
> > :)
> > > > > > > > > > > Don't have any better idea
> > > > > > > > > > >
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the continued feedback and discussion. Looks
> > > > like
> > > > > we
> > > > > > > are
> > > > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > > > >
> > > > > > > > > > > > In parallel it would be good to find the repository
> > name.
> > > > > > > > > > > >
> > > > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > > > >
> > > > > > > > > > > > I thought "flink-operator" could be a bit misleading
> > > since
> > > > > the
> > > > > > > term
> > > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > > >
> > > > > > > > > > > > I also considered "flink-k8s-operator" but that would
> > be
> > > > > almost
> > > > > > > > > > > > identical to existing operator implementations and
> > could
> > > > lead
> > > > > > to
> > > > > > > > > > > > confusion in the future.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > > gyula.fora@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > > >
> > > > > > > > > > > > > So far we have been focusing our dev efforts on the
> > > > initial
> > > > > > > > native
> > > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > > If the discussion and vote goes well for this FLIP we
> > > are
> > > > > > > looking
> > > > > > > > > > > forward
> > > > > > > > > > > > > to contributing the initial version sometime next
> > week
> > > > > > (fingers
> > > > > > > > > > > crossed).
> > > > > > > > > > > > >
> > > > > > > > > > > > > At that point I think we can already start the dev
> > work
> > > > to
> > > > > > > > support
> > > > > > > > > > the
> > > > > > > > > > > > > standalone mode as well, especially if you can
> > dedicate
> > > > > some
> > > > > > > > effort
> > > > > > > > > > to
> > > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > > Working together on this sounds like a great idea and
> > > we
> > > > > > should
> > > > > > > > > start
> > > > > > > > > > > as
> > > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I have been discussing this one with my team. We
> > are
> > > > > > > interested
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > Standalone mode, and are willing to contribute
> > > towards
> > > > > the
> > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > Potentially we can work together to support both
> > > modes
> > > > in
> > > > > > > > > parallel?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > > Versioning will be independent from Flink and the
> > > > > > operator
> > > > > > > > will
> > > > > > > > > > > > depend
> > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > fixed flink version (in every given operator
> > > > version).
> > > > > > > > > > > > > > > This should be the exact same setup as with
> > > Stateful
> > > > > > > > Functions
> > > > > > > > > (
> > > > > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > > > independent
> > > > > > > > > > release
> > > > > > > > > > > > cycle
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > > I think that's a very good point, as general
> > > > exception
> > > > > > > > handling
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > different failure scenarios is a tricky problem.
> > I
> > > > > think
> > > > > > > the
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > > classifiers and retry strategies could avoid a
> > lot
> > > of
> > > > > > > manual
> > > > > > > > > > > > intervention
> > > > > > > > > > > > > > > from the user. We will definitely need to add
> > > > something
> > > > > > > like
> > > > > > > > > > this.
> > > > > > > > > > > > Once
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > have the repo created with the initial operator
> > > code
> > > > we
> > > > > > > > should
> > > > > > > > > > open
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > tickets for this and put it on the short term
> > > > roadmap!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Great work on the FLIP, I am looking forward to
> > > > this
> > > > > > > one. I
> > > > > > > > > > agree
> > > > > > > > > > > > that
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I have general feedback around how we will
> > handle
> > > > job
> > > > > > > > > > submission
> > > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > > Alternatives
> > > > > > > > section,
> > > > > > > > > > we
> > > > > > > > > > > > can
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > Java to handle job submission failures from the
> > > > Flink
> > > > > > > > client.
> > > > > > > > > > It
> > > > > > > > > > > > would
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > useful to have the ability to configure
> > exception
> > > > > > > > classifiers
> > > > > > > > > > and
> > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Given this will be in a separate Github
> > > repository
> > > > I
> > > > > am
> > > > > > > > > curious
> > > > > > > > > > > how
> > > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > > versioning strategy will work in relation to
> > the
> > > > > Flink
> > > > > > > > > version?
> > > > > > > > > > > Do
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > any other components with a similar setup I can
> > > > look
> > > > > > at?
> > > > > > > > Will
> > > > > > > > > > the
> > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > version track Flink or will it use its own
> > > > versioning
> > > > > > > > > strategy
> > > > > > > > > > > > with a
> > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> > > > > updated
> > > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > > page
> > > > > > > > > > > > > > > > > accordingly. If you are comfortable with the
> > > > > > currently
> > > > > > > > > > existing
> > > > > > > > > > > > > > design
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving
> > forward
> > > to
> > > > > the
> > > > > > > > > voting
> > > > > > > > > > > > stage -
> > > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > > that reaches a positive conclusion it lets us
> > > > > create
> > > > > > > the
> > > > > > > > > > > separate
> > > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > > repository under the flink project for the
> > > > > operator.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I encourage everyone to keep improving the
> > > > details
> > > > > in
> > > > > > > the
> > > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > > I believe given the existing design and the
> > > > general
> > > > > > > > > sentiment
> > > > > > > > > > > on
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > thread that the most efficient path from here
> > > is
> > > > > > > starting
> > > > > > > > > the
> > > > > > > > > > > > > > > > > implementation so that we can collectively
> > > > iterate
> > > > > > over
> > > > > > > > it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas
> > Weise <
> > > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > > > responses
> > > > > > > below
> > > > > > > > > -->
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong
> > > Song <
> > > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > > > everyone
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > > > kubectl &
> > > > > > CR
> > > > > > > > and
> > > > > > > > > > then
> > > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> > > > probably
> > > > > > the
> > > > > > > > > > > approach
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > > the least effort. However, I'd like to
> > > point
> > > > > out
> > > > > > 2
> > > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > > perjob/application
> > > > > > > > > modes.
> > > > > > > > > > > For
> > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > > having to run the job in two steps
> > (deploy
> > > > the
> > > > > > > > cluster,
> > > > > > > > > > and
> > > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > > 2. One of our motivations is being able
> > to
> > > > > manage
> > > > > > > > Flink
> > > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs
> > > from
> > > > > cli
> > > > > > > > > sounds
> > > > > > > > > > > not
> > > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > > I think it's probably worth it to support
> > > > > > > submitting
> > > > > > > > > jobs
> > > > > > > > > > > via
> > > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > > in the first version, both together with
> > > > > > deploying
> > > > > > > > the
> > > > > > > > > > > > cluster
> > > > > > > > > > > > > > like
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > perjob/application mode and after
> > deploying
> > > > the
> > > > > > > > cluster
> > > > > > > > > > > like
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The intention is to support application
> > > > > management
> > > > > > > > > through
> > > > > > > > > > > > operator
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > > which means there won't be any 2 step
> > > > submission
> > > > > > > > process,
> > > > > > > > > > > > which as
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > allude to would defeat the purpose of this
> > > > > project.
> > > > > > > The
> > > > > > > > > CR
> > > > > > > > > > > > example
> > > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > > the application part. Please note that the
> > > bare
> > > > > > > cluster
> > > > > > > > > > > > support is
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > *additional* feature for scenarios that
> > > require
> > > > > > > > external
> > > > > > > > > > job
> > > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > there anything on the FLIP page that
> > creates
> > > a
> > > > > > > > different
> > > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > > Which Flink versions does the operator
> > plan
> > > > to
> > > > > > > > support?
> > > > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > > > introduced
> > > > > > in
> > > > > > > > > Flink
> > > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink
> > > 1.12
> > > > > > > > > > > > > > > > > > > 3. The Pod template support was
> > introduced
> > > in
> > > > > > Flink
> > > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > > 4. There was some changes to the Flink
> > > docker
> > > > > > image
> > > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > > > important
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > > going forward also. We are targeting Flink
> > > > 1.14.x
> > > > > > > > > upwards.
> > > > > > > > > > > > Before
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > operator is ready there will be another
> > Flink
> > > > > > > release.
> > > > > > > > > > Let's
> > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > > What kind of API compatibility we can
> > > commit
> > > > > to?
> > > > > > > It's
> > > > > > > > > > > > probably
> > > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > > incompatible
> > > > > > > > > future
> > > > > > > > > > > > changes
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > first version. But eventually we would
> > need
> > > > to
> > > > > > > > > guarantee
> > > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > > compatibility, so that an early version
> > CR
> > > > can
> > > > > > work
> > > > > > > > > with
> > > > > > > > > > a
> > > > > > > > > > > > new
> > > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Another great point and please let me
> > include
> > > > > that
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > FLIP
> > > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think we should allow incompatible
> > changes
> > > > for
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > one
> > > > > > > > > > > > or
> > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > versions, similar to how other major
> > features
> > > > > have
> > > > > > > > > evolved
> > > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Would be great to get broader feedback on
> > > this
> > > > > one.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas
> > > Weise
> > > > <
> > > > > > > > > > > thw@apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > > > integration
> > > > > > > > > > > > > > > > > > > > > Maybe we should make this more clear
> > in
> > > > the
> > > > > > > FLIP
> > > > > > > > > but
> > > > > > > > > > we
> > > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > first version of the operator based
> > on
> > > > the
> > > > > > > native
> > > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > > While this clearly does not cover all
> > > > > > use-cases
> > > > > > > > and
> > > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> > > initial
> > > > > > > effort
> > > > > > > > > and
> > > > > > > > > > a
> > > > > > > > > > > > nicer
> > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > > integration,
> > > > > > > as
> > > > > > > > > > long
> > > > > > > > > > > > as it
> > > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator
> > will
> > > > need
> > > > > > to
> > > > > > > > also
> > > > > > > > > > > > support
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > standalone mode. I would like to gain
> > > more
> > > > > > > > confidence
> > > > > > > > > > > that
> > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > integration reduces the effort. While
> > it
> > > > cuts
> > > > > > the
> > > > > > > > > > effort
> > > > > > > > > > > to
> > > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > > pod creation, some mapping code from
> > the
> > > CR
> > > > > to
> > > > > > > the
> > > > > > > > > > native
> > > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > > client and config needs to be created.
> > As
> > > > > > > mentioned
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > > integration requires the Flink job
> > > manager
> > > > to
> > > > > > > have
> > > > > > > > > > access
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > create pods, which in some scenarios
> > may
> > > be
> > > > > > seen
> > > > > > > as
> > > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR same
> > with
> > > > > what
> > > > > > > > Flink
> > > > > > > > > > has
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> > arbitrary
> > > > > > > field(e.g.
> > > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > > > identical.
> > > > > > > > There
> > > > > > > > > > are
> > > > > > > > > > > a
> > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > > that the operator will control (and
> > that
> > > > may
> > > > > > need
> > > > > > > > to
> > > > > > > > > be
> > > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > > > > restrictions. I
> > > > > > > > > > > > think a
> > > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > > where a pod template is merged from
> > > > multiple
> > > > > > > layers
> > > > > > > > > > would
> > > > > > > > > > > > also
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Devs,

Yang Wang discovered that the current prototype is not compatible with Java
8 but only 11 and upwards.

The reason for this is that the java operator SDK itself is not java 8
compatible unfortunately.

Given that Java 8 is on the road to deprecation and that the operator runs
as a containerized deployment, are there any concerns regarding making the
target java version 11?
This should not affect deployed flink clusters and jobs, those should still
work with Java 8, but only the kubernetes operator itself.

Cheers,
Gyula


On Tue, Feb 15, 2022 at 1:06 PM Yang Wang <da...@gmail.com> wrote:

> I also lean to not introduce the savepoint/checkpoint related fields to the
> job spec, especially in the very beginning of flink-kubernetes-operator.
>
>
> Best,
> Yang
>
> Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:
>
> > Hi Peng Yuan!
> >
> > While I do agree that savepoint path is a very important production
> > configuration there are a lot of other things that come to my mind:
> >  - savepoint dir
> >  - checkpoint dir
> >  - checkpoint interval/timeout
> >  - high availability settings (provider/storagedir etc)
> >
> > just to name a few...
> >
> > While these are all production critical, they have nice clean Flink
> config
> > settings to go with them. If we stand introducing these to jobspec we
> only
> > get confusion about priority order etc and it is going to be hard to
> change
> > or remove them in the future. In any case we should validate that these
> > configs exist in cases where users use a stateful upgrade mode for
> example.
> > This is something we need to add for sure.
> >
> > As for the other options you mentioned like automatic savepoint
> generation
> > for instance, those deserve an independent discussion of their own I
> > believe :)
> >
> > Cheers,
> > Gyula
> >
> > On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com> wrote:
> >
> > > Hi Matyas!
> > >
> > > Thanks for your reply!
> > > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> > solution
> > > , i missed this part.
> > > For savepoint related configuration, I think it's very important to be
> > > specified in JobSpec, Because savepoint is a very common configuration
> > for
> > > upgrading a job, if it has been placed in JobSpec can be obviously
> > > configured by the user. In addition, other advanced properties can be
> put
> > > into flinkConfiguration customized by expert users.
> > > A bunch of savepoint configuration as follows:
> > >
> > > > fromSavepoint——Job restart from
> > >
> > > autoSavepointSecond—— Automatically take a savepoint to the
> > `savepointsDir`
> > > > every n seconds.
> > >
> > > savepointsDir—— Savepoints dir where to store automatically taken
> > > > savepoints
> > >
> > > savepointGeneration—— Update savepoint generation of job status for a
> > > > running job (should be defined in JobStatus)
> > >
> > >
> > > Best wishes,
> > > Peng Yuan.
> > >
> > > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <matyas.orhidi@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Peng,
> > > >
> > > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> > podTemplate
> > > > functionality in the operator could cover both. We also need to be
> > > careful
> > > > about introducing proxy parameters in the CRD spec. The savepoint
> path
> > is
> > > > usually accompanied with a bunch of other configurations for example,
> > so
> > > > users need to use configuration params anyway. What do you think?
> > > >
> > > > Best,
> > > > Matyas
> > > >
> > > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com>
> wrote:
> > > >
> > > > > Hi Gyula!
> > > > >
> > > > > I have reviewed the prototype design of flink-kubernetes-operator
> you
> > > > > submitted, and I have the following questions:
> > > > >
> > > > > 1.Can a Flink Jar package that supports pulling from the sidecar be
> > > added
> > > > > to the JobSpec? just like this:
> > > > >
> > > > > > initContainers:
> > > > > >       - name: downloader
> > > > > >         image: curlimages/curl
> > > > > >         env:
> > > > > >           - name: JAR_URL
> > > > > >             value:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > > >           - name: DEST_PATH
> > > > > >             value: /cache/flink-app.jar
> > > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> > > > >
> > > > > 2.Can we add savepoint path property to job specification?
> > > > > 3.Can we add an extra port to the JobManagerSpec and
> TaskManagerSpec
> > to
> > > > > expose some service ,such as prometheus?The property can be this:
> > > > >
> > > > > > extraPorts:
> > > > > >       - name: prom
> > > > > >         containerPort: 9249
> > > > >
> > > > >
> > > > >
> > > > > Best wishes,
> > > > > Peng Yuan
> > > > >
> > > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org>
> > wrote:
> > > > >
> > > > > > Hi Flink Devs!
> > > > > >
> > > > > > We would like to present to you the first prototype of the
> > > > > > flink-kubernetes-operator that was built based on the FLIP and
> the
> > > > > > discussion on this mail thread. We would also like to call out
> some
> > > > > design
> > > > > > decisions that we have made regarding architecture components
> that
> > > were
> > > > > not
> > > > > > explicitly mentioned in the FLIP document/thread and give you the
> > > > > > opportunity to raise any concerns here.
> > > > > >
> > > > > > You can find the initial prototype here:
> > > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > > >
> > > > > > We will leave the PR open for 1-2 days before merging to let
> people
> > > > > comment
> > > > > > on it, but please be mindful that this is an initial prototype
> with
> > > > many
> > > > > > rough edges. It is not intended to be a complete implementation
> of
> > > the
> > > > > FLIP
> > > > > > specs as that will take some more work from all of us :)
> > > > > >
> > > > > >
> > > > > > *Prototype feature set:*The prototype contains a basic working
> > > version
> > > > of
> > > > > > the flink-kubernetes-operator that supports deployment and
> > lifecycle
> > > > > > management of a stateful native flink application. We have basic
> > > > support
> > > > > > for stateful and stateless upgrades, UI ingress, pod templates
> etc.
> > > > Error
> > > > > > handling at this point is largely missing.
> > > > > >
> > > > > >
> > > > > > *Features / design decisions that were not explicitly discussed
> in
> > > this
> > > > > > thread*
> > > > > >
> > > > > > *Basic Admission control using a Webhook*Standard resource
> > admission
> > > > > > control in Kubernetes to validate and potentially reject
> resources
> > is
> > > > > done
> > > > > > through Webhooks.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > > This is a necessary mechanism to give the user an upfront error
> > when
> > > an
> > > > > > incorrect resource was submitted. In the Flink operator's case we
> > > need
> > > > to
> > > > > > validate that the FlinkDeployment yaml actually makes sense and
> > does
> > > > not
> > > > > > contain erroneous config options that would inevitably lead to
> > > > > > deployment/job failures.
> > > > > >
> > > > > > We have implemented a simple webhook that we can use for this
> type
> > of
> > > > > > validation, as a separate maven module
> (flink-kubernetes-webhook).
> > > The
> > > > > > webhook is an optional component and can be enabled or disabled
> > > during
> > > > > > deployment. To avoid pulling in new external dependencies we have
> > > used
> > > > > the
> > > > > > Flink Shaded Netty module to build the simple rest endpoint
> > required.
> > > > If
> > > > > > the community feels that Netty adds unnecessary complexity to the
> > > > webhook
> > > > > > implementation we are open to alternative backends such as
> > Springboot
> > > > for
> > > > > > instance which would practically eliminate all the boilerplate.
> > > > > >
> > > > > >
> > > > > > *Helm Chart for deployment*Helm charts provide an industry
> standard
> > > way
> > > > > of
> > > > > > managing kubernetes deployments. We have created a helm chart
> > > prototype
> > > > > > that can be used to deploy the operator together with all
> required
> > > > > > resources. The helm chart allows easy configuration for things
> like
> > > > > images,
> > > > > > namespaces etc and flags to control specific parts of the
> > deployment
> > > > such
> > > > > > as RBAC or the webhook.
> > > > > >
> > > > > > The helm chart provided is intended to be a first version that
> > worked
> > > > for
> > > > > > us during development but we expect to have a lot of iterations
> on
> > it
> > > > > based
> > > > > > on the feedback from the community.
> > > > > >
> > > > > > *Acknowledgment*
> > > > > > We would like to thank everyone who has provided support and
> > valuable
> > > > > > feedback on this FLIP.
> > > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > > specifically
> > > > > > for making their operators open source and available to us which
> > had
> > > a
> > > > > big
> > > > > > impact on the FLIP and the prototype.
> > > > > >
> > > > > > We are looking forward to continuing development on the operator
> > > > together
> > > > > > with the broader community.
> > > > > > All work will be tracked using the ASF Jira from now on.
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Gyula,
> > > > > > >
> > > > > > > Thanks!
> > > > > > > It's great to see the project getting started and I can't wait
> to
> > > see
> > > > > the
> > > > > > > PR and start contributing code.😄😄😄
> > > > > > >
> > > > > > > Best Wishes!
> > > > > > > Peng Yuan
> > > > > > >
> > > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <
> gyula.fora@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Peng Yuan!
> > > > > > > >
> > > > > > > > The repo is already created:
> > > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > > >
> > > > > > > > We will open the PR with the initial prototype later today,
> > stay
> > > > > tuned
> > > > > > in
> > > > > > > > this thread! :)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <
> yuanpengfred@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > Has the project of flink-kubernetes-operator been created
> in
> > > > > github?
> > > > > > > > >
> > > > > > > > > Peng Yuan
> > > > > > > > >
> > > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I agree with flink-kubernetes-operator as the repo name
> :)
> > > > > > > > > > Don't have any better idea
> > > > > > > > > >
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> > thw@apache.org>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the continued feedback and discussion. Looks
> > > like
> > > > we
> > > > > > are
> > > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > > >
> > > > > > > > > > > In parallel it would be good to find the repository
> name.
> > > > > > > > > > >
> > > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > > >
> > > > > > > > > > > I thought "flink-operator" could be a bit misleading
> > since
> > > > the
> > > > > > term
> > > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > > >
> > > > > > > > > > > I also considered "flink-k8s-operator" but that would
> be
> > > > almost
> > > > > > > > > > > identical to existing operator implementations and
> could
> > > lead
> > > > > to
> > > > > > > > > > > confusion in the future.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > > gyula.fora@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Danny,
> > > > > > > > > > > >
> > > > > > > > > > > > So far we have been focusing our dev efforts on the
> > > initial
> > > > > > > native
> > > > > > > > > > > > implementation with the team.
> > > > > > > > > > > > If the discussion and vote goes well for this FLIP we
> > are
> > > > > > looking
> > > > > > > > > > forward
> > > > > > > > > > > > to contributing the initial version sometime next
> week
> > > > > (fingers
> > > > > > > > > > crossed).
> > > > > > > > > > > >
> > > > > > > > > > > > At that point I think we can already start the dev
> work
> > > to
> > > > > > > support
> > > > > > > > > the
> > > > > > > > > > > > standalone mode as well, especially if you can
> dedicate
> > > > some
> > > > > > > effort
> > > > > > > > > to
> > > > > > > > > > > > pushing that side.
> > > > > > > > > > > > Working together on this sounds like a great idea and
> > we
> > > > > should
> > > > > > > > start
> > > > > > > > > > as
> > > > > > > > > > > > soon as possible! :)
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I have been discussing this one with my team. We
> are
> > > > > > interested
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > Standalone mode, and are willing to contribute
> > towards
> > > > the
> > > > > > > > > > > implementation.
> > > > > > > > > > > > > Potentially we can work together to support both
> > modes
> > > in
> > > > > > > > parallel?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > > gyula.fora@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > > Versioning will be independent from Flink and the
> > > > > operator
> > > > > > > will
> > > > > > > > > > > depend
> > > > > > > > > > > > > on a
> > > > > > > > > > > > > > fixed flink version (in every given operator
> > > version).
> > > > > > > > > > > > > > This should be the exact same setup as with
> > Stateful
> > > > > > > Functions
> > > > > > > > (
> > > > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > > independent
> > > > > > > > > release
> > > > > > > > > > > cycle
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > > I think that's a very good point, as general
> > > exception
> > > > > > > handling
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > > different failure scenarios is a tricky problem.
> I
> > > > think
> > > > > > the
> > > > > > > > > > > exception
> > > > > > > > > > > > > > classifiers and retry strategies could avoid a
> lot
> > of
> > > > > > manual
> > > > > > > > > > > intervention
> > > > > > > > > > > > > > from the user. We will definitely need to add
> > > something
> > > > > > like
> > > > > > > > > this.
> > > > > > > > > > > Once
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > have the repo created with the initial operator
> > code
> > > we
> > > > > > > should
> > > > > > > > > open
> > > > > > > > > > > some
> > > > > > > > > > > > > > tickets for this and put it on the short term
> > > roadmap!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Gyula
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Great work on the FLIP, I am looking forward to
> > > this
> > > > > > one. I
> > > > > > > > > agree
> > > > > > > > > > > that
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have general feedback around how we will
> handle
> > > job
> > > > > > > > > submission
> > > > > > > > > > > > > failure
> > > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > > Alternatives
> > > > > > > section,
> > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > Java to handle job submission failures from the
> > > Flink
> > > > > > > client.
> > > > > > > > > It
> > > > > > > > > > > would
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > useful to have the ability to configure
> exception
> > > > > > > classifiers
> > > > > > > > > and
> > > > > > > > > > > retry
> > > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Given this will be in a separate Github
> > repository
> > > I
> > > > am
> > > > > > > > curious
> > > > > > > > > > how
> > > > > > > > > > > > > ther
> > > > > > > > > > > > > > > versioning strategy will work in relation to
> the
> > > > Flink
> > > > > > > > version?
> > > > > > > > > > Do
> > > > > > > > > > > we
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > any other components with a similar setup I can
> > > look
> > > > > at?
> > > > > > > Will
> > > > > > > > > the
> > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > version track Flink or will it use its own
> > > versioning
> > > > > > > > strategy
> > > > > > > > > > > with a
> > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> > > > updated
> > > > > > the
> > > > > > > > FLIP
> > > > > > > > > > > page
> > > > > > > > > > > > > > > > accordingly. If you are comfortable with the
> > > > > currently
> > > > > > > > > existing
> > > > > > > > > > > > > design
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving
> forward
> > to
> > > > the
> > > > > > > > voting
> > > > > > > > > > > stage -
> > > > > > > > > > > > > > once
> > > > > > > > > > > > > > > > that reaches a positive conclusion it lets us
> > > > create
> > > > > > the
> > > > > > > > > > separate
> > > > > > > > > > > > > code
> > > > > > > > > > > > > > > > repository under the flink project for the
> > > > operator.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I encourage everyone to keep improving the
> > > details
> > > > in
> > > > > > the
> > > > > > > > > > > meantime,
> > > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > > I believe given the existing design and the
> > > general
> > > > > > > > sentiment
> > > > > > > > > > on
> > > > > > > > > > > this
> > > > > > > > > > > > > > > > thread that the most efficient path from here
> > is
> > > > > > starting
> > > > > > > > the
> > > > > > > > > > > > > > > > implementation so that we can collectively
> > > iterate
> > > > > over
> > > > > > > it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas
> Weise <
> > > > > > > > > thw@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > > responses
> > > > > > below
> > > > > > > > -->
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong
> > Song <
> > > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > > everyone
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > > kubectl &
> > > > > CR
> > > > > > > and
> > > > > > > > > then
> > > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> > > probably
> > > > > the
> > > > > > > > > > approach
> > > > > > > > > > > that
> > > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > > the least effort. However, I'd like to
> > point
> > > > out
> > > > > 2
> > > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > > perjob/application
> > > > > > > > modes.
> > > > > > > > > > For
> > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > > having to run the job in two steps
> (deploy
> > > the
> > > > > > > cluster,
> > > > > > > > > and
> > > > > > > > > > > > > submit
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > > 2. One of our motivations is being able
> to
> > > > manage
> > > > > > > Flink
> > > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs
> > from
> > > > cli
> > > > > > > > sounds
> > > > > > > > > > not
> > > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > > I think it's probably worth it to support
> > > > > > submitting
> > > > > > > > jobs
> > > > > > > > > > via
> > > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > > in the first version, both together with
> > > > > deploying
> > > > > > > the
> > > > > > > > > > > cluster
> > > > > > > > > > > > > like
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > perjob/application mode and after
> deploying
> > > the
> > > > > > > cluster
> > > > > > > > > > like
> > > > > > > > > > > in
> > > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The intention is to support application
> > > > management
> > > > > > > > through
> > > > > > > > > > > operator
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > > which means there won't be any 2 step
> > > submission
> > > > > > > process,
> > > > > > > > > > > which as
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > allude to would defeat the purpose of this
> > > > project.
> > > > > > The
> > > > > > > > CR
> > > > > > > > > > > example
> > > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > > the application part. Please note that the
> > bare
> > > > > > cluster
> > > > > > > > > > > support is
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > *additional* feature for scenarios that
> > require
> > > > > > > external
> > > > > > > > > job
> > > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > there anything on the FLIP page that
> creates
> > a
> > > > > > > different
> > > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > > Which Flink versions does the operator
> plan
> > > to
> > > > > > > support?
> > > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > > introduced
> > > > > in
> > > > > > > > Flink
> > > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink
> > 1.12
> > > > > > > > > > > > > > > > > > 3. The Pod template support was
> introduced
> > in
> > > > > Flink
> > > > > > > > 1.13
> > > > > > > > > > > > > > > > > > 4. There was some changes to the Flink
> > docker
> > > > > image
> > > > > > > > > > > entrypoint
> > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > > important
> > > > > for
> > > > > > > the
> > > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > > going forward also. We are targeting Flink
> > > 1.14.x
> > > > > > > > upwards.
> > > > > > > > > > > Before
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > operator is ready there will be another
> Flink
> > > > > > release.
> > > > > > > > > Let's
> > > > > > > > > > > see if
> > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > > What kind of API compatibility we can
> > commit
> > > > to?
> > > > > > It's
> > > > > > > > > > > probably
> > > > > > > > > > > > > fine
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > > incompatible
> > > > > > > > future
> > > > > > > > > > > changes
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > first version. But eventually we would
> need
> > > to
> > > > > > > > guarantee
> > > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > > compatibility, so that an early version
> CR
> > > can
> > > > > work
> > > > > > > > with
> > > > > > > > > a
> > > > > > > > > > > new
> > > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Another great point and please let me
> include
> > > > that
> > > > > on
> > > > > > > the
> > > > > > > > > > FLIP
> > > > > > > > > > > > > page.
> > > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think we should allow incompatible
> changes
> > > for
> > > > > the
> > > > > > > > first
> > > > > > > > > > one
> > > > > > > > > > > or
> > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > versions, similar to how other major
> features
> > > > have
> > > > > > > > evolved
> > > > > > > > > > > > > recently,
> > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Would be great to get broader feedback on
> > this
> > > > one.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas
> > Weise
> > > <
> > > > > > > > > > thw@apache.org
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > > integration
> > > > > > > > > > > > > > > > > > > > Maybe we should make this more clear
> in
> > > the
> > > > > > FLIP
> > > > > > > > but
> > > > > > > > > we
> > > > > > > > > > > > > agreed
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > first version of the operator based
> on
> > > the
> > > > > > native
> > > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > > While this clearly does not cover all
> > > > > use-cases
> > > > > > > and
> > > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> > initial
> > > > > > effort
> > > > > > > > and
> > > > > > > > > a
> > > > > > > > > > > nicer
> > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > > integration,
> > > > > > as
> > > > > > > > > long
> > > > > > > > > > > as it
> > > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator
> will
> > > need
> > > > > to
> > > > > > > also
> > > > > > > > > > > support
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > standalone mode. I would like to gain
> > more
> > > > > > > confidence
> > > > > > > > > > that
> > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > integration reduces the effort. While
> it
> > > cuts
> > > > > the
> > > > > > > > > effort
> > > > > > > > > > to
> > > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > > pod creation, some mapping code from
> the
> > CR
> > > > to
> > > > > > the
> > > > > > > > > native
> > > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > > client and config needs to be created.
> As
> > > > > > mentioned
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > > integration requires the Flink job
> > manager
> > > to
> > > > > > have
> > > > > > > > > access
> > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > create pods, which in some scenarios
> may
> > be
> > > > > seen
> > > > > > as
> > > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > > Is the pod template in CR same
> with
> > > > what
> > > > > > > Flink
> > > > > > > > > has
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > > Then I am afraid not the
> arbitrary
> > > > > > field(e.g.
> > > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > > identical.
> > > > > > > There
> > > > > > > > > are
> > > > > > > > > > a
> > > > > > > > > > > few
> > > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > > that the operator will control (and
> that
> > > may
> > > > > need
> > > > > > > to
> > > > > > > > be
> > > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > > > restrictions. I
> > > > > > > > > > > think a
> > > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > > where a pod template is merged from
> > > multiple
> > > > > > layers
> > > > > > > > > would
> > > > > > > > > > > also
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Yang Wang <da...@gmail.com>.
I also lean to not introduce the savepoint/checkpoint related fields to the
job spec, especially in the very beginning of flink-kubernetes-operator.


Best,
Yang

Gyula Fóra <gy...@gmail.com> 于2022年2月15日周二 19:02写道:

> Hi Peng Yuan!
>
> While I do agree that savepoint path is a very important production
> configuration there are a lot of other things that come to my mind:
>  - savepoint dir
>  - checkpoint dir
>  - checkpoint interval/timeout
>  - high availability settings (provider/storagedir etc)
>
> just to name a few...
>
> While these are all production critical, they have nice clean Flink config
> settings to go with them. If we stand introducing these to jobspec we only
> get confusion about priority order etc and it is going to be hard to change
> or remove them in the future. In any case we should validate that these
> configs exist in cases where users use a stateful upgrade mode for example.
> This is something we need to add for sure.
>
> As for the other options you mentioned like automatic savepoint generation
> for instance, those deserve an independent discussion of their own I
> believe :)
>
> Cheers,
> Gyula
>
> On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com> wrote:
>
> > Hi Matyas!
> >
> > Thanks for your reply!
> > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> solution
> > , i missed this part.
> > For savepoint related configuration, I think it's very important to be
> > specified in JobSpec, Because savepoint is a very common configuration
> for
> > upgrading a job, if it has been placed in JobSpec can be obviously
> > configured by the user. In addition, other advanced properties can be put
> > into flinkConfiguration customized by expert users.
> > A bunch of savepoint configuration as follows:
> >
> > > fromSavepoint——Job restart from
> >
> > autoSavepointSecond—— Automatically take a savepoint to the
> `savepointsDir`
> > > every n seconds.
> >
> > savepointsDir—— Savepoints dir where to store automatically taken
> > > savepoints
> >
> > savepointGeneration—— Update savepoint generation of job status for a
> > > running job (should be defined in JobStatus)
> >
> >
> > Best wishes,
> > Peng Yuan.
> >
> > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <ma...@gmail.com>
> > wrote:
> >
> > > Hi Peng,
> > >
> > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> podTemplate
> > > functionality in the operator could cover both. We also need to be
> > careful
> > > about introducing proxy parameters in the CRD spec. The savepoint path
> is
> > > usually accompanied with a bunch of other configurations for example,
> so
> > > users need to use configuration params anyway. What do you think?
> > >
> > > Best,
> > > Matyas
> > >
> > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com> wrote:
> > >
> > > > Hi Gyula!
> > > >
> > > > I have reviewed the prototype design of flink-kubernetes-operator you
> > > > submitted, and I have the following questions:
> > > >
> > > > 1.Can a Flink Jar package that supports pulling from the sidecar be
> > added
> > > > to the JobSpec? just like this:
> > > >
> > > > > initContainers:
> > > > >       - name: downloader
> > > > >         image: curlimages/curl
> > > > >         env:
> > > > >           - name: JAR_URL
> > > > >             value:
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > >           - name: DEST_PATH
> > > > >             value: /cache/flink-app.jar
> > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> > > >
> > > > 2.Can we add savepoint path property to job specification?
> > > > 3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec
> to
> > > > expose some service ,such as prometheus?The property can be this:
> > > >
> > > > > extraPorts:
> > > > >       - name: prom
> > > > >         containerPort: 9249
> > > >
> > > >
> > > >
> > > > Best wishes,
> > > > Peng Yuan
> > > >
> > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org>
> wrote:
> > > >
> > > > > Hi Flink Devs!
> > > > >
> > > > > We would like to present to you the first prototype of the
> > > > > flink-kubernetes-operator that was built based on the FLIP and the
> > > > > discussion on this mail thread. We would also like to call out some
> > > > design
> > > > > decisions that we have made regarding architecture components that
> > were
> > > > not
> > > > > explicitly mentioned in the FLIP document/thread and give you the
> > > > > opportunity to raise any concerns here.
> > > > >
> > > > > You can find the initial prototype here:
> > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > >
> > > > > We will leave the PR open for 1-2 days before merging to let people
> > > > comment
> > > > > on it, but please be mindful that this is an initial prototype with
> > > many
> > > > > rough edges. It is not intended to be a complete implementation of
> > the
> > > > FLIP
> > > > > specs as that will take some more work from all of us :)
> > > > >
> > > > >
> > > > > *Prototype feature set:*The prototype contains a basic working
> > version
> > > of
> > > > > the flink-kubernetes-operator that supports deployment and
> lifecycle
> > > > > management of a stateful native flink application. We have basic
> > > support
> > > > > for stateful and stateless upgrades, UI ingress, pod templates etc.
> > > Error
> > > > > handling at this point is largely missing.
> > > > >
> > > > >
> > > > > *Features / design decisions that were not explicitly discussed in
> > this
> > > > > thread*
> > > > >
> > > > > *Basic Admission control using a Webhook*Standard resource
> admission
> > > > > control in Kubernetes to validate and potentially reject resources
> is
> > > > done
> > > > > through Webhooks.
> > > > >
> > > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > This is a necessary mechanism to give the user an upfront error
> when
> > an
> > > > > incorrect resource was submitted. In the Flink operator's case we
> > need
> > > to
> > > > > validate that the FlinkDeployment yaml actually makes sense and
> does
> > > not
> > > > > contain erroneous config options that would inevitably lead to
> > > > > deployment/job failures.
> > > > >
> > > > > We have implemented a simple webhook that we can use for this type
> of
> > > > > validation, as a separate maven module (flink-kubernetes-webhook).
> > The
> > > > > webhook is an optional component and can be enabled or disabled
> > during
> > > > > deployment. To avoid pulling in new external dependencies we have
> > used
> > > > the
> > > > > Flink Shaded Netty module to build the simple rest endpoint
> required.
> > > If
> > > > > the community feels that Netty adds unnecessary complexity to the
> > > webhook
> > > > > implementation we are open to alternative backends such as
> Springboot
> > > for
> > > > > instance which would practically eliminate all the boilerplate.
> > > > >
> > > > >
> > > > > *Helm Chart for deployment*Helm charts provide an industry standard
> > way
> > > > of
> > > > > managing kubernetes deployments. We have created a helm chart
> > prototype
> > > > > that can be used to deploy the operator together with all required
> > > > > resources. The helm chart allows easy configuration for things like
> > > > images,
> > > > > namespaces etc and flags to control specific parts of the
> deployment
> > > such
> > > > > as RBAC or the webhook.
> > > > >
> > > > > The helm chart provided is intended to be a first version that
> worked
> > > for
> > > > > us during development but we expect to have a lot of iterations on
> it
> > > > based
> > > > > on the feedback from the community.
> > > > >
> > > > > *Acknowledgment*
> > > > > We would like to thank everyone who has provided support and
> valuable
> > > > > feedback on this FLIP.
> > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > specifically
> > > > > for making their operators open source and available to us which
> had
> > a
> > > > big
> > > > > impact on the FLIP and the prototype.
> > > > >
> > > > > We are looking forward to continuing development on the operator
> > > together
> > > > > with the broader community.
> > > > > All work will be tracked using the ASF Jira from now on.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Gyula,
> > > > > >
> > > > > > Thanks!
> > > > > > It's great to see the project getting started and I can't wait to
> > see
> > > > the
> > > > > > PR and start contributing code.😄😄😄
> > > > > >
> > > > > > Best Wishes!
> > > > > > Peng Yuan
> > > > > >
> > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gyula.fora@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Peng Yuan!
> > > > > > >
> > > > > > > The repo is already created:
> > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > >
> > > > > > > We will open the PR with the initial prototype later today,
> stay
> > > > tuned
> > > > > in
> > > > > > > this thread! :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yuanpengfred@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > Has the project of flink-kubernetes-operator been created in
> > > > github?
> > > > > > > >
> > > > > > > > Peng Yuan
> > > > > > > >
> > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > > > > > Don't have any better idea
> > > > > > > > >
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> thw@apache.org>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > Thanks for the continued feedback and discussion. Looks
> > like
> > > we
> > > > > are
> > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > >
> > > > > > > > > > In parallel it would be good to find the repository name.
> > > > > > > > > >
> > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > >
> > > > > > > > > > I thought "flink-operator" could be a bit misleading
> since
> > > the
> > > > > term
> > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > >
> > > > > > > > > > I also considered "flink-k8s-operator" but that would be
> > > almost
> > > > > > > > > > identical to existing operator implementations and could
> > lead
> > > > to
> > > > > > > > > > confusion in the future.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Danny,
> > > > > > > > > > >
> > > > > > > > > > > So far we have been focusing our dev efforts on the
> > initial
> > > > > > native
> > > > > > > > > > > implementation with the team.
> > > > > > > > > > > If the discussion and vote goes well for this FLIP we
> are
> > > > > looking
> > > > > > > > > forward
> > > > > > > > > > > to contributing the initial version sometime next week
> > > > (fingers
> > > > > > > > > crossed).
> > > > > > > > > > >
> > > > > > > > > > > At that point I think we can already start the dev work
> > to
> > > > > > support
> > > > > > > > the
> > > > > > > > > > > standalone mode as well, especially if you can dedicate
> > > some
> > > > > > effort
> > > > > > > > to
> > > > > > > > > > > pushing that side.
> > > > > > > > > > > Working together on this sounds like a great idea and
> we
> > > > should
> > > > > > > start
> > > > > > > > > as
> > > > > > > > > > > soon as possible! :)
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I have been discussing this one with my team. We are
> > > > > interested
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > Standalone mode, and are willing to contribute
> towards
> > > the
> > > > > > > > > > implementation.
> > > > > > > > > > > > Potentially we can work together to support both
> modes
> > in
> > > > > > > parallel?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > gyula.fora@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > Versioning will be independent from Flink and the
> > > > operator
> > > > > > will
> > > > > > > > > > depend
> > > > > > > > > > > > on a
> > > > > > > > > > > > > fixed flink version (in every given operator
> > version).
> > > > > > > > > > > > > This should be the exact same setup as with
> Stateful
> > > > > > Functions
> > > > > > > (
> > > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > independent
> > > > > > > > release
> > > > > > > > > > cycle
> > > > > > > > > > > > > but
> > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > I think that's a very good point, as general
> > exception
> > > > > > handling
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > different failure scenarios is a tricky problem. I
> > > think
> > > > > the
> > > > > > > > > > exception
> > > > > > > > > > > > > classifiers and retry strategies could avoid a lot
> of
> > > > > manual
> > > > > > > > > > intervention
> > > > > > > > > > > > > from the user. We will definitely need to add
> > something
> > > > > like
> > > > > > > > this.
> > > > > > > > > > Once
> > > > > > > > > > > > we
> > > > > > > > > > > > > have the repo created with the initial operator
> code
> > we
> > > > > > should
> > > > > > > > open
> > > > > > > > > > some
> > > > > > > > > > > > > tickets for this and put it on the short term
> > roadmap!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Great work on the FLIP, I am looking forward to
> > this
> > > > > one. I
> > > > > > > > agree
> > > > > > > > > > that
> > > > > > > > > > > > we
> > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have general feedback around how we will handle
> > job
> > > > > > > > submission
> > > > > > > > > > > > failure
> > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > Alternatives
> > > > > > section,
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > use
> > > > > > > > > > > > > > Java to handle job submission failures from the
> > Flink
> > > > > > client.
> > > > > > > > It
> > > > > > > > > > would
> > > > > > > > > > > > be
> > > > > > > > > > > > > > useful to have the ability to configure exception
> > > > > > classifiers
> > > > > > > > and
> > > > > > > > > > retry
> > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Given this will be in a separate Github
> repository
> > I
> > > am
> > > > > > > curious
> > > > > > > > > how
> > > > > > > > > > > > ther
> > > > > > > > > > > > > > versioning strategy will work in relation to the
> > > Flink
> > > > > > > version?
> > > > > > > > > Do
> > > > > > > > > > we
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > any other components with a similar setup I can
> > look
> > > > at?
> > > > > > Will
> > > > > > > > the
> > > > > > > > > > > > > operator
> > > > > > > > > > > > > > version track Flink or will it use its own
> > versioning
> > > > > > > strategy
> > > > > > > > > > with a
> > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> > > updated
> > > > > the
> > > > > > > FLIP
> > > > > > > > > > page
> > > > > > > > > > > > > > > accordingly. If you are comfortable with the
> > > > currently
> > > > > > > > existing
> > > > > > > > > > > > design
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward
> to
> > > the
> > > > > > > voting
> > > > > > > > > > stage -
> > > > > > > > > > > > > once
> > > > > > > > > > > > > > > that reaches a positive conclusion it lets us
> > > create
> > > > > the
> > > > > > > > > separate
> > > > > > > > > > > > code
> > > > > > > > > > > > > > > repository under the flink project for the
> > > operator.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I encourage everyone to keep improving the
> > details
> > > in
> > > > > the
> > > > > > > > > > meantime,
> > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > I believe given the existing design and the
> > general
> > > > > > > sentiment
> > > > > > > > > on
> > > > > > > > > > this
> > > > > > > > > > > > > > > thread that the most efficient path from here
> is
> > > > > starting
> > > > > > > the
> > > > > > > > > > > > > > > implementation so that we can collectively
> > iterate
> > > > over
> > > > > > it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > > > > > thw@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > responses
> > > > > below
> > > > > > > -->
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong
> Song <
> > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > everyone
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > kubectl &
> > > > CR
> > > > > > and
> > > > > > > > then
> > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> > probably
> > > > the
> > > > > > > > > approach
> > > > > > > > > > that
> > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > the least effort. However, I'd like to
> point
> > > out
> > > > 2
> > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > perjob/application
> > > > > > > modes.
> > > > > > > > > For
> > > > > > > > > > > > these
> > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > having to run the job in two steps (deploy
> > the
> > > > > > cluster,
> > > > > > > > and
> > > > > > > > > > > > submit
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > 2. One of our motivations is being able to
> > > manage
> > > > > > Flink
> > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs
> from
> > > cli
> > > > > > > sounds
> > > > > > > > > not
> > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > I think it's probably worth it to support
> > > > > submitting
> > > > > > > jobs
> > > > > > > > > via
> > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > in the first version, both together with
> > > > deploying
> > > > > > the
> > > > > > > > > > cluster
> > > > > > > > > > > > like
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > perjob/application mode and after deploying
> > the
> > > > > > cluster
> > > > > > > > > like
> > > > > > > > > > in
> > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The intention is to support application
> > > management
> > > > > > > through
> > > > > > > > > > operator
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > which means there won't be any 2 step
> > submission
> > > > > > process,
> > > > > > > > > > which as
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > allude to would defeat the purpose of this
> > > project.
> > > > > The
> > > > > > > CR
> > > > > > > > > > example
> > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > the application part. Please note that the
> bare
> > > > > cluster
> > > > > > > > > > support is
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > > *additional* feature for scenarios that
> require
> > > > > > external
> > > > > > > > job
> > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > there anything on the FLIP page that creates
> a
> > > > > > different
> > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > Which Flink versions does the operator plan
> > to
> > > > > > support?
> > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > introduced
> > > > in
> > > > > > > Flink
> > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink
> 1.12
> > > > > > > > > > > > > > > > > 3. The Pod template support was introduced
> in
> > > > Flink
> > > > > > > 1.13
> > > > > > > > > > > > > > > > > 4. There was some changes to the Flink
> docker
> > > > image
> > > > > > > > > > entrypoint
> > > > > > > > > > > > > script
> > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > important
> > > > for
> > > > > > the
> > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > going forward also. We are targeting Flink
> > 1.14.x
> > > > > > > upwards.
> > > > > > > > > > Before
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > operator is ready there will be another Flink
> > > > > release.
> > > > > > > > Let's
> > > > > > > > > > see if
> > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > What kind of API compatibility we can
> commit
> > > to?
> > > > > It's
> > > > > > > > > > probably
> > > > > > > > > > > > fine
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > incompatible
> > > > > > > future
> > > > > > > > > > changes
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > first version. But eventually we would need
> > to
> > > > > > > guarantee
> > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > compatibility, so that an early version CR
> > can
> > > > work
> > > > > > > with
> > > > > > > > a
> > > > > > > > > > new
> > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Another great point and please let me include
> > > that
> > > > on
> > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > > page.
> > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think we should allow incompatible changes
> > for
> > > > the
> > > > > > > first
> > > > > > > > > one
> > > > > > > > > > or
> > > > > > > > > > > > two
> > > > > > > > > > > > > > > > versions, similar to how other major features
> > > have
> > > > > > > evolved
> > > > > > > > > > > > recently,
> > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Would be great to get broader feedback on
> this
> > > one.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas
> Weise
> > <
> > > > > > > > > thw@apache.org
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > integration
> > > > > > > > > > > > > > > > > > > Maybe we should make this more clear in
> > the
> > > > > FLIP
> > > > > > > but
> > > > > > > > we
> > > > > > > > > > > > agreed
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > first version of the operator based on
> > the
> > > > > native
> > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > While this clearly does not cover all
> > > > use-cases
> > > > > > and
> > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> initial
> > > > > effort
> > > > > > > and
> > > > > > > > a
> > > > > > > > > > nicer
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > integration,
> > > > > as
> > > > > > > > long
> > > > > > > > > > as it
> > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator will
> > need
> > > > to
> > > > > > also
> > > > > > > > > > support
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > standalone mode. I would like to gain
> more
> > > > > > confidence
> > > > > > > > > that
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration reduces the effort. While it
> > cuts
> > > > the
> > > > > > > > effort
> > > > > > > > > to
> > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > pod creation, some mapping code from the
> CR
> > > to
> > > > > the
> > > > > > > > native
> > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > client and config needs to be created. As
> > > > > mentioned
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration requires the Flink job
> manager
> > to
> > > > > have
> > > > > > > > access
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > create pods, which in some scenarios may
> be
> > > > seen
> > > > > as
> > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > Is the pod template in CR same with
> > > what
> > > > > > Flink
> > > > > > > > has
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> > > > > field(e.g.
> > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > identical.
> > > > > > There
> > > > > > > > are
> > > > > > > > > a
> > > > > > > > > > few
> > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > that the operator will control (and that
> > may
> > > > need
> > > > > > to
> > > > > > > be
> > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > > restrictions. I
> > > > > > > > > > think a
> > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > where a pod template is merged from
> > multiple
> > > > > layers
> > > > > > > > would
> > > > > > > > > > also
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi Gyula!

Alright! For clarity, the savepoint path and other savepoint related
configuration can be put into flinkConfiguration.
On the side, I think the automatic savepoint generation for instance should
be put into JobSpec along with other job options, and FlinkConfiguration
only configures what is contained in the Flink-conf.yaml file.

Best Wishes,
Peng Yuan

On Tue, Feb 15, 2022 at 7:02 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Peng Yuan!
>
> While I do agree that savepoint path is a very important production
> configuration there are a lot of other things that come to my mind:
>  - savepoint dir
>  - checkpoint dir
>  - checkpoint interval/timeout
>  - high availability settings (provider/storagedir etc)
>
> just to name a few...
>
> While these are all production critical, they have nice clean Flink config
> settings to go with them. If we stand introducing these to jobspec we only
> get confusion about priority order etc and it is going to be hard to change
> or remove them in the future. In any case we should validate that these
> configs exist in cases where users use a stateful upgrade mode for example.
> This is something we need to add for sure.
>
> As for the other options you mentioned like automatic savepoint generation
> for instance, those deserve an independent discussion of their own I
> believe :)
>
> Cheers,
> Gyula
>
> On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com> wrote:
>
> > Hi Matyas!
> >
> > Thanks for your reply!
> > For 1. and 3. scenarios,I couldn't agree more with the podTemplate
> solution
> > , i missed this part.
> > For savepoint related configuration, I think it's very important to be
> > specified in JobSpec, Because savepoint is a very common configuration
> for
> > upgrading a job, if it has been placed in JobSpec can be obviously
> > configured by the user. In addition, other advanced properties can be put
> > into flinkConfiguration customized by expert users.
> > A bunch of savepoint configuration as follows:
> >
> > > fromSavepoint——Job restart from
> >
> > autoSavepointSecond—— Automatically take a savepoint to the
> `savepointsDir`
> > > every n seconds.
> >
> > savepointsDir—— Savepoints dir where to store automatically taken
> > > savepoints
> >
> > savepointGeneration—— Update savepoint generation of job status for a
> > > running job (should be defined in JobStatus)
> >
> >
> > Best wishes,
> > Peng Yuan.
> >
> > On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <ma...@gmail.com>
> > wrote:
> >
> > > Hi Peng,
> > >
> > > Thanks for your feedback. Regarding 1. and 3. scenarios, the
> podTemplate
> > > functionality in the operator could cover both. We also need to be
> > careful
> > > about introducing proxy parameters in the CRD spec. The savepoint path
> is
> > > usually accompanied with a bunch of other configurations for example,
> so
> > > users need to use configuration params anyway. What do you think?
> > >
> > > Best,
> > > Matyas
> > >
> > > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com> wrote:
> > >
> > > > Hi Gyula!
> > > >
> > > > I have reviewed the prototype design of flink-kubernetes-operator you
> > > > submitted, and I have the following questions:
> > > >
> > > > 1.Can a Flink Jar package that supports pulling from the sidecar be
> > added
> > > > to the JobSpec? just like this:
> > > >
> > > > > initContainers:
> > > > >       - name: downloader
> > > > >         image: curlimages/curl
> > > > >         env:
> > > > >           - name: JAR_URL
> > > > >             value:
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > > >           - name: DEST_PATH
> > > > >             value: /cache/flink-app.jar
> > > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> > > >
> > > > 2.Can we add savepoint path property to job specification?
> > > > 3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec
> to
> > > > expose some service ,such as prometheus?The property can be this:
> > > >
> > > > > extraPorts:
> > > > >       - name: prom
> > > > >         containerPort: 9249
> > > >
> > > >
> > > >
> > > > Best wishes,
> > > > Peng Yuan
> > > >
> > > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org>
> wrote:
> > > >
> > > > > Hi Flink Devs!
> > > > >
> > > > > We would like to present to you the first prototype of the
> > > > > flink-kubernetes-operator that was built based on the FLIP and the
> > > > > discussion on this mail thread. We would also like to call out some
> > > > design
> > > > > decisions that we have made regarding architecture components that
> > were
> > > > not
> > > > > explicitly mentioned in the FLIP document/thread and give you the
> > > > > opportunity to raise any concerns here.
> > > > >
> > > > > You can find the initial prototype here:
> > > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > > >
> > > > > We will leave the PR open for 1-2 days before merging to let people
> > > > comment
> > > > > on it, but please be mindful that this is an initial prototype with
> > > many
> > > > > rough edges. It is not intended to be a complete implementation of
> > the
> > > > FLIP
> > > > > specs as that will take some more work from all of us :)
> > > > >
> > > > >
> > > > > *Prototype feature set:*The prototype contains a basic working
> > version
> > > of
> > > > > the flink-kubernetes-operator that supports deployment and
> lifecycle
> > > > > management of a stateful native flink application. We have basic
> > > support
> > > > > for stateful and stateless upgrades, UI ingress, pod templates etc.
> > > Error
> > > > > handling at this point is largely missing.
> > > > >
> > > > >
> > > > > *Features / design decisions that were not explicitly discussed in
> > this
> > > > > thread*
> > > > >
> > > > > *Basic Admission control using a Webhook*Standard resource
> admission
> > > > > control in Kubernetes to validate and potentially reject resources
> is
> > > > done
> > > > > through Webhooks.
> > > > >
> > > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > > This is a necessary mechanism to give the user an upfront error
> when
> > an
> > > > > incorrect resource was submitted. In the Flink operator's case we
> > need
> > > to
> > > > > validate that the FlinkDeployment yaml actually makes sense and
> does
> > > not
> > > > > contain erroneous config options that would inevitably lead to
> > > > > deployment/job failures.
> > > > >
> > > > > We have implemented a simple webhook that we can use for this type
> of
> > > > > validation, as a separate maven module (flink-kubernetes-webhook).
> > The
> > > > > webhook is an optional component and can be enabled or disabled
> > during
> > > > > deployment. To avoid pulling in new external dependencies we have
> > used
> > > > the
> > > > > Flink Shaded Netty module to build the simple rest endpoint
> required.
> > > If
> > > > > the community feels that Netty adds unnecessary complexity to the
> > > webhook
> > > > > implementation we are open to alternative backends such as
> Springboot
> > > for
> > > > > instance which would practically eliminate all the boilerplate.
> > > > >
> > > > >
> > > > > *Helm Chart for deployment*Helm charts provide an industry standard
> > way
> > > > of
> > > > > managing kubernetes deployments. We have created a helm chart
> > prototype
> > > > > that can be used to deploy the operator together with all required
> > > > > resources. The helm chart allows easy configuration for things like
> > > > images,
> > > > > namespaces etc and flags to control specific parts of the
> deployment
> > > such
> > > > > as RBAC or the webhook.
> > > > >
> > > > > The helm chart provided is intended to be a first version that
> worked
> > > for
> > > > > us during development but we expect to have a lot of iterations on
> it
> > > > based
> > > > > on the feedback from the community.
> > > > >
> > > > > *Acknowledgment*
> > > > > We would like to thank everyone who has provided support and
> valuable
> > > > > feedback on this FLIP.
> > > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > > specifically
> > > > > for making their operators open source and available to us which
> had
> > a
> > > > big
> > > > > impact on the FLIP and the prototype.
> > > > >
> > > > > We are looking forward to continuing development on the operator
> > > together
> > > > > with the broader community.
> > > > > All work will be tracked using the ASF Jira from now on.
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Gyula,
> > > > > >
> > > > > > Thanks!
> > > > > > It's great to see the project getting started and I can't wait to
> > see
> > > > the
> > > > > > PR and start contributing code.😄😄😄
> > > > > >
> > > > > > Best Wishes!
> > > > > > Peng Yuan
> > > > > >
> > > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gyula.fora@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Peng Yuan!
> > > > > > >
> > > > > > > The repo is already created:
> > > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > > >
> > > > > > > We will open the PR with the initial prototype later today,
> stay
> > > > tuned
> > > > > in
> > > > > > > this thread! :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yuanpengfred@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > Has the project of flink-kubernetes-operator been created in
> > > > github?
> > > > > > > >
> > > > > > > > Peng Yuan
> > > > > > > >
> > > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > > > > > Don't have any better idea
> > > > > > > > >
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <
> thw@apache.org>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > Thanks for the continued feedback and discussion. Looks
> > like
> > > we
> > > > > are
> > > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > > >
> > > > > > > > > > In parallel it would be good to find the repository name.
> > > > > > > > > >
> > > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > > >
> > > > > > > > > > I thought "flink-operator" could be a bit misleading
> since
> > > the
> > > > > term
> > > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > > >
> > > > > > > > > > I also considered "flink-k8s-operator" but that would be
> > > almost
> > > > > > > > > > identical to existing operator implementations and could
> > lead
> > > > to
> > > > > > > > > > confusion in the future.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Danny,
> > > > > > > > > > >
> > > > > > > > > > > So far we have been focusing our dev efforts on the
> > initial
> > > > > > native
> > > > > > > > > > > implementation with the team.
> > > > > > > > > > > If the discussion and vote goes well for this FLIP we
> are
> > > > > looking
> > > > > > > > > forward
> > > > > > > > > > > to contributing the initial version sometime next week
> > > > (fingers
> > > > > > > > > crossed).
> > > > > > > > > > >
> > > > > > > > > > > At that point I think we can already start the dev work
> > to
> > > > > > support
> > > > > > > > the
> > > > > > > > > > > standalone mode as well, especially if you can dedicate
> > > some
> > > > > > effort
> > > > > > > > to
> > > > > > > > > > > pushing that side.
> > > > > > > > > > > Working together on this sounds like a great idea and
> we
> > > > should
> > > > > > > start
> > > > > > > > > as
> > > > > > > > > > > soon as possible! :)
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I have been discussing this one with my team. We are
> > > > > interested
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > Standalone mode, and are willing to contribute
> towards
> > > the
> > > > > > > > > > implementation.
> > > > > > > > > > > > Potentially we can work together to support both
> modes
> > in
> > > > > > > parallel?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > > gyula.fora@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Versioning:
> > > > > > > > > > > > > Versioning will be independent from Flink and the
> > > > operator
> > > > > > will
> > > > > > > > > > depend
> > > > > > > > > > > > on a
> > > > > > > > > > > > > fixed flink version (in every given operator
> > version).
> > > > > > > > > > > > > This should be the exact same setup as with
> Stateful
> > > > > > Functions
> > > > > > > (
> > > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > > independent
> > > > > > > > release
> > > > > > > > > > cycle
> > > > > > > > > > > > > but
> > > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > > I think that's a very good point, as general
> > exception
> > > > > > handling
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > different failure scenarios is a tricky problem. I
> > > think
> > > > > the
> > > > > > > > > > exception
> > > > > > > > > > > > > classifiers and retry strategies could avoid a lot
> of
> > > > > manual
> > > > > > > > > > intervention
> > > > > > > > > > > > > from the user. We will definitely need to add
> > something
> > > > > like
> > > > > > > > this.
> > > > > > > > > > Once
> > > > > > > > > > > > we
> > > > > > > > > > > > > have the repo created with the initial operator
> code
> > we
> > > > > > should
> > > > > > > > open
> > > > > > > > > > some
> > > > > > > > > > > > > tickets for this and put it on the short term
> > roadmap!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Gyula
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Great work on the FLIP, I am looking forward to
> > this
> > > > > one. I
> > > > > > > > agree
> > > > > > > > > > that
> > > > > > > > > > > > we
> > > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have general feedback around how we will handle
> > job
> > > > > > > > submission
> > > > > > > > > > > > failure
> > > > > > > > > > > > > > and retry. As discussed in the Rejected
> > Alternatives
> > > > > > section,
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > use
> > > > > > > > > > > > > > Java to handle job submission failures from the
> > Flink
> > > > > > client.
> > > > > > > > It
> > > > > > > > > > would
> > > > > > > > > > > > be
> > > > > > > > > > > > > > useful to have the ability to configure exception
> > > > > > classifiers
> > > > > > > > and
> > > > > > > > > > retry
> > > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Given this will be in a separate Github
> repository
> > I
> > > am
> > > > > > > curious
> > > > > > > > > how
> > > > > > > > > > > > ther
> > > > > > > > > > > > > > versioning strategy will work in relation to the
> > > Flink
> > > > > > > version?
> > > > > > > > > Do
> > > > > > > > > > we
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > any other components with a similar setup I can
> > look
> > > > at?
> > > > > > Will
> > > > > > > > the
> > > > > > > > > > > > > operator
> > > > > > > > > > > > > > version track Flink or will it use its own
> > versioning
> > > > > > > strategy
> > > > > > > > > > with a
> > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> > > updated
> > > > > the
> > > > > > > FLIP
> > > > > > > > > > page
> > > > > > > > > > > > > > > accordingly. If you are comfortable with the
> > > > currently
> > > > > > > > existing
> > > > > > > > > > > > design
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward
> to
> > > the
> > > > > > > voting
> > > > > > > > > > stage -
> > > > > > > > > > > > > once
> > > > > > > > > > > > > > > that reaches a positive conclusion it lets us
> > > create
> > > > > the
> > > > > > > > > separate
> > > > > > > > > > > > code
> > > > > > > > > > > > > > > repository under the flink project for the
> > > operator.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I encourage everyone to keep improving the
> > details
> > > in
> > > > > the
> > > > > > > > > > meantime,
> > > > > > > > > > > > > > however
> > > > > > > > > > > > > > > I believe given the existing design and the
> > general
> > > > > > > sentiment
> > > > > > > > > on
> > > > > > > > > > this
> > > > > > > > > > > > > > > thread that the most efficient path from here
> is
> > > > > starting
> > > > > > > the
> > > > > > > > > > > > > > > implementation so that we can collectively
> > iterate
> > > > over
> > > > > > it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > > > > > thw@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the feedback and please see
> > responses
> > > > > below
> > > > > > > -->
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong
> Song <
> > > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > > everyone
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> > kubectl &
> > > > CR
> > > > > > and
> > > > > > > > then
> > > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> > probably
> > > > the
> > > > > > > > > approach
> > > > > > > > > > that
> > > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > > the least effort. However, I'd like to
> point
> > > out
> > > > 2
> > > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > > perjob/application
> > > > > > > modes.
> > > > > > > > > For
> > > > > > > > > > > > these
> > > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > > having to run the job in two steps (deploy
> > the
> > > > > > cluster,
> > > > > > > > and
> > > > > > > > > > > > submit
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > > 2. One of our motivations is being able to
> > > manage
> > > > > > Flink
> > > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs
> from
> > > cli
> > > > > > > sounds
> > > > > > > > > not
> > > > > > > > > > > > > aligned
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > > I think it's probably worth it to support
> > > > > submitting
> > > > > > > jobs
> > > > > > > > > via
> > > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > > in the first version, both together with
> > > > deploying
> > > > > > the
> > > > > > > > > > cluster
> > > > > > > > > > > > like
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > perjob/application mode and after deploying
> > the
> > > > > > cluster
> > > > > > > > > like
> > > > > > > > > > in
> > > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The intention is to support application
> > > management
> > > > > > > through
> > > > > > > > > > operator
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > > which means there won't be any 2 step
> > submission
> > > > > > process,
> > > > > > > > > > which as
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > allude to would defeat the purpose of this
> > > project.
> > > > > The
> > > > > > > CR
> > > > > > > > > > example
> > > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > > the application part. Please note that the
> bare
> > > > > cluster
> > > > > > > > > > support is
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > > *additional* feature for scenarios that
> require
> > > > > > external
> > > > > > > > job
> > > > > > > > > > > > > > management.
> > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > there anything on the FLIP page that creates
> a
> > > > > > different
> > > > > > > > > > > > impression?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > > Which Flink versions does the operator plan
> > to
> > > > > > support?
> > > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> > introduced
> > > > in
> > > > > > > Flink
> > > > > > > > > 1.10
> > > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink
> 1.12
> > > > > > > > > > > > > > > > > 3. The Pod template support was introduced
> in
> > > > Flink
> > > > > > > 1.13
> > > > > > > > > > > > > > > > > 4. There was some changes to the Flink
> docker
> > > > image
> > > > > > > > > > entrypoint
> > > > > > > > > > > > > script
> > > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Great, thanks for providing this. It is
> > important
> > > > for
> > > > > > the
> > > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > going forward also. We are targeting Flink
> > 1.14.x
> > > > > > > upwards.
> > > > > > > > > > Before
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > operator is ready there will be another Flink
> > > > > release.
> > > > > > > > Let's
> > > > > > > > > > see if
> > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > > What kind of API compatibility we can
> commit
> > > to?
> > > > > It's
> > > > > > > > > > probably
> > > > > > > > > > > > fine
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > > incompatible
> > > > > > > future
> > > > > > > > > > changes
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > first version. But eventually we would need
> > to
> > > > > > > guarantee
> > > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > > compatibility, so that an early version CR
> > can
> > > > work
> > > > > > > with
> > > > > > > > a
> > > > > > > > > > new
> > > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Another great point and please let me include
> > > that
> > > > on
> > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > > page.
> > > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think we should allow incompatible changes
> > for
> > > > the
> > > > > > > first
> > > > > > > > > one
> > > > > > > > > > or
> > > > > > > > > > > > two
> > > > > > > > > > > > > > > > versions, similar to how other major features
> > > have
> > > > > > > evolved
> > > > > > > > > > > > recently,
> > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Would be great to get broader feedback on
> this
> > > one.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas
> Weise
> > <
> > > > > > > > > thw@apache.org
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> > integration
> > > > > > > > > > > > > > > > > > > Maybe we should make this more clear in
> > the
> > > > > FLIP
> > > > > > > but
> > > > > > > > we
> > > > > > > > > > > > agreed
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > first version of the operator based on
> > the
> > > > > native
> > > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > > While this clearly does not cover all
> > > > use-cases
> > > > > > and
> > > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > > this would lead to a much smaller
> initial
> > > > > effort
> > > > > > > and
> > > > > > > > a
> > > > > > > > > > nicer
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > > integration,
> > > > > as
> > > > > > > > long
> > > > > > > > > > as it
> > > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator will
> > need
> > > > to
> > > > > > also
> > > > > > > > > > support
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > standalone mode. I would like to gain
> more
> > > > > > confidence
> > > > > > > > > that
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration reduces the effort. While it
> > cuts
> > > > the
> > > > > > > > effort
> > > > > > > > > to
> > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > pod creation, some mapping code from the
> CR
> > > to
> > > > > the
> > > > > > > > native
> > > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > > client and config needs to be created. As
> > > > > mentioned
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > > integration requires the Flink job
> manager
> > to
> > > > > have
> > > > > > > > access
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > create pods, which in some scenarios may
> be
> > > > seen
> > > > > as
> > > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > > Is the pod template in CR same with
> > > what
> > > > > > Flink
> > > > > > > > has
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> > > > > field(e.g.
> > > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > > identical.
> > > > > > There
> > > > > > > > are
> > > > > > > > > a
> > > > > > > > > > few
> > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > > that the operator will control (and that
> > may
> > > > need
> > > > > > to
> > > > > > > be
> > > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > > restrictions. I
> > > > > > > > > > think a
> > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > where a pod template is merged from
> > multiple
> > > > > layers
> > > > > > > > would
> > > > > > > > > > also
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Peng Yuan!

While I do agree that savepoint path is a very important production
configuration there are a lot of other things that come to my mind:
 - savepoint dir
 - checkpoint dir
 - checkpoint interval/timeout
 - high availability settings (provider/storagedir etc)

just to name a few...

While these are all production critical, they have nice clean Flink config
settings to go with them. If we stand introducing these to jobspec we only
get confusion about priority order etc and it is going to be hard to change
or remove them in the future. In any case we should validate that these
configs exist in cases where users use a stateful upgrade mode for example.
This is something we need to add for sure.

As for the other options you mentioned like automatic savepoint generation
for instance, those deserve an independent discussion of their own I
believe :)

Cheers,
Gyula

On Tue, Feb 15, 2022 at 11:23 AM K Fred <yu...@gmail.com> wrote:

> Hi Matyas!
>
> Thanks for your reply!
> For 1. and 3. scenarios,I couldn't agree more with the podTemplate solution
> , i missed this part.
> For savepoint related configuration, I think it's very important to be
> specified in JobSpec, Because savepoint is a very common configuration for
> upgrading a job, if it has been placed in JobSpec can be obviously
> configured by the user. In addition, other advanced properties can be put
> into flinkConfiguration customized by expert users.
> A bunch of savepoint configuration as follows:
>
> > fromSavepoint——Job restart from
>
> autoSavepointSecond—— Automatically take a savepoint to the `savepointsDir`
> > every n seconds.
>
> savepointsDir—— Savepoints dir where to store automatically taken
> > savepoints
>
> savepointGeneration—— Update savepoint generation of job status for a
> > running job (should be defined in JobStatus)
>
>
> Best wishes,
> Peng Yuan.
>
> On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <ma...@gmail.com>
> wrote:
>
> > Hi Peng,
> >
> > Thanks for your feedback. Regarding 1. and 3. scenarios, the podTemplate
> > functionality in the operator could cover both. We also need to be
> careful
> > about introducing proxy parameters in the CRD spec. The savepoint path is
> > usually accompanied with a bunch of other configurations for example, so
> > users need to use configuration params anyway. What do you think?
> >
> > Best,
> > Matyas
> >
> > On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com> wrote:
> >
> > > Hi Gyula!
> > >
> > > I have reviewed the prototype design of flink-kubernetes-operator you
> > > submitted, and I have the following questions:
> > >
> > > 1.Can a Flink Jar package that supports pulling from the sidecar be
> added
> > > to the JobSpec? just like this:
> > >
> > > > initContainers:
> > > >       - name: downloader
> > > >         image: curlimages/curl
> > > >         env:
> > > >           - name: JAR_URL
> > > >             value:
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > > >           - name: DEST_PATH
> > > >             value: /cache/flink-app.jar
> > > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> > >
> > > 2.Can we add savepoint path property to job specification?
> > > 3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec to
> > > expose some service ,such as prometheus?The property can be this:
> > >
> > > > extraPorts:
> > > >       - name: prom
> > > >         containerPort: 9249
> > >
> > >
> > >
> > > Best wishes,
> > > Peng Yuan
> > >
> > > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org> wrote:
> > >
> > > > Hi Flink Devs!
> > > >
> > > > We would like to present to you the first prototype of the
> > > > flink-kubernetes-operator that was built based on the FLIP and the
> > > > discussion on this mail thread. We would also like to call out some
> > > design
> > > > decisions that we have made regarding architecture components that
> were
> > > not
> > > > explicitly mentioned in the FLIP document/thread and give you the
> > > > opportunity to raise any concerns here.
> > > >
> > > > You can find the initial prototype here:
> > > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > > >
> > > > We will leave the PR open for 1-2 days before merging to let people
> > > comment
> > > > on it, but please be mindful that this is an initial prototype with
> > many
> > > > rough edges. It is not intended to be a complete implementation of
> the
> > > FLIP
> > > > specs as that will take some more work from all of us :)
> > > >
> > > >
> > > > *Prototype feature set:*The prototype contains a basic working
> version
> > of
> > > > the flink-kubernetes-operator that supports deployment and lifecycle
> > > > management of a stateful native flink application. We have basic
> > support
> > > > for stateful and stateless upgrades, UI ingress, pod templates etc.
> > Error
> > > > handling at this point is largely missing.
> > > >
> > > >
> > > > *Features / design decisions that were not explicitly discussed in
> this
> > > > thread*
> > > >
> > > > *Basic Admission control using a Webhook*Standard resource admission
> > > > control in Kubernetes to validate and potentially reject resources is
> > > done
> > > > through Webhooks.
> > > >
> > > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > > This is a necessary mechanism to give the user an upfront error when
> an
> > > > incorrect resource was submitted. In the Flink operator's case we
> need
> > to
> > > > validate that the FlinkDeployment yaml actually makes sense and does
> > not
> > > > contain erroneous config options that would inevitably lead to
> > > > deployment/job failures.
> > > >
> > > > We have implemented a simple webhook that we can use for this type of
> > > > validation, as a separate maven module (flink-kubernetes-webhook).
> The
> > > > webhook is an optional component and can be enabled or disabled
> during
> > > > deployment. To avoid pulling in new external dependencies we have
> used
> > > the
> > > > Flink Shaded Netty module to build the simple rest endpoint required.
> > If
> > > > the community feels that Netty adds unnecessary complexity to the
> > webhook
> > > > implementation we are open to alternative backends such as Springboot
> > for
> > > > instance which would practically eliminate all the boilerplate.
> > > >
> > > >
> > > > *Helm Chart for deployment*Helm charts provide an industry standard
> way
> > > of
> > > > managing kubernetes deployments. We have created a helm chart
> prototype
> > > > that can be used to deploy the operator together with all required
> > > > resources. The helm chart allows easy configuration for things like
> > > images,
> > > > namespaces etc and flags to control specific parts of the deployment
> > such
> > > > as RBAC or the webhook.
> > > >
> > > > The helm chart provided is intended to be a first version that worked
> > for
> > > > us during development but we expect to have a lot of iterations on it
> > > based
> > > > on the feedback from the community.
> > > >
> > > > *Acknowledgment*
> > > > We would like to thank everyone who has provided support and valuable
> > > > feedback on this FLIP.
> > > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > > specifically
> > > > for making their operators open source and available to us which had
> a
> > > big
> > > > impact on the FLIP and the prototype.
> > > >
> > > > We are looking forward to continuing development on the operator
> > together
> > > > with the broader community.
> > > > All work will be tracked using the ASF Jira from now on.
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com>
> wrote:
> > > >
> > > > > Hi Gyula,
> > > > >
> > > > > Thanks!
> > > > > It's great to see the project getting started and I can't wait to
> see
> > > the
> > > > > PR and start contributing code.😄😄😄
> > > > >
> > > > > Best Wishes!
> > > > > Peng Yuan
> > > > >
> > > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Peng Yuan!
> > > > > >
> > > > > > The repo is already created:
> > > > > > https://github.com/apache/flink-kubernetes-operator
> > > > > >
> > > > > > We will open the PR with the initial prototype later today, stay
> > > tuned
> > > > in
> > > > > > this thread! :)
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > Has the project of flink-kubernetes-operator been created in
> > > github?
> > > > > > >
> > > > > > > Peng Yuan
> > > > > > >
> > > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > > > > Don't have any better idea
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > Thanks for the continued feedback and discussion. Looks
> like
> > we
> > > > are
> > > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > > >
> > > > > > > > > In parallel it would be good to find the repository name.
> > > > > > > > >
> > > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > > >
> > > > > > > > > I thought "flink-operator" could be a bit misleading since
> > the
> > > > term
> > > > > > > > > operator already has a meaning in Flink.
> > > > > > > > >
> > > > > > > > > I also considered "flink-k8s-operator" but that would be
> > almost
> > > > > > > > > identical to existing operator implementations and could
> lead
> > > to
> > > > > > > > > confusion in the future.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Danny,
> > > > > > > > > >
> > > > > > > > > > So far we have been focusing our dev efforts on the
> initial
> > > > > native
> > > > > > > > > > implementation with the team.
> > > > > > > > > > If the discussion and vote goes well for this FLIP we are
> > > > looking
> > > > > > > > forward
> > > > > > > > > > to contributing the initial version sometime next week
> > > (fingers
> > > > > > > > crossed).
> > > > > > > > > >
> > > > > > > > > > At that point I think we can already start the dev work
> to
> > > > > support
> > > > > > > the
> > > > > > > > > > standalone mode as well, especially if you can dedicate
> > some
> > > > > effort
> > > > > > > to
> > > > > > > > > > pushing that side.
> > > > > > > > > > Working together on this sounds like a great idea and we
> > > should
> > > > > > start
> > > > > > > > as
> > > > > > > > > > soon as possible! :)
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > > dannycranmer@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I have been discussing this one with my team. We are
> > > > interested
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > Standalone mode, and are willing to contribute towards
> > the
> > > > > > > > > implementation.
> > > > > > > > > > > Potentially we can work together to support both modes
> in
> > > > > > parallel?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > > gyula.fora@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Danny!
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > > >
> > > > > > > > > > > > Versioning:
> > > > > > > > > > > > Versioning will be independent from Flink and the
> > > operator
> > > > > will
> > > > > > > > > depend
> > > > > > > > > > > on a
> > > > > > > > > > > > fixed flink version (in every given operator
> version).
> > > > > > > > > > > > This should be the exact same setup as with Stateful
> > > > > Functions
> > > > > > (
> > > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > > independent
> > > > > > > release
> > > > > > > > > cycle
> > > > > > > > > > > > but
> > > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > > >
> > > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > > I think that's a very good point, as general
> exception
> > > > > handling
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > different failure scenarios is a tricky problem. I
> > think
> > > > the
> > > > > > > > > exception
> > > > > > > > > > > > classifiers and retry strategies could avoid a lot of
> > > > manual
> > > > > > > > > intervention
> > > > > > > > > > > > from the user. We will definitely need to add
> something
> > > > like
> > > > > > > this.
> > > > > > > > > Once
> > > > > > > > > > > we
> > > > > > > > > > > > have the repo created with the initial operator code
> we
> > > > > should
> > > > > > > open
> > > > > > > > > some
> > > > > > > > > > > > tickets for this and put it on the short term
> roadmap!
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Gyula
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey team,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Great work on the FLIP, I am looking forward to
> this
> > > > one. I
> > > > > > > agree
> > > > > > > > > that
> > > > > > > > > > > we
> > > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have general feedback around how we will handle
> job
> > > > > > > submission
> > > > > > > > > > > failure
> > > > > > > > > > > > > and retry. As discussed in the Rejected
> Alternatives
> > > > > section,
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > use
> > > > > > > > > > > > > Java to handle job submission failures from the
> Flink
> > > > > client.
> > > > > > > It
> > > > > > > > > would
> > > > > > > > > > > be
> > > > > > > > > > > > > useful to have the ability to configure exception
> > > > > classifiers
> > > > > > > and
> > > > > > > > > retry
> > > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given this will be in a separate Github repository
> I
> > am
> > > > > > curious
> > > > > > > > how
> > > > > > > > > > > ther
> > > > > > > > > > > > > versioning strategy will work in relation to the
> > Flink
> > > > > > version?
> > > > > > > > Do
> > > > > > > > > we
> > > > > > > > > > > > have
> > > > > > > > > > > > > any other components with a similar setup I can
> look
> > > at?
> > > > > Will
> > > > > > > the
> > > > > > > > > > > > operator
> > > > > > > > > > > > > version track Flink or will it use its own
> versioning
> > > > > > strategy
> > > > > > > > > with a
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> > updated
> > > > the
> > > > > > FLIP
> > > > > > > > > page
> > > > > > > > > > > > > > accordingly. If you are comfortable with the
> > > currently
> > > > > > > existing
> > > > > > > > > > > design
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to
> > the
> > > > > > voting
> > > > > > > > > stage -
> > > > > > > > > > > > once
> > > > > > > > > > > > > > that reaches a positive conclusion it lets us
> > create
> > > > the
> > > > > > > > separate
> > > > > > > > > > > code
> > > > > > > > > > > > > > repository under the flink project for the
> > operator.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I encourage everyone to keep improving the
> details
> > in
> > > > the
> > > > > > > > > meantime,
> > > > > > > > > > > > > however
> > > > > > > > > > > > > > I believe given the existing design and the
> general
> > > > > > sentiment
> > > > > > > > on
> > > > > > > > > this
> > > > > > > > > > > > > > thread that the most efficient path from here is
> > > > starting
> > > > > > the
> > > > > > > > > > > > > > implementation so that we can collectively
> iterate
> > > over
> > > > > it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > > > > thw@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the feedback and please see
> responses
> > > > below
> > > > > > -->
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > > everyone
> > > > > for
> > > > > > > the
> > > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > > Deploying a Flink session cluster via
> kubectl &
> > > CR
> > > > > and
> > > > > > > then
> > > > > > > > > > > > > submitting
> > > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > > to the cluster via Flink cli / REST is
> probably
> > > the
> > > > > > > > approach
> > > > > > > > > that
> > > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > > the least effort. However, I'd like to point
> > out
> > > 2
> > > > > > > > > weaknesses.
> > > > > > > > > > > > > > > > 1. A lot of users use Flink in
> > perjob/application
> > > > > > modes.
> > > > > > > > For
> > > > > > > > > > > these
> > > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > > having to run the job in two steps (deploy
> the
> > > > > cluster,
> > > > > > > and
> > > > > > > > > > > submit
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > > 2. One of our motivations is being able to
> > manage
> > > > > Flink
> > > > > > > > > > > > applications'
> > > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from
> > cli
> > > > > > sounds
> > > > > > > > not
> > > > > > > > > > > > aligned
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > > I think it's probably worth it to support
> > > > submitting
> > > > > > jobs
> > > > > > > > via
> > > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > > in the first version, both together with
> > > deploying
> > > > > the
> > > > > > > > > cluster
> > > > > > > > > > > like
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > perjob/application mode and after deploying
> the
> > > > > cluster
> > > > > > > > like
> > > > > > > > > in
> > > > > > > > > > > > > session
> > > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The intention is to support application
> > management
> > > > > > through
> > > > > > > > > operator
> > > > > > > > > > > > and
> > > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > > which means there won't be any 2 step
> submission
> > > > > process,
> > > > > > > > > which as
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > allude to would defeat the purpose of this
> > project.
> > > > The
> > > > > > CR
> > > > > > > > > example
> > > > > > > > > > > > > shows
> > > > > > > > > > > > > > > the application part. Please note that the bare
> > > > cluster
> > > > > > > > > support is
> > > > > > > > > > > an
> > > > > > > > > > > > > > > *additional* feature for scenarios that require
> > > > > external
> > > > > > > job
> > > > > > > > > > > > > management.
> > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > there anything on the FLIP page that creates a
> > > > > different
> > > > > > > > > > > impression?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > > Which Flink versions does the operator plan
> to
> > > > > support?
> > > > > > > > > > > > > > > > 1. Native K8s deployment was firstly
> introduced
> > > in
> > > > > > Flink
> > > > > > > > 1.10
> > > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > > > > > 3. The Pod template support was introduced in
> > > Flink
> > > > > > 1.13
> > > > > > > > > > > > > > > > 4. There was some changes to the Flink docker
> > > image
> > > > > > > > > entrypoint
> > > > > > > > > > > > script
> > > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Great, thanks for providing this. It is
> important
> > > for
> > > > > the
> > > > > > > > > > > > compatibility
> > > > > > > > > > > > > > > going forward also. We are targeting Flink
> 1.14.x
> > > > > > upwards.
> > > > > > > > > Before
> > > > > > > > > > > the
> > > > > > > > > > > > > > > operator is ready there will be another Flink
> > > > release.
> > > > > > > Let's
> > > > > > > > > see if
> > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > > What kind of API compatibility we can commit
> > to?
> > > > It's
> > > > > > > > > probably
> > > > > > > > > > > fine
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > alpha / beta version APIs that allow
> > incompatible
> > > > > > future
> > > > > > > > > changes
> > > > > > > > > > > > for
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > first version. But eventually we would need
> to
> > > > > > guarantee
> > > > > > > > > > > backwards
> > > > > > > > > > > > > > > > compatibility, so that an early version CR
> can
> > > work
> > > > > > with
> > > > > > > a
> > > > > > > > > new
> > > > > > > > > > > > > version
> > > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Another great point and please let me include
> > that
> > > on
> > > > > the
> > > > > > > > FLIP
> > > > > > > > > > > page.
> > > > > > > > > > > > > ;-)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think we should allow incompatible changes
> for
> > > the
> > > > > > first
> > > > > > > > one
> > > > > > > > > or
> > > > > > > > > > > two
> > > > > > > > > > > > > > > versions, similar to how other major features
> > have
> > > > > > evolved
> > > > > > > > > > > recently,
> > > > > > > > > > > > > such
> > > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Would be great to get broader feedback on this
> > one.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise
> <
> > > > > > > > thw@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone
> integration
> > > > > > > > > > > > > > > > > > Maybe we should make this more clear in
> the
> > > > FLIP
> > > > > > but
> > > > > > > we
> > > > > > > > > > > agreed
> > > > > > > > > > > > to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > first version of the operator based on
> the
> > > > native
> > > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > > While this clearly does not cover all
> > > use-cases
> > > > > and
> > > > > > > > > > > > requirements,
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > > this would lead to a much smaller initial
> > > > effort
> > > > > > and
> > > > > > > a
> > > > > > > > > nicer
> > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm also leaning towards the native
> > > integration,
> > > > as
> > > > > > > long
> > > > > > > > > as it
> > > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > MVP effort. Ultimately the operator will
> need
> > > to
> > > > > also
> > > > > > > > > support
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > standalone mode. I would like to gain more
> > > > > confidence
> > > > > > > > that
> > > > > > > > > > > native
> > > > > > > > > > > > > > > > > integration reduces the effort. While it
> cuts
> > > the
> > > > > > > effort
> > > > > > > > to
> > > > > > > > > > > > handle
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > pod creation, some mapping code from the CR
> > to
> > > > the
> > > > > > > native
> > > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > > client and config needs to be created. As
> > > > mentioned
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > FLIP,
> > > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > > integration requires the Flink job manager
> to
> > > > have
> > > > > > > access
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > k8s
> > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > create pods, which in some scenarios may be
> > > seen
> > > > as
> > > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > > Is the pod template in CR same with
> > what
> > > > > Flink
> > > > > > > has
> > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> > > > field(e.g.
> > > > > > > > > cpu/memory
> > > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, pod template would look almost
> > identical.
> > > > > There
> > > > > > > are
> > > > > > > > a
> > > > > > > > > few
> > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > > that the operator will control (and that
> may
> > > need
> > > > > to
> > > > > > be
> > > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > in general we would not want to place
> > > > > restrictions. I
> > > > > > > > > think a
> > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > where a pod template is merged from
> multiple
> > > > layers
> > > > > > > would
> > > > > > > > > also
> > > > > > > > > > > be
> > > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi Matyas!

Thanks for your reply!
For 1. and 3. scenarios,I couldn't agree more with the podTemplate solution
, i missed this part.
For savepoint related configuration, I think it's very important to be
specified in JobSpec, Because savepoint is a very common configuration for
upgrading a job, if it has been placed in JobSpec can be obviously
configured by the user. In addition, other advanced properties can be put
into flinkConfiguration customized by expert users.
A bunch of savepoint configuration as follows:

> fromSavepoint——Job restart from

autoSavepointSecond—— Automatically take a savepoint to the `savepointsDir`
> every n seconds.

savepointsDir—— Savepoints dir where to store automatically taken
> savepoints

savepointGeneration—— Update savepoint generation of job status for a
> running job (should be defined in JobStatus)


Best wishes,
Peng Yuan.

On Tue, Feb 15, 2022 at 4:41 PM Őrhidi Mátyás <ma...@gmail.com>
wrote:

> Hi Peng,
>
> Thanks for your feedback. Regarding 1. and 3. scenarios, the podTemplate
> functionality in the operator could cover both. We also need to be careful
> about introducing proxy parameters in the CRD spec. The savepoint path is
> usually accompanied with a bunch of other configurations for example, so
> users need to use configuration params anyway. What do you think?
>
> Best,
> Matyas
>
> On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com> wrote:
>
> > Hi Gyula!
> >
> > I have reviewed the prototype design of flink-kubernetes-operator you
> > submitted, and I have the following questions:
> >
> > 1.Can a Flink Jar package that supports pulling from the sidecar be added
> > to the JobSpec? just like this:
> >
> > > initContainers:
> > >       - name: downloader
> > >         image: curlimages/curl
> > >         env:
> > >           - name: JAR_URL
> > >             value:
> > >
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> > >           - name: DEST_PATH
> > >             value: /cache/flink-app.jar
> > >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
> >
> > 2.Can we add savepoint path property to job specification?
> > 3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec to
> > expose some service ,such as prometheus?The property can be this:
> >
> > > extraPorts:
> > >       - name: prom
> > >         containerPort: 9249
> >
> >
> >
> > Best wishes,
> > Peng Yuan
> >
> > On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org> wrote:
> >
> > > Hi Flink Devs!
> > >
> > > We would like to present to you the first prototype of the
> > > flink-kubernetes-operator that was built based on the FLIP and the
> > > discussion on this mail thread. We would also like to call out some
> > design
> > > decisions that we have made regarding architecture components that were
> > not
> > > explicitly mentioned in the FLIP document/thread and give you the
> > > opportunity to raise any concerns here.
> > >
> > > You can find the initial prototype here:
> > > https://github.com/apache/flink-kubernetes-operator/pull/1
> > >
> > > We will leave the PR open for 1-2 days before merging to let people
> > comment
> > > on it, but please be mindful that this is an initial prototype with
> many
> > > rough edges. It is not intended to be a complete implementation of the
> > FLIP
> > > specs as that will take some more work from all of us :)
> > >
> > >
> > > *Prototype feature set:*The prototype contains a basic working version
> of
> > > the flink-kubernetes-operator that supports deployment and lifecycle
> > > management of a stateful native flink application. We have basic
> support
> > > for stateful and stateless upgrades, UI ingress, pod templates etc.
> Error
> > > handling at this point is largely missing.
> > >
> > >
> > > *Features / design decisions that were not explicitly discussed in this
> > > thread*
> > >
> > > *Basic Admission control using a Webhook*Standard resource admission
> > > control in Kubernetes to validate and potentially reject resources is
> > done
> > > through Webhooks.
> > >
> > >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > > This is a necessary mechanism to give the user an upfront error when an
> > > incorrect resource was submitted. In the Flink operator's case we need
> to
> > > validate that the FlinkDeployment yaml actually makes sense and does
> not
> > > contain erroneous config options that would inevitably lead to
> > > deployment/job failures.
> > >
> > > We have implemented a simple webhook that we can use for this type of
> > > validation, as a separate maven module (flink-kubernetes-webhook). The
> > > webhook is an optional component and can be enabled or disabled during
> > > deployment. To avoid pulling in new external dependencies we have used
> > the
> > > Flink Shaded Netty module to build the simple rest endpoint required.
> If
> > > the community feels that Netty adds unnecessary complexity to the
> webhook
> > > implementation we are open to alternative backends such as Springboot
> for
> > > instance which would practically eliminate all the boilerplate.
> > >
> > >
> > > *Helm Chart for deployment*Helm charts provide an industry standard way
> > of
> > > managing kubernetes deployments. We have created a helm chart prototype
> > > that can be used to deploy the operator together with all required
> > > resources. The helm chart allows easy configuration for things like
> > images,
> > > namespaces etc and flags to control specific parts of the deployment
> such
> > > as RBAC or the webhook.
> > >
> > > The helm chart provided is intended to be a first version that worked
> for
> > > us during development but we expect to have a lot of iterations on it
> > based
> > > on the feedback from the community.
> > >
> > > *Acknowledgment*
> > > We would like to thank everyone who has provided support and valuable
> > > feedback on this FLIP.
> > > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> > specifically
> > > for making their operators open source and available to us which had a
> > big
> > > impact on the FLIP and the prototype.
> > >
> > > We are looking forward to continuing development on the operator
> together
> > > with the broader community.
> > > All work will be tracked using the ASF Jira from now on.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com> wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > Thanks!
> > > > It's great to see the project getting started and I can't wait to see
> > the
> > > > PR and start contributing code.😄😄😄
> > > >
> > > > Best Wishes!
> > > > Peng Yuan
> > > >
> > > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Peng Yuan!
> > > > >
> > > > > The repo is already created:
> > > > > https://github.com/apache/flink-kubernetes-operator
> > > > >
> > > > > We will open the PR with the initial prototype later today, stay
> > tuned
> > > in
> > > > > this thread! :)
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Has the project of flink-kubernetes-operator been created in
> > github?
> > > > > >
> > > > > > Peng Yuan
> > > > > >
> > > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > > > Don't have any better idea
> > > > > > >
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Thanks for the continued feedback and discussion. Looks like
> we
> > > are
> > > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > > >
> > > > > > > > In parallel it would be good to find the repository name.
> > > > > > > >
> > > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > > >
> > > > > > > > I thought "flink-operator" could be a bit misleading since
> the
> > > term
> > > > > > > > operator already has a meaning in Flink.
> > > > > > > >
> > > > > > > > I also considered "flink-k8s-operator" but that would be
> almost
> > > > > > > > identical to existing operator implementations and could lead
> > to
> > > > > > > > confusion in the future.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Danny,
> > > > > > > > >
> > > > > > > > > So far we have been focusing our dev efforts on the initial
> > > > native
> > > > > > > > > implementation with the team.
> > > > > > > > > If the discussion and vote goes well for this FLIP we are
> > > looking
> > > > > > > forward
> > > > > > > > > to contributing the initial version sometime next week
> > (fingers
> > > > > > > crossed).
> > > > > > > > >
> > > > > > > > > At that point I think we can already start the dev work to
> > > > support
> > > > > > the
> > > > > > > > > standalone mode as well, especially if you can dedicate
> some
> > > > effort
> > > > > > to
> > > > > > > > > pushing that side.
> > > > > > > > > Working together on this sounds like a great idea and we
> > should
> > > > > start
> > > > > > > as
> > > > > > > > > soon as possible! :)
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > > dannycranmer@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I have been discussing this one with my team. We are
> > > interested
> > > > > in
> > > > > > > the
> > > > > > > > > > Standalone mode, and are willing to contribute towards
> the
> > > > > > > > implementation.
> > > > > > > > > > Potentially we can work together to support both modes in
> > > > > parallel?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Danny!
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > > >
> > > > > > > > > > > Versioning:
> > > > > > > > > > > Versioning will be independent from Flink and the
> > operator
> > > > will
> > > > > > > > depend
> > > > > > > > > > on a
> > > > > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > > > > This should be the exact same setup as with Stateful
> > > > Functions
> > > > > (
> > > > > > > > > > > https://github.com/apache/flink-statefun). So
> > independent
> > > > > > release
> > > > > > > > cycle
> > > > > > > > > > > but
> > > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > > >
> > > > > > > > > > > Deployment error handling:
> > > > > > > > > > > I think that's a very good point, as general exception
> > > > handling
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > different failure scenarios is a tricky problem. I
> think
> > > the
> > > > > > > > exception
> > > > > > > > > > > classifiers and retry strategies could avoid a lot of
> > > manual
> > > > > > > > intervention
> > > > > > > > > > > from the user. We will definitely need to add something
> > > like
> > > > > > this.
> > > > > > > > Once
> > > > > > > > > > we
> > > > > > > > > > > have the repo created with the initial operator code we
> > > > should
> > > > > > open
> > > > > > > > some
> > > > > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Gyula
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > > dannycranmer@apache.org>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey team,
> > > > > > > > > > > >
> > > > > > > > > > > > Great work on the FLIP, I am looking forward to this
> > > one. I
> > > > > > agree
> > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > > >
> > > > > > > > > > > > I have general feedback around how we will handle job
> > > > > > submission
> > > > > > > > > > failure
> > > > > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > > > section,
> > > > > > we
> > > > > > > > can
> > > > > > > > > > use
> > > > > > > > > > > > Java to handle job submission failures from the Flink
> > > > client.
> > > > > > It
> > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > > > useful to have the ability to configure exception
> > > > classifiers
> > > > > > and
> > > > > > > > retry
> > > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > > >
> > > > > > > > > > > > Given this will be in a separate Github repository I
> am
> > > > > curious
> > > > > > > how
> > > > > > > > > > ther
> > > > > > > > > > > > versioning strategy will work in relation to the
> Flink
> > > > > version?
> > > > > > > Do
> > > > > > > > we
> > > > > > > > > > > have
> > > > > > > > > > > > any other components with a similar setup I can look
> > at?
> > > > Will
> > > > > > the
> > > > > > > > > > > operator
> > > > > > > > > > > > version track Flink or will it use its own versioning
> > > > > strategy
> > > > > > > > with a
> > > > > > > > > > > Flink
> > > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi team,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for the great feedback, Thomas has
> updated
> > > the
> > > > > FLIP
> > > > > > > > page
> > > > > > > > > > > > > accordingly. If you are comfortable with the
> > currently
> > > > > > existing
> > > > > > > > > > design
> > > > > > > > > > > > and
> > > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to
> the
> > > > > voting
> > > > > > > > stage -
> > > > > > > > > > > once
> > > > > > > > > > > > > that reaches a positive conclusion it lets us
> create
> > > the
> > > > > > > separate
> > > > > > > > > > code
> > > > > > > > > > > > > repository under the flink project for the
> operator.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I encourage everyone to keep improving the details
> in
> > > the
> > > > > > > > meantime,
> > > > > > > > > > > > however
> > > > > > > > > > > > > I believe given the existing design and the general
> > > > > sentiment
> > > > > > > on
> > > > > > > > this
> > > > > > > > > > > > > thread that the most efficient path from here is
> > > starting
> > > > > the
> > > > > > > > > > > > > implementation so that we can collectively iterate
> > over
> > > > it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > > > thw@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback and please see responses
> > > below
> > > > > -->
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> > everyone
> > > > for
> > > > > > the
> > > > > > > > > > > > discussion.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > > Deploying a Flink session cluster via kubectl &
> > CR
> > > > and
> > > > > > then
> > > > > > > > > > > > submitting
> > > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > > to the cluster via Flink cli / REST is probably
> > the
> > > > > > > approach
> > > > > > > > that
> > > > > > > > > > > > > > requires
> > > > > > > > > > > > > > > the least effort. However, I'd like to point
> out
> > 2
> > > > > > > > weaknesses.
> > > > > > > > > > > > > > > 1. A lot of users use Flink in
> perjob/application
> > > > > modes.
> > > > > > > For
> > > > > > > > > > these
> > > > > > > > > > > > > users,
> > > > > > > > > > > > > > > having to run the job in two steps (deploy the
> > > > cluster,
> > > > > > and
> > > > > > > > > > submit
> > > > > > > > > > > > the
> > > > > > > > > > > > > > job)
> > > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > > 2. One of our motivations is being able to
> manage
> > > > Flink
> > > > > > > > > > > applications'
> > > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from
> cli
> > > > > sounds
> > > > > > > not
> > > > > > > > > > > aligned
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > > I think it's probably worth it to support
> > > submitting
> > > > > jobs
> > > > > > > via
> > > > > > > > > > > > kubectl &
> > > > > > > > > > > > > > CR
> > > > > > > > > > > > > > > in the first version, both together with
> > deploying
> > > > the
> > > > > > > > cluster
> > > > > > > > > > like
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > perjob/application mode and after deploying the
> > > > cluster
> > > > > > > like
> > > > > > > > in
> > > > > > > > > > > > session
> > > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The intention is to support application
> management
> > > > > through
> > > > > > > > operator
> > > > > > > > > > > and
> > > > > > > > > > > > > CR,
> > > > > > > > > > > > > > which means there won't be any 2 step submission
> > > > process,
> > > > > > > > which as
> > > > > > > > > > > you
> > > > > > > > > > > > > > allude to would defeat the purpose of this
> project.
> > > The
> > > > > CR
> > > > > > > > example
> > > > > > > > > > > > shows
> > > > > > > > > > > > > > the application part. Please note that the bare
> > > cluster
> > > > > > > > support is
> > > > > > > > > > an
> > > > > > > > > > > > > > *additional* feature for scenarios that require
> > > > external
> > > > > > job
> > > > > > > > > > > > management.
> > > > > > > > > > > > > Is
> > > > > > > > > > > > > > there anything on the FLIP page that creates a
> > > > different
> > > > > > > > > > impression?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > > Which Flink versions does the operator plan to
> > > > support?
> > > > > > > > > > > > > > > 1. Native K8s deployment was firstly introduced
> > in
> > > > > Flink
> > > > > > > 1.10
> > > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > > > > 3. The Pod template support was introduced in
> > Flink
> > > > > 1.13
> > > > > > > > > > > > > > > 4. There was some changes to the Flink docker
> > image
> > > > > > > > entrypoint
> > > > > > > > > > > script
> > > > > > > > > > > > > in,
> > > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Great, thanks for providing this. It is important
> > for
> > > > the
> > > > > > > > > > > compatibility
> > > > > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > > > > upwards.
> > > > > > > > Before
> > > > > > > > > > the
> > > > > > > > > > > > > > operator is ready there will be another Flink
> > > release.
> > > > > > Let's
> > > > > > > > see if
> > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > > What kind of API compatibility we can commit
> to?
> > > It's
> > > > > > > > probably
> > > > > > > > > > fine
> > > > > > > > > > > > to
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > alpha / beta version APIs that allow
> incompatible
> > > > > future
> > > > > > > > changes
> > > > > > > > > > > for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > first version. But eventually we would need to
> > > > > guarantee
> > > > > > > > > > backwards
> > > > > > > > > > > > > > > compatibility, so that an early version CR can
> > work
> > > > > with
> > > > > > a
> > > > > > > > new
> > > > > > > > > > > > version
> > > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Another great point and please let me include
> that
> > on
> > > > the
> > > > > > > FLIP
> > > > > > > > > > page.
> > > > > > > > > > > > ;-)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think we should allow incompatible changes for
> > the
> > > > > first
> > > > > > > one
> > > > > > > > or
> > > > > > > > > > two
> > > > > > > > > > > > > > versions, similar to how other major features
> have
> > > > > evolved
> > > > > > > > > > recently,
> > > > > > > > > > > > such
> > > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Would be great to get broader feedback on this
> one.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > > > > thw@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > > > > Maybe we should make this more clear in the
> > > FLIP
> > > > > but
> > > > > > we
> > > > > > > > > > agreed
> > > > > > > > > > > to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > first version of the operator based on the
> > > native
> > > > > > > > > > integration.
> > > > > > > > > > > > > > > > > While this clearly does not cover all
> > use-cases
> > > > and
> > > > > > > > > > > requirements,
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > > this would lead to a much smaller initial
> > > effort
> > > > > and
> > > > > > a
> > > > > > > > nicer
> > > > > > > > > > > > first
> > > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm also leaning towards the native
> > integration,
> > > as
> > > > > > long
> > > > > > > > as it
> > > > > > > > > > > > > reduces
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > MVP effort. Ultimately the operator will need
> > to
> > > > also
> > > > > > > > support
> > > > > > > > > > the
> > > > > > > > > > > > > > > > standalone mode. I would like to gain more
> > > > confidence
> > > > > > > that
> > > > > > > > > > native
> > > > > > > > > > > > > > > > integration reduces the effort. While it cuts
> > the
> > > > > > effort
> > > > > > > to
> > > > > > > > > > > handle
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > pod creation, some mapping code from the CR
> to
> > > the
> > > > > > native
> > > > > > > > > > > > integration
> > > > > > > > > > > > > > > > client and config needs to be created. As
> > > mentioned
> > > > > in
> > > > > > > the
> > > > > > > > > > FLIP,
> > > > > > > > > > > > > native
> > > > > > > > > > > > > > > > integration requires the Flink job manager to
> > > have
> > > > > > access
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > > k8s
> > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > create pods, which in some scenarios may be
> > seen
> > > as
> > > > > > > > > > unfavorable.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > > Is the pod template in CR same with
> what
> > > > Flink
> > > > > > has
> > > > > > > > > > already
> > > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> > > field(e.g.
> > > > > > > > cpu/memory
> > > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, pod template would look almost
> identical.
> > > > There
> > > > > > are
> > > > > > > a
> > > > > > > > few
> > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > > that the operator will control (and that may
> > need
> > > > to
> > > > > be
> > > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > in general we would not want to place
> > > > restrictions. I
> > > > > > > > think a
> > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > where a pod template is merged from multiple
> > > layers
> > > > > > would
> > > > > > > > also
> > > > > > > > > > be
> > > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Őrhidi Mátyás <ma...@gmail.com>.
Hi Peng,

Thanks for your feedback. Regarding 1. and 3. scenarios, the podTemplate
functionality in the operator could cover both. We also need to be careful
about introducing proxy parameters in the CRD spec. The savepoint path is
usually accompanied with a bunch of other configurations for example, so
users need to use configuration params anyway. What do you think?

Best,
Matyas

On Tue, Feb 15, 2022 at 8:58 AM K Fred <yu...@gmail.com> wrote:

> Hi Gyula!
>
> I have reviewed the prototype design of flink-kubernetes-operator you
> submitted, and I have the following questions:
>
> 1.Can a Flink Jar package that supports pulling from the sidecar be added
> to the JobSpec? just like this:
>
> > initContainers:
> >       - name: downloader
> >         image: curlimages/curl
> >         env:
> >           - name: JAR_URL
> >             value:
> >
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
> >           - name: DEST_PATH
> >             value: /cache/flink-app.jar
> >         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']
>
> 2.Can we add savepoint path property to job specification?
> 3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec to
> expose some service ,such as prometheus?The property can be this:
>
> > extraPorts:
> >       - name: prom
> >         containerPort: 9249
>
>
>
> Best wishes,
> Peng Yuan
>
> On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org> wrote:
>
> > Hi Flink Devs!
> >
> > We would like to present to you the first prototype of the
> > flink-kubernetes-operator that was built based on the FLIP and the
> > discussion on this mail thread. We would also like to call out some
> design
> > decisions that we have made regarding architecture components that were
> not
> > explicitly mentioned in the FLIP document/thread and give you the
> > opportunity to raise any concerns here.
> >
> > You can find the initial prototype here:
> > https://github.com/apache/flink-kubernetes-operator/pull/1
> >
> > We will leave the PR open for 1-2 days before merging to let people
> comment
> > on it, but please be mindful that this is an initial prototype with many
> > rough edges. It is not intended to be a complete implementation of the
> FLIP
> > specs as that will take some more work from all of us :)
> >
> >
> > *Prototype feature set:*The prototype contains a basic working version of
> > the flink-kubernetes-operator that supports deployment and lifecycle
> > management of a stateful native flink application. We have basic support
> > for stateful and stateless upgrades, UI ingress, pod templates etc. Error
> > handling at this point is largely missing.
> >
> >
> > *Features / design decisions that were not explicitly discussed in this
> > thread*
> >
> > *Basic Admission control using a Webhook*Standard resource admission
> > control in Kubernetes to validate and potentially reject resources is
> done
> > through Webhooks.
> >
> >
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> > This is a necessary mechanism to give the user an upfront error when an
> > incorrect resource was submitted. In the Flink operator's case we need to
> > validate that the FlinkDeployment yaml actually makes sense and does not
> > contain erroneous config options that would inevitably lead to
> > deployment/job failures.
> >
> > We have implemented a simple webhook that we can use for this type of
> > validation, as a separate maven module (flink-kubernetes-webhook). The
> > webhook is an optional component and can be enabled or disabled during
> > deployment. To avoid pulling in new external dependencies we have used
> the
> > Flink Shaded Netty module to build the simple rest endpoint required. If
> > the community feels that Netty adds unnecessary complexity to the webhook
> > implementation we are open to alternative backends such as Springboot for
> > instance which would practically eliminate all the boilerplate.
> >
> >
> > *Helm Chart for deployment*Helm charts provide an industry standard way
> of
> > managing kubernetes deployments. We have created a helm chart prototype
> > that can be used to deploy the operator together with all required
> > resources. The helm chart allows easy configuration for things like
> images,
> > namespaces etc and flags to control specific parts of the deployment such
> > as RBAC or the webhook.
> >
> > The helm chart provided is intended to be a first version that worked for
> > us during development but we expect to have a lot of iterations on it
> based
> > on the feedback from the community.
> >
> > *Acknowledgment*
> > We would like to thank everyone who has provided support and valuable
> > feedback on this FLIP.
> > We would also like to thank Yang Wang & Alexis Sarda-Espinosa
> specifically
> > for making their operators open source and available to us which had a
> big
> > impact on the FLIP and the prototype.
> >
> > We are looking forward to continuing development on the operator together
> > with the broader community.
> > All work will be tracked using the ASF Jira from now on.
> >
> > Cheers,
> > Gyula
> >
> > On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com> wrote:
> >
> > > Hi Gyula,
> > >
> > > Thanks!
> > > It's great to see the project getting started and I can't wait to see
> the
> > > PR and start contributing code.😄😄😄
> > >
> > > Best Wishes!
> > > Peng Yuan
> > >
> > > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Hi Peng Yuan!
> > > >
> > > > The repo is already created:
> > > > https://github.com/apache/flink-kubernetes-operator
> > > >
> > > > We will open the PR with the initial prototype later today, stay
> tuned
> > in
> > > > this thread! :)
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com>
> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Has the project of flink-kubernetes-operator been created in
> github?
> > > > >
> > > > > Peng Yuan
> > > > >
> > > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > > Don't have any better idea
> > > > > >
> > > > > > Gyula
> > > > > >
> > > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Thanks for the continued feedback and discussion. Looks like we
> > are
> > > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > > >
> > > > > > > In parallel it would be good to find the repository name.
> > > > > > >
> > > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > > >
> > > > > > > I thought "flink-operator" could be a bit misleading since the
> > term
> > > > > > > operator already has a meaning in Flink.
> > > > > > >
> > > > > > > I also considered "flink-k8s-operator" but that would be almost
> > > > > > > identical to existing operator implementations and could lead
> to
> > > > > > > confusion in the future.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Danny,
> > > > > > > >
> > > > > > > > So far we have been focusing our dev efforts on the initial
> > > native
> > > > > > > > implementation with the team.
> > > > > > > > If the discussion and vote goes well for this FLIP we are
> > looking
> > > > > > forward
> > > > > > > > to contributing the initial version sometime next week
> (fingers
> > > > > > crossed).
> > > > > > > >
> > > > > > > > At that point I think we can already start the dev work to
> > > support
> > > > > the
> > > > > > > > standalone mode as well, especially if you can dedicate some
> > > effort
> > > > > to
> > > > > > > > pushing that side.
> > > > > > > > Working together on this sounds like a great idea and we
> should
> > > > start
> > > > > > as
> > > > > > > > soon as possible! :)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > > dannycranmer@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I have been discussing this one with my team. We are
> > interested
> > > > in
> > > > > > the
> > > > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > > > implementation.
> > > > > > > > > Potentially we can work together to support both modes in
> > > > parallel?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Danny!
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback :)
> > > > > > > > > >
> > > > > > > > > > Versioning:
> > > > > > > > > > Versioning will be independent from Flink and the
> operator
> > > will
> > > > > > > depend
> > > > > > > > > on a
> > > > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > > > This should be the exact same setup as with Stateful
> > > Functions
> > > > (
> > > > > > > > > > https://github.com/apache/flink-statefun). So
> independent
> > > > > release
> > > > > > > cycle
> > > > > > > > > > but
> > > > > > > > > > still within the Flink umbrella.
> > > > > > > > > >
> > > > > > > > > > Deployment error handling:
> > > > > > > > > > I think that's a very good point, as general exception
> > > handling
> > > > > for
> > > > > > > the
> > > > > > > > > > different failure scenarios is a tricky problem. I think
> > the
> > > > > > > exception
> > > > > > > > > > classifiers and retry strategies could avoid a lot of
> > manual
> > > > > > > intervention
> > > > > > > > > > from the user. We will definitely need to add something
> > like
> > > > > this.
> > > > > > > Once
> > > > > > > > > we
> > > > > > > > > > have the repo created with the initial operator code we
> > > should
> > > > > open
> > > > > > > some
> > > > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gyula
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > > dannycranmer@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey team,
> > > > > > > > > > >
> > > > > > > > > > > Great work on the FLIP, I am looking forward to this
> > one. I
> > > > > agree
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > > >
> > > > > > > > > > > I have general feedback around how we will handle job
> > > > > submission
> > > > > > > > > failure
> > > > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > > section,
> > > > > we
> > > > > > > can
> > > > > > > > > use
> > > > > > > > > > > Java to handle job submission failures from the Flink
> > > client.
> > > > > It
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > useful to have the ability to configure exception
> > > classifiers
> > > > > and
> > > > > > > retry
> > > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > > >
> > > > > > > > > > > Given this will be in a separate Github repository I am
> > > > curious
> > > > > > how
> > > > > > > > > ther
> > > > > > > > > > > versioning strategy will work in relation to the Flink
> > > > version?
> > > > > > Do
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > any other components with a similar setup I can look
> at?
> > > Will
> > > > > the
> > > > > > > > > > operator
> > > > > > > > > > > version track Flink or will it use its own versioning
> > > > strategy
> > > > > > > with a
> > > > > > > > > > Flink
> > > > > > > > > > > version support matrix, or similar?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi team,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the great feedback, Thomas has updated
> > the
> > > > FLIP
> > > > > > > page
> > > > > > > > > > > > accordingly. If you are comfortable with the
> currently
> > > > > existing
> > > > > > > > > design
> > > > > > > > > > > and
> > > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> > > > voting
> > > > > > > stage -
> > > > > > > > > > once
> > > > > > > > > > > > that reaches a positive conclusion it lets us create
> > the
> > > > > > separate
> > > > > > > > > code
> > > > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > > > >
> > > > > > > > > > > > I encourage everyone to keep improving the details in
> > the
> > > > > > > meantime,
> > > > > > > > > > > however
> > > > > > > > > > > > I believe given the existing design and the general
> > > > sentiment
> > > > > > on
> > > > > > > this
> > > > > > > > > > > > thread that the most efficient path from here is
> > starting
> > > > the
> > > > > > > > > > > > implementation so that we can collectively iterate
> over
> > > it.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > > thw@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback and please see responses
> > below
> > > > -->
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > > tonysong820@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and
> everyone
> > > for
> > > > > the
> > > > > > > > > > > discussion.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > > Deploying a Flink session cluster via kubectl &
> CR
> > > and
> > > > > then
> > > > > > > > > > > submitting
> > > > > > > > > > > > > jobs
> > > > > > > > > > > > > > to the cluster via Flink cli / REST is probably
> the
> > > > > > approach
> > > > > > > that
> > > > > > > > > > > > > requires
> > > > > > > > > > > > > > the least effort. However, I'd like to point out
> 2
> > > > > > > weaknesses.
> > > > > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> > > > modes.
> > > > > > For
> > > > > > > > > these
> > > > > > > > > > > > users,
> > > > > > > > > > > > > > having to run the job in two steps (deploy the
> > > cluster,
> > > > > and
> > > > > > > > > submit
> > > > > > > > > > > the
> > > > > > > > > > > > > job)
> > > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > > 2. One of our motivations is being able to manage
> > > Flink
> > > > > > > > > > applications'
> > > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> > > > sounds
> > > > > > not
> > > > > > > > > > aligned
> > > > > > > > > > > > with
> > > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > > I think it's probably worth it to support
> > submitting
> > > > jobs
> > > > > > via
> > > > > > > > > > > kubectl &
> > > > > > > > > > > > > CR
> > > > > > > > > > > > > > in the first version, both together with
> deploying
> > > the
> > > > > > > cluster
> > > > > > > > > like
> > > > > > > > > > > in
> > > > > > > > > > > > > > perjob/application mode and after deploying the
> > > cluster
> > > > > > like
> > > > > > > in
> > > > > > > > > > > session
> > > > > > > > > > > > > > mode.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > The intention is to support application management
> > > > through
> > > > > > > operator
> > > > > > > > > > and
> > > > > > > > > > > > CR,
> > > > > > > > > > > > > which means there won't be any 2 step submission
> > > process,
> > > > > > > which as
> > > > > > > > > > you
> > > > > > > > > > > > > allude to would defeat the purpose of this project.
> > The
> > > > CR
> > > > > > > example
> > > > > > > > > > > shows
> > > > > > > > > > > > > the application part. Please note that the bare
> > cluster
> > > > > > > support is
> > > > > > > > > an
> > > > > > > > > > > > > *additional* feature for scenarios that require
> > > external
> > > > > job
> > > > > > > > > > > management.
> > > > > > > > > > > > Is
> > > > > > > > > > > > > there anything on the FLIP page that creates a
> > > different
> > > > > > > > > impression?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > > Which Flink versions does the operator plan to
> > > support?
> > > > > > > > > > > > > > 1. Native K8s deployment was firstly introduced
> in
> > > > Flink
> > > > > > 1.10
> > > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > > > 3. The Pod template support was introduced in
> Flink
> > > > 1.13
> > > > > > > > > > > > > > 4. There was some changes to the Flink docker
> image
> > > > > > > entrypoint
> > > > > > > > > > script
> > > > > > > > > > > > in,
> > > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Great, thanks for providing this. It is important
> for
> > > the
> > > > > > > > > > compatibility
> > > > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > > > upwards.
> > > > > > > Before
> > > > > > > > > the
> > > > > > > > > > > > > operator is ready there will be another Flink
> > release.
> > > > > Let's
> > > > > > > see if
> > > > > > > > > > > > anyone
> > > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > > What kind of API compatibility we can commit to?
> > It's
> > > > > > > probably
> > > > > > > > > fine
> > > > > > > > > > > to
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > alpha / beta version APIs that allow incompatible
> > > > future
> > > > > > > changes
> > > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > first version. But eventually we would need to
> > > > guarantee
> > > > > > > > > backwards
> > > > > > > > > > > > > > compatibility, so that an early version CR can
> work
> > > > with
> > > > > a
> > > > > > > new
> > > > > > > > > > > version
> > > > > > > > > > > > > > operator.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another great point and please let me include that
> on
> > > the
> > > > > > FLIP
> > > > > > > > > page.
> > > > > > > > > > > ;-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think we should allow incompatible changes for
> the
> > > > first
> > > > > > one
> > > > > > > or
> > > > > > > > > two
> > > > > > > > > > > > > versions, similar to how other major features have
> > > > evolved
> > > > > > > > > recently,
> > > > > > > > > > > such
> > > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > > > thw@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > > > Maybe we should make this more clear in the
> > FLIP
> > > > but
> > > > > we
> > > > > > > > > agreed
> > > > > > > > > > to
> > > > > > > > > > > > do
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > first version of the operator based on the
> > native
> > > > > > > > > integration.
> > > > > > > > > > > > > > > > While this clearly does not cover all
> use-cases
> > > and
> > > > > > > > > > requirements,
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > > this would lead to a much smaller initial
> > effort
> > > > and
> > > > > a
> > > > > > > nicer
> > > > > > > > > > > first
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm also leaning towards the native
> integration,
> > as
> > > > > long
> > > > > > > as it
> > > > > > > > > > > > reduces
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > MVP effort. Ultimately the operator will need
> to
> > > also
> > > > > > > support
> > > > > > > > > the
> > > > > > > > > > > > > > > standalone mode. I would like to gain more
> > > confidence
> > > > > > that
> > > > > > > > > native
> > > > > > > > > > > > > > > integration reduces the effort. While it cuts
> the
> > > > > effort
> > > > > > to
> > > > > > > > > > handle
> > > > > > > > > > > > the
> > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > pod creation, some mapping code from the CR to
> > the
> > > > > native
> > > > > > > > > > > integration
> > > > > > > > > > > > > > > client and config needs to be created. As
> > mentioned
> > > > in
> > > > > > the
> > > > > > > > > FLIP,
> > > > > > > > > > > > native
> > > > > > > > > > > > > > > integration requires the Flink job manager to
> > have
> > > > > access
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > k8s
> > > > > > > > > > > > > API
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > create pods, which in some scenarios may be
> seen
> > as
> > > > > > > > > unfavorable.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > > Is the pod template in CR same with what
> > > Flink
> > > > > has
> > > > > > > > > already
> > > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> > field(e.g.
> > > > > > > cpu/memory
> > > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, pod template would look almost identical.
> > > There
> > > > > are
> > > > > > a
> > > > > > > few
> > > > > > > > > > > > settings
> > > > > > > > > > > > > > > that the operator will control (and that may
> need
> > > to
> > > > be
> > > > > > > > > > > blacklisted),
> > > > > > > > > > > > > but
> > > > > > > > > > > > > > > in general we would not want to place
> > > restrictions. I
> > > > > > > think a
> > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > where a pod template is merged from multiple
> > layers
> > > > > would
> > > > > > > also
> > > > > > > > > be
> > > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi Gyula!

I have reviewed the prototype design of flink-kubernetes-operator you
submitted, and I have the following questions:

1.Can a Flink Jar package that supports pulling from the sidecar be added
to the JobSpec? just like this:

> initContainers:
>       - name: downloader
>         image: curlimages/curl
>         env:
>           - name: JAR_URL
>             value:
> https://repo1.maven.org/maven2/org/apache/flink/flink-examples-streaming_2.12/1.14.3/flink-examples-streaming_2.12-1.14.3-WordCount.jar
>           - name: DEST_PATH
>             value: /cache/flink-app.jar
>         command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']

2.Can we add savepoint path property to job specification?
3.Can we add an extra port to the JobManagerSpec and TaskManagerSpec to
expose some service ,such as prometheus?The property can be this:

> extraPorts:
>       - name: prom
>         containerPort: 9249



Best wishes,
Peng Yuan

On Tue, Feb 15, 2022 at 12:23 AM Gyula Fóra <gy...@apache.org> wrote:

> Hi Flink Devs!
>
> We would like to present to you the first prototype of the
> flink-kubernetes-operator that was built based on the FLIP and the
> discussion on this mail thread. We would also like to call out some design
> decisions that we have made regarding architecture components that were not
> explicitly mentioned in the FLIP document/thread and give you the
> opportunity to raise any concerns here.
>
> You can find the initial prototype here:
> https://github.com/apache/flink-kubernetes-operator/pull/1
>
> We will leave the PR open for 1-2 days before merging to let people comment
> on it, but please be mindful that this is an initial prototype with many
> rough edges. It is not intended to be a complete implementation of the FLIP
> specs as that will take some more work from all of us :)
>
>
> *Prototype feature set:*The prototype contains a basic working version of
> the flink-kubernetes-operator that supports deployment and lifecycle
> management of a stateful native flink application. We have basic support
> for stateful and stateless upgrades, UI ingress, pod templates etc. Error
> handling at this point is largely missing.
>
>
> *Features / design decisions that were not explicitly discussed in this
> thread*
>
> *Basic Admission control using a Webhook*Standard resource admission
> control in Kubernetes to validate and potentially reject resources is done
> through Webhooks.
>
> https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
> This is a necessary mechanism to give the user an upfront error when an
> incorrect resource was submitted. In the Flink operator's case we need to
> validate that the FlinkDeployment yaml actually makes sense and does not
> contain erroneous config options that would inevitably lead to
> deployment/job failures.
>
> We have implemented a simple webhook that we can use for this type of
> validation, as a separate maven module (flink-kubernetes-webhook). The
> webhook is an optional component and can be enabled or disabled during
> deployment. To avoid pulling in new external dependencies we have used the
> Flink Shaded Netty module to build the simple rest endpoint required. If
> the community feels that Netty adds unnecessary complexity to the webhook
> implementation we are open to alternative backends such as Springboot for
> instance which would practically eliminate all the boilerplate.
>
>
> *Helm Chart for deployment*Helm charts provide an industry standard way of
> managing kubernetes deployments. We have created a helm chart prototype
> that can be used to deploy the operator together with all required
> resources. The helm chart allows easy configuration for things like images,
> namespaces etc and flags to control specific parts of the deployment such
> as RBAC or the webhook.
>
> The helm chart provided is intended to be a first version that worked for
> us during development but we expect to have a lot of iterations on it based
> on the feedback from the community.
>
> *Acknowledgment*
> We would like to thank everyone who has provided support and valuable
> feedback on this FLIP.
> We would also like to thank Yang Wang & Alexis Sarda-Espinosa specifically
> for making their operators open source and available to us which had a big
> impact on the FLIP and the prototype.
>
> We are looking forward to continuing development on the operator together
> with the broader community.
> All work will be tracked using the ASF Jira from now on.
>
> Cheers,
> Gyula
>
> On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com> wrote:
>
> > Hi Gyula,
> >
> > Thanks!
> > It's great to see the project getting started and I can't wait to see the
> > PR and start contributing code.😄😄😄
> >
> > Best Wishes!
> > Peng Yuan
> >
> > On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Hi Peng Yuan!
> > >
> > > The repo is already created:
> > > https://github.com/apache/flink-kubernetes-operator
> > >
> > > We will open the PR with the initial prototype later today, stay tuned
> in
> > > this thread! :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Has the project of flink-kubernetes-operator been created in github?
> > > >
> > > > Peng Yuan
> > > >
> > > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > >
> > > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > > Don't have any better idea
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org>
> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks for the continued feedback and discussion. Looks like we
> are
> > > > > > ready to start a VOTE, I will initiate it shortly.
> > > > > >
> > > > > > In parallel it would be good to find the repository name.
> > > > > >
> > > > > > My suggestion would be: flink-kubernetes-operator
> > > > > >
> > > > > > I thought "flink-operator" could be a bit misleading since the
> term
> > > > > > operator already has a meaning in Flink.
> > > > > >
> > > > > > I also considered "flink-k8s-operator" but that would be almost
> > > > > > identical to existing operator implementations and could lead to
> > > > > > confusion in the future.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Hi Danny,
> > > > > > >
> > > > > > > So far we have been focusing our dev efforts on the initial
> > native
> > > > > > > implementation with the team.
> > > > > > > If the discussion and vote goes well for this FLIP we are
> looking
> > > > > forward
> > > > > > > to contributing the initial version sometime next week (fingers
> > > > > crossed).
> > > > > > >
> > > > > > > At that point I think we can already start the dev work to
> > support
> > > > the
> > > > > > > standalone mode as well, especially if you can dedicate some
> > effort
> > > > to
> > > > > > > pushing that side.
> > > > > > > Working together on this sounds like a great idea and we should
> > > start
> > > > > as
> > > > > > > soon as possible! :)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > > dannycranmer@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I have been discussing this one with my team. We are
> interested
> > > in
> > > > > the
> > > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > > implementation.
> > > > > > > > Potentially we can work together to support both modes in
> > > parallel?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Danny!
> > > > > > > > >
> > > > > > > > > Thanks for the feedback :)
> > > > > > > > >
> > > > > > > > > Versioning:
> > > > > > > > > Versioning will be independent from Flink and the operator
> > will
> > > > > > depend
> > > > > > > > on a
> > > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > > This should be the exact same setup as with Stateful
> > Functions
> > > (
> > > > > > > > > https://github.com/apache/flink-statefun). So independent
> > > > release
> > > > > > cycle
> > > > > > > > > but
> > > > > > > > > still within the Flink umbrella.
> > > > > > > > >
> > > > > > > > > Deployment error handling:
> > > > > > > > > I think that's a very good point, as general exception
> > handling
> > > > for
> > > > > > the
> > > > > > > > > different failure scenarios is a tricky problem. I think
> the
> > > > > > exception
> > > > > > > > > classifiers and retry strategies could avoid a lot of
> manual
> > > > > > intervention
> > > > > > > > > from the user. We will definitely need to add something
> like
> > > > this.
> > > > > > Once
> > > > > > > > we
> > > > > > > > > have the repo created with the initial operator code we
> > should
> > > > open
> > > > > > some
> > > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gyula
> > > > > > > > >
> > > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > > dannycranmer@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey team,
> > > > > > > > > >
> > > > > > > > > > Great work on the FLIP, I am looking forward to this
> one. I
> > > > agree
> > > > > > that
> > > > > > > > we
> > > > > > > > > > can move forward to the voting stage.
> > > > > > > > > >
> > > > > > > > > > I have general feedback around how we will handle job
> > > > submission
> > > > > > > > failure
> > > > > > > > > > and retry. As discussed in the Rejected Alternatives
> > section,
> > > > we
> > > > > > can
> > > > > > > > use
> > > > > > > > > > Java to handle job submission failures from the Flink
> > client.
> > > > It
> > > > > > would
> > > > > > > > be
> > > > > > > > > > useful to have the ability to configure exception
> > classifiers
> > > > and
> > > > > > retry
> > > > > > > > > > strategy as part of operator configuration.
> > > > > > > > > >
> > > > > > > > > > Given this will be in a separate Github repository I am
> > > curious
> > > > > how
> > > > > > > > ther
> > > > > > > > > > versioning strategy will work in relation to the Flink
> > > version?
> > > > > Do
> > > > > > we
> > > > > > > > > have
> > > > > > > > > > any other components with a similar setup I can look at?
> > Will
> > > > the
> > > > > > > > > operator
> > > > > > > > > > version track Flink or will it use its own versioning
> > > strategy
> > > > > > with a
> > > > > > > > > Flink
> > > > > > > > > > version support matrix, or similar?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > > balassi.marton@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi team,
> > > > > > > > > > >
> > > > > > > > > > > Thank you for the great feedback, Thomas has updated
> the
> > > FLIP
> > > > > > page
> > > > > > > > > > > accordingly. If you are comfortable with the currently
> > > > existing
> > > > > > > > design
> > > > > > > > > > and
> > > > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> > > voting
> > > > > > stage -
> > > > > > > > > once
> > > > > > > > > > > that reaches a positive conclusion it lets us create
> the
> > > > > separate
> > > > > > > > code
> > > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > > >
> > > > > > > > > > > I encourage everyone to keep improving the details in
> the
> > > > > > meantime,
> > > > > > > > > > however
> > > > > > > > > > > I believe given the existing design and the general
> > > sentiment
> > > > > on
> > > > > > this
> > > > > > > > > > > thread that the most efficient path from here is
> starting
> > > the
> > > > > > > > > > > implementation so that we can collectively iterate over
> > it.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > > thw@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > HI Xintong,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback and please see responses
> below
> > > -->
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > > tonysong820@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone
> > for
> > > > the
> > > > > > > > > > discussion.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > > Deploying a Flink session cluster via kubectl & CR
> > and
> > > > then
> > > > > > > > > > submitting
> > > > > > > > > > > > jobs
> > > > > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > > > > approach
> > > > > > that
> > > > > > > > > > > > requires
> > > > > > > > > > > > > the least effort. However, I'd like to point out 2
> > > > > > weaknesses.
> > > > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> > > modes.
> > > > > For
> > > > > > > > these
> > > > > > > > > > > users,
> > > > > > > > > > > > > having to run the job in two steps (deploy the
> > cluster,
> > > > and
> > > > > > > > submit
> > > > > > > > > > the
> > > > > > > > > > > > job)
> > > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > > 2. One of our motivations is being able to manage
> > Flink
> > > > > > > > > applications'
> > > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> > > sounds
> > > > > not
> > > > > > > > > aligned
> > > > > > > > > > > with
> > > > > > > > > > > > > this motivation.
> > > > > > > > > > > > > I think it's probably worth it to support
> submitting
> > > jobs
> > > > > via
> > > > > > > > > > kubectl &
> > > > > > > > > > > > CR
> > > > > > > > > > > > > in the first version, both together with deploying
> > the
> > > > > > cluster
> > > > > > > > like
> > > > > > > > > > in
> > > > > > > > > > > > > perjob/application mode and after deploying the
> > cluster
> > > > > like
> > > > > > in
> > > > > > > > > > session
> > > > > > > > > > > > > mode.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The intention is to support application management
> > > through
> > > > > > operator
> > > > > > > > > and
> > > > > > > > > > > CR,
> > > > > > > > > > > > which means there won't be any 2 step submission
> > process,
> > > > > > which as
> > > > > > > > > you
> > > > > > > > > > > > allude to would defeat the purpose of this project.
> The
> > > CR
> > > > > > example
> > > > > > > > > > shows
> > > > > > > > > > > > the application part. Please note that the bare
> cluster
> > > > > > support is
> > > > > > > > an
> > > > > > > > > > > > *additional* feature for scenarios that require
> > external
> > > > job
> > > > > > > > > > management.
> > > > > > > > > > > Is
> > > > > > > > > > > > there anything on the FLIP page that creates a
> > different
> > > > > > > > impression?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > > Which Flink versions does the operator plan to
> > support?
> > > > > > > > > > > > > 1. Native K8s deployment was firstly introduced in
> > > Flink
> > > > > 1.10
> > > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > > 3. The Pod template support was introduced in Flink
> > > 1.13
> > > > > > > > > > > > > 4. There was some changes to the Flink docker image
> > > > > > entrypoint
> > > > > > > > > script
> > > > > > > > > > > in,
> > > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Great, thanks for providing this. It is important for
> > the
> > > > > > > > > compatibility
> > > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > > upwards.
> > > > > > Before
> > > > > > > > the
> > > > > > > > > > > > operator is ready there will be another Flink
> release.
> > > > Let's
> > > > > > see if
> > > > > > > > > > > anyone
> > > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > > What kind of API compatibility we can commit to?
> It's
> > > > > > probably
> > > > > > > > fine
> > > > > > > > > > to
> > > > > > > > > > > > have
> > > > > > > > > > > > > alpha / beta version APIs that allow incompatible
> > > future
> > > > > > changes
> > > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > > > first version. But eventually we would need to
> > > guarantee
> > > > > > > > backwards
> > > > > > > > > > > > > compatibility, so that an early version CR can work
> > > with
> > > > a
> > > > > > new
> > > > > > > > > > version
> > > > > > > > > > > > > operator.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Another great point and please let me include that on
> > the
> > > > > FLIP
> > > > > > > > page.
> > > > > > > > > > ;-)
> > > > > > > > > > > >
> > > > > > > > > > > > I think we should allow incompatible changes for the
> > > first
> > > > > one
> > > > > > or
> > > > > > > > two
> > > > > > > > > > > > versions, similar to how other major features have
> > > evolved
> > > > > > > > recently,
> > > > > > > > > > such
> > > > > > > > > > > > as FLIP-27.
> > > > > > > > > > > >
> > > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > > thw@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > > Maybe we should make this more clear in the
> FLIP
> > > but
> > > > we
> > > > > > > > agreed
> > > > > > > > > to
> > > > > > > > > > > do
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > first version of the operator based on the
> native
> > > > > > > > integration.
> > > > > > > > > > > > > > > While this clearly does not cover all use-cases
> > and
> > > > > > > > > requirements,
> > > > > > > > > > > it
> > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > this would lead to a much smaller initial
> effort
> > > and
> > > > a
> > > > > > nicer
> > > > > > > > > > first
> > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm also leaning towards the native integration,
> as
> > > > long
> > > > > > as it
> > > > > > > > > > > reduces
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > MVP effort. Ultimately the operator will need to
> > also
> > > > > > support
> > > > > > > > the
> > > > > > > > > > > > > > standalone mode. I would like to gain more
> > confidence
> > > > > that
> > > > > > > > native
> > > > > > > > > > > > > > integration reduces the effort. While it cuts the
> > > > effort
> > > > > to
> > > > > > > > > handle
> > > > > > > > > > > the
> > > > > > > > > > > > TM
> > > > > > > > > > > > > > pod creation, some mapping code from the CR to
> the
> > > > native
> > > > > > > > > > integration
> > > > > > > > > > > > > > client and config needs to be created. As
> mentioned
> > > in
> > > > > the
> > > > > > > > FLIP,
> > > > > > > > > > > native
> > > > > > > > > > > > > > integration requires the Flink job manager to
> have
> > > > access
> > > > > > to
> > > > > > > > the
> > > > > > > > > > k8s
> > > > > > > > > > > > API
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > create pods, which in some scenarios may be seen
> as
> > > > > > > > unfavorable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > > Is the pod template in CR same with what
> > Flink
> > > > has
> > > > > > > > already
> > > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > > Then I am afraid not the arbitrary
> field(e.g.
> > > > > > cpu/memory
> > > > > > > > > > > > resources)
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, pod template would look almost identical.
> > There
> > > > are
> > > > > a
> > > > > > few
> > > > > > > > > > > settings
> > > > > > > > > > > > > > that the operator will control (and that may need
> > to
> > > be
> > > > > > > > > > blacklisted),
> > > > > > > > > > > > but
> > > > > > > > > > > > > > in general we would not want to place
> > restrictions. I
> > > > > > think a
> > > > > > > > > > > mechanism
> > > > > > > > > > > > > > where a pod template is merged from multiple
> layers
> > > > would
> > > > > > also
> > > > > > > > be
> > > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Thomas
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@apache.org>.
Hi Flink Devs!

We would like to present to you the first prototype of the
flink-kubernetes-operator that was built based on the FLIP and the
discussion on this mail thread. We would also like to call out some design
decisions that we have made regarding architecture components that were not
explicitly mentioned in the FLIP document/thread and give you the
opportunity to raise any concerns here.

You can find the initial prototype here:
https://github.com/apache/flink-kubernetes-operator/pull/1

We will leave the PR open for 1-2 days before merging to let people comment
on it, but please be mindful that this is an initial prototype with many
rough edges. It is not intended to be a complete implementation of the FLIP
specs as that will take some more work from all of us :)


*Prototype feature set:*The prototype contains a basic working version of
the flink-kubernetes-operator that supports deployment and lifecycle
management of a stateful native flink application. We have basic support
for stateful and stateless upgrades, UI ingress, pod templates etc. Error
handling at this point is largely missing.


*Features / design decisions that were not explicitly discussed in this
thread*

*Basic Admission control using a Webhook*Standard resource admission
control in Kubernetes to validate and potentially reject resources is done
through Webhooks.
https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/
This is a necessary mechanism to give the user an upfront error when an
incorrect resource was submitted. In the Flink operator's case we need to
validate that the FlinkDeployment yaml actually makes sense and does not
contain erroneous config options that would inevitably lead to
deployment/job failures.

We have implemented a simple webhook that we can use for this type of
validation, as a separate maven module (flink-kubernetes-webhook). The
webhook is an optional component and can be enabled or disabled during
deployment. To avoid pulling in new external dependencies we have used the
Flink Shaded Netty module to build the simple rest endpoint required. If
the community feels that Netty adds unnecessary complexity to the webhook
implementation we are open to alternative backends such as Springboot for
instance which would practically eliminate all the boilerplate.


*Helm Chart for deployment*Helm charts provide an industry standard way of
managing kubernetes deployments. We have created a helm chart prototype
that can be used to deploy the operator together with all required
resources. The helm chart allows easy configuration for things like images,
namespaces etc and flags to control specific parts of the deployment such
as RBAC or the webhook.

The helm chart provided is intended to be a first version that worked for
us during development but we expect to have a lot of iterations on it based
on the feedback from the community.

*Acknowledgment*
We would like to thank everyone who has provided support and valuable
feedback on this FLIP.
We would also like to thank Yang Wang & Alexis Sarda-Espinosa specifically
for making their operators open source and available to us which had a big
impact on the FLIP and the prototype.

We are looking forward to continuing development on the operator together
with the broader community.
All work will be tracked using the ASF Jira from now on.

Cheers,
Gyula

On Mon, Feb 14, 2022 at 9:21 AM K Fred <yu...@gmail.com> wrote:

> Hi Gyula,
>
> Thanks!
> It's great to see the project getting started and I can't wait to see the
> PR and start contributing code.😄😄😄
>
> Best Wishes!
> Peng Yuan
>
> On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Hi Peng Yuan!
> >
> > The repo is already created:
> > https://github.com/apache/flink-kubernetes-operator
> >
> > We will open the PR with the initial prototype later today, stay tuned in
> > this thread! :)
> >
> > Cheers,
> > Gyula
> >
> > On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > Has the project of flink-kubernetes-operator been created in github?
> > >
> > > Peng Yuan
> > >
> > > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > I agree with flink-kubernetes-operator as the repo name :)
> > > > Don't have any better idea
> > > >
> > > > Gyula
> > > >
> > > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the continued feedback and discussion. Looks like we are
> > > > > ready to start a VOTE, I will initiate it shortly.
> > > > >
> > > > > In parallel it would be good to find the repository name.
> > > > >
> > > > > My suggestion would be: flink-kubernetes-operator
> > > > >
> > > > > I thought "flink-operator" could be a bit misleading since the term
> > > > > operator already has a meaning in Flink.
> > > > >
> > > > > I also considered "flink-k8s-operator" but that would be almost
> > > > > identical to existing operator implementations and could lead to
> > > > > confusion in the future.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hi Danny,
> > > > > >
> > > > > > So far we have been focusing our dev efforts on the initial
> native
> > > > > > implementation with the team.
> > > > > > If the discussion and vote goes well for this FLIP we are looking
> > > > forward
> > > > > > to contributing the initial version sometime next week (fingers
> > > > crossed).
> > > > > >
> > > > > > At that point I think we can already start the dev work to
> support
> > > the
> > > > > > standalone mode as well, especially if you can dedicate some
> effort
> > > to
> > > > > > pushing that side.
> > > > > > Working together on this sounds like a great idea and we should
> > start
> > > > as
> > > > > > soon as possible! :)
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > > dannycranmer@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > I have been discussing this one with my team. We are interested
> > in
> > > > the
> > > > > > > Standalone mode, and are willing to contribute towards the
> > > > > implementation.
> > > > > > > Potentially we can work together to support both modes in
> > parallel?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <
> gyula.fora@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Danny!
> > > > > > > >
> > > > > > > > Thanks for the feedback :)
> > > > > > > >
> > > > > > > > Versioning:
> > > > > > > > Versioning will be independent from Flink and the operator
> will
> > > > > depend
> > > > > > > on a
> > > > > > > > fixed flink version (in every given operator version).
> > > > > > > > This should be the exact same setup as with Stateful
> Functions
> > (
> > > > > > > > https://github.com/apache/flink-statefun). So independent
> > > release
> > > > > cycle
> > > > > > > > but
> > > > > > > > still within the Flink umbrella.
> > > > > > > >
> > > > > > > > Deployment error handling:
> > > > > > > > I think that's a very good point, as general exception
> handling
> > > for
> > > > > the
> > > > > > > > different failure scenarios is a tricky problem. I think the
> > > > > exception
> > > > > > > > classifiers and retry strategies could avoid a lot of manual
> > > > > intervention
> > > > > > > > from the user. We will definitely need to add something like
> > > this.
> > > > > Once
> > > > > > > we
> > > > > > > > have the repo created with the initial operator code we
> should
> > > open
> > > > > some
> > > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > > dannycranmer@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey team,
> > > > > > > > >
> > > > > > > > > Great work on the FLIP, I am looking forward to this one. I
> > > agree
> > > > > that
> > > > > > > we
> > > > > > > > > can move forward to the voting stage.
> > > > > > > > >
> > > > > > > > > I have general feedback around how we will handle job
> > > submission
> > > > > > > failure
> > > > > > > > > and retry. As discussed in the Rejected Alternatives
> section,
> > > we
> > > > > can
> > > > > > > use
> > > > > > > > > Java to handle job submission failures from the Flink
> client.
> > > It
> > > > > would
> > > > > > > be
> > > > > > > > > useful to have the ability to configure exception
> classifiers
> > > and
> > > > > retry
> > > > > > > > > strategy as part of operator configuration.
> > > > > > > > >
> > > > > > > > > Given this will be in a separate Github repository I am
> > curious
> > > > how
> > > > > > > ther
> > > > > > > > > versioning strategy will work in relation to the Flink
> > version?
> > > > Do
> > > > > we
> > > > > > > > have
> > > > > > > > > any other components with a similar setup I can look at?
> Will
> > > the
> > > > > > > > operator
> > > > > > > > > version track Flink or will it use its own versioning
> > strategy
> > > > > with a
> > > > > > > > Flink
> > > > > > > > > version support matrix, or similar?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > > balassi.marton@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi team,
> > > > > > > > > >
> > > > > > > > > > Thank you for the great feedback, Thomas has updated the
> > FLIP
> > > > > page
> > > > > > > > > > accordingly. If you are comfortable with the currently
> > > existing
> > > > > > > design
> > > > > > > > > and
> > > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> > voting
> > > > > stage -
> > > > > > > > once
> > > > > > > > > > that reaches a positive conclusion it lets us create the
> > > > separate
> > > > > > > code
> > > > > > > > > > repository under the flink project for the operator.
> > > > > > > > > >
> > > > > > > > > > I encourage everyone to keep improving the details in the
> > > > > meantime,
> > > > > > > > > however
> > > > > > > > > > I believe given the existing design and the general
> > sentiment
> > > > on
> > > > > this
> > > > > > > > > > thread that the most efficient path from here is starting
> > the
> > > > > > > > > > implementation so that we can collectively iterate over
> it.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > > >
> > > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > > thw@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > HI Xintong,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback and please see responses below
> > -->
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > > tonysong820@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone
> for
> > > the
> > > > > > > > > discussion.
> > > > > > > > > > > >
> > > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > > >
> > > > > > > > > > > > ## Job Submission
> > > > > > > > > > > > Deploying a Flink session cluster via kubectl & CR
> and
> > > then
> > > > > > > > > submitting
> > > > > > > > > > > jobs
> > > > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > > > approach
> > > > > that
> > > > > > > > > > > requires
> > > > > > > > > > > > the least effort. However, I'd like to point out 2
> > > > > weaknesses.
> > > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> > modes.
> > > > For
> > > > > > > these
> > > > > > > > > > users,
> > > > > > > > > > > > having to run the job in two steps (deploy the
> cluster,
> > > and
> > > > > > > submit
> > > > > > > > > the
> > > > > > > > > > > job)
> > > > > > > > > > > > is not that convenient.
> > > > > > > > > > > > 2. One of our motivations is being able to manage
> Flink
> > > > > > > > applications'
> > > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> > sounds
> > > > not
> > > > > > > > aligned
> > > > > > > > > > with
> > > > > > > > > > > > this motivation.
> > > > > > > > > > > > I think it's probably worth it to support submitting
> > jobs
> > > > via
> > > > > > > > > kubectl &
> > > > > > > > > > > CR
> > > > > > > > > > > > in the first version, both together with deploying
> the
> > > > > cluster
> > > > > > > like
> > > > > > > > > in
> > > > > > > > > > > > perjob/application mode and after deploying the
> cluster
> > > > like
> > > > > in
> > > > > > > > > session
> > > > > > > > > > > > mode.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intention is to support application management
> > through
> > > > > operator
> > > > > > > > and
> > > > > > > > > > CR,
> > > > > > > > > > > which means there won't be any 2 step submission
> process,
> > > > > which as
> > > > > > > > you
> > > > > > > > > > > allude to would defeat the purpose of this project. The
> > CR
> > > > > example
> > > > > > > > > shows
> > > > > > > > > > > the application part. Please note that the bare cluster
> > > > > support is
> > > > > > > an
> > > > > > > > > > > *additional* feature for scenarios that require
> external
> > > job
> > > > > > > > > management.
> > > > > > > > > > Is
> > > > > > > > > > > there anything on the FLIP page that creates a
> different
> > > > > > > impression?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Versioning
> > > > > > > > > > > > Which Flink versions does the operator plan to
> support?
> > > > > > > > > > > > 1. Native K8s deployment was firstly introduced in
> > Flink
> > > > 1.10
> > > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > > 3. The Pod template support was introduced in Flink
> > 1.13
> > > > > > > > > > > > 4. There was some changes to the Flink docker image
> > > > > entrypoint
> > > > > > > > script
> > > > > > > > > > in,
> > > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Great, thanks for providing this. It is important for
> the
> > > > > > > > compatibility
> > > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> > upwards.
> > > > > Before
> > > > > > > the
> > > > > > > > > > > operator is ready there will be another Flink release.
> > > Let's
> > > > > see if
> > > > > > > > > > anyone
> > > > > > > > > > > is interested in earlier versions?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ## Compatibility
> > > > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > > > probably
> > > > > > > fine
> > > > > > > > > to
> > > > > > > > > > > have
> > > > > > > > > > > > alpha / beta version APIs that allow incompatible
> > future
> > > > > changes
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > first version. But eventually we would need to
> > guarantee
> > > > > > > backwards
> > > > > > > > > > > > compatibility, so that an early version CR can work
> > with
> > > a
> > > > > new
> > > > > > > > > version
> > > > > > > > > > > > operator.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Another great point and please let me include that on
> the
> > > > FLIP
> > > > > > > page.
> > > > > > > > > ;-)
> > > > > > > > > > >
> > > > > > > > > > > I think we should allow incompatible changes for the
> > first
> > > > one
> > > > > or
> > > > > > > two
> > > > > > > > > > > versions, similar to how other major features have
> > evolved
> > > > > > > recently,
> > > > > > > > > such
> > > > > > > > > > > as FLIP-27.
> > > > > > > > > > >
> > > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > > thw@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > > Maybe we should make this more clear in the FLIP
> > but
> > > we
> > > > > > > agreed
> > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > > > the
> > > > > > > > > > > > > > first version of the operator based on the native
> > > > > > > integration.
> > > > > > > > > > > > > > While this clearly does not cover all use-cases
> and
> > > > > > > > requirements,
> > > > > > > > > > it
> > > > > > > > > > > > > seems
> > > > > > > > > > > > > > this would lead to a much smaller initial effort
> > and
> > > a
> > > > > nicer
> > > > > > > > > first
> > > > > > > > > > > > > version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm also leaning towards the native integration, as
> > > long
> > > > > as it
> > > > > > > > > > reduces
> > > > > > > > > > > > the
> > > > > > > > > > > > > MVP effort. Ultimately the operator will need to
> also
> > > > > support
> > > > > > > the
> > > > > > > > > > > > > standalone mode. I would like to gain more
> confidence
> > > > that
> > > > > > > native
> > > > > > > > > > > > > integration reduces the effort. While it cuts the
> > > effort
> > > > to
> > > > > > > > handle
> > > > > > > > > > the
> > > > > > > > > > > TM
> > > > > > > > > > > > > pod creation, some mapping code from the CR to the
> > > native
> > > > > > > > > integration
> > > > > > > > > > > > > client and config needs to be created. As mentioned
> > in
> > > > the
> > > > > > > FLIP,
> > > > > > > > > > native
> > > > > > > > > > > > > integration requires the Flink job manager to have
> > > access
> > > > > to
> > > > > > > the
> > > > > > > > > k8s
> > > > > > > > > > > API
> > > > > > > > > > > > to
> > > > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > > > unfavorable.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > > Is the pod template in CR same with what
> Flink
> > > has
> > > > > > > already
> > > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > > cpu/memory
> > > > > > > > > > > resources)
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, pod template would look almost identical.
> There
> > > are
> > > > a
> > > > > few
> > > > > > > > > > settings
> > > > > > > > > > > > > that the operator will control (and that may need
> to
> > be
> > > > > > > > > blacklisted),
> > > > > > > > > > > but
> > > > > > > > > > > > > in general we would not want to place
> restrictions. I
> > > > > think a
> > > > > > > > > > mechanism
> > > > > > > > > > > > > where a pod template is merged from multiple layers
> > > would
> > > > > also
> > > > > > > be
> > > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Thomas
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi Gyula,

Thanks!
It's great to see the project getting started and I can't wait to see the
PR and start contributing code.😄😄😄

Best Wishes!
Peng Yuan

On Mon, Feb 14, 2022 at 4:14 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Peng Yuan!
>
> The repo is already created:
> https://github.com/apache/flink-kubernetes-operator
>
> We will open the PR with the initial prototype later today, stay tuned in
> this thread! :)
>
> Cheers,
> Gyula
>
> On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com> wrote:
>
> > Hi All,
> >
> > Has the project of flink-kubernetes-operator been created in github?
> >
> > Peng Yuan
> >
> > On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > I agree with flink-kubernetes-operator as the repo name :)
> > > Don't have any better idea
> > >
> > > Gyula
> > >
> > > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the continued feedback and discussion. Looks like we are
> > > > ready to start a VOTE, I will initiate it shortly.
> > > >
> > > > In parallel it would be good to find the repository name.
> > > >
> > > > My suggestion would be: flink-kubernetes-operator
> > > >
> > > > I thought "flink-operator" could be a bit misleading since the term
> > > > operator already has a meaning in Flink.
> > > >
> > > > I also considered "flink-k8s-operator" but that would be almost
> > > > identical to existing operator implementations and could lead to
> > > > confusion in the future.
> > > >
> > > > Thoughts?
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > >
> > > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi Danny,
> > > > >
> > > > > So far we have been focusing our dev efforts on the initial native
> > > > > implementation with the team.
> > > > > If the discussion and vote goes well for this FLIP we are looking
> > > forward
> > > > > to contributing the initial version sometime next week (fingers
> > > crossed).
> > > > >
> > > > > At that point I think we can already start the dev work to support
> > the
> > > > > standalone mode as well, especially if you can dedicate some effort
> > to
> > > > > pushing that side.
> > > > > Working together on this sounds like a great idea and we should
> start
> > > as
> > > > > soon as possible! :)
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> > dannycranmer@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I have been discussing this one with my team. We are interested
> in
> > > the
> > > > > > Standalone mode, and are willing to contribute towards the
> > > > implementation.
> > > > > > Potentially we can work together to support both modes in
> parallel?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Danny!
> > > > > > >
> > > > > > > Thanks for the feedback :)
> > > > > > >
> > > > > > > Versioning:
> > > > > > > Versioning will be independent from Flink and the operator will
> > > > depend
> > > > > > on a
> > > > > > > fixed flink version (in every given operator version).
> > > > > > > This should be the exact same setup as with Stateful Functions
> (
> > > > > > > https://github.com/apache/flink-statefun). So independent
> > release
> > > > cycle
> > > > > > > but
> > > > > > > still within the Flink umbrella.
> > > > > > >
> > > > > > > Deployment error handling:
> > > > > > > I think that's a very good point, as general exception handling
> > for
> > > > the
> > > > > > > different failure scenarios is a tricky problem. I think the
> > > > exception
> > > > > > > classifiers and retry strategies could avoid a lot of manual
> > > > intervention
> > > > > > > from the user. We will definitely need to add something like
> > this.
> > > > Once
> > > > > > we
> > > > > > > have the repo created with the initial operator code we should
> > open
> > > > some
> > > > > > > tickets for this and put it on the short term roadmap!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > > dannycranmer@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey team,
> > > > > > > >
> > > > > > > > Great work on the FLIP, I am looking forward to this one. I
> > agree
> > > > that
> > > > > > we
> > > > > > > > can move forward to the voting stage.
> > > > > > > >
> > > > > > > > I have general feedback around how we will handle job
> > submission
> > > > > > failure
> > > > > > > > and retry. As discussed in the Rejected Alternatives section,
> > we
> > > > can
> > > > > > use
> > > > > > > > Java to handle job submission failures from the Flink client.
> > It
> > > > would
> > > > > > be
> > > > > > > > useful to have the ability to configure exception classifiers
> > and
> > > > retry
> > > > > > > > strategy as part of operator configuration.
> > > > > > > >
> > > > > > > > Given this will be in a separate Github repository I am
> curious
> > > how
> > > > > > ther
> > > > > > > > versioning strategy will work in relation to the Flink
> version?
> > > Do
> > > > we
> > > > > > > have
> > > > > > > > any other components with a similar setup I can look at? Will
> > the
> > > > > > > operator
> > > > > > > > version track Flink or will it use its own versioning
> strategy
> > > > with a
> > > > > > > Flink
> > > > > > > > version support matrix, or similar?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > > balassi.marton@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi team,
> > > > > > > > >
> > > > > > > > > Thank you for the great feedback, Thomas has updated the
> FLIP
> > > > page
> > > > > > > > > accordingly. If you are comfortable with the currently
> > existing
> > > > > > design
> > > > > > > > and
> > > > > > > > > depth in the FLIP [1] I suggest moving forward to the
> voting
> > > > stage -
> > > > > > > once
> > > > > > > > > that reaches a positive conclusion it lets us create the
> > > separate
> > > > > > code
> > > > > > > > > repository under the flink project for the operator.
> > > > > > > > >
> > > > > > > > > I encourage everyone to keep improving the details in the
> > > > meantime,
> > > > > > > > however
> > > > > > > > > I believe given the existing design and the general
> sentiment
> > > on
> > > > this
> > > > > > > > > thread that the most efficient path from here is starting
> the
> > > > > > > > > implementation so that we can collectively iterate over it.
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > > >
> > > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> > thw@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > HI Xintong,
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback and please see responses below
> -->
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > > tonysong820@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for
> > the
> > > > > > > > discussion.
> > > > > > > > > > >
> > > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > > >
> > > > > > > > > > > ## Job Submission
> > > > > > > > > > > Deploying a Flink session cluster via kubectl & CR and
> > then
> > > > > > > > submitting
> > > > > > > > > > jobs
> > > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > > approach
> > > > that
> > > > > > > > > > requires
> > > > > > > > > > > the least effort. However, I'd like to point out 2
> > > > weaknesses.
> > > > > > > > > > > 1. A lot of users use Flink in perjob/application
> modes.
> > > For
> > > > > > these
> > > > > > > > > users,
> > > > > > > > > > > having to run the job in two steps (deploy the cluster,
> > and
> > > > > > submit
> > > > > > > > the
> > > > > > > > > > job)
> > > > > > > > > > > is not that convenient.
> > > > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > > > applications'
> > > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli
> sounds
> > > not
> > > > > > > aligned
> > > > > > > > > with
> > > > > > > > > > > this motivation.
> > > > > > > > > > > I think it's probably worth it to support submitting
> jobs
> > > via
> > > > > > > > kubectl &
> > > > > > > > > > CR
> > > > > > > > > > > in the first version, both together with deploying the
> > > > cluster
> > > > > > like
> > > > > > > > in
> > > > > > > > > > > perjob/application mode and after deploying the cluster
> > > like
> > > > in
> > > > > > > > session
> > > > > > > > > > > mode.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The intention is to support application management
> through
> > > > operator
> > > > > > > and
> > > > > > > > > CR,
> > > > > > > > > > which means there won't be any 2 step submission process,
> > > > which as
> > > > > > > you
> > > > > > > > > > allude to would defeat the purpose of this project. The
> CR
> > > > example
> > > > > > > > shows
> > > > > > > > > > the application part. Please note that the bare cluster
> > > > support is
> > > > > > an
> > > > > > > > > > *additional* feature for scenarios that require external
> > job
> > > > > > > > management.
> > > > > > > > > Is
> > > > > > > > > > there anything on the FLIP page that creates a different
> > > > > > impression?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ## Versioning
> > > > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > > > 1. Native K8s deployment was firstly introduced in
> Flink
> > > 1.10
> > > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > > 3. The Pod template support was introduced in Flink
> 1.13
> > > > > > > > > > > 4. There was some changes to the Flink docker image
> > > > entrypoint
> > > > > > > script
> > > > > > > > > in,
> > > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Great, thanks for providing this. It is important for the
> > > > > > > compatibility
> > > > > > > > > > going forward also. We are targeting Flink 1.14.x
> upwards.
> > > > Before
> > > > > > the
> > > > > > > > > > operator is ready there will be another Flink release.
> > Let's
> > > > see if
> > > > > > > > > anyone
> > > > > > > > > > is interested in earlier versions?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ## Compatibility
> > > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > > probably
> > > > > > fine
> > > > > > > > to
> > > > > > > > > > have
> > > > > > > > > > > alpha / beta version APIs that allow incompatible
> future
> > > > changes
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > first version. But eventually we would need to
> guarantee
> > > > > > backwards
> > > > > > > > > > > compatibility, so that an early version CR can work
> with
> > a
> > > > new
> > > > > > > > version
> > > > > > > > > > > operator.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Another great point and please let me include that on the
> > > FLIP
> > > > > > page.
> > > > > > > > ;-)
> > > > > > > > > >
> > > > > > > > > > I think we should allow incompatible changes for the
> first
> > > one
> > > > or
> > > > > > two
> > > > > > > > > > versions, similar to how other major features have
> evolved
> > > > > > recently,
> > > > > > > > such
> > > > > > > > > > as FLIP-27.
> > > > > > > > > >
> > > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > > thw@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > > Maybe we should make this more clear in the FLIP
> but
> > we
> > > > > > agreed
> > > > > > > to
> > > > > > > > > do
> > > > > > > > > > > the
> > > > > > > > > > > > > first version of the operator based on the native
> > > > > > integration.
> > > > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > > > requirements,
> > > > > > > > > it
> > > > > > > > > > > > seems
> > > > > > > > > > > > > this would lead to a much smaller initial effort
> and
> > a
> > > > nicer
> > > > > > > > first
> > > > > > > > > > > > version.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm also leaning towards the native integration, as
> > long
> > > > as it
> > > > > > > > > reduces
> > > > > > > > > > > the
> > > > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > > > support
> > > > > > the
> > > > > > > > > > > > standalone mode. I would like to gain more confidence
> > > that
> > > > > > native
> > > > > > > > > > > > integration reduces the effort. While it cuts the
> > effort
> > > to
> > > > > > > handle
> > > > > > > > > the
> > > > > > > > > > TM
> > > > > > > > > > > > pod creation, some mapping code from the CR to the
> > native
> > > > > > > > integration
> > > > > > > > > > > > client and config needs to be created. As mentioned
> in
> > > the
> > > > > > FLIP,
> > > > > > > > > native
> > > > > > > > > > > > integration requires the Flink job manager to have
> > access
> > > > to
> > > > > > the
> > > > > > > > k8s
> > > > > > > > > > API
> > > > > > > > > > > to
> > > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > > unfavorable.
> > > > > > > > > > > >
> > > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > > Is the pod template in CR same with what Flink
> > has
> > > > > > already
> > > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > > cpu/memory
> > > > > > > > > > resources)
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take effect.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, pod template would look almost identical. There
> > are
> > > a
> > > > few
> > > > > > > > > settings
> > > > > > > > > > > > that the operator will control (and that may need to
> be
> > > > > > > > blacklisted),
> > > > > > > > > > but
> > > > > > > > > > > > in general we would not want to place restrictions. I
> > > > think a
> > > > > > > > > mechanism
> > > > > > > > > > > > where a pod template is merged from multiple layers
> > would
> > > > also
> > > > > > be
> > > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Thomas
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Peng Yuan!

The repo is already created:
https://github.com/apache/flink-kubernetes-operator

We will open the PR with the initial prototype later today, stay tuned in
this thread! :)

Cheers,
Gyula

On Mon, Feb 14, 2022 at 9:09 AM K Fred <yu...@gmail.com> wrote:

> Hi All,
>
> Has the project of flink-kubernetes-operator been created in github?
>
> Peng Yuan
>
> On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> > I agree with flink-kubernetes-operator as the repo name :)
> > Don't have any better idea
> >
> > Gyula
> >
> > On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Thanks for the continued feedback and discussion. Looks like we are
> > > ready to start a VOTE, I will initiate it shortly.
> > >
> > > In parallel it would be good to find the repository name.
> > >
> > > My suggestion would be: flink-kubernetes-operator
> > >
> > > I thought "flink-operator" could be a bit misleading since the term
> > > operator already has a meaning in Flink.
> > >
> > > I also considered "flink-k8s-operator" but that would be almost
> > > identical to existing operator implementations and could lead to
> > > confusion in the future.
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> > > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com>
> wrote:
> > > >
> > > > Hi Danny,
> > > >
> > > > So far we have been focusing our dev efforts on the initial native
> > > > implementation with the team.
> > > > If the discussion and vote goes well for this FLIP we are looking
> > forward
> > > > to contributing the initial version sometime next week (fingers
> > crossed).
> > > >
> > > > At that point I think we can already start the dev work to support
> the
> > > > standalone mode as well, especially if you can dedicate some effort
> to
> > > > pushing that side.
> > > > Working together on this sounds like a great idea and we should start
> > as
> > > > soon as possible! :)
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <
> dannycranmer@apache.org>
> > > > wrote:
> > > >
> > > > > I have been discussing this one with my team. We are interested in
> > the
> > > > > Standalone mode, and are willing to contribute towards the
> > > implementation.
> > > > > Potentially we can work together to support both modes in parallel?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Danny!
> > > > > >
> > > > > > Thanks for the feedback :)
> > > > > >
> > > > > > Versioning:
> > > > > > Versioning will be independent from Flink and the operator will
> > > depend
> > > > > on a
> > > > > > fixed flink version (in every given operator version).
> > > > > > This should be the exact same setup as with Stateful Functions (
> > > > > > https://github.com/apache/flink-statefun). So independent
> release
> > > cycle
> > > > > > but
> > > > > > still within the Flink umbrella.
> > > > > >
> > > > > > Deployment error handling:
> > > > > > I think that's a very good point, as general exception handling
> for
> > > the
> > > > > > different failure scenarios is a tricky problem. I think the
> > > exception
> > > > > > classifiers and retry strategies could avoid a lot of manual
> > > intervention
> > > > > > from the user. We will definitely need to add something like
> this.
> > > Once
> > > > > we
> > > > > > have the repo created with the initial operator code we should
> open
> > > some
> > > > > > tickets for this and put it on the short term roadmap!
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > > dannycranmer@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey team,
> > > > > > >
> > > > > > > Great work on the FLIP, I am looking forward to this one. I
> agree
> > > that
> > > > > we
> > > > > > > can move forward to the voting stage.
> > > > > > >
> > > > > > > I have general feedback around how we will handle job
> submission
> > > > > failure
> > > > > > > and retry. As discussed in the Rejected Alternatives section,
> we
> > > can
> > > > > use
> > > > > > > Java to handle job submission failures from the Flink client.
> It
> > > would
> > > > > be
> > > > > > > useful to have the ability to configure exception classifiers
> and
> > > retry
> > > > > > > strategy as part of operator configuration.
> > > > > > >
> > > > > > > Given this will be in a separate Github repository I am curious
> > how
> > > > > ther
> > > > > > > versioning strategy will work in relation to the Flink version?
> > Do
> > > we
> > > > > > have
> > > > > > > any other components with a similar setup I can look at? Will
> the
> > > > > > operator
> > > > > > > version track Flink or will it use its own versioning strategy
> > > with a
> > > > > > Flink
> > > > > > > version support matrix, or similar?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > > balassi.marton@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi team,
> > > > > > > >
> > > > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> > > page
> > > > > > > > accordingly. If you are comfortable with the currently
> existing
> > > > > design
> > > > > > > and
> > > > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> > > stage -
> > > > > > once
> > > > > > > > that reaches a positive conclusion it lets us create the
> > separate
> > > > > code
> > > > > > > > repository under the flink project for the operator.
> > > > > > > >
> > > > > > > > I encourage everyone to keep improving the details in the
> > > meantime,
> > > > > > > however
> > > > > > > > I believe given the existing design and the general sentiment
> > on
> > > this
> > > > > > > > thread that the most efficient path from here is starting the
> > > > > > > > implementation so that we can collectively iterate over it.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > > >
> > > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <
> thw@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > HI Xintong,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > > > >
> > > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > > tonysong820@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for
> the
> > > > > > > discussion.
> > > > > > > > > >
> > > > > > > > > > I also have a few questions and comments.
> > > > > > > > > >
> > > > > > > > > > ## Job Submission
> > > > > > > > > > Deploying a Flink session cluster via kubectl & CR and
> then
> > > > > > > submitting
> > > > > > > > > jobs
> > > > > > > > > > to the cluster via Flink cli / REST is probably the
> > approach
> > > that
> > > > > > > > > requires
> > > > > > > > > > the least effort. However, I'd like to point out 2
> > > weaknesses.
> > > > > > > > > > 1. A lot of users use Flink in perjob/application modes.
> > For
> > > > > these
> > > > > > > > users,
> > > > > > > > > > having to run the job in two steps (deploy the cluster,
> and
> > > > > submit
> > > > > > > the
> > > > > > > > > job)
> > > > > > > > > > is not that convenient.
> > > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > > applications'
> > > > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds
> > not
> > > > > > aligned
> > > > > > > > with
> > > > > > > > > > this motivation.
> > > > > > > > > > I think it's probably worth it to support submitting jobs
> > via
> > > > > > > kubectl &
> > > > > > > > > CR
> > > > > > > > > > in the first version, both together with deploying the
> > > cluster
> > > > > like
> > > > > > > in
> > > > > > > > > > perjob/application mode and after deploying the cluster
> > like
> > > in
> > > > > > > session
> > > > > > > > > > mode.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > The intention is to support application management through
> > > operator
> > > > > > and
> > > > > > > > CR,
> > > > > > > > > which means there won't be any 2 step submission process,
> > > which as
> > > > > > you
> > > > > > > > > allude to would defeat the purpose of this project. The CR
> > > example
> > > > > > > shows
> > > > > > > > > the application part. Please note that the bare cluster
> > > support is
> > > > > an
> > > > > > > > > *additional* feature for scenarios that require external
> job
> > > > > > > management.
> > > > > > > > Is
> > > > > > > > > there anything on the FLIP page that creates a different
> > > > > impression?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ## Versioning
> > > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > > 1. Native K8s deployment was firstly introduced in Flink
> > 1.10
> > > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > > > 4. There was some changes to the Flink docker image
> > > entrypoint
> > > > > > script
> > > > > > > > in,
> > > > > > > > > > IIRC, Flink 1.13
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Great, thanks for providing this. It is important for the
> > > > > > compatibility
> > > > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> > > Before
> > > > > the
> > > > > > > > > operator is ready there will be another Flink release.
> Let's
> > > see if
> > > > > > > > anyone
> > > > > > > > > is interested in earlier versions?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ## Compatibility
> > > > > > > > > > What kind of API compatibility we can commit to? It's
> > > probably
> > > > > fine
> > > > > > > to
> > > > > > > > > have
> > > > > > > > > > alpha / beta version APIs that allow incompatible future
> > > changes
> > > > > > for
> > > > > > > > the
> > > > > > > > > > first version. But eventually we would need to guarantee
> > > > > backwards
> > > > > > > > > > compatibility, so that an early version CR can work with
> a
> > > new
> > > > > > > version
> > > > > > > > > > operator.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Another great point and please let me include that on the
> > FLIP
> > > > > page.
> > > > > > > ;-)
> > > > > > > > >
> > > > > > > > > I think we should allow incompatible changes for the first
> > one
> > > or
> > > > > two
> > > > > > > > > versions, similar to how other major features have evolved
> > > > > recently,
> > > > > > > such
> > > > > > > > > as FLIP-27.
> > > > > > > > >
> > > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> > thw@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback!
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > > Maybe we should make this more clear in the FLIP but
> we
> > > > > agreed
> > > > > > to
> > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > > > first version of the operator based on the native
> > > > > integration.
> > > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > > requirements,
> > > > > > > > it
> > > > > > > > > > > seems
> > > > > > > > > > > > this would lead to a much smaller initial effort and
> a
> > > nicer
> > > > > > > first
> > > > > > > > > > > version.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm also leaning towards the native integration, as
> long
> > > as it
> > > > > > > > reduces
> > > > > > > > > > the
> > > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > > support
> > > > > the
> > > > > > > > > > > standalone mode. I would like to gain more confidence
> > that
> > > > > native
> > > > > > > > > > > integration reduces the effort. While it cuts the
> effort
> > to
> > > > > > handle
> > > > > > > > the
> > > > > > > > > TM
> > > > > > > > > > > pod creation, some mapping code from the CR to the
> native
> > > > > > > integration
> > > > > > > > > > > client and config needs to be created. As mentioned in
> > the
> > > > > FLIP,
> > > > > > > > native
> > > > > > > > > > > integration requires the Flink job manager to have
> access
> > > to
> > > > > the
> > > > > > > k8s
> > > > > > > > > API
> > > > > > > > > > to
> > > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > > unfavorable.
> > > > > > > > > > >
> > > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > > Is the pod template in CR same with what Flink
> has
> > > > > already
> > > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > > cpu/memory
> > > > > > > > > resources)
> > > > > > > > > > > > could
> > > > > > > > > > > > > > take effect.
> > > > > > > > > > >
> > > > > > > > > > > Yes, pod template would look almost identical. There
> are
> > a
> > > few
> > > > > > > > settings
> > > > > > > > > > > that the operator will control (and that may need to be
> > > > > > > blacklisted),
> > > > > > > > > but
> > > > > > > > > > > in general we would not want to place restrictions. I
> > > think a
> > > > > > > > mechanism
> > > > > > > > > > > where a pod template is merged from multiple layers
> would
> > > also
> > > > > be
> > > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Thomas
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by K Fred <yu...@gmail.com>.
Hi All,

Has the project of flink-kubernetes-operator been created in github?

Peng Yuan

On Wed, Feb 9, 2022 at 1:23 AM Gyula Fóra <gy...@gmail.com> wrote:

> I agree with flink-kubernetes-operator as the repo name :)
> Don't have any better idea
>
> Gyula
>
> On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org> wrote:
>
> > Hi,
> >
> > Thanks for the continued feedback and discussion. Looks like we are
> > ready to start a VOTE, I will initiate it shortly.
> >
> > In parallel it would be good to find the repository name.
> >
> > My suggestion would be: flink-kubernetes-operator
> >
> > I thought "flink-operator" could be a bit misleading since the term
> > operator already has a meaning in Flink.
> >
> > I also considered "flink-k8s-operator" but that would be almost
> > identical to existing operator implementations and could lead to
> > confusion in the future.
> >
> > Thoughts?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
> > >
> > > Hi Danny,
> > >
> > > So far we have been focusing our dev efforts on the initial native
> > > implementation with the team.
> > > If the discussion and vote goes well for this FLIP we are looking
> forward
> > > to contributing the initial version sometime next week (fingers
> crossed).
> > >
> > > At that point I think we can already start the dev work to support the
> > > standalone mode as well, especially if you can dedicate some effort to
> > > pushing that side.
> > > Working together on this sounds like a great idea and we should start
> as
> > > soon as possible! :)
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> > > wrote:
> > >
> > > > I have been discussing this one with my team. We are interested in
> the
> > > > Standalone mode, and are willing to contribute towards the
> > implementation.
> > > > Potentially we can work together to support both modes in parallel?
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Danny!
> > > > >
> > > > > Thanks for the feedback :)
> > > > >
> > > > > Versioning:
> > > > > Versioning will be independent from Flink and the operator will
> > depend
> > > > on a
> > > > > fixed flink version (in every given operator version).
> > > > > This should be the exact same setup as with Stateful Functions (
> > > > > https://github.com/apache/flink-statefun). So independent release
> > cycle
> > > > > but
> > > > > still within the Flink umbrella.
> > > > >
> > > > > Deployment error handling:
> > > > > I think that's a very good point, as general exception handling for
> > the
> > > > > different failure scenarios is a tricky problem. I think the
> > exception
> > > > > classifiers and retry strategies could avoid a lot of manual
> > intervention
> > > > > from the user. We will definitely need to add something like this.
> > Once
> > > > we
> > > > > have the repo created with the initial operator code we should open
> > some
> > > > > tickets for this and put it on the short term roadmap!
> > > > >
> > > > > Cheers,
> > > > > Gyula
> > > > >
> > > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> > dannycranmer@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hey team,
> > > > > >
> > > > > > Great work on the FLIP, I am looking forward to this one. I agree
> > that
> > > > we
> > > > > > can move forward to the voting stage.
> > > > > >
> > > > > > I have general feedback around how we will handle job submission
> > > > failure
> > > > > > and retry. As discussed in the Rejected Alternatives section, we
> > can
> > > > use
> > > > > > Java to handle job submission failures from the Flink client. It
> > would
> > > > be
> > > > > > useful to have the ability to configure exception classifiers and
> > retry
> > > > > > strategy as part of operator configuration.
> > > > > >
> > > > > > Given this will be in a separate Github repository I am curious
> how
> > > > ther
> > > > > > versioning strategy will work in relation to the Flink version?
> Do
> > we
> > > > > have
> > > > > > any other components with a similar setup I can look at? Will the
> > > > > operator
> > > > > > version track Flink or will it use its own versioning strategy
> > with a
> > > > > Flink
> > > > > > version support matrix, or similar?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > > balassi.marton@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > >
> > > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> > page
> > > > > > > accordingly. If you are comfortable with the currently existing
> > > > design
> > > > > > and
> > > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> > stage -
> > > > > once
> > > > > > > that reaches a positive conclusion it lets us create the
> separate
> > > > code
> > > > > > > repository under the flink project for the operator.
> > > > > > >
> > > > > > > I encourage everyone to keep improving the details in the
> > meantime,
> > > > > > however
> > > > > > > I believe given the existing design and the general sentiment
> on
> > this
> > > > > > > thread that the most efficient path from here is starting the
> > > > > > > implementation so that we can collectively iterate over it.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > > >
> > > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > HI Xintong,
> > > > > > > >
> > > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > > tonysong820@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > > discussion.
> > > > > > > > >
> > > > > > > > > I also have a few questions and comments.
> > > > > > > > >
> > > > > > > > > ## Job Submission
> > > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > > submitting
> > > > > > > > jobs
> > > > > > > > > to the cluster via Flink cli / REST is probably the
> approach
> > that
> > > > > > > > requires
> > > > > > > > > the least effort. However, I'd like to point out 2
> > weaknesses.
> > > > > > > > > 1. A lot of users use Flink in perjob/application modes.
> For
> > > > these
> > > > > > > users,
> > > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > > submit
> > > > > > the
> > > > > > > > job)
> > > > > > > > > is not that convenient.
> > > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > > applications'
> > > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds
> not
> > > > > aligned
> > > > > > > with
> > > > > > > > > this motivation.
> > > > > > > > > I think it's probably worth it to support submitting jobs
> via
> > > > > > kubectl &
> > > > > > > > CR
> > > > > > > > > in the first version, both together with deploying the
> > cluster
> > > > like
> > > > > > in
> > > > > > > > > perjob/application mode and after deploying the cluster
> like
> > in
> > > > > > session
> > > > > > > > > mode.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The intention is to support application management through
> > operator
> > > > > and
> > > > > > > CR,
> > > > > > > > which means there won't be any 2 step submission process,
> > which as
> > > > > you
> > > > > > > > allude to would defeat the purpose of this project. The CR
> > example
> > > > > > shows
> > > > > > > > the application part. Please note that the bare cluster
> > support is
> > > > an
> > > > > > > > *additional* feature for scenarios that require external job
> > > > > > management.
> > > > > > > Is
> > > > > > > > there anything on the FLIP page that creates a different
> > > > impression?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Versioning
> > > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > > 1. Native K8s deployment was firstly introduced in Flink
> 1.10
> > > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > > 4. There was some changes to the Flink docker image
> > entrypoint
> > > > > script
> > > > > > > in,
> > > > > > > > > IIRC, Flink 1.13
> > > > > > > > >
> > > > > > > >
> > > > > > > > Great, thanks for providing this. It is important for the
> > > > > compatibility
> > > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> > Before
> > > > the
> > > > > > > > operator is ready there will be another Flink release. Let's
> > see if
> > > > > > > anyone
> > > > > > > > is interested in earlier versions?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ## Compatibility
> > > > > > > > > What kind of API compatibility we can commit to? It's
> > probably
> > > > fine
> > > > > > to
> > > > > > > > have
> > > > > > > > > alpha / beta version APIs that allow incompatible future
> > changes
> > > > > for
> > > > > > > the
> > > > > > > > > first version. But eventually we would need to guarantee
> > > > backwards
> > > > > > > > > compatibility, so that an early version CR can work with a
> > new
> > > > > > version
> > > > > > > > > operator.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Another great point and please let me include that on the
> FLIP
> > > > page.
> > > > > > ;-)
> > > > > > > >
> > > > > > > > I think we should allow incompatible changes for the first
> one
> > or
> > > > two
> > > > > > > > versions, similar to how other major features have evolved
> > > > recently,
> > > > > > such
> > > > > > > > as FLIP-27.
> > > > > > > >
> > > > > > > > Would be great to get broader feedback on this one.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <
> thw@apache.org
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback!
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > > agreed
> > > > > to
> > > > > > > do
> > > > > > > > > the
> > > > > > > > > > > first version of the operator based on the native
> > > > integration.
> > > > > > > > > > > While this clearly does not cover all use-cases and
> > > > > requirements,
> > > > > > > it
> > > > > > > > > > seems
> > > > > > > > > > > this would lead to a much smaller initial effort and a
> > nicer
> > > > > > first
> > > > > > > > > > version.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm also leaning towards the native integration, as long
> > as it
> > > > > > > reduces
> > > > > > > > > the
> > > > > > > > > > MVP effort. Ultimately the operator will need to also
> > support
> > > > the
> > > > > > > > > > standalone mode. I would like to gain more confidence
> that
> > > > native
> > > > > > > > > > integration reduces the effort. While it cuts the effort
> to
> > > > > handle
> > > > > > > the
> > > > > > > > TM
> > > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > > integration
> > > > > > > > > > client and config needs to be created. As mentioned in
> the
> > > > FLIP,
> > > > > > > native
> > > > > > > > > > integration requires the Flink job manager to have access
> > to
> > > > the
> > > > > > k8s
> > > > > > > > API
> > > > > > > > > to
> > > > > > > > > > create pods, which in some scenarios may be seen as
> > > > unfavorable.
> > > > > > > > > >
> > > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > > already
> > > > > > > > > > > supported[4]?
> > > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> > cpu/memory
> > > > > > > > resources)
> > > > > > > > > > > could
> > > > > > > > > > > > > take effect.
> > > > > > > > > >
> > > > > > > > > > Yes, pod template would look almost identical. There are
> a
> > few
> > > > > > > settings
> > > > > > > > > > that the operator will control (and that may need to be
> > > > > > blacklisted),
> > > > > > > > but
> > > > > > > > > > in general we would not want to place restrictions. I
> > think a
> > > > > > > mechanism
> > > > > > > > > > where a pod template is merged from multiple layers would
> > also
> > > > be
> > > > > > > > > > interesting to make this more flexible.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
I agree with flink-kubernetes-operator as the repo name :)
Don't have any better idea

Gyula

On Sat, Feb 5, 2022 at 2:41 AM Thomas Weise <th...@apache.org> wrote:

> Hi,
>
> Thanks for the continued feedback and discussion. Looks like we are
> ready to start a VOTE, I will initiate it shortly.
>
> In parallel it would be good to find the repository name.
>
> My suggestion would be: flink-kubernetes-operator
>
> I thought "flink-operator" could be a bit misleading since the term
> operator already has a meaning in Flink.
>
> I also considered "flink-k8s-operator" but that would be almost
> identical to existing operator implementations and could lead to
> confusion in the future.
>
> Thoughts?
>
> Thanks,
> Thomas
>
>
>
> On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > Hi Danny,
> >
> > So far we have been focusing our dev efforts on the initial native
> > implementation with the team.
> > If the discussion and vote goes well for this FLIP we are looking forward
> > to contributing the initial version sometime next week (fingers crossed).
> >
> > At that point I think we can already start the dev work to support the
> > standalone mode as well, especially if you can dedicate some effort to
> > pushing that side.
> > Working together on this sounds like a great idea and we should start as
> > soon as possible! :)
> >
> > Cheers,
> > Gyula
> >
> > On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> > wrote:
> >
> > > I have been discussing this one with my team. We are interested in the
> > > Standalone mode, and are willing to contribute towards the
> implementation.
> > > Potentially we can work together to support both modes in parallel?
> > >
> > > Thanks,
> > >
> > > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Hi Danny!
> > > >
> > > > Thanks for the feedback :)
> > > >
> > > > Versioning:
> > > > Versioning will be independent from Flink and the operator will
> depend
> > > on a
> > > > fixed flink version (in every given operator version).
> > > > This should be the exact same setup as with Stateful Functions (
> > > > https://github.com/apache/flink-statefun). So independent release
> cycle
> > > > but
> > > > still within the Flink umbrella.
> > > >
> > > > Deployment error handling:
> > > > I think that's a very good point, as general exception handling for
> the
> > > > different failure scenarios is a tricky problem. I think the
> exception
> > > > classifiers and retry strategies could avoid a lot of manual
> intervention
> > > > from the user. We will definitely need to add something like this.
> Once
> > > we
> > > > have the repo created with the initial operator code we should open
> some
> > > > tickets for this and put it on the short term roadmap!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <
> dannycranmer@apache.org>
> > > > wrote:
> > > >
> > > > > Hey team,
> > > > >
> > > > > Great work on the FLIP, I am looking forward to this one. I agree
> that
> > > we
> > > > > can move forward to the voting stage.
> > > > >
> > > > > I have general feedback around how we will handle job submission
> > > failure
> > > > > and retry. As discussed in the Rejected Alternatives section, we
> can
> > > use
> > > > > Java to handle job submission failures from the Flink client. It
> would
> > > be
> > > > > useful to have the ability to configure exception classifiers and
> retry
> > > > > strategy as part of operator configuration.
> > > > >
> > > > > Given this will be in a separate Github repository I am curious how
> > > ther
> > > > > versioning strategy will work in relation to the Flink version? Do
> we
> > > > have
> > > > > any other components with a similar setup I can look at? Will the
> > > > operator
> > > > > version track Flink or will it use its own versioning strategy
> with a
> > > > Flink
> > > > > version support matrix, or similar?
> > > > >
> > > > > Thanks,
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > > balassi.marton@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi team,
> > > > > >
> > > > > > Thank you for the great feedback, Thomas has updated the FLIP
> page
> > > > > > accordingly. If you are comfortable with the currently existing
> > > design
> > > > > and
> > > > > > depth in the FLIP [1] I suggest moving forward to the voting
> stage -
> > > > once
> > > > > > that reaches a positive conclusion it lets us create the separate
> > > code
> > > > > > repository under the flink project for the operator.
> > > > > >
> > > > > > I encourage everyone to keep improving the details in the
> meantime,
> > > > > however
> > > > > > I believe given the existing design and the general sentiment on
> this
> > > > > > thread that the most efficient path from here is starting the
> > > > > > implementation so that we can collectively iterate over it.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > > >
> > > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > HI Xintong,
> > > > > > >
> > > > > > > Thanks for the feedback and please see responses below -->
> > > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > > tonysong820@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > > discussion.
> > > > > > > >
> > > > > > > > I also have a few questions and comments.
> > > > > > > >
> > > > > > > > ## Job Submission
> > > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > > submitting
> > > > > > > jobs
> > > > > > > > to the cluster via Flink cli / REST is probably the approach
> that
> > > > > > > requires
> > > > > > > > the least effort. However, I'd like to point out 2
> weaknesses.
> > > > > > > > 1. A lot of users use Flink in perjob/application modes. For
> > > these
> > > > > > users,
> > > > > > > > having to run the job in two steps (deploy the cluster, and
> > > submit
> > > > > the
> > > > > > > job)
> > > > > > > > is not that convenient.
> > > > > > > > 2. One of our motivations is being able to manage Flink
> > > > applications'
> > > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > > > aligned
> > > > > > with
> > > > > > > > this motivation.
> > > > > > > > I think it's probably worth it to support submitting jobs via
> > > > > kubectl &
> > > > > > > CR
> > > > > > > > in the first version, both together with deploying the
> cluster
> > > like
> > > > > in
> > > > > > > > perjob/application mode and after deploying the cluster like
> in
> > > > > session
> > > > > > > > mode.
> > > > > > > >
> > > > > > >
> > > > > > > The intention is to support application management through
> operator
> > > > and
> > > > > > CR,
> > > > > > > which means there won't be any 2 step submission process,
> which as
> > > > you
> > > > > > > allude to would defeat the purpose of this project. The CR
> example
> > > > > shows
> > > > > > > the application part. Please note that the bare cluster
> support is
> > > an
> > > > > > > *additional* feature for scenarios that require external job
> > > > > management.
> > > > > > Is
> > > > > > > there anything on the FLIP page that creates a different
> > > impression?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > ## Versioning
> > > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > > 4. There was some changes to the Flink docker image
> entrypoint
> > > > script
> > > > > > in,
> > > > > > > > IIRC, Flink 1.13
> > > > > > > >
> > > > > > >
> > > > > > > Great, thanks for providing this. It is important for the
> > > > compatibility
> > > > > > > going forward also. We are targeting Flink 1.14.x upwards.
> Before
> > > the
> > > > > > > operator is ready there will be another Flink release. Let's
> see if
> > > > > > anyone
> > > > > > > is interested in earlier versions?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > ## Compatibility
> > > > > > > > What kind of API compatibility we can commit to? It's
> probably
> > > fine
> > > > > to
> > > > > > > have
> > > > > > > > alpha / beta version APIs that allow incompatible future
> changes
> > > > for
> > > > > > the
> > > > > > > > first version. But eventually we would need to guarantee
> > > backwards
> > > > > > > > compatibility, so that an early version CR can work with a
> new
> > > > > version
> > > > > > > > operator.
> > > > > > > >
> > > > > > >
> > > > > > > Another great point and please let me include that on the FLIP
> > > page.
> > > > > ;-)
> > > > > > >
> > > > > > > I think we should allow incompatible changes for the first one
> or
> > > two
> > > > > > > versions, similar to how other major features have evolved
> > > recently,
> > > > > such
> > > > > > > as FLIP-27.
> > > > > > >
> > > > > > > Would be great to get broader feedback on this one.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Thomas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <thw@apache.org
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > > agreed
> > > > to
> > > > > > do
> > > > > > > > the
> > > > > > > > > > first version of the operator based on the native
> > > integration.
> > > > > > > > > > While this clearly does not cover all use-cases and
> > > > requirements,
> > > > > > it
> > > > > > > > > seems
> > > > > > > > > > this would lead to a much smaller initial effort and a
> nicer
> > > > > first
> > > > > > > > > version.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm also leaning towards the native integration, as long
> as it
> > > > > > reduces
> > > > > > > > the
> > > > > > > > > MVP effort. Ultimately the operator will need to also
> support
> > > the
> > > > > > > > > standalone mode. I would like to gain more confidence that
> > > native
> > > > > > > > > integration reduces the effort. While it cuts the effort to
> > > > handle
> > > > > > the
> > > > > > > TM
> > > > > > > > > pod creation, some mapping code from the CR to the native
> > > > > integration
> > > > > > > > > client and config needs to be created. As mentioned in the
> > > FLIP,
> > > > > > native
> > > > > > > > > integration requires the Flink job manager to have access
> to
> > > the
> > > > > k8s
> > > > > > > API
> > > > > > > > to
> > > > > > > > > create pods, which in some scenarios may be seen as
> > > unfavorable.
> > > > > > > > >
> > > > > > > > >  > > > # Pod Template
> > > > > > > > > > > > Is the pod template in CR same with what Flink has
> > > already
> > > > > > > > > > supported[4]?
> > > > > > > > > > > > Then I am afraid not the arbitrary field(e.g.
> cpu/memory
> > > > > > > resources)
> > > > > > > > > > could
> > > > > > > > > > > > take effect.
> > > > > > > > >
> > > > > > > > > Yes, pod template would look almost identical. There are a
> few
> > > > > > settings
> > > > > > > > > that the operator will control (and that may need to be
> > > > > blacklisted),
> > > > > > > but
> > > > > > > > > in general we would not want to place restrictions. I
> think a
> > > > > > mechanism
> > > > > > > > > where a pod template is merged from multiple layers would
> also
> > > be
> > > > > > > > > interesting to make this more flexible.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Thomas Weise <th...@apache.org>.
Hi,

Thanks for the continued feedback and discussion. Looks like we are
ready to start a VOTE, I will initiate it shortly.

In parallel it would be good to find the repository name.

My suggestion would be: flink-kubernetes-operator

I thought "flink-operator" could be a bit misleading since the term
operator already has a meaning in Flink.

I also considered "flink-k8s-operator" but that would be almost
identical to existing operator implementations and could lead to
confusion in the future.

Thoughts?

Thanks,
Thomas



On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <gy...@gmail.com> wrote:
>
> Hi Danny,
>
> So far we have been focusing our dev efforts on the initial native
> implementation with the team.
> If the discussion and vote goes well for this FLIP we are looking forward
> to contributing the initial version sometime next week (fingers crossed).
>
> At that point I think we can already start the dev work to support the
> standalone mode as well, especially if you can dedicate some effort to
> pushing that side.
> Working together on this sounds like a great idea and we should start as
> soon as possible! :)
>
> Cheers,
> Gyula
>
> On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
> wrote:
>
> > I have been discussing this one with my team. We are interested in the
> > Standalone mode, and are willing to contribute towards the implementation.
> > Potentially we can work together to support both modes in parallel?
> >
> > Thanks,
> >
> > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Hi Danny!
> > >
> > > Thanks for the feedback :)
> > >
> > > Versioning:
> > > Versioning will be independent from Flink and the operator will depend
> > on a
> > > fixed flink version (in every given operator version).
> > > This should be the exact same setup as with Stateful Functions (
> > > https://github.com/apache/flink-statefun). So independent release cycle
> > > but
> > > still within the Flink umbrella.
> > >
> > > Deployment error handling:
> > > I think that's a very good point, as general exception handling for the
> > > different failure scenarios is a tricky problem. I think the exception
> > > classifiers and retry strategies could avoid a lot of manual intervention
> > > from the user. We will definitely need to add something like this. Once
> > we
> > > have the repo created with the initial operator code we should open some
> > > tickets for this and put it on the short term roadmap!
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <da...@apache.org>
> > > wrote:
> > >
> > > > Hey team,
> > > >
> > > > Great work on the FLIP, I am looking forward to this one. I agree that
> > we
> > > > can move forward to the voting stage.
> > > >
> > > > I have general feedback around how we will handle job submission
> > failure
> > > > and retry. As discussed in the Rejected Alternatives section, we can
> > use
> > > > Java to handle job submission failures from the Flink client. It would
> > be
> > > > useful to have the ability to configure exception classifiers and retry
> > > > strategy as part of operator configuration.
> > > >
> > > > Given this will be in a separate Github repository I am curious how
> > ther
> > > > versioning strategy will work in relation to the Flink version? Do we
> > > have
> > > > any other components with a similar setup I can look at? Will the
> > > operator
> > > > version track Flink or will it use its own versioning strategy with a
> > > Flink
> > > > version support matrix, or similar?
> > > >
> > > > Thanks,
> > > >
> > > >
> > > >
> > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > balassi.marton@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi team,
> > > > >
> > > > > Thank you for the great feedback, Thomas has updated the FLIP page
> > > > > accordingly. If you are comfortable with the currently existing
> > design
> > > > and
> > > > > depth in the FLIP [1] I suggest moving forward to the voting stage -
> > > once
> > > > > that reaches a positive conclusion it lets us create the separate
> > code
> > > > > repository under the flink project for the operator.
> > > > >
> > > > > I encourage everyone to keep improving the details in the meantime,
> > > > however
> > > > > I believe given the existing design and the general sentiment on this
> > > > > thread that the most efficient path from here is starting the
> > > > > implementation so that we can collectively iterate over it.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > >
> > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> > wrote:
> > > > >
> > > > > > HI Xintong,
> > > > > >
> > > > > > Thanks for the feedback and please see responses below -->
> > > > > >
> > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > tonysong820@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > discussion.
> > > > > > >
> > > > > > > I also have a few questions and comments.
> > > > > > >
> > > > > > > ## Job Submission
> > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > submitting
> > > > > > jobs
> > > > > > > to the cluster via Flink cli / REST is probably the approach that
> > > > > > requires
> > > > > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > > > > 1. A lot of users use Flink in perjob/application modes. For
> > these
> > > > > users,
> > > > > > > having to run the job in two steps (deploy the cluster, and
> > submit
> > > > the
> > > > > > job)
> > > > > > > is not that convenient.
> > > > > > > 2. One of our motivations is being able to manage Flink
> > > applications'
> > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > > aligned
> > > > > with
> > > > > > > this motivation.
> > > > > > > I think it's probably worth it to support submitting jobs via
> > > > kubectl &
> > > > > > CR
> > > > > > > in the first version, both together with deploying the cluster
> > like
> > > > in
> > > > > > > perjob/application mode and after deploying the cluster like in
> > > > session
> > > > > > > mode.
> > > > > > >
> > > > > >
> > > > > > The intention is to support application management through operator
> > > and
> > > > > CR,
> > > > > > which means there won't be any 2 step submission process, which as
> > > you
> > > > > > allude to would defeat the purpose of this project. The CR example
> > > > shows
> > > > > > the application part. Please note that the bare cluster support is
> > an
> > > > > > *additional* feature for scenarios that require external job
> > > > management.
> > > > > Is
> > > > > > there anything on the FLIP page that creates a different
> > impression?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Versioning
> > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > 4. There was some changes to the Flink docker image entrypoint
> > > script
> > > > > in,
> > > > > > > IIRC, Flink 1.13
> > > > > > >
> > > > > >
> > > > > > Great, thanks for providing this. It is important for the
> > > compatibility
> > > > > > going forward also. We are targeting Flink 1.14.x upwards. Before
> > the
> > > > > > operator is ready there will be another Flink release. Let's see if
> > > > > anyone
> > > > > > is interested in earlier versions?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Compatibility
> > > > > > > What kind of API compatibility we can commit to? It's probably
> > fine
> > > > to
> > > > > > have
> > > > > > > alpha / beta version APIs that allow incompatible future changes
> > > for
> > > > > the
> > > > > > > first version. But eventually we would need to guarantee
> > backwards
> > > > > > > compatibility, so that an early version CR can work with a new
> > > > version
> > > > > > > operator.
> > > > > > >
> > > > > >
> > > > > > Another great point and please let me include that on the FLIP
> > page.
> > > > ;-)
> > > > > >
> > > > > > I think we should allow incompatible changes for the first one or
> > two
> > > > > > versions, similar to how other major features have evolved
> > recently,
> > > > such
> > > > > > as FLIP-27.
> > > > > >
> > > > > > Would be great to get broader feedback on this one.
> > > > > >
> > > > > > Cheers,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > agreed
> > > to
> > > > > do
> > > > > > > the
> > > > > > > > > first version of the operator based on the native
> > integration.
> > > > > > > > > While this clearly does not cover all use-cases and
> > > requirements,
> > > > > it
> > > > > > > > seems
> > > > > > > > > this would lead to a much smaller initial effort and a nicer
> > > > first
> > > > > > > > version.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm also leaning towards the native integration, as long as it
> > > > > reduces
> > > > > > > the
> > > > > > > > MVP effort. Ultimately the operator will need to also support
> > the
> > > > > > > > standalone mode. I would like to gain more confidence that
> > native
> > > > > > > > integration reduces the effort. While it cuts the effort to
> > > handle
> > > > > the
> > > > > > TM
> > > > > > > > pod creation, some mapping code from the CR to the native
> > > > integration
> > > > > > > > client and config needs to be created. As mentioned in the
> > FLIP,
> > > > > native
> > > > > > > > integration requires the Flink job manager to have access to
> > the
> > > > k8s
> > > > > > API
> > > > > > > to
> > > > > > > > create pods, which in some scenarios may be seen as
> > unfavorable.
> > > > > > > >
> > > > > > > >  > > > # Pod Template
> > > > > > > > > > > Is the pod template in CR same with what Flink has
> > already
> > > > > > > > > supported[4]?
> > > > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > > > > resources)
> > > > > > > > > could
> > > > > > > > > > > take effect.
> > > > > > > >
> > > > > > > > Yes, pod template would look almost identical. There are a few
> > > > > settings
> > > > > > > > that the operator will control (and that may need to be
> > > > blacklisted),
> > > > > > but
> > > > > > > > in general we would not want to place restrictions. I think a
> > > > > mechanism
> > > > > > > > where a pod template is merged from multiple layers would also
> > be
> > > > > > > > interesting to make this more flexible.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Danny,

So far we have been focusing our dev efforts on the initial native
implementation with the team.
If the discussion and vote goes well for this FLIP we are looking forward
to contributing the initial version sometime next week (fingers crossed).

At that point I think we can already start the dev work to support the
standalone mode as well, especially if you can dedicate some effort to
pushing that side.
Working together on this sounds like a great idea and we should start as
soon as possible! :)

Cheers,
Gyula

On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <da...@apache.org>
wrote:

> I have been discussing this one with my team. We are interested in the
> Standalone mode, and are willing to contribute towards the implementation.
> Potentially we can work together to support both modes in parallel?
>
> Thanks,
>
> On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Hi Danny!
> >
> > Thanks for the feedback :)
> >
> > Versioning:
> > Versioning will be independent from Flink and the operator will depend
> on a
> > fixed flink version (in every given operator version).
> > This should be the exact same setup as with Stateful Functions (
> > https://github.com/apache/flink-statefun). So independent release cycle
> > but
> > still within the Flink umbrella.
> >
> > Deployment error handling:
> > I think that's a very good point, as general exception handling for the
> > different failure scenarios is a tricky problem. I think the exception
> > classifiers and retry strategies could avoid a lot of manual intervention
> > from the user. We will definitely need to add something like this. Once
> we
> > have the repo created with the initial operator code we should open some
> > tickets for this and put it on the short term roadmap!
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <da...@apache.org>
> > wrote:
> >
> > > Hey team,
> > >
> > > Great work on the FLIP, I am looking forward to this one. I agree that
> we
> > > can move forward to the voting stage.
> > >
> > > I have general feedback around how we will handle job submission
> failure
> > > and retry. As discussed in the Rejected Alternatives section, we can
> use
> > > Java to handle job submission failures from the Flink client. It would
> be
> > > useful to have the ability to configure exception classifiers and retry
> > > strategy as part of operator configuration.
> > >
> > > Given this will be in a separate Github repository I am curious how
> ther
> > > versioning strategy will work in relation to the Flink version? Do we
> > have
> > > any other components with a similar setup I can look at? Will the
> > operator
> > > version track Flink or will it use its own versioning strategy with a
> > Flink
> > > version support matrix, or similar?
> > >
> > > Thanks,
> > >
> > >
> > >
> > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> balassi.marton@gmail.com>
> > > wrote:
> > >
> > > > Hi team,
> > > >
> > > > Thank you for the great feedback, Thomas has updated the FLIP page
> > > > accordingly. If you are comfortable with the currently existing
> design
> > > and
> > > > depth in the FLIP [1] I suggest moving forward to the voting stage -
> > once
> > > > that reaches a positive conclusion it lets us create the separate
> code
> > > > repository under the flink project for the operator.
> > > >
> > > > I encourage everyone to keep improving the details in the meantime,
> > > however
> > > > I believe given the existing design and the general sentiment on this
> > > > thread that the most efficient path from here is starting the
> > > > implementation so that we can collectively iterate over it.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > >
> > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org>
> wrote:
> > > >
> > > > > HI Xintong,
> > > > >
> > > > > Thanks for the feedback and please see responses below -->
> > > > >
> > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> tonysong820@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > discussion.
> > > > > >
> > > > > > I also have a few questions and comments.
> > > > > >
> > > > > > ## Job Submission
> > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > submitting
> > > > > jobs
> > > > > > to the cluster via Flink cli / REST is probably the approach that
> > > > > requires
> > > > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > > > 1. A lot of users use Flink in perjob/application modes. For
> these
> > > > users,
> > > > > > having to run the job in two steps (deploy the cluster, and
> submit
> > > the
> > > > > job)
> > > > > > is not that convenient.
> > > > > > 2. One of our motivations is being able to manage Flink
> > applications'
> > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > aligned
> > > > with
> > > > > > this motivation.
> > > > > > I think it's probably worth it to support submitting jobs via
> > > kubectl &
> > > > > CR
> > > > > > in the first version, both together with deploying the cluster
> like
> > > in
> > > > > > perjob/application mode and after deploying the cluster like in
> > > session
> > > > > > mode.
> > > > > >
> > > > >
> > > > > The intention is to support application management through operator
> > and
> > > > CR,
> > > > > which means there won't be any 2 step submission process, which as
> > you
> > > > > allude to would defeat the purpose of this project. The CR example
> > > shows
> > > > > the application part. Please note that the bare cluster support is
> an
> > > > > *additional* feature for scenarios that require external job
> > > management.
> > > > Is
> > > > > there anything on the FLIP page that creates a different
> impression?
> > > > >
> > > > >
> > > > > >
> > > > > > ## Versioning
> > > > > > Which Flink versions does the operator plan to support?
> > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > 4. There was some changes to the Flink docker image entrypoint
> > script
> > > > in,
> > > > > > IIRC, Flink 1.13
> > > > > >
> > > > >
> > > > > Great, thanks for providing this. It is important for the
> > compatibility
> > > > > going forward also. We are targeting Flink 1.14.x upwards. Before
> the
> > > > > operator is ready there will be another Flink release. Let's see if
> > > > anyone
> > > > > is interested in earlier versions?
> > > > >
> > > > >
> > > > > >
> > > > > > ## Compatibility
> > > > > > What kind of API compatibility we can commit to? It's probably
> fine
> > > to
> > > > > have
> > > > > > alpha / beta version APIs that allow incompatible future changes
> > for
> > > > the
> > > > > > first version. But eventually we would need to guarantee
> backwards
> > > > > > compatibility, so that an early version CR can work with a new
> > > version
> > > > > > operator.
> > > > > >
> > > > >
> > > > > Another great point and please let me include that on the FLIP
> page.
> > > ;-)
> > > > >
> > > > > I think we should allow incompatible changes for the first one or
> two
> > > > > versions, similar to how other major features have evolved
> recently,
> > > such
> > > > > as FLIP-27.
> > > > >
> > > > > Would be great to get broader feedback on this one.
> > > > >
> > > > > Cheers,
> > > > > Thomas
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Thanks for the feedback!
> > > > > > >
> > > > > > > >
> > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > Maybe we should make this more clear in the FLIP but we
> agreed
> > to
> > > > do
> > > > > > the
> > > > > > > > first version of the operator based on the native
> integration.
> > > > > > > > While this clearly does not cover all use-cases and
> > requirements,
> > > > it
> > > > > > > seems
> > > > > > > > this would lead to a much smaller initial effort and a nicer
> > > first
> > > > > > > version.
> > > > > > > >
> > > > > > >
> > > > > > > I'm also leaning towards the native integration, as long as it
> > > > reduces
> > > > > > the
> > > > > > > MVP effort. Ultimately the operator will need to also support
> the
> > > > > > > standalone mode. I would like to gain more confidence that
> native
> > > > > > > integration reduces the effort. While it cuts the effort to
> > handle
> > > > the
> > > > > TM
> > > > > > > pod creation, some mapping code from the CR to the native
> > > integration
> > > > > > > client and config needs to be created. As mentioned in the
> FLIP,
> > > > native
> > > > > > > integration requires the Flink job manager to have access to
> the
> > > k8s
> > > > > API
> > > > > > to
> > > > > > > create pods, which in some scenarios may be seen as
> unfavorable.
> > > > > > >
> > > > > > >  > > > # Pod Template
> > > > > > > > > > Is the pod template in CR same with what Flink has
> already
> > > > > > > > supported[4]?
> > > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > > > resources)
> > > > > > > > could
> > > > > > > > > > take effect.
> > > > > > >
> > > > > > > Yes, pod template would look almost identical. There are a few
> > > > settings
> > > > > > > that the operator will control (and that may need to be
> > > blacklisted),
> > > > > but
> > > > > > > in general we would not want to place restrictions. I think a
> > > > mechanism
> > > > > > > where a pod template is merged from multiple layers would also
> be
> > > > > > > interesting to make this more flexible.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Thomas
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Danny Cranmer <da...@apache.org>.
I have been discussing this one with my team. We are interested in the
Standalone mode, and are willing to contribute towards the implementation.
Potentially we can work together to support both modes in parallel?

Thanks,

On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Danny!
>
> Thanks for the feedback :)
>
> Versioning:
> Versioning will be independent from Flink and the operator will depend on a
> fixed flink version (in every given operator version).
> This should be the exact same setup as with Stateful Functions (
> https://github.com/apache/flink-statefun). So independent release cycle
> but
> still within the Flink umbrella.
>
> Deployment error handling:
> I think that's a very good point, as general exception handling for the
> different failure scenarios is a tricky problem. I think the exception
> classifiers and retry strategies could avoid a lot of manual intervention
> from the user. We will definitely need to add something like this. Once we
> have the repo created with the initial operator code we should open some
> tickets for this and put it on the short term roadmap!
>
> Cheers,
> Gyula
>
> On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <da...@apache.org>
> wrote:
>
> > Hey team,
> >
> > Great work on the FLIP, I am looking forward to this one. I agree that we
> > can move forward to the voting stage.
> >
> > I have general feedback around how we will handle job submission failure
> > and retry. As discussed in the Rejected Alternatives section, we can use
> > Java to handle job submission failures from the Flink client. It would be
> > useful to have the ability to configure exception classifiers and retry
> > strategy as part of operator configuration.
> >
> > Given this will be in a separate Github repository I am curious how ther
> > versioning strategy will work in relation to the Flink version? Do we
> have
> > any other components with a similar setup I can look at? Will the
> operator
> > version track Flink or will it use its own versioning strategy with a
> Flink
> > version support matrix, or similar?
> >
> > Thanks,
> >
> >
> >
> > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <ba...@gmail.com>
> > wrote:
> >
> > > Hi team,
> > >
> > > Thank you for the great feedback, Thomas has updated the FLIP page
> > > accordingly. If you are comfortable with the currently existing design
> > and
> > > depth in the FLIP [1] I suggest moving forward to the voting stage -
> once
> > > that reaches a positive conclusion it lets us create the separate code
> > > repository under the flink project for the operator.
> > >
> > > I encourage everyone to keep improving the details in the meantime,
> > however
> > > I believe given the existing design and the general sentiment on this
> > > thread that the most efficient path from here is starting the
> > > implementation so that we can collectively iterate over it.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > >
> > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org> wrote:
> > >
> > > > HI Xintong,
> > > >
> > > > Thanks for the feedback and please see responses below -->
> > > >
> > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <tonysong820@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > discussion.
> > > > >
> > > > > I also have a few questions and comments.
> > > > >
> > > > > ## Job Submission
> > > > > Deploying a Flink session cluster via kubectl & CR and then
> > submitting
> > > > jobs
> > > > > to the cluster via Flink cli / REST is probably the approach that
> > > > requires
> > > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > > 1. A lot of users use Flink in perjob/application modes. For these
> > > users,
> > > > > having to run the job in two steps (deploy the cluster, and submit
> > the
> > > > job)
> > > > > is not that convenient.
> > > > > 2. One of our motivations is being able to manage Flink
> applications'
> > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> aligned
> > > with
> > > > > this motivation.
> > > > > I think it's probably worth it to support submitting jobs via
> > kubectl &
> > > > CR
> > > > > in the first version, both together with deploying the cluster like
> > in
> > > > > perjob/application mode and after deploying the cluster like in
> > session
> > > > > mode.
> > > > >
> > > >
> > > > The intention is to support application management through operator
> and
> > > CR,
> > > > which means there won't be any 2 step submission process, which as
> you
> > > > allude to would defeat the purpose of this project. The CR example
> > shows
> > > > the application part. Please note that the bare cluster support is an
> > > > *additional* feature for scenarios that require external job
> > management.
> > > Is
> > > > there anything on the FLIP page that creates a different impression?
> > > >
> > > >
> > > > >
> > > > > ## Versioning
> > > > > Which Flink versions does the operator plan to support?
> > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > 4. There was some changes to the Flink docker image entrypoint
> script
> > > in,
> > > > > IIRC, Flink 1.13
> > > > >
> > > >
> > > > Great, thanks for providing this. It is important for the
> compatibility
> > > > going forward also. We are targeting Flink 1.14.x upwards. Before the
> > > > operator is ready there will be another Flink release. Let's see if
> > > anyone
> > > > is interested in earlier versions?
> > > >
> > > >
> > > > >
> > > > > ## Compatibility
> > > > > What kind of API compatibility we can commit to? It's probably fine
> > to
> > > > have
> > > > > alpha / beta version APIs that allow incompatible future changes
> for
> > > the
> > > > > first version. But eventually we would need to guarantee backwards
> > > > > compatibility, so that an early version CR can work with a new
> > version
> > > > > operator.
> > > > >
> > > >
> > > > Another great point and please let me include that on the FLIP page.
> > ;-)
> > > >
> > > > I think we should allow incompatible changes for the first one or two
> > > > versions, similar to how other major features have evolved recently,
> > such
> > > > as FLIP-27.
> > > >
> > > > Would be great to get broader feedback on this one.
> > > >
> > > > Cheers,
> > > > Thomas
> > > >
> > > >
> > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org>
> wrote:
> > > > >
> > > > > > Thanks for the feedback!
> > > > > >
> > > > > > >
> > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > Maybe we should make this more clear in the FLIP but we agreed
> to
> > > do
> > > > > the
> > > > > > > first version of the operator based on the native integration.
> > > > > > > While this clearly does not cover all use-cases and
> requirements,
> > > it
> > > > > > seems
> > > > > > > this would lead to a much smaller initial effort and a nicer
> > first
> > > > > > version.
> > > > > > >
> > > > > >
> > > > > > I'm also leaning towards the native integration, as long as it
> > > reduces
> > > > > the
> > > > > > MVP effort. Ultimately the operator will need to also support the
> > > > > > standalone mode. I would like to gain more confidence that native
> > > > > > integration reduces the effort. While it cuts the effort to
> handle
> > > the
> > > > TM
> > > > > > pod creation, some mapping code from the CR to the native
> > integration
> > > > > > client and config needs to be created. As mentioned in the FLIP,
> > > native
> > > > > > integration requires the Flink job manager to have access to the
> > k8s
> > > > API
> > > > > to
> > > > > > create pods, which in some scenarios may be seen as unfavorable.
> > > > > >
> > > > > >  > > > # Pod Template
> > > > > > > > > Is the pod template in CR same with what Flink has already
> > > > > > > supported[4]?
> > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > > resources)
> > > > > > > could
> > > > > > > > > take effect.
> > > > > >
> > > > > > Yes, pod template would look almost identical. There are a few
> > > settings
> > > > > > that the operator will control (and that may need to be
> > blacklisted),
> > > > but
> > > > > > in general we would not want to place restrictions. I think a
> > > mechanism
> > > > > > where a pod template is merged from multiple layers would also be
> > > > > > interesting to make this more flexible.
> > > > > >
> > > > > > Cheers,
> > > > > > Thomas
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Gyula Fóra <gy...@gmail.com>.
Hi Danny!

Thanks for the feedback :)

Versioning:
Versioning will be independent from Flink and the operator will depend on a
fixed flink version (in every given operator version).
This should be the exact same setup as with Stateful Functions (
https://github.com/apache/flink-statefun). So independent release cycle but
still within the Flink umbrella.

Deployment error handling:
I think that's a very good point, as general exception handling for the
different failure scenarios is a tricky problem. I think the exception
classifiers and retry strategies could avoid a lot of manual intervention
from the user. We will definitely need to add something like this. Once we
have the repo created with the initial operator code we should open some
tickets for this and put it on the short term roadmap!

Cheers,
Gyula

On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <da...@apache.org>
wrote:

> Hey team,
>
> Great work on the FLIP, I am looking forward to this one. I agree that we
> can move forward to the voting stage.
>
> I have general feedback around how we will handle job submission failure
> and retry. As discussed in the Rejected Alternatives section, we can use
> Java to handle job submission failures from the Flink client. It would be
> useful to have the ability to configure exception classifiers and retry
> strategy as part of operator configuration.
>
> Given this will be in a separate Github repository I am curious how ther
> versioning strategy will work in relation to the Flink version? Do we have
> any other components with a similar setup I can look at? Will the operator
> version track Flink or will it use its own versioning strategy with a Flink
> version support matrix, or similar?
>
> Thanks,
>
>
>
> On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <ba...@gmail.com>
> wrote:
>
> > Hi team,
> >
> > Thank you for the great feedback, Thomas has updated the FLIP page
> > accordingly. If you are comfortable with the currently existing design
> and
> > depth in the FLIP [1] I suggest moving forward to the voting stage - once
> > that reaches a positive conclusion it lets us create the separate code
> > repository under the flink project for the operator.
> >
> > I encourage everyone to keep improving the details in the meantime,
> however
> > I believe given the existing design and the general sentiment on this
> > thread that the most efficient path from here is starting the
> > implementation so that we can collectively iterate over it.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> >
> > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org> wrote:
> >
> > > HI Xintong,
> > >
> > > Thanks for the feedback and please see responses below -->
> > >
> > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <to...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Thomas for drafting this FLIP, and everyone for the
> discussion.
> > > >
> > > > I also have a few questions and comments.
> > > >
> > > > ## Job Submission
> > > > Deploying a Flink session cluster via kubectl & CR and then
> submitting
> > > jobs
> > > > to the cluster via Flink cli / REST is probably the approach that
> > > requires
> > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > 1. A lot of users use Flink in perjob/application modes. For these
> > users,
> > > > having to run the job in two steps (deploy the cluster, and submit
> the
> > > job)
> > > > is not that convenient.
> > > > 2. One of our motivations is being able to manage Flink applications'
> > > > lifecycles with kubectl. Submitting jobs from cli sounds not aligned
> > with
> > > > this motivation.
> > > > I think it's probably worth it to support submitting jobs via
> kubectl &
> > > CR
> > > > in the first version, both together with deploying the cluster like
> in
> > > > perjob/application mode and after deploying the cluster like in
> session
> > > > mode.
> > > >
> > >
> > > The intention is to support application management through operator and
> > CR,
> > > which means there won't be any 2 step submission process, which as you
> > > allude to would defeat the purpose of this project. The CR example
> shows
> > > the application part. Please note that the bare cluster support is an
> > > *additional* feature for scenarios that require external job
> management.
> > Is
> > > there anything on the FLIP page that creates a different impression?
> > >
> > >
> > > >
> > > > ## Versioning
> > > > Which Flink versions does the operator plan to support?
> > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > 3. The Pod template support was introduced in Flink 1.13
> > > > 4. There was some changes to the Flink docker image entrypoint script
> > in,
> > > > IIRC, Flink 1.13
> > > >
> > >
> > > Great, thanks for providing this. It is important for the compatibility
> > > going forward also. We are targeting Flink 1.14.x upwards. Before the
> > > operator is ready there will be another Flink release. Let's see if
> > anyone
> > > is interested in earlier versions?
> > >
> > >
> > > >
> > > > ## Compatibility
> > > > What kind of API compatibility we can commit to? It's probably fine
> to
> > > have
> > > > alpha / beta version APIs that allow incompatible future changes for
> > the
> > > > first version. But eventually we would need to guarantee backwards
> > > > compatibility, so that an early version CR can work with a new
> version
> > > > operator.
> > > >
> > >
> > > Another great point and please let me include that on the FLIP page.
> ;-)
> > >
> > > I think we should allow incompatible changes for the first one or two
> > > versions, similar to how other major features have evolved recently,
> such
> > > as FLIP-27.
> > >
> > > Would be great to get broader feedback on this one.
> > >
> > > Cheers,
> > > Thomas
> > >
> > >
> > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org> wrote:
> > > >
> > > > > Thanks for the feedback!
> > > > >
> > > > > >
> > > > > > # 1 Flink Native vs Standalone integration
> > > > > > Maybe we should make this more clear in the FLIP but we agreed to
> > do
> > > > the
> > > > > > first version of the operator based on the native integration.
> > > > > > While this clearly does not cover all use-cases and requirements,
> > it
> > > > > seems
> > > > > > this would lead to a much smaller initial effort and a nicer
> first
> > > > > version.
> > > > > >
> > > > >
> > > > > I'm also leaning towards the native integration, as long as it
> > reduces
> > > > the
> > > > > MVP effort. Ultimately the operator will need to also support the
> > > > > standalone mode. I would like to gain more confidence that native
> > > > > integration reduces the effort. While it cuts the effort to handle
> > the
> > > TM
> > > > > pod creation, some mapping code from the CR to the native
> integration
> > > > > client and config needs to be created. As mentioned in the FLIP,
> > native
> > > > > integration requires the Flink job manager to have access to the
> k8s
> > > API
> > > > to
> > > > > create pods, which in some scenarios may be seen as unfavorable.
> > > > >
> > > > >  > > > # Pod Template
> > > > > > > > Is the pod template in CR same with what Flink has already
> > > > > > supported[4]?
> > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > resources)
> > > > > > could
> > > > > > > > take effect.
> > > > >
> > > > > Yes, pod template would look almost identical. There are a few
> > settings
> > > > > that the operator will control (and that may need to be
> blacklisted),
> > > but
> > > > > in general we would not want to place restrictions. I think a
> > mechanism
> > > > > where a pod template is merged from multiple layers would also be
> > > > > interesting to make this more flexible.
> > > > >
> > > > > Cheers,
> > > > > Thomas
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Posted by Danny Cranmer <da...@apache.org>.
Hey team,

Great work on the FLIP, I am looking forward to this one. I agree that we
can move forward to the voting stage.

I have general feedback around how we will handle job submission failure
and retry. As discussed in the Rejected Alternatives section, we can use
Java to handle job submission failures from the Flink client. It would be
useful to have the ability to configure exception classifiers and retry
strategy as part of operator configuration.

Given this will be in a separate Github repository I am curious how ther
versioning strategy will work in relation to the Flink version? Do we have
any other components with a similar setup I can look at? Will the operator
version track Flink or will it use its own versioning strategy with a Flink
version support matrix, or similar?

Thanks,



On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <ba...@gmail.com>
wrote:

> Hi team,
>
> Thank you for the great feedback, Thomas has updated the FLIP page
> accordingly. If you are comfortable with the currently existing design and
> depth in the FLIP [1] I suggest moving forward to the voting stage - once
> that reaches a positive conclusion it lets us create the separate code
> repository under the flink project for the operator.
>
> I encourage everyone to keep improving the details in the meantime, however
> I believe given the existing design and the general sentiment on this
> thread that the most efficient path from here is starting the
> implementation so that we can collectively iterate over it.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
>
> On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <th...@apache.org> wrote:
>
> > HI Xintong,
> >
> > Thanks for the feedback and please see responses below -->
> >
> > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <to...@gmail.com>
> > wrote:
> >
> > > Thanks Thomas for drafting this FLIP, and everyone for the discussion.
> > >
> > > I also have a few questions and comments.
> > >
> > > ## Job Submission
> > > Deploying a Flink session cluster via kubectl & CR and then submitting
> > jobs
> > > to the cluster via Flink cli / REST is probably the approach that
> > requires
> > > the least effort. However, I'd like to point out 2 weaknesses.
> > > 1. A lot of users use Flink in perjob/application modes. For these
> users,
> > > having to run the job in two steps (deploy the cluster, and submit the
> > job)
> > > is not that convenient.
> > > 2. One of our motivations is being able to manage Flink applications'
> > > lifecycles with kubectl. Submitting jobs from cli sounds not aligned
> with
> > > this motivation.
> > > I think it's probably worth it to support submitting jobs via kubectl &
> > CR
> > > in the first version, both together with deploying the cluster like in
> > > perjob/application mode and after deploying the cluster like in session
> > > mode.
> > >
> >
> > The intention is to support application management through operator and
> CR,
> > which means there won't be any 2 step submission process, which as you
> > allude to would defeat the purpose of this project. The CR example shows
> > the application part. Please note that the bare cluster support is an
> > *additional* feature for scenarios that require external job management.
> Is
> > there anything on the FLIP page that creates a different impression?
> >
> >
> > >
> > > ## Versioning
> > > Which Flink versions does the operator plan to support?
> > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > 2. Native K8s HA was introduced in Flink 1.12
> > > 3. The Pod template support was introduced in Flink 1.13
> > > 4. There was some changes to the Flink docker image entrypoint script
> in,
> > > IIRC, Flink 1.13
> > >
> >
> > Great, thanks for providing this. It is important for the compatibility
> > going forward also. We are targeting Flink 1.14.x upwards. Before the
> > operator is ready there will be another Flink release. Let's see if
> anyone
> > is interested in earlier versions?
> >
> >
> > >
> > > ## Compatibility
> > > What kind of API compatibility we can commit to? It's probably fine to
> > have
> > > alpha / beta version APIs that allow incompatible future changes for
> the
> > > first version. But eventually we would need to guarantee backwards
> > > compatibility, so that an early version CR can work with a new version
> > > operator.
> > >
> >
> > Another great point and please let me include that on the FLIP page. ;-)
> >
> > I think we should allow incompatible changes for the first one or two
> > versions, similar to how other major features have evolved recently, such
> > as FLIP-27.
> >
> > Would be great to get broader feedback on this one.
> >
> > Cheers,
> > Thomas
> >
> >
> >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <th...@apache.org> wrote:
> > >
> > > > Thanks for the feedback!
> > > >
> > > > >
> > > > > # 1 Flink Native vs Standalone integration
> > > > > Maybe we should make this more clear in the FLIP but we agreed to
> do
> > > the
> > > > > first version of the operator based on the native integration.
> > > > > While this clearly does not cover all use-cases and requirements,
> it
> > > > seems
> > > > > this would lead to a much smaller initial effort and a nicer first
> > > > version.
> > > > >
> > > >
> > > > I'm also leaning towards the native integration, as long as it
> reduces
> > > the
> > > > MVP effort. Ultimately the operator will need to also support the
> > > > standalone mode. I would like to gain more confidence that native
> > > > integration reduces the effort. While it cuts the effort to handle
> the
> > TM
> > > > pod creation, some mapping code from the CR to the native integration
> > > > client and config needs to be created. As mentioned in the FLIP,
> native
> > > > integration requires the Flink job manager to have access to the k8s
> > API
> > > to
> > > > create pods, which in some scenarios may be seen as unfavorable.
> > > >
> > > >  > > > # Pod Template
> > > > > > > Is the pod template in CR same with what Flink has already
> > > > > supported[4]?
> > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > resources)
> > > > > could
> > > > > > > take effect.
> > > >
> > > > Yes, pod template would look almost identical. There are a few
> settings
> > > > that the operator will control (and that may need to be blacklisted),
> > but
> > > > in general we would not want to place restrictions. I think a
> mechanism
> > > > where a pod template is merged from multiple layers would also be
> > > > interesting to make this more flexible.
> > > >
> > > > Cheers,
> > > > Thomas
> > > >
> > >
> >
>