You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Chad Dombrova <ch...@gmail.com> on 2019/10/19 18:54:43 UTC

Beam on Flink with "active" k8s support

Hi all,
I've been following the Jira issue on Flink "active" k8s support
(autoscaling based on task resource requirements, IIUC) and there has been
a lot of activity there lately.  There are two design docs [2][3] from
different teams and it seems like some good collaboration is going on to
reconcile the differences and get the feature implemented.

At the Beam Summit I saw the great presentations by Thomas and Micah on the
work that Lyft has been doing to run Flink and Beam on k8s.  I'm curious if
these various features and approaches can be made to work with each other,
and if so, what it would take to do so.

thanks,
-chad

[1] https://issues.apache.org/jira/browse/FLINK-9953
[2]
https://docs.google.com/document/d/1-jNzqGF6NfZuwVaFICoFQ5HFFXzF5NVIagUZByFMfBY/edit
[3]
https://docs.google.com/document/d/1Or1hmDXABk1Q3ToITMi0HoZyHLs2nZ6gk7E0y2sEe28/edit#heading=h.ubqno6n4vwrl

Re: Beam on Flink with "active" k8s support

Posted by Thomas Weise <th...@apache.org>.

Hi Chad,

Thanks for bringing up this discussion. I think for most of it the Flink
list would be the better suited place, but given that there are more and
more Beam users interested to deploy on Beam on Flink on k8s, I will leave
the placement to you ;-)

As for the differences between FlinkK8sOperator and the proposals that you
linked, the motivation section in [2] covers that a bit. With
FlinkK8sOperator the deployment indeed looks more or less like "any other
k8s deployment". Within the Lyft infrastructure it is also automated
through the CI and there is a bit of tooling for development so it is
usually not necessary to work with kubectl. Flink cluster deployment
specifics and other details are encapsulated in the operator. This was a
deliberate choice based on our experience of previously running Flink in a
way that exposes too many deployment details to the user. FlinkK8sOperator
instead provides the control plane that should allow users to focus on
their application (vs. how it is being deployed). An important
consideration are upgrades, so an "application" is really a sequence of
jobs. Please see state machine:

https://github.com/lyft/flinkk8soperator/blob/master/docs/state_machine.md

Since I'm not interested to work with flink command line to manage
production environments, making Flink on k8s look like Flink on Yarn or any
other Flink would be a non-goal.

The more interesting part is the active resource management; I suspect you
were thinking of that? Currently all TM pods have the same size and they
are allocated upfront. Rescaling a Flink job requires the update management
FlinkK8sOperator provides. The job needs to be stopped with a savepoint and
restarted with changed parallelism (because that's how Flink works at its
core). That would not change, even if the job was to (actively) manage the
k8s resources. Dynamic scaling with the operator would mean that a
parallelism configuration change would be triggered automatically.

A consequence of resources being allocated by Flink on demand would be that
different pod sizes would become possible, based on the exact shape of
topology. I think there is a proposal to support such resource allocation
in Flink in the future. Note that this is not necessarily an advantage
though. If the resource requests cannot be filled fast enough, you have
more downtime. That's why with the operator, we only stop the old job when
the new cluster is already deployed and ready to run the replace job.

HTH,
Thomas

On Sat, Oct 19, 2019 at 11:55 AM Chad Dombrova <ch...@gmail.com> wrote:

> Hi all,
> I've been following the Jira issue on Flink "active" k8s support
> (autoscaling based on task resource requirements, IIUC) and there has been
> a lot of activity there lately.  There are two design docs [2][3] from
> different teams and it seems like some good collaboration is going on to
> reconcile the differences and get the feature implemented.
>
> At the Beam Summit I saw the great presentations by Thomas and Micah on
> the work that Lyft has been doing to run Flink and Beam on k8s.  I'm
> curious if these various features and approaches can be made to work with
> each other, and if so, what it would take to do so.
>
> thanks,
> -chad
>
> [1] https://issues.apache.org/jira/browse/FLINK-9953
> [2]
> https://docs.google.com/document/d/1-jNzqGF6NfZuwVaFICoFQ5HFFXzF5NVIagUZByFMfBY/edit
> [3]
> https://docs.google.com/document/d/1Or1hmDXABk1Q3ToITMi0HoZyHLs2nZ6gk7E0y2sEe28/edit#heading=h.ubqno6n4vwrl
>
>