You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@yunikorn.apache.org by Chenya Zhang <ch...@gmail.com> on 2022/01/05 02:35:26 UTC

YuniKorn for Streaming Use Cases

Hey folks,

We have some new streaming use cases with Apache Flink that could
potentially leverage YuniKorn for resource scheduling.

The initial implementation is to use K8s namespace for resource quota
management. We are investigating what could be some strong benefits
switching to YuniKorn in streaming cases for long-running services. For
example: Job queueing, job ordering, resource reservation, user groups etc
all seem to be more desirable for batch use cases.

Any thoughts or suggestions?

Thanks,
Chenya

Re: YuniKorn for Streaming Use Cases

Posted by Chenya Zhang <ch...@gmail.com>.

** Corrections: Apache YuniKorn meetup :) **

On Wed, Jan 5, 2022 at 4:56 PM Chenya Zhang <ch...@gmail.com>
wrote:

> Hi Weiwei, thanks for sharing your past experience! This is a helpful
> discussion.
>
> We should set up some dedicated discussions and topic threads for
> "Streaming with Apache YuniKorn". I know a lot of folks from the industry
> would be interested. This would be a great opportunity to expand YuniKorn's
> footprints to more use case scenarios.
>
> In our next Apache Flink meetup, I could help to invite some speakers
> (please feel free to recommend any) and organize a roundtable for
> streaming-specific discussions so folks could share their experience/needs
> to identify any gaps for future improvement together.
>
> Please let me know what you think. +devs
>
> Best,
> Chenya
>
>
>
> On Wed, Jan 5, 2022 at 9:52 AM Weiwei Yang <ww...@apache.org> wrote:
>
>> hi Chenya
>>
>> > As we know, streaming applications are long-running and need to secure
>> all
>> requested resources before starting to run. In most cases, they do not
>> have
>> a strong need to be queued, ordered, or preempted to wait to obtain or
>> give
>> back their resource.
>>
>> You are right if the assumption is pure streaming cases, all long-running
>> jobs, and the cluster has sufficient resources for all jobs. Maybe it is
>> fair to say it is not a day 1 challenge.
>> However, in my past experience, this is not always enough and will not be
>> enough. When we operate large-scale Flink jobs, the major issues we were
>> dealing with: resource utilization, resource contention, hot-spot,
>> isolation, etc. We used to have tens of queues per cluster and shared by
>> many users, and jobs have different priorities and high-priority jobs can
>> make room by preempting lower priority ones. We have a customized
>> node-score system in order to distribute pods more efficiently. As you
>> see,
>> resource queues, app-sorting, node-sorting, preemption, all play a role
>> here. Also central job management, scheduling latency/throughput are also
>> important.
>>
>> On K8s and Cloud, it brings more challenges. I guess one thing challenging
>> and also interesting is how to do auto-scaling more efficiently. Sometimes
>> we need a strategy to warm up resources on Cloud in order to fit new jobs
>> in low latency. Most likely the scheduler can give some hints for that.
>> This will be a fun part to explore too. With all being said, I do think a
>> customized scheduler (instead of the pod-level scheduler -
>> default-k8s-scheduler) will be necessary.
>>
>> On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <chenyazhangchenya@gmail.com
>> >
>> wrote:
>>
>> > Hi Weiwei
>> >
>> > Thanks for sharing. I checked the video and for Alibaba's use case, they
>> > have a mixed cluster for streaming and batch applications running with
>> > Apache Flink. Our use case is different. We only use Apache Flink for
>> > stream processing in physical clusters separate from Spark for batch
>> > processing.
>> >
>> > As we know, streaming applications are long-running and need to secure
>> all
>> > requested resources before starting to run. In most cases, they do not
>> have
>> > a strong need to be queued, ordered, or preempted to wait to obtain or
>> give
>> > back their resource.
>> >
>> > I'm gathering more streaming use case requirements that could not be
>> > satisfied by K8s namespace for resource quota management or other
>> advanced
>> > scheduling needs. Will keep this thread updated.
>> >
>> > Meanwhile, happy to hear more thoughts from you!
>> >
>> > Best,
>> > Chenya
>> >
>> > On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <ww...@apache.org> wrote:
>> >
>> > > Hi Chenya
>> > >
>> > > The use case is similar, YK will play a big role there. Lots of
>> features
>> > > are relevant, such as queues, job ordering, user/group ACLs,
>> preemption,
>> > > over-subscription, and performance etc.
>> > > Some of the basic functionalities are available in YK, some more
>> needs to
>> > > be built.
>> > > Please take a look at the slides from the Alibaba Flink team, they
>> have
>> > > shared how they use YK to address their use cases.
>> > > This was presented in ApacheConf:
>> > > https://www.youtube.com/watch?v=4hghJCuZk5M
>> > >
>> > > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <
>> chenyazhangchenya@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > Hey folks,
>> > > >
>> > > > We have some new streaming use cases with Apache Flink that could
>> > > > potentially leverage YuniKorn for resource scheduling.
>> > > >
>> > > > The initial implementation is to use K8s namespace for resource
>> quota
>> > > > management. We are investigating what could be some strong benefits
>> > > > switching to YuniKorn in streaming cases for long-running services.
>> For
>> > > > example: Job queueing, job ordering, resource reservation, user
>> groups
>> > > etc
>> > > > all seem to be more desirable for batch use cases.
>> > > >
>> > > > Any thoughts or suggestions?
>> > > >
>> > > > Thanks,
>> > > > Chenya
>> > > >
>> > >
>> >
>>
>

Re: YuniKorn for Streaming Use Cases

Posted by Chenya Zhang <ch...@gmail.com>.

Hi Weiwei, thanks for sharing your past experience! This is a helpful
discussion.

We should set up some dedicated discussions and topic threads for
"Streaming with Apache YuniKorn". I know a lot of folks from the industry
would be interested. This would be a great opportunity to expand YuniKorn's
footprints to more use case scenarios.

In our next Apache Flink meetup, I could help to invite some speakers
(please feel free to recommend any) and organize a roundtable for
streaming-specific discussions so folks could share their experience/needs
to identify any gaps for future improvement together.

Please let me know what you think. +devs

Best,
Chenya



On Wed, Jan 5, 2022 at 9:52 AM Weiwei Yang <ww...@apache.org> wrote:

> hi Chenya
>
> > As we know, streaming applications are long-running and need to secure
> all
> requested resources before starting to run. In most cases, they do not have
> a strong need to be queued, ordered, or preempted to wait to obtain or give
> back their resource.
>
> You are right if the assumption is pure streaming cases, all long-running
> jobs, and the cluster has sufficient resources for all jobs. Maybe it is
> fair to say it is not a day 1 challenge.
> However, in my past experience, this is not always enough and will not be
> enough. When we operate large-scale Flink jobs, the major issues we were
> dealing with: resource utilization, resource contention, hot-spot,
> isolation, etc. We used to have tens of queues per cluster and shared by
> many users, and jobs have different priorities and high-priority jobs can
> make room by preempting lower priority ones. We have a customized
> node-score system in order to distribute pods more efficiently. As you see,
> resource queues, app-sorting, node-sorting, preemption, all play a role
> here. Also central job management, scheduling latency/throughput are also
> important.
>
> On K8s and Cloud, it brings more challenges. I guess one thing challenging
> and also interesting is how to do auto-scaling more efficiently. Sometimes
> we need a strategy to warm up resources on Cloud in order to fit new jobs
> in low latency. Most likely the scheduler can give some hints for that.
> This will be a fun part to explore too. With all being said, I do think a
> customized scheduler (instead of the pod-level scheduler -
> default-k8s-scheduler) will be necessary.
>
> On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <ch...@gmail.com>
> wrote:
>
> > Hi Weiwei
> >
> > Thanks for sharing. I checked the video and for Alibaba's use case, they
> > have a mixed cluster for streaming and batch applications running with
> > Apache Flink. Our use case is different. We only use Apache Flink for
> > stream processing in physical clusters separate from Spark for batch
> > processing.
> >
> > As we know, streaming applications are long-running and need to secure
> all
> > requested resources before starting to run. In most cases, they do not
> have
> > a strong need to be queued, ordered, or preempted to wait to obtain or
> give
> > back their resource.
> >
> > I'm gathering more streaming use case requirements that could not be
> > satisfied by K8s namespace for resource quota management or other
> advanced
> > scheduling needs. Will keep this thread updated.
> >
> > Meanwhile, happy to hear more thoughts from you!
> >
> > Best,
> > Chenya
> >
> > On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <ww...@apache.org> wrote:
> >
> > > Hi Chenya
> > >
> > > The use case is similar, YK will play a big role there. Lots of
> features
> > > are relevant, such as queues, job ordering, user/group ACLs,
> preemption,
> > > over-subscription, and performance etc.
> > > Some of the basic functionalities are available in YK, some more needs
> to
> > > be built.
> > > Please take a look at the slides from the Alibaba Flink team, they have
> > > shared how they use YK to address their use cases.
> > > This was presented in ApacheConf:
> > > https://www.youtube.com/watch?v=4hghJCuZk5M
> > >
> > > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <
> chenyazhangchenya@gmail.com
> > >
> > > wrote:
> > >
> > > > Hey folks,
> > > >
> > > > We have some new streaming use cases with Apache Flink that could
> > > > potentially leverage YuniKorn for resource scheduling.
> > > >
> > > > The initial implementation is to use K8s namespace for resource quota
> > > > management. We are investigating what could be some strong benefits
> > > > switching to YuniKorn in streaming cases for long-running services.
> For
> > > > example: Job queueing, job ordering, resource reservation, user
> groups
> > > etc
> > > > all seem to be more desirable for batch use cases.
> > > >
> > > > Any thoughts or suggestions?
> > > >
> > > > Thanks,
> > > > Chenya
> > > >
> > >
> >
>

Re: YuniKorn for Streaming Use Cases

Posted by Weiwei Yang <ww...@apache.org>.

hi Chenya

> As we know, streaming applications are long-running and need to secure all
requested resources before starting to run. In most cases, they do not have
a strong need to be queued, ordered, or preempted to wait to obtain or give
back their resource.

You are right if the assumption is pure streaming cases, all long-running
jobs, and the cluster has sufficient resources for all jobs. Maybe it is
fair to say it is not a day 1 challenge.
However, in my past experience, this is not always enough and will not be
enough. When we operate large-scale Flink jobs, the major issues we were
dealing with: resource utilization, resource contention, hot-spot,
isolation, etc. We used to have tens of queues per cluster and shared by
many users, and jobs have different priorities and high-priority jobs can
make room by preempting lower priority ones. We have a customized
node-score system in order to distribute pods more efficiently. As you see,
resource queues, app-sorting, node-sorting, preemption, all play a role
here. Also central job management, scheduling latency/throughput are also
important.

On K8s and Cloud, it brings more challenges. I guess one thing challenging
and also interesting is how to do auto-scaling more efficiently. Sometimes
we need a strategy to warm up resources on Cloud in order to fit new jobs
in low latency. Most likely the scheduler can give some hints for that.
This will be a fun part to explore too. With all being said, I do think a
customized scheduler (instead of the pod-level scheduler -
default-k8s-scheduler) will be necessary.

On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <ch...@gmail.com>
wrote:

> Hi Weiwei
>
> Thanks for sharing. I checked the video and for Alibaba's use case, they
> have a mixed cluster for streaming and batch applications running with
> Apache Flink. Our use case is different. We only use Apache Flink for
> stream processing in physical clusters separate from Spark for batch
> processing.
>
> As we know, streaming applications are long-running and need to secure all
> requested resources before starting to run. In most cases, they do not have
> a strong need to be queued, ordered, or preempted to wait to obtain or give
> back their resource.
>
> I'm gathering more streaming use case requirements that could not be
> satisfied by K8s namespace for resource quota management or other advanced
> scheduling needs. Will keep this thread updated.
>
> Meanwhile, happy to hear more thoughts from you!
>
> Best,
> Chenya
>
> On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <ww...@apache.org> wrote:
>
> > Hi Chenya
> >
> > The use case is similar, YK will play a big role there. Lots of features
> > are relevant, such as queues, job ordering, user/group ACLs, preemption,
> > over-subscription, and performance etc.
> > Some of the basic functionalities are available in YK, some more needs to
> > be built.
> > Please take a look at the slides from the Alibaba Flink team, they have
> > shared how they use YK to address their use cases.
> > This was presented in ApacheConf:
> > https://www.youtube.com/watch?v=4hghJCuZk5M
> >
> > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <chenyazhangchenya@gmail.com
> >
> > wrote:
> >
> > > Hey folks,
> > >
> > > We have some new streaming use cases with Apache Flink that could
> > > potentially leverage YuniKorn for resource scheduling.
> > >
> > > The initial implementation is to use K8s namespace for resource quota
> > > management. We are investigating what could be some strong benefits
> > > switching to YuniKorn in streaming cases for long-running services. For
> > > example: Job queueing, job ordering, resource reservation, user groups
> > etc
> > > all seem to be more desirable for batch use cases.
> > >
> > > Any thoughts or suggestions?
> > >
> > > Thanks,
> > > Chenya
> > >
> >
>

Re: YuniKorn for Streaming Use Cases

Posted by Chenya Zhang <ch...@gmail.com>.

Hi Weiwei

Thanks for sharing. I checked the video and for Alibaba's use case, they
have a mixed cluster for streaming and batch applications running with
Apache Flink. Our use case is different. We only use Apache Flink for
stream processing in physical clusters separate from Spark for batch
processing.

As we know, streaming applications are long-running and need to secure all
requested resources before starting to run. In most cases, they do not have
a strong need to be queued, ordered, or preempted to wait to obtain or give
back their resource.

I'm gathering more streaming use case requirements that could not be
satisfied by K8s namespace for resource quota management or other advanced
scheduling needs. Will keep this thread updated.

Meanwhile, happy to hear more thoughts from you!

Best,
Chenya

On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <ww...@apache.org> wrote:

> Hi Chenya
>
> The use case is similar, YK will play a big role there. Lots of features
> are relevant, such as queues, job ordering, user/group ACLs, preemption,
> over-subscription, and performance etc.
> Some of the basic functionalities are available in YK, some more needs to
> be built.
> Please take a look at the slides from the Alibaba Flink team, they have
> shared how they use YK to address their use cases.
> This was presented in ApacheConf:
> https://www.youtube.com/watch?v=4hghJCuZk5M
>
> On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <ch...@gmail.com>
> wrote:
>
> > Hey folks,
> >
> > We have some new streaming use cases with Apache Flink that could
> > potentially leverage YuniKorn for resource scheduling.
> >
> > The initial implementation is to use K8s namespace for resource quota
> > management. We are investigating what could be some strong benefits
> > switching to YuniKorn in streaming cases for long-running services. For
> > example: Job queueing, job ordering, resource reservation, user groups
> etc
> > all seem to be more desirable for batch use cases.
> >
> > Any thoughts or suggestions?
> >
> > Thanks,
> > Chenya
> >
>

Re: YuniKorn for Streaming Use Cases

Posted by Weiwei Yang <ww...@apache.org>.

Hi Chenya

The use case is similar, YK will play a big role there. Lots of features
are relevant, such as queues, job ordering, user/group ACLs, preemption,
over-subscription, and performance etc.
Some of the basic functionalities are available in YK, some more needs to
be built.
Please take a look at the slides from the Alibaba Flink team, they have
shared how they use YK to address their use cases.
This was presented in ApacheConf:
https://www.youtube.com/watch?v=4hghJCuZk5M

On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <ch...@gmail.com>
wrote:

> Hey folks,
>
> We have some new streaming use cases with Apache Flink that could
> potentially leverage YuniKorn for resource scheduling.
>
> The initial implementation is to use K8s namespace for resource quota
> management. We are investigating what could be some strong benefits
> switching to YuniKorn in streaming cases for long-running services. For
> example: Job queueing, job ordering, resource reservation, user groups etc
> all seem to be more desirable for batch use cases.
>
> Any thoughts or suggestions?
>
> Thanks,
> Chenya
>