You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Sharma, Sanskriti, Rakesh" <sa...@bu.edu> on 2022/08/23 15:44:04 UTC

Energy/performance research questions

Hi everyone,


We are a team of researchers at Boston University investigating the energy and performance behavior of open-source stream processing platforms. We have started looking into Flink and we wanted to reach out to community to see if anyone has tried to optimize the underlying OS/VM/container to achieve these outcomes.


Some of the specific aspects we would like to explore include the following: What Linux kernel configurations are used? Has any OS tuning been done? What workloads are used to evaluate performance/efficiency, both for turning and more generally to evaluate the impact of changes to either the software or hardware? What is considered a baseline network setup, with respect to both hardware and software? Has anyone investigated the policy used in terms of the cpufreq governor ( https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)?


It would be especially helpful to hear from people running Flink in production or offering it as a service.

Thank you!

Sana


Re: [DISCUSS] Re: Energy/performance research questions

Posted by Piotr Nowojski <pn...@apache.org>.
Hi Sana,

> Something our group is considering is how the OS network stack affects a
distributed application like Flink.

To the best of my knowledge, in a healthy cluster, such low level overheads
are very rarely a problem for Flink. There are much more prominent
overheads in other places. First and foremost, usually the state accesses
(especially if using RocksDB) are the dominant performance factor. If
that's not the case, the user code in Flink Jobs tends to be quite compute
intensive per each record. However, even if we assume very lightweight jobs
(no-ops), with negligible state backend usage (or none), the nature of
streaming processing, where you handle each record individually, one by
one, creates overheads in the JVM, like many virtual calls per each record.
That IMO/from my experience would overshadow any changes in the OS level
network stack.

If you want to do some more research in this direction, you probably can
quite easily validate any changes against the network stack benchmarks
[1][2] from the benchmark repository that I mentioned above. They are
running the Flink's network stack in isolation (records serialisation, hand
over to netty, network transfer, handover back to Flink, records
deserialisation), without actually running a Flink job, so if they don't
show measurable performance difference, it's almost impossible for that
change to show up in the real world. However since this benchmark is not
running the whole Flink, even large performance improvements (like 30%),
are watered down in the real world (for example down to a couple of %) in
most cases.

Best,
Piotrek

[1]
https://github.com/apache/flink-benchmarks/blob/master/src/test/java/org/apache/flink/benchmark/StreamNetworkThroughputBenchmarkExecutor.java
[2]
http://codespeed.dak8s.net:8000/timeline/?ben=networkThroughput.1000,100ms&env=2

wt., 6 wrz 2022 o 21:50 Sharma, Sanskriti, Rakesh <sa...@bu.edu> napisał(a):

> Hi Piotr,
>
> Thank you so much for responding. We will look into those benchmarks.
>
> Something our group is considering is how the OS network stack affects a
> distributed application like Flink.
>
> Thank you!
> Sana
>
> ________________________________
> From: Piotr Nowojski <pn...@apache.org>
> Sent: Friday, August 26, 2022 4:16 AM
> To: dev <de...@flink.apache.org>
> Subject: Re: Energy/performance research questions
>
> Hi Sana,
>
> I don't have much to offer. I haven't heard anyone doing any work directly
> towards energy efficiency per se, but indirectly yes. I have seen companies
> optimising performance of their workloads, with an ultimate goal of
> assigning fewer resources to a cluster in order to save up on a limited
> electricity budget in their data centers.
>
> From the Open Source perspective, we are trying to optimize Apache Flink,
> fix performance bottlenecks and fight against performance regressions. To
> this effect we primarily rely on our set of micro benchmarks [1] and
> occasional cluster level macro benchmarks, either with some artificial
> jobs, or TPC-DS benchmark suite for example.
>
> > What Linux kernel configurations are used? Has any OS tuning been done?
> > if anyone has tried to optimize the underlying OS/VM/container to achieve
> these outcomes.
>
> I don't remember those topics popping up in discussion around performance.
> My best guess is that the teams managing the hardware or containers are
> very far away from the teams that are actually touching Apache Flink in any
> way. Often for example teams using/touching Apache Flink don't have any
> guarantees or any knowledge about the environment. Also my best guess is
> that there are more lower hanging fruits to solve first before touching
> those lower layers. But I might be wrong and would be happy to learn
> something :)
>
> Do you maybe have some suggestions? What things would you expect us to try
> out in the future?
>
> Best,
> Piotrek
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847
>
> wt., 23 sie 2022 o 17:45 Sharma, Sanskriti, Rakesh <sa...@bu.edu>
> napisał(a):
>
> > Hi everyone,
> >
> >
> > We are a team of researchers at Boston University investigating the
> energy
> > and performance behavior of open-source stream processing platforms. We
> > have started looking into Flink and we wanted to reach out to community
> to
> > see if anyone has tried to optimize the underlying OS/VM/container to
> > achieve these outcomes.
> >
> >
> > Some of the specific aspects we would like to explore include the
> > following: What Linux kernel configurations are used? Has any OS tuning
> > been done? What workloads are used to evaluate performance/efficiency,
> both
> > for turning and more generally to evaluate the impact of changes to
> either
> > the software or hardware? What is considered a baseline network setup,
> with
> > respect to both hardware and software? Has anyone investigated the policy
> > used in terms of the cpufreq governor (
> > https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)?
> >
> >
> > It would be especially helpful to hear from people running Flink in
> > production or offering it as a service.
> >
> > Thank you!
> >
> > Sana
> >
> >
>

[DISCUSS] Re: Energy/performance research questions

Posted by "Sharma, Sanskriti, Rakesh" <sa...@bu.edu>.
Hi Piotr,

Thank you so much for responding. We will look into those benchmarks.

Something our group is considering is how the OS network stack affects a distributed application like Flink.

Thank you!
Sana

________________________________
From: Piotr Nowojski <pn...@apache.org>
Sent: Friday, August 26, 2022 4:16 AM
To: dev <de...@flink.apache.org>
Subject: Re: Energy/performance research questions

Hi Sana,

I don't have much to offer. I haven't heard anyone doing any work directly
towards energy efficiency per se, but indirectly yes. I have seen companies
optimising performance of their workloads, with an ultimate goal of
assigning fewer resources to a cluster in order to save up on a limited
electricity budget in their data centers.

From the Open Source perspective, we are trying to optimize Apache Flink,
fix performance bottlenecks and fight against performance regressions. To
this effect we primarily rely on our set of micro benchmarks [1] and
occasional cluster level macro benchmarks, either with some artificial
jobs, or TPC-DS benchmark suite for example.

> What Linux kernel configurations are used? Has any OS tuning been done?
> if anyone has tried to optimize the underlying OS/VM/container to achieve
these outcomes.

I don't remember those topics popping up in discussion around performance.
My best guess is that the teams managing the hardware or containers are
very far away from the teams that are actually touching Apache Flink in any
way. Often for example teams using/touching Apache Flink don't have any
guarantees or any knowledge about the environment. Also my best guess is
that there are more lower hanging fruits to solve first before touching
those lower layers. But I might be wrong and would be happy to learn
something :)

Do you maybe have some suggestions? What things would you expect us to try
out in the future?

Best,
Piotrek

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847

wt., 23 sie 2022 o 17:45 Sharma, Sanskriti, Rakesh <sa...@bu.edu>
napisał(a):

> Hi everyone,
>
>
> We are a team of researchers at Boston University investigating the energy
> and performance behavior of open-source stream processing platforms. We
> have started looking into Flink and we wanted to reach out to community to
> see if anyone has tried to optimize the underlying OS/VM/container to
> achieve these outcomes.
>
>
> Some of the specific aspects we would like to explore include the
> following: What Linux kernel configurations are used? Has any OS tuning
> been done? What workloads are used to evaluate performance/efficiency, both
> for turning and more generally to evaluate the impact of changes to either
> the software or hardware? What is considered a baseline network setup, with
> respect to both hardware and software? Has anyone investigated the policy
> used in terms of the cpufreq governor (
> https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)?
>
>
> It would be especially helpful to hear from people running Flink in
> production or offering it as a service.
>
> Thank you!
>
> Sana
>
>

Re: Energy/performance research questions

Posted by Piotr Nowojski <pn...@apache.org>.
Hi Sana,

I don't have much to offer. I haven't heard anyone doing any work directly
towards energy efficiency per se, but indirectly yes. I have seen companies
optimising performance of their workloads, with an ultimate goal of
assigning fewer resources to a cluster in order to save up on a limited
electricity budget in their data centers.

From the Open Source perspective, we are trying to optimize Apache Flink,
fix performance bottlenecks and fight against performance regressions. To
this effect we primarily rely on our set of micro benchmarks [1] and
occasional cluster level macro benchmarks, either with some artificial
jobs, or TPC-DS benchmark suite for example.

> What Linux kernel configurations are used? Has any OS tuning been done?
> if anyone has tried to optimize the underlying OS/VM/container to achieve
these outcomes.

I don't remember those topics popping up in discussion around performance.
My best guess is that the teams managing the hardware or containers are
very far away from the teams that are actually touching Apache Flink in any
way. Often for example teams using/touching Apache Flink don't have any
guarantees or any knowledge about the environment. Also my best guess is
that there are more lower hanging fruits to solve first before touching
those lower layers. But I might be wrong and would be happy to learn
something :)

Do you maybe have some suggestions? What things would you expect us to try
out in the future?

Best,
Piotrek

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115511847

wt., 23 sie 2022 o 17:45 Sharma, Sanskriti, Rakesh <sa...@bu.edu>
napisał(a):

> Hi everyone,
>
>
> We are a team of researchers at Boston University investigating the energy
> and performance behavior of open-source stream processing platforms. We
> have started looking into Flink and we wanted to reach out to community to
> see if anyone has tried to optimize the underlying OS/VM/container to
> achieve these outcomes.
>
>
> Some of the specific aspects we would like to explore include the
> following: What Linux kernel configurations are used? Has any OS tuning
> been done? What workloads are used to evaluate performance/efficiency, both
> for turning and more generally to evaluate the impact of changes to either
> the software or hardware? What is considered a baseline network setup, with
> respect to both hardware and software? Has anyone investigated the policy
> used in terms of the cpufreq governor (
> https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)?
>
>
> It would be especially helpful to hear from people running Flink in
> production or offering it as a service.
>
> Thank you!
>
> Sana
>
>