You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by "Vergilio, Thalita" <T....@leedsbeckett.ac.uk> on 2018/10/11 20:36:10 UTC

The return of the Beam vs. Flink SDK experiments!

Hello Beamers,


You'll be pleased to know that I've made some progress in my Beam/Flink SDK comparison exercise (maybe now she'll go away...). However, it would be great those of you who are more familiar with Beam and Flink could cast your eyes over my new code snippets and let me know if you can spot any massive discrepancies.


Following the advice from some of you at the Beam Summit last week (special thanks to Reuven for taking the time to look into this issue with me), I have made the following changes:

  1.  ?Used a KeyedDeserializationSchema implementation in my Flink example, instead of the JSONDeserializationSchema? that eagerly deserialised the whole object before it was needed.
  2.  Simplified my data model so the records read from Kafka are non-keyed and deserialise from byte[ ] to primitives.
  3.  Removed all JSON serialisation and deserialisation from this example.

Both snippets now run much faster, and the performance is equivalent. This is much closer to what I was originally expecting, but the question remains: am I as close to a like-for-like comparison as I can get? Here are the code snippets, amended:

https://gist.github.com/tvergilio/fbb2e855e3d32de223d171d91fd1ec1e

https://gist.github.com/tvergilio/453638bd7cdd28a808ee103775b1fae5


And here is a recap of the experiment (if you care to know the details):

My aim is to write two pipelines which are equivalent, one using Beam, and the other using the Flink SDK. I will then run them in the same Flink cluster to measure the assumed cost of using an abstraction such as Beam to ensure portability of the code (as opposed to using the Flink SDK directly). I am using containerised Flink running on cloud-based nodes.


Procedure:

Two measurements are emitted to two different Kafka topics simultaneously: the total energy consumption of a server room, and the energy consumption of the IT equipment in that room. This data is emitted for 5 minutes at a time, at different velocities (every minute, every second, every millisecond, etc), and the performance of the task managers in terms of container CPU, memory and network usage is measured.


The processing pipeline calculates the Power Usage Effectiveness (PUE) of the room. This is simply the total energy consumed divided by the IT energy consumed. The PUE is calculated for a given window of time, so there must be some notion of per-window state (or else all the records of a window must have arrived before the entire calculation is executed). I have used the exact same calculation for the PUE for both cases, which can be found here:


https://gist.github.com/tvergilio/1c78ea337e6795de01f06cafdfa4cf84?


and here:


https://gist.github.com/tvergilio/2ed7d4541bc0de14325f82f8aa538d43


Now, theoretically, the PUE calculation could be improved by aggregating the real energy readings as they come, doing the same for the IT readings, emitting from both when the watermark passes the end of the window, then having a separate step calculate the PUE by dividing one aggregation by the other. My last question is: do you think this refinement is necessary, or is the "whole window at once" approach good enough? Would you expect the difference to be significant?


Thank you very much for taking the time to read this VERY long e-mail. Any suggestions/opinions/feedback are much appreciated.


All the best,


Thalita


To view the terms under which this email is distributed, please go to:-
http://disclaimer.leedsbeckett.ac.uk/disclaimer/disclaimer.html