You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@metron.apache.org by Ali Nazemian <al...@gmail.com> on 2019/05/11 04:42:23 UTC

Re: [DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

Oops. Turns out I totally missed this email.

Thanks, Mike for your reply. Spark native support of Kubernetes has been
added very recently and it is not really at the stage that can provide all
the aforementioned features. There is no doubt that Spark is a powerful
tool and it is been widely used for similar use cases in the last few
years. However, when we look at the features that Spark can provide and try
to map them to Metron high-level architecture, It is hard to believe that
Spark will bring much added value to this architecture for the event
processing (no doubt about the batch side of it, though). When we compare
that with more lightweight frameworks for event-driven data processing
pipeline and cloud-native architectures you can see that all the features
Spark targets them in the real-time side can be covered by your
architecture natively (without getting help from your framework). Stuff
like fault tolerance, reliability, back pressure, at least once guarantee,
etc. all can be provided very easily. The only difference is you have got
the full support of Kubernetes features out of the box instead of waiting
for technology to evolve and maybe in two years come to the point that you
can truly have stuff like self-healing, change isolation, auto-scalability,
etc. with Spark whereas you can have them all right now just by looking at
this problem form a different angle.

Cheers,
Ali

On Fri, Apr 12, 2019 at 3:54 AM Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Hi Ali,
>
> Thank you for taking the time to share your experiences with us. I've been
> thinking about this a while now and wanted to take some time for reflection
> before responding. I need to kick out a proper dev list DISCUSS thread on
> this, but if you've seen a couple of the recent refactoring PR's, you are
> right that we've been looking to decouple ourselves from Storm and open up
> the possibility of onboarding another processing engine. The most obvious
> candidate here, imho, is Apache Spark. Getting right to the meat of your
> discussion points, I don't think this is an either/or proposition between
> Kubernetes and Spark. I see this as an AND proposition. The reality is that
> Spark offers quite a bit from a job scaling, redundancy, and efficiency
> perspective. Not to mention, the capabilities it provides purely from a
> data transformation and processing engine perspective. The real roadmap, at
> least in my mind, would be for us to onboard Spark and then leverage
> Kubernetes at some point to enable some of the features that you describe -
> vertical and horizontal elasticity, in particular. In addition to that,
> Helm could provide some compelling features for managing that container
> application deployment story. Expect a discussion from me very soon about
> more specific ideas as to what I think our integration with Spark can and
> should look like in the near future with Metron. We have nearly completed
> decoupling our core infrastructure from Storm at this point, which opens us
> up to a number of possibilities going forward.
>
> Best,
> Mike Miklavcic
>
>
> On Thu, Apr 4, 2019 at 1:35 AM Ali Nazemian <al...@gmail.com> wrote:
>
> > Hi All,
> >
> > As far as I understood, there is a plan to change the real-time engine of
> > Metron due to some issues that user and developer have been facing with
> it.
> > I would like to explain some critical issues that customer have been
> facing
> > to clarify it for the development team what the best approach could be
> for
> > the future of Metron. Based on the experience we have had with Metron
> there
> > are two important issues that cause lots of problems from the technology
> > and business:
> >
> > - Infrastructure cost
> > - Operational complexity
> >
> > We have had lots of issues to minimize infrastructure cost. We have also
> > spent significant time to tune infrastructure to be able to reduce the
> > cost. However, regardless of what had been done, we were not able to
> manage
> > our cost properly. The main reason for that is the rate of log ingestion
> > has been very fluctuating. It means we were receiving 4k eps on a sensor
> > during the peak time and less than 1 eps off-peak (e.g. during night).
> The
> > problem with that is you want to have an environment that can easily
> *scale
> > up* and *scale down* based on your ingestion traffic. Not to mention that
> > there have been situations where we cannot even predict the ingestion
> rate
> > as there has been a sort of cyber attach where lots of logs are generated
> > from the source devices. For example, DDOS might be one of the scenarios
> > that lots of logs are generated.
> >
> > When it comes to operational complexity, we have had lots of issues to
> > manage sensors and tune different parameters based on the traffic we
> > receive. We have had lots of failures as well due to different reasons
> and
> > we spent a fair amount of time to write scripts that can be simulated
> > *self-healing* feature at a very basic level. In the production use case,
> > we need to be able to respond to different situations very quickly. For
> > example, if a service is down, bring it up automatically or if a new
> sensor
> > is onboarded make sure that there won't be any risk to other services. We
> > also have lots of discussion about how we can create different processes
> or
> > automation tests to make sure nothing can go wrong. However, this made us
> > to create lots of platforms to test something from different aspects
> which
> > increases our cost even more. We didn't have the capability of
> provisioning
> > a short-lived environment once a PR is submitted. We really miss an
> ability
> > to *provision an environment very quickly*. We really needed to have a
> > capability to isolate different sensors and different use cases entirely
> > from not only parser topology, but also enrichment and indexing
> topologies.
> > We needed a good mechanism for *change isolation*.
> >
> > I understand that the requirements of running an application on Cloud
> would
> > be different than on-premise. However, the majority of them are quite the
> > same when it comes to running Metron in production.
> >
> > We have recently delivered a data processing pipeline project using more
> > cloud-native architectures and we have found out that how similar the
> > concerns have been and how easily Kubernetes helped us to manage these
> > problems with providing native support for scale-up and scale-downs,
> > self-healing, being able to provision a short-lived environment very
> > quickly and isolate our changes via canary and blue-green deployments. Of
> > course, following 12-factors were a big important principle for us to
> > manage those concerns. We have used Spring Cloud Stream to create an
> > event-driven data processing pipeline for this matter and some other
> > complementary frameworks provided by Pivotal. What has come to my mind
> is,
> > if other customer experiences of using Metron in production were similar
> to
> > our experience and they had had the same sort of concerns, can migrating
> > from Storm to Event-Driven Pipeline help all users to have a better
> > experience with running Metron in production? Of course, I have not been
> > across other user challenges so I cannot answer that, but it is just an
> > idea.
> >
> > There is no doubt that we can have all these features by using Spark as
> > well in future, but it requires more time to build the integration and
> some
> > of these functionalities are not going to be available very soon. It is
> > just a thought that the Metron architecture is already Event-Driven at
> some
> > stages and state-less by nature. Which makes it a good fit for using an
> > event-driven pipeline to deploy it on containers.
> >
> > Cheers,
> > Ali
> >
>


-- 
A.Nazemian