You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Peter Groesbeck <pe...@gmail.com> on 2019/05/20 22:30:20 UTC

Flink vs KStreams

Hi folks,

I'm hoping to get some deeper clarification on which framework, Flink or
KStreams, to use in a given scenario. I've read over the following blog
article which I think sets a great baseline understanding of the
differences between those frameworks but I would like to get some outside
opinions:
https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/

My understanding of this article is that KStreams works well as an embedded
library in a microservice, API layer, or as a standalone application at a
company with centralized deployment infrastructure, such as a shared
Kubernetes cluster.

In this case, there is discussion around deploying KStreams as a standalone
application stack backed by EC2 or ECS, and whether or not Flink is better
suited to serve as the data transformation layer. We already do run Flink
applications on EMR.

The point against Flink is that it requires a cluster whereas KStreams does
not, and can be deployed on ECS or EC2. We do not have a centralized
deployment team, and will still have to maintain either the CNF
Stack/AutoScaling Group or EMR Cluster ourselves.

What are some of the advantages of using Flink over KStreams standalone?

The Job management UI is one that comes to mind, and another are some of
the more advanced API options such as CEP. But I would really love to hear
the opinions of people who are familiar with both. In what scenarios would
you choose one over the other? Is it advisable or even preferable to
attempt to deploy KStreams as it's own stack and avoid the complexity of
maintaining a cluster?

Thanks,
Peter

Re: Flink vs KStreams

Posted by orips <or...@gmail.com>.

Elias Levy wrote
> Flink:
> 
> Pros:
> * Intra-job traffic flows directly between workers.
> * More mature.
> * Higher-level constructs: SQL, CEP, etc.

How is SQL a Pro in Flink? Kafka Streams has KSQL which is at least as good
as Flink's SQL.






--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink vs KStreams

Posted by Elias Levy <fe...@gmail.com>.

My 2c:

KStreams:

Pros:
* Streaming as a library:  No need to submit your job to a cluster.  Easy
to scale up/down the job by adding or removing workers.
* Streaming durability: State is durably stored in Kafka topics in a
streaming fashion. Durability is amortized across the job's lifetime.
* No dependencies other than Kafka:  If you are already a Kafka shop, then
there are no additional dependencies.

Cons:
* Intra-job traffic goes flows through Kafka:  This increases the workload
of your Kafka traffic.
* State is stored in Kafka:  This increases the workload of your Kafka
cluster, particularly as they are compacted topics.
* Tight Kafka coupling:  If you are not a Kafka shop or some other message
broker would serve you better.
* Parallelism is limited by the number of topic partitions: This is a side
effect of the last two issues.

Flink:

Pros:
* Intra-job traffic flows directly between workers.
* More mature.
* Higher-level constructs: SQL, CEP, etc.

Cons:
* Requires a cluster to submit a job to:  You can't just have some jar and
run it.  You need either a stand alone or YARN cluster to submit the job
to.  This makes initial deployment and job deployment complicated.  There
is some work to alleviate this (e.g. embedding the job with Flink in a
container), but its still nowhere as easy as KStreams.
* Checkpointed durability:  State is durability stored in a distributed
filesystem in a checkpoint fashion.  If you have a lot of state,
checkpointing will intermittently drop the performance of the job and
create very spiky network traffic.

On Mon, May 20, 2019 at 3:30 PM Peter Groesbeck <pe...@gmail.com>
wrote:

> Hi folks,
>
> I'm hoping to get some deeper clarification on which framework, Flink or
> KStreams, to use in a given scenario. I've read over the following blog
> article which I think sets a great baseline understanding of the
> differences between those frameworks but I would like to get some outside
> opinions:
>
> https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
>
> My understanding of this article is that KStreams works well as an
> embedded library in a microservice, API layer, or as a standalone
> application at a company with centralized deployment infrastructure, such
> as a shared Kubernetes cluster.
>
> In this case, there is discussion around deploying KStreams as a
> standalone application stack backed by EC2 or ECS, and whether or not Flink
> is better suited to serve as the data transformation layer. We already do
> run Flink applications on EMR.
>
> The point against Flink is that it requires a cluster whereas KStreams
> does not, and can be deployed on ECS or EC2. We do not have a centralized
> deployment team, and will still have to maintain either the CNF
> Stack/AutoScaling Group or EMR Cluster ourselves.
>
> What are some of the advantages of using Flink over KStreams standalone?
>
> The Job management UI is one that comes to mind, and another are some of
> the more advanced API options such as CEP. But I would really love to hear
> the opinions of people who are familiar with both. In what scenarios would
> you choose one over the other? Is it advisable or even preferable to
> attempt to deploy KStreams as it's own stack and avoid the complexity of
> maintaining a cluster?
>
> Thanks,
> Peter
>

Re: Flink vs KStreams

Posted by Timothy Victor <vi...@gmail.com>.

This is probably a very subjective question, but nevertheless here are my
reasons for choosing Flink over KStreams or even Spark.

a) KStreams couples you tightly to Kafka, and I personally don't want my
stream processing engine to be married to my message bus.   There are other
(even better alternatives) to Kafka given your requirements, so I thought
it best to keep these two separate.

b) Fault Tolerance on KStreams relies on intermittent Kafka topics.  The
increased network and I/O for messages to/from Kafka for bookkeeping was to
high for my use cases.   The concept of local state and asynchronous
barrier snapshotting are key principles as to why Flink does better in this
area.   You can't just bolt that on to a framework that didnt get those
concepts right from the beginning.

c) Lack of fine grained control over operator parallelism.  A graph may
have different computation hot spots where it maybe worth fanning out to
process over multiple nodes in parallel.  In KStreams as well as Spark you
can't really do that.  In Kstreams the parallelism is tightly bound to the
number of partitions.  This means that your entire processing graph needs
to align well with the way you've partitioned your data in order to
maximize throughput.  You cant (or maybe even don't) want to do that.

d)  The API lends itself well to describing DAGs.   I found Spark and
KStreams not so intuitive in this regard.

HTH

Tim

On Mon, May 20, 2019, 5:30 PM Peter Groesbeck <pe...@gmail.com>
wrote:

> Hi folks,
>
> I'm hoping to get some deeper clarification on which framework, Flink or
> KStreams, to use in a given scenario. I've read over the following blog
> article which I think sets a great baseline understanding of the
> differences between those frameworks but I would like to get some outside
> opinions:
>
> https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
>
> My understanding of this article is that KStreams works well as an
> embedded library in a microservice, API layer, or as a standalone
> application at a company with centralized deployment infrastructure, such
> as a shared Kubernetes cluster.
>
> In this case, there is discussion around deploying KStreams as a
> standalone application stack backed by EC2 or ECS, and whether or not Flink
> is better suited to serve as the data transformation layer. We already do
> run Flink applications on EMR.
>
> The point against Flink is that it requires a cluster whereas KStreams
> does not, and can be deployed on ECS or EC2. We do not have a centralized
> deployment team, and will still have to maintain either the CNF
> Stack/AutoScaling Group or EMR Cluster ourselves.
>
> What are some of the advantages of using Flink over KStreams standalone?
>
> The Job management UI is one that comes to mind, and another are some of
> the more advanced API options such as CEP. But I would really love to hear
> the opinions of people who are familiar with both. In what scenarios would
> you choose one over the other? Is it advisable or even preferable to
> attempt to deploy KStreams as it's own stack and avoid the complexity of
> maintaining a cluster?
>
> Thanks,
> Peter
>