You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Navneeth Krishnan <re...@gmail.com> on 2019/11/08 07:39:36 UTC

Flink vs Kafka streams

Hello All,

I have a streaming job running in production which is processing over 2
billion events per day and it does some heavy processing on each event. We
have been facing some challenges in managing flink in production like
scaling in and out, restarting the job with savepoint etc. Flink provides a
lot of features which seemed as an obvious choice at that time but now with
all the operational overhead we are thinking should we still use flink for
our stream processing requirements or choose kafka streams.

We currently deploy flink on ECR. Bringing up a new cluster for another
stream job is too expensive but on the flip side running it on the same
cluster becomes difficult since there are no ways to say this job has to be
run on a dedicated server versus this can run on a shared instance. Also
savepoint point, cancel and submit a new job results in some downtime. The
most critical part being there is no shared state among all tasks sort of a
global state. We sort of achieve this today using an external redis cache
but that incurs cost as well.

If we are moving to kafka streams, it makes our deployment life much
easier, each new stream job will be a microservice that can scale
independently. With global state it's much easier to share state without
using external cache. But the disadvantage is we have to rely on the
partitions for parallelism. Although this might initially sound easier,
when we need to scale much higher this will become a bottleneck.

Do you guys have any suggestions on this? We need to decide which way to
move forward and any suggestions would be of much greater help.

Thanks

Re: Flink vs Kafka streams

Posted by Praveen <pr...@gmail.com>.
I have not found relying on partitions for parallelism as a disadvantage.
At flurry, we have several pipelines using both lower level API Kafka (for
legacy reasons) and kafka streams + kafka connect.
They process over 10B events per day at around 200k rps. We also use the
same system to send over 10M notifications per day. Just to give you an
example of a non-deterministic traffic pattern.

- Praveen

On Fri, Nov 8, 2019 at 10:43 PM Navneeth Krishnan <re...@gmail.com>
wrote:

> Thanks Peter, even with ECS we have autoscaling enabled but the issue is
> during autoscaling we need to stop the job and start with new
> parallelism which creates a downtime.
>
> Thanks
>
> On Fri, Nov 8, 2019 at 1:01 PM Peter Groesbeck <pe...@gmail.com>
> wrote:
>
> > We use EMR instead of ECS but if that’s an option for your team, you can
> > configure auto scaling rules in your cloud formation so that your
> task/job
> > load dynamically controls cluster sizing.
> >
> > Sent from my iPhone
> >
> > > On Nov 8, 2019, at 1:40 AM, Navneeth Krishnan <
> reachnavneeth2@gmail.com>
> > wrote:
> > >
> > > Hello All,
> > >
> > > I have a streaming job running in production which is processing over 2
> > > billion events per day and it does some heavy processing on each event.
> > We
> > > have been facing some challenges in managing flink in production like
> > > scaling in and out, restarting the job with savepoint etc. Flink
> > provides a
> > > lot of features which seemed as an obvious choice at that time but now
> > with
> > > all the operational overhead we are thinking should we still use flink
> > for
> > > our stream processing requirements or choose kafka streams.
> > >
> > > We currently deploy flink on ECR. Bringing up a new cluster for another
> > > stream job is too expensive but on the flip side running it on the same
> > > cluster becomes difficult since there are no ways to say this job has
> to
> > be
> > > run on a dedicated server versus this can run on a shared instance.
> Also
> > > savepoint point, cancel and submit a new job results in some downtime.
> > The
> > > most critical part being there is no shared state among all tasks sort
> > of a
> > > global state. We sort of achieve this today using an external redis
> cache
> > > but that incurs cost as well.
> > >
> > > If we are moving to kafka streams, it makes our deployment life much
> > > easier, each new stream job will be a microservice that can scale
> > > independently. With global state it's much easier to share state
> without
> > > using external cache. But the disadvantage is we have to rely on the
> > > partitions for parallelism. Although this might initially sound easier,
> > > when we need to scale much higher this will become a bottleneck.
> > >
> > > Do you guys have any suggestions on this? We need to decide which way
> to
> > > move forward and any suggestions would be of much greater help.
> > >
> > > Thanks
> >
>

Re: Flink vs Kafka streams

Posted by Navneeth Krishnan <re...@gmail.com>.
Thanks Peter, even with ECS we have autoscaling enabled but the issue is
during autoscaling we need to stop the job and start with new
parallelism which creates a downtime.

Thanks

On Fri, Nov 8, 2019 at 1:01 PM Peter Groesbeck <pe...@gmail.com>
wrote:

> We use EMR instead of ECS but if that’s an option for your team, you can
> configure auto scaling rules in your cloud formation so that your task/job
> load dynamically controls cluster sizing.
>
> Sent from my iPhone
>
> > On Nov 8, 2019, at 1:40 AM, Navneeth Krishnan <re...@gmail.com>
> wrote:
> >
> > Hello All,
> >
> > I have a streaming job running in production which is processing over 2
> > billion events per day and it does some heavy processing on each event.
> We
> > have been facing some challenges in managing flink in production like
> > scaling in and out, restarting the job with savepoint etc. Flink
> provides a
> > lot of features which seemed as an obvious choice at that time but now
> with
> > all the operational overhead we are thinking should we still use flink
> for
> > our stream processing requirements or choose kafka streams.
> >
> > We currently deploy flink on ECR. Bringing up a new cluster for another
> > stream job is too expensive but on the flip side running it on the same
> > cluster becomes difficult since there are no ways to say this job has to
> be
> > run on a dedicated server versus this can run on a shared instance. Also
> > savepoint point, cancel and submit a new job results in some downtime.
> The
> > most critical part being there is no shared state among all tasks sort
> of a
> > global state. We sort of achieve this today using an external redis cache
> > but that incurs cost as well.
> >
> > If we are moving to kafka streams, it makes our deployment life much
> > easier, each new stream job will be a microservice that can scale
> > independently. With global state it's much easier to share state without
> > using external cache. But the disadvantage is we have to rely on the
> > partitions for parallelism. Although this might initially sound easier,
> > when we need to scale much higher this will become a bottleneck.
> >
> > Do you guys have any suggestions on this? We need to decide which way to
> > move forward and any suggestions would be of much greater help.
> >
> > Thanks
>

Re: Flink vs Kafka streams

Posted by Peter Groesbeck <pe...@gmail.com>.
We use EMR instead of ECS but if that’s an option for your team, you can configure auto scaling rules in your cloud formation so that your task/job load dynamically controls cluster sizing.

Sent from my iPhone

> On Nov 8, 2019, at 1:40 AM, Navneeth Krishnan <re...@gmail.com> wrote:
> 
> Hello All,
> 
> I have a streaming job running in production which is processing over 2
> billion events per day and it does some heavy processing on each event. We
> have been facing some challenges in managing flink in production like
> scaling in and out, restarting the job with savepoint etc. Flink provides a
> lot of features which seemed as an obvious choice at that time but now with
> all the operational overhead we are thinking should we still use flink for
> our stream processing requirements or choose kafka streams.
> 
> We currently deploy flink on ECR. Bringing up a new cluster for another
> stream job is too expensive but on the flip side running it on the same
> cluster becomes difficult since there are no ways to say this job has to be
> run on a dedicated server versus this can run on a shared instance. Also
> savepoint point, cancel and submit a new job results in some downtime. The
> most critical part being there is no shared state among all tasks sort of a
> global state. We sort of achieve this today using an external redis cache
> but that incurs cost as well.
> 
> If we are moving to kafka streams, it makes our deployment life much
> easier, each new stream job will be a microservice that can scale
> independently. With global state it's much easier to share state without
> using external cache. But the disadvantage is we have to rely on the
> partitions for parallelism. Although this might initially sound easier,
> when we need to scale much higher this will become a bottleneck.
> 
> Do you guys have any suggestions on this? We need to decide which way to
> move forward and any suggestions would be of much greater help.
> 
> Thanks