You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Yi Pan <ni...@gmail.com> on 2015/10/02 22:54:37 UTC

Samza and KStreams (KIP-28): LinkedIn's POV

Hi, all Samza-lovers,

This question on the relationship of Kafka KStream (KIP-28) and Samza has
come up a couple times recently. So we wanted to clarify where we stand at
LinkedIn in terms of this discussion.

Samza has historically had a symbiotic relationship with Kafka and will
continue to work very well with Kafka.  Earlier in the year, we had an
in-depth discussion exploring an even deeper integration with Kafka.  After
hitting multiple practical issues (e.g. Apache rules) and technical issues
we had to give up on that idea.  As a fall out of the discussion, the Kafka
community is adding some of the basic event processing capabilities into
Kafka core directly. The basic callback/push style programming model by
itself is a great addition to the Kafka API set.

However at LinkedIn, we continue to be firmly committed to Samza as our
stream processing framework. Although KStream is a nice addition to Kafka
stack, our goals for Samza are broader. There are some key technical
differences that makes Samza the right strategy for us.

1.  Support for non-kafka systems :

At LinkedIn a larger percentage of our streaming jobs use Databus as an
input source.   For any such non-Kafka source, although the CopyCat
connector framework gives a common model for pulling data out of a source
and pushing it into Kafka, it introduces yet another piece of
infrastructure that we have to operate and manage.  Also for any companies
who are already on AWS, Google Compute, Azure etc.  asking them to deploy
and operate kafka in AWS instead of using the natively supported services
like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
non-starter.  With more acquisitions at LinkedIn that use AWS we are also
facing this first hand.  The Samza community has a healthy set of system
producers/consumers which are in the works (Kinesis, ActiveMQ,
ElasticSearch, HDFS, etc.).

2. We run Samza as a Stream Processing Service at LinkedIn. This is
fundamentally different from KStream.

This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
Insight and similar services.  The service makes it much easier for
developers to get their stream processing jobs up and running in production
by helping with the most common problems like monitoring, dashboards,
auto-scale, rolling upgrades and such.

The truth is that if the stream processing application is stateless then
some of these common problems are not as involved and can be solved even on
regular IaaS platforms like EC2 and such.   Arguably stateless applications
can be built easily on top of the native APIs from the input source like
kafka, kinesis etc.

The place where Samza shines is with stateful stream processing
applications.  When a Samza application uses the local rocks DB based
state, the application needs special care in terms of rolling upgrades,
addition/removal of containers/machines, temporary machine failures,
capacity management.  We have already done some really good work in Samza
0.10 to make sure that we don't reseed the state from kafka (i.e.
host-affinity feature that allows to reuse the local states).  In the
absence of this feature, we had multiple production issues caused due to
network saturation during state reseeding from kafka.   The problems with
stateful applications are similar to problems encountered when building
no-sql databases and other data systems.

There are surely some scenarios where customers don't want any YARN
dependency and want to run their stream processing application on a
dedicated set of nodes.  This is where KStream clearly has an advantage
over current Samza. Prior to KStream we had a patch in Samza which also
solved the same problem (SAMZA-516). We do expect to finish this patch soon
and formally support Stand Alone Samza.

3. Operators for Stream Processing and SQL :

At LinkedIn, there is a growing need to iterate Samza jobs faster in the
loop of implement, deploy, revise the code, and deploy again. A key
bottleneck that slows down this iteration is the implementation of a Samza
job. It has long-been recognized in the Samza community that there is a
strong need for a high-level language support to shorten this iterative
process. Since last year, we have identified SQL as the user-facing
high-level language and completed the high-level design and started
prototyping it in Samza. The prototype starts with a set of physical
operators which are crucial to the correctness of streaming processing,
namely, the window operator, aggregation, and join. KStream adopts some of
these core ideas. However, our view in Samza’s SQL support goes beyond
what’s covered in KStream. We want Samza’s SQL support to be as easy as
Google Dataflow and Azure Stream Analytics, in which a user can upload a
query statement and the system will parse the query, translate it into a
distributed execution plan, allocate the containers and stream resources in
a cluster, and deploy it. To support this grand vision, our effort in
building the SQL operators API, the query planner and optimizers is vastly
different from what KStream covers, which only covers a single node
programming interface.

Independent of these strategic differences, one big aspect for us is also
the fact that Samza is an established and mature system which we have
successfully operationalized and has been running in production for a few
years.



Thanks!

-Yi Pan
Samza @ LinkedIn

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Posted by Roger Hoover <ro...@gmail.com>.

Great.  Thanks, Yi.

On Mon, Oct 5, 2015 at 10:25 AM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Roger,
>
>
> On Sat, Oct 3, 2015 at 11:13 AM, Roger Hoover <ro...@gmail.com>
> wrote:
>
> > As previously discussed, the biggest request I
> > have is being able to run Samza without YARN, under something like
> > Kubernetes instead.
> >
> >
> Totally. We will be actively working on the standalone Samza after the
> upcoming 0.10 release.
>
>
> > Also, I'm curious.  What's the current state of the Samza SQL physical
> > operators?  Are they used in production yet?  Is there a doc on how to
> use
> > them?
> >
> >
> The current physical operators code now lives in samza-sql branch. There
> are still two big pending check-ins in-review right now, one to stabilize
> the operator's APIs, the other to implement the window operator. We are
> planning to finish the proto-type in Q4.
>
> Regards,
>
> -Yi
>

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Posted by Yi Pan <ni...@gmail.com>.

Hi, Roger,


On Sat, Oct 3, 2015 at 11:13 AM, Roger Hoover <ro...@gmail.com>
wrote:

> As previously discussed, the biggest request I
> have is being able to run Samza without YARN, under something like
> Kubernetes instead.
>
>
Totally. We will be actively working on the standalone Samza after the
upcoming 0.10 release.


> Also, I'm curious.  What's the current state of the Samza SQL physical
> operators?  Are they used in production yet?  Is there a doc on how to use
> them?
>
>
The current physical operators code now lives in samza-sql branch. There
are still two big pending check-ins in-review right now, one to stabilize
the operator's APIs, the other to implement the window operator. We are
planning to finish the proto-type in Q4.

Regards,

-Yi

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Posted by Yi Pan <ni...@gmail.com>.

Hi, Renato,

Thanks a lot! Please feel free to distribute this message if you
encountered this question again. :)

Looking forward to your Kinesis and ActiveMQ patches!

Cheers!

-Yi

On Sun, Oct 4, 2015 at 9:45 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi Yi,
>
> Thanks a lot for sharing this POV, makes a lot more sense to me now what
> both (KStreams and Samza) are targeting to.
>
>
> Best,
>
> Renato M.
>
> 2015-10-03 20:13 GMT+02:00 Roger Hoover <ro...@gmail.com>:
>
> > Hi Yi,
> >
> > Thank you for sharing this update and perspective.  I tend to agree that
> > for simple, stateless cases, things could be easier and hopefully
> KStreams
> > may help with that.  I also appreciate a lot of features that Samza
> already
> > supports for operations.  As previously discussed, the biggest request I
> > have is being able to run Samza without YARN, under something like
> > Kubernetes instead.
> >
> > Also, I'm curious.  What's the current state of the Samza SQL physical
> > operators?  Are they used in production yet?  Is there a doc on how to
> use
> > them?
> >
> > Thanks,
> >
> > Roger
> >
> >
> >
> > On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <ni...@gmail.com> wrote:
> >
> > > Hi, all Samza-lovers,
> > >
> > > This question on the relationship of Kafka KStream (KIP-28) and Samza
> has
> > > come up a couple times recently. So we wanted to clarify where we stand
> > at
> > > LinkedIn in terms of this discussion.
> > >
> > > Samza has historically had a symbiotic relationship with Kafka and will
> > > continue to work very well with Kafka.  Earlier in the year, we had an
> > > in-depth discussion exploring an even deeper integration with Kafka.
> > After
> > > hitting multiple practical issues (e.g. Apache rules) and technical
> > issues
> > > we had to give up on that idea.  As a fall out of the discussion, the
> > Kafka
> > > community is adding some of the basic event processing capabilities
> into
> > > Kafka core directly. The basic callback/push style programming model by
> > > itself is a great addition to the Kafka API set.
> > >
> > > However at LinkedIn, we continue to be firmly committed to Samza as our
> > > stream processing framework. Although KStream is a nice addition to
> Kafka
> > > stack, our goals for Samza are broader. There are some key technical
> > > differences that makes Samza the right strategy for us.
> > >
> > > 1.  Support for non-kafka systems :
> > >
> > > At LinkedIn a larger percentage of our streaming jobs use Databus as an
> > > input source.   For any such non-Kafka source, although the CopyCat
> > > connector framework gives a common model for pulling data out of a
> source
> > > and pushing it into Kafka, it introduces yet another piece of
> > > infrastructure that we have to operate and manage.  Also for any
> > companies
> > > who are already on AWS, Google Compute, Azure etc.  asking them to
> deploy
> > > and operate kafka in AWS instead of using the natively supported
> services
> > > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> > > non-starter.  With more acquisitions at LinkedIn that use AWS we are
> also
> > > facing this first hand.  The Samza community has a healthy set of
> system
> > > producers/consumers which are in the works (Kinesis, ActiveMQ,
> > > ElasticSearch, HDFS, etc.).
> > >
> > > 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> > > fundamentally different from KStream.
> > >
> > > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> > > Insight and similar services.  The service makes it much easier for
> > > developers to get their stream processing jobs up and running in
> > production
> > > by helping with the most common problems like monitoring, dashboards,
> > > auto-scale, rolling upgrades and such.
> > >
> > > The truth is that if the stream processing application is stateless
> then
> > > some of these common problems are not as involved and can be solved
> even
> > on
> > > regular IaaS platforms like EC2 and such.   Arguably stateless
> > applications
> > > can be built easily on top of the native APIs from the input source
> like
> > > kafka, kinesis etc.
> > >
> > > The place where Samza shines is with stateful stream processing
> > > applications.  When a Samza application uses the local rocks DB based
> > > state, the application needs special care in terms of rolling upgrades,
> > > addition/removal of containers/machines, temporary machine failures,
> > > capacity management.  We have already done some really good work in
> Samza
> > > 0.10 to make sure that we don't reseed the state from kafka (i.e.
> > > host-affinity feature that allows to reuse the local states).  In the
> > > absence of this feature, we had multiple production issues caused due
> to
> > > network saturation during state reseeding from kafka.   The problems
> with
> > > stateful applications are similar to problems encountered when building
> > > no-sql databases and other data systems.
> > >
> > > There are surely some scenarios where customers don't want any YARN
> > > dependency and want to run their stream processing application on a
> > > dedicated set of nodes.  This is where KStream clearly has an advantage
> > > over current Samza. Prior to KStream we had a patch in Samza which also
> > > solved the same problem (SAMZA-516). We do expect to finish this patch
> > soon
> > > and formally support Stand Alone Samza.
> > >
> > > 3. Operators for Stream Processing and SQL :
> > >
> > > At LinkedIn, there is a growing need to iterate Samza jobs faster in
> the
> > > loop of implement, deploy, revise the code, and deploy again. A key
> > > bottleneck that slows down this iteration is the implementation of a
> > Samza
> > > job. It has long-been recognized in the Samza community that there is a
> > > strong need for a high-level language support to shorten this iterative
> > > process. Since last year, we have identified SQL as the user-facing
> > > high-level language and completed the high-level design and started
> > > prototyping it in Samza. The prototype starts with a set of physical
> > > operators which are crucial to the correctness of streaming processing,
> > > namely, the window operator, aggregation, and join. KStream adopts some
> > of
> > > these core ideas. However, our view in Samza’s SQL support goes beyond
> > > what’s covered in KStream. We want Samza’s SQL support to be as easy as
> > > Google Dataflow and Azure Stream Analytics, in which a user can upload
> a
> > > query statement and the system will parse the query, translate it into
> a
> > > distributed execution plan, allocate the containers and stream
> resources
> > in
> > > a cluster, and deploy it. To support this grand vision, our effort in
> > > building the SQL operators API, the query planner and optimizers is
> > vastly
> > > different from what KStream covers, which only covers a single node
> > > programming interface.
> > >
> > > Independent of these strategic differences, one big aspect for us is
> also
> > > the fact that Samza is an established and mature system which we have
> > > successfully operationalized and has been running in production for a
> few
> > > years.
> > >
> > >
> > >
> > > Thanks!
> > >
> > > -Yi Pan
> > > Samza @ LinkedIn
> > >
> >
>

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Yi,

Thanks a lot for sharing this POV, makes a lot more sense to me now what
both (KStreams and Samza) are targeting to.


Best,

Renato M.

2015-10-03 20:13 GMT+02:00 Roger Hoover <ro...@gmail.com>:

> Hi Yi,
>
> Thank you for sharing this update and perspective.  I tend to agree that
> for simple, stateless cases, things could be easier and hopefully KStreams
> may help with that.  I also appreciate a lot of features that Samza already
> supports for operations.  As previously discussed, the biggest request I
> have is being able to run Samza without YARN, under something like
> Kubernetes instead.
>
> Also, I'm curious.  What's the current state of the Samza SQL physical
> operators?  Are they used in production yet?  Is there a doc on how to use
> them?
>
> Thanks,
>
> Roger
>
>
>
> On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <ni...@gmail.com> wrote:
>
> > Hi, all Samza-lovers,
> >
> > This question on the relationship of Kafka KStream (KIP-28) and Samza has
> > come up a couple times recently. So we wanted to clarify where we stand
> at
> > LinkedIn in terms of this discussion.
> >
> > Samza has historically had a symbiotic relationship with Kafka and will
> > continue to work very well with Kafka.  Earlier in the year, we had an
> > in-depth discussion exploring an even deeper integration with Kafka.
> After
> > hitting multiple practical issues (e.g. Apache rules) and technical
> issues
> > we had to give up on that idea.  As a fall out of the discussion, the
> Kafka
> > community is adding some of the basic event processing capabilities into
> > Kafka core directly. The basic callback/push style programming model by
> > itself is a great addition to the Kafka API set.
> >
> > However at LinkedIn, we continue to be firmly committed to Samza as our
> > stream processing framework. Although KStream is a nice addition to Kafka
> > stack, our goals for Samza are broader. There are some key technical
> > differences that makes Samza the right strategy for us.
> >
> > 1.  Support for non-kafka systems :
> >
> > At LinkedIn a larger percentage of our streaming jobs use Databus as an
> > input source.   For any such non-Kafka source, although the CopyCat
> > connector framework gives a common model for pulling data out of a source
> > and pushing it into Kafka, it introduces yet another piece of
> > infrastructure that we have to operate and manage.  Also for any
> companies
> > who are already on AWS, Google Compute, Azure etc.  asking them to deploy
> > and operate kafka in AWS instead of using the natively supported services
> > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> > non-starter.  With more acquisitions at LinkedIn that use AWS we are also
> > facing this first hand.  The Samza community has a healthy set of system
> > producers/consumers which are in the works (Kinesis, ActiveMQ,
> > ElasticSearch, HDFS, etc.).
> >
> > 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> > fundamentally different from KStream.
> >
> > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> > Insight and similar services.  The service makes it much easier for
> > developers to get their stream processing jobs up and running in
> production
> > by helping with the most common problems like monitoring, dashboards,
> > auto-scale, rolling upgrades and such.
> >
> > The truth is that if the stream processing application is stateless then
> > some of these common problems are not as involved and can be solved even
> on
> > regular IaaS platforms like EC2 and such.   Arguably stateless
> applications
> > can be built easily on top of the native APIs from the input source like
> > kafka, kinesis etc.
> >
> > The place where Samza shines is with stateful stream processing
> > applications.  When a Samza application uses the local rocks DB based
> > state, the application needs special care in terms of rolling upgrades,
> > addition/removal of containers/machines, temporary machine failures,
> > capacity management.  We have already done some really good work in Samza
> > 0.10 to make sure that we don't reseed the state from kafka (i.e.
> > host-affinity feature that allows to reuse the local states).  In the
> > absence of this feature, we had multiple production issues caused due to
> > network saturation during state reseeding from kafka.   The problems with
> > stateful applications are similar to problems encountered when building
> > no-sql databases and other data systems.
> >
> > There are surely some scenarios where customers don't want any YARN
> > dependency and want to run their stream processing application on a
> > dedicated set of nodes.  This is where KStream clearly has an advantage
> > over current Samza. Prior to KStream we had a patch in Samza which also
> > solved the same problem (SAMZA-516). We do expect to finish this patch
> soon
> > and formally support Stand Alone Samza.
> >
> > 3. Operators for Stream Processing and SQL :
> >
> > At LinkedIn, there is a growing need to iterate Samza jobs faster in the
> > loop of implement, deploy, revise the code, and deploy again. A key
> > bottleneck that slows down this iteration is the implementation of a
> Samza
> > job. It has long-been recognized in the Samza community that there is a
> > strong need for a high-level language support to shorten this iterative
> > process. Since last year, we have identified SQL as the user-facing
> > high-level language and completed the high-level design and started
> > prototyping it in Samza. The prototype starts with a set of physical
> > operators which are crucial to the correctness of streaming processing,
> > namely, the window operator, aggregation, and join. KStream adopts some
> of
> > these core ideas. However, our view in Samza’s SQL support goes beyond
> > what’s covered in KStream. We want Samza’s SQL support to be as easy as
> > Google Dataflow and Azure Stream Analytics, in which a user can upload a
> > query statement and the system will parse the query, translate it into a
> > distributed execution plan, allocate the containers and stream resources
> in
> > a cluster, and deploy it. To support this grand vision, our effort in
> > building the SQL operators API, the query planner and optimizers is
> vastly
> > different from what KStream covers, which only covers a single node
> > programming interface.
> >
> > Independent of these strategic differences, one big aspect for us is also
> > the fact that Samza is an established and mature system which we have
> > successfully operationalized and has been running in production for a few
> > years.
> >
> >
> >
> > Thanks!
> >
> > -Yi Pan
> > Samza @ LinkedIn
> >
>

Re: Samza and KStreams (KIP-28): LinkedIn's POV

Posted by Roger Hoover <ro...@gmail.com>.

Hi Yi,

Thank you for sharing this update and perspective.  I tend to agree that
for simple, stateless cases, things could be easier and hopefully KStreams
may help with that.  I also appreciate a lot of features that Samza already
supports for operations.  As previously discussed, the biggest request I
have is being able to run Samza without YARN, under something like
Kubernetes instead.

Also, I'm curious.  What's the current state of the Samza SQL physical
operators?  Are they used in production yet?  Is there a doc on how to use
them?

Thanks,

Roger



On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <ni...@gmail.com> wrote:

> Hi, all Samza-lovers,
>
> This question on the relationship of Kafka KStream (KIP-28) and Samza has
> come up a couple times recently. So we wanted to clarify where we stand at
> LinkedIn in terms of this discussion.
>
> Samza has historically had a symbiotic relationship with Kafka and will
> continue to work very well with Kafka.  Earlier in the year, we had an
> in-depth discussion exploring an even deeper integration with Kafka.  After
> hitting multiple practical issues (e.g. Apache rules) and technical issues
> we had to give up on that idea.  As a fall out of the discussion, the Kafka
> community is adding some of the basic event processing capabilities into
> Kafka core directly. The basic callback/push style programming model by
> itself is a great addition to the Kafka API set.
>
> However at LinkedIn, we continue to be firmly committed to Samza as our
> stream processing framework. Although KStream is a nice addition to Kafka
> stack, our goals for Samza are broader. There are some key technical
> differences that makes Samza the right strategy for us.
>
> 1.  Support for non-kafka systems :
>
> At LinkedIn a larger percentage of our streaming jobs use Databus as an
> input source.   For any such non-Kafka source, although the CopyCat
> connector framework gives a common model for pulling data out of a source
> and pushing it into Kafka, it introduces yet another piece of
> infrastructure that we have to operate and manage.  Also for any companies
> who are already on AWS, Google Compute, Azure etc.  asking them to deploy
> and operate kafka in AWS instead of using the natively supported services
> like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a
> non-starter.  With more acquisitions at LinkedIn that use AWS we are also
> facing this first hand.  The Samza community has a healthy set of system
> producers/consumers which are in the works (Kinesis, ActiveMQ,
> ElasticSearch, HDFS, etc.).
>
> 2. We run Samza as a Stream Processing Service at LinkedIn. This is
> fundamentally different from KStream.
>
> This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream
> Insight and similar services.  The service makes it much easier for
> developers to get their stream processing jobs up and running in production
> by helping with the most common problems like monitoring, dashboards,
> auto-scale, rolling upgrades and such.
>
> The truth is that if the stream processing application is stateless then
> some of these common problems are not as involved and can be solved even on
> regular IaaS platforms like EC2 and such.   Arguably stateless applications
> can be built easily on top of the native APIs from the input source like
> kafka, kinesis etc.
>
> The place where Samza shines is with stateful stream processing
> applications.  When a Samza application uses the local rocks DB based
> state, the application needs special care in terms of rolling upgrades,
> addition/removal of containers/machines, temporary machine failures,
> capacity management.  We have already done some really good work in Samza
> 0.10 to make sure that we don't reseed the state from kafka (i.e.
> host-affinity feature that allows to reuse the local states).  In the
> absence of this feature, we had multiple production issues caused due to
> network saturation during state reseeding from kafka.   The problems with
> stateful applications are similar to problems encountered when building
> no-sql databases and other data systems.
>
> There are surely some scenarios where customers don't want any YARN
> dependency and want to run their stream processing application on a
> dedicated set of nodes.  This is where KStream clearly has an advantage
> over current Samza. Prior to KStream we had a patch in Samza which also
> solved the same problem (SAMZA-516). We do expect to finish this patch soon
> and formally support Stand Alone Samza.
>
> 3. Operators for Stream Processing and SQL :
>
> At LinkedIn, there is a growing need to iterate Samza jobs faster in the
> loop of implement, deploy, revise the code, and deploy again. A key
> bottleneck that slows down this iteration is the implementation of a Samza
> job. It has long-been recognized in the Samza community that there is a
> strong need for a high-level language support to shorten this iterative
> process. Since last year, we have identified SQL as the user-facing
> high-level language and completed the high-level design and started
> prototyping it in Samza. The prototype starts with a set of physical
> operators which are crucial to the correctness of streaming processing,
> namely, the window operator, aggregation, and join. KStream adopts some of
> these core ideas. However, our view in Samza’s SQL support goes beyond
> what’s covered in KStream. We want Samza’s SQL support to be as easy as
> Google Dataflow and Azure Stream Analytics, in which a user can upload a
> query statement and the system will parse the query, translate it into a
> distributed execution plan, allocate the containers and stream resources in
> a cluster, and deploy it. To support this grand vision, our effort in
> building the SQL operators API, the query planner and optimizers is vastly
> different from what KStream covers, which only covers a single node
> programming interface.
>
> Independent of these strategic differences, one big aspect for us is also
> the fact that Samza is an established and mature system which we have
> successfully operationalized and has been running in production for a few
> years.
>
>
>
> Thanks!
>
> -Yi Pan
> Samza @ LinkedIn
>