You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jay Kreps <ja...@gmail.com> on 2013/08/23 17:39:33 UTC

Samza -- A YARN stream processing framework for Kafka

Hey guys,

This may be relevant to people on this list. A few of us at LinkedIn have
been working on Samza, a stream processing framework built on YARN. We just
added this as an Apache Incubator project. We would love to get people's
feedback (and help!). Here are the docs:

http://samza.incubator.apache.org

If anyone has any questions I'm happy to discuss what we are up to. Our
mailing list is here:

http://samza.incubator.apache.org/community/mailing-lists.html

-Jay

Re: Samza -- A YARN stream processing framework for Kafka

Posted by Xavier Stevens <xa...@gaikai.com>.
I can't answer the rest but the catchy name is from Gregor Samza. A
character from Kafka's novel called The Metamorphosis.

https://en.wikipedia.org/wiki/Gregor_Samsa#Gregor_Samsa


-Xavier


On Tue, Aug 27, 2013 at 6:51 AM, Jonathan Hodges <ho...@gmail.com> wrote:

> First off, I want to say this is awesome!  It has been great to see all the
> great YARN offerings being released lately.  I noticed Hadoop 2.x was
> recently voted beta so very exciting!
>
> Like many we use Storm for near real-time processing our Kafka based
> streams.  In addition we send this data to Hadoop for offline analysis.
>  Consolidating these three environments to one is a win by itself.  I also
> really like the fault tolerance and security features.  Are you guys using
> Samza in production yet at LinkedIn or still development?
>
> The local state approach is very interesting.  Are you guys using Databus
> for the feed of changes from the external stores?  Is something like
> Voldemort integrated locally for the key/value store?  Can you maintain
> multiple tables locally for stream processing?
>
> Since we are using Storm, do any latency comparisons exist?  Since Samza
> makes the fault tolerance/durability tradeoff to persist to disk on every
> hop between StreamTasks, it would seem to take a hit here.  That said we
> use Trident a good bit, so many of our topologies are already slowed by
> remote calls to Cassandra.
>
> I know it is fairly new, but were any comparisons against Spark Streaming
> considered?  They take a similar tact of maintaining state locally as
> opposed to external stores, but I believe they are limited on what can fit
> in memory.
>
> Finally where did the catchy name, Samza come from?
>
> Thanks!
> Jonathan
>
>
>
> On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Hey guys,
> >
> > This may be relevant to people on this list. A few of us at LinkedIn have
> > been working on Samza, a stream processing framework built on YARN. We
> just
> > added this as an Apache Incubator project. We would love to get people's
> > feedback (and help!). Here are the docs:
> >
> > http://samza.incubator.apache.org
> >
> > If anyone has any questions I'm happy to discuss what we are up to. Our
> > mailing list is here:
> >
> > http://samza.incubator.apache.org/community/mailing-lists.html
> >
> > -Jay
> >
>

Re: Samza -- A YARN stream processing framework for Kafka

Posted by Jonathan Hodges <ho...@gmail.com>.
First off, I want to say this is awesome!  It has been great to see all the
great YARN offerings being released lately.  I noticed Hadoop 2.x was
recently voted beta so very exciting!

Like many we use Storm for near real-time processing our Kafka based
streams.  In addition we send this data to Hadoop for offline analysis.
 Consolidating these three environments to one is a win by itself.  I also
really like the fault tolerance and security features.  Are you guys using
Samza in production yet at LinkedIn or still development?

The local state approach is very interesting.  Are you guys using Databus
for the feed of changes from the external stores?  Is something like
Voldemort integrated locally for the key/value store?  Can you maintain
multiple tables locally for stream processing?

Since we are using Storm, do any latency comparisons exist?  Since Samza
makes the fault tolerance/durability tradeoff to persist to disk on every
hop between StreamTasks, it would seem to take a hit here.  That said we
use Trident a good bit, so many of our topologies are already slowed by
remote calls to Cassandra.

I know it is fairly new, but were any comparisons against Spark Streaming
considered?  They take a similar tact of maintaining state locally as
opposed to external stores, but I believe they are limited on what can fit
in memory.

Finally where did the catchy name, Samza come from?

Thanks!
Jonathan



On Fri, Aug 23, 2013 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:

> Hey guys,
>
> This may be relevant to people on this list. A few of us at LinkedIn have
> been working on Samza, a stream processing framework built on YARN. We just
> added this as an Apache Incubator project. We would love to get people's
> feedback (and help!). Here are the docs:
>
> http://samza.incubator.apache.org
>
> If anyone has any questions I'm happy to discuss what we are up to. Our
> mailing list is here:
>
> http://samza.incubator.apache.org/community/mailing-lists.html
>
> -Jay
>