You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Jakob Homan <jg...@gmail.com> on 2013/10/17 20:08:13 UTC

Special Bay Area HUG: Tajo and Samza

Hey everybody-
   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.

http://www.meetup.com/hadoop/events/146077932/

-Jakob

Re: Special Bay Area HUG: Tajo and Samza

Posted by Chris Curtin <cu...@gmail.com>.
Hi Jay,

Do they record these meetups?

Thanks,

Chris


On Thu, Oct 17, 2013 at 5:03 PM, Jay Kreps <ja...@gmail.com> wrote:

> FYI.
>
> ---------- Forwarded message ----------
> From: Jakob Homan <jg...@gmail.com>
> Date: Thu, Oct 17, 2013 at 11:08 AM
> Subject: Special Bay Area HUG: Tajo and Samza
> To: dev@samza.incubator.apache.org
>
>
> Hey everybody-
>    Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
> awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
> and Samza.
>
> http://www.meetup.com/hadoop/events/146077932/
>
> -Jakob
>

Fwd: Special Bay Area HUG: Tajo and Samza

Posted by Jay Kreps <ja...@gmail.com>.
FYI.

---------- Forwarded message ----------
From: Jakob Homan <jg...@gmail.com>
Date: Thu, Oct 17, 2013 at 11:08 AM
Subject: Special Bay Area HUG: Tajo and Samza
To: dev@samza.incubator.apache.org


Hey everybody-
   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.

http://www.meetup.com/hadoop/events/146077932/

-Jakob

Fwd: Special Bay Area HUG: Tajo and Samza

Posted by Jay Kreps <ja...@gmail.com>.
FYI.

---------- Forwarded message ----------
From: Jakob Homan <jg...@gmail.com>
Date: Thu, Oct 17, 2013 at 11:08 AM
Subject: Special Bay Area HUG: Tajo and Samza
To: dev@samza.incubator.apache.org


Hey everybody-
   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.

http://www.meetup.com/hadoop/events/146077932/

-Jakob

Re: Special Bay Area HUG: Tajo and Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,

Very cool. :) When we have something up, I'll try and follow up here to
let you know.

In the mean time, you might have a look at Google's Dapper paper:

  http://research.google.com/pubs/pub36356.html


And the Google Wide Profiling paper:

  http://research.google.com/pubs/pub36575.html

Both are great resources for folks building things in this area.

Cheers,
Chris

On 10/20/13 7:15 AM, "Philip Reynolds" <ph...@gmail.com> wrote:

>Just to chime in, I'd be very interested in the monitoring blog post too.
>We're doing a kafka implementation for a robust data pipeline. Initially,
>samza does look interesting for monitoring use-cases.
>
>On Sun, Oct 20, 2013 at 2:53 PM, Garry Turkington <
>g.turkington@improvedigital.com> wrote:
>
>> Hi Chris,
>>
>> Thanks for all this, makes sense.  Be interested to hear where things go
>> with the locality optimizations. I'm just looking at deploying our first
>> Kafka cluster to change how we do data distribution and that's not
>>going to
>>  initially be collocated with the Hadoop cluster.  Samza's tight Kafka
>> integration is one of the things that has drawn me to it so I'm looking
>> forward (!) to seeing what sort of performance/latency I get from the
>> remote/smaller Kafka setup.
>>
>> Looking forward to the blog post on the monitoring jobs written in
>>Samza.
>>  We're in the earlier stages of a common service framework so have the
>> luxury of building on the experiences of others who learned this stuff
>>the
>> hard way. :)
>>
>> Regards
>> Garry
>>
>> -----Original Message-----
>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> Sent: 18 October 2013 19:01
>> To: dev@samza.incubator.apache.org
>> Subject: Re: Special Bay Area HUG: Tajo and Samza
>>
>> Hey Gary,
>>
>> Thanks!
>>
>> Locality: A few things to note here.
>>
>> 1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
>> 2. Samza does not explicitly try to do any co-location right now. Any
>> locality that we get is purely luck.
>> 3. YARN allows you to make resource requests for a specific host/rack.
>> This is the feature we would like to use to provide better locality.
>>
>> We haven't done any meaningful evaluation of the locality we're getting
>> (or would get) right now, though.
>>
>> Operations: Yes, we have a pretty cool set of Samza jobs that Jakob
>>wrote
>> to do some metrics/monitoring stuff. He can probably talk more about it
>> than I can. We're planning on putting up a blog post in the near future
>> about it.
>>
>> More broadly, we have a pretty well defined service container at
>>LinkedIn.
>> These services are called via RPC. Every time an RPC request is made,
>>the
>> service logs out information about the request: who sent the request,
>>what
>> method was called, how long it took to process, etc etc. In addition, we
>> also have all WARN/ERROR log events flowing through Kafka as well (via
>> Kafka's Log4j appender). There is a brief mention of this in:
>>
>>   http://sites.computer.org/debull/A12june/pipeline.pdf
>>
>> As you can imagine, there are a ton of things you can do with this
>>data. :)
>>
>> Cheers,
>> Chris
>>
>> On 10/18/13 4:44 AM, "Garry Turkington"
>><g....@improvedigital.com>
>> wrote:
>>
>> >Hi Chris,
>> >
>> >Nice presentation -- 2 questions:
>> >
>> >1. I had wondered about the references to Kafka broker colocation I'd
>> >seen around the place.  So for example in the 18-node sized cluster you
>> >mention you'd have 18 Kafka brokers running there, 1 per host?  Do you
>> >actually get any sort of data locality benefits from this, is there a
>> >way to ensure that the Samza container on host x is processing the
>> >partitions of each topic on the collocated Kafka broker?  Or am I
>>missing
>> the intent?
>> >
>> >2. Interested at your mention of using something like Samza for
>> >processing of monitoring and metric type data, it's something we've
>> >been talking about internally.  Anything been published on what you are
>> >doing in that space?
>> >
>> >Thanks!
>> >Garry
>> >
>> >-----Original Message-----
>> >From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> >Sent: 17 October 2013 21:54
>> >To: dev@samza.incubator.apache.org
>> >Subject: Re: Special Bay Area HUG: Tajo and Samza
>> >
>> >Hey Guys,
>> >
>> >On a related note, my talk from the YARN meet up at LinkedIn is now
>> >online:
>> >
>> >  https://www.youtube.com/watch?v=7YBmUKjzg7c
>> >
>> >If you're not too familiar with Samza, this is a great place to start.
>> >
>> >Also, feedback welcome on presentation content, style, etc.
>> >
>> >Cheers,
>> >Chris
>> >
>> >On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>> >
>> >>Hey everybody-
>> >>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>> >>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>> >>YARN and Samza.
>> >>
>> >>http://www.meetup.com/hadoop/events/146077932/
>> >>
>> >>-Jakob
>> >
>> >
>> >-----
>> >No virus found in this message.
>> >Checked by AVG - www.avg.com
>> >Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
>> >10/15/13
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
>>10/15/13
>>


Re: Special Bay Area HUG: Tajo and Samza

Posted by Philip Reynolds <ph...@gmail.com>.
Just to chime in, I'd be very interested in the monitoring blog post too.
We're doing a kafka implementation for a robust data pipeline. Initially,
samza does look interesting for monitoring use-cases.

On Sun, Oct 20, 2013 at 2:53 PM, Garry Turkington <
g.turkington@improvedigital.com> wrote:

> Hi Chris,
>
> Thanks for all this, makes sense.  Be interested to hear where things go
> with the locality optimizations. I'm just looking at deploying our first
> Kafka cluster to change how we do data distribution and that's not going to
>  initially be collocated with the Hadoop cluster.  Samza's tight Kafka
> integration is one of the things that has drawn me to it so I'm looking
> forward (!) to seeing what sort of performance/latency I get from the
> remote/smaller Kafka setup.
>
> Looking forward to the blog post on the monitoring jobs written in Samza.
>  We're in the earlier stages of a common service framework so have the
> luxury of building on the experiences of others who learned this stuff the
> hard way. :)
>
> Regards
> Garry
>
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: 18 October 2013 19:01
> To: dev@samza.incubator.apache.org
> Subject: Re: Special Bay Area HUG: Tajo and Samza
>
> Hey Gary,
>
> Thanks!
>
> Locality: A few things to note here.
>
> 1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
> 2. Samza does not explicitly try to do any co-location right now. Any
> locality that we get is purely luck.
> 3. YARN allows you to make resource requests for a specific host/rack.
> This is the feature we would like to use to provide better locality.
>
> We haven't done any meaningful evaluation of the locality we're getting
> (or would get) right now, though.
>
> Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote
> to do some metrics/monitoring stuff. He can probably talk more about it
> than I can. We're planning on putting up a blog post in the near future
> about it.
>
> More broadly, we have a pretty well defined service container at LinkedIn.
> These services are called via RPC. Every time an RPC request is made, the
> service logs out information about the request: who sent the request, what
> method was called, how long it took to process, etc etc. In addition, we
> also have all WARN/ERROR log events flowing through Kafka as well (via
> Kafka's Log4j appender). There is a brief mention of this in:
>
>   http://sites.computer.org/debull/A12june/pipeline.pdf
>
> As you can imagine, there are a ton of things you can do with this data. :)
>
> Cheers,
> Chris
>
> On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
> wrote:
>
> >Hi Chris,
> >
> >Nice presentation -- 2 questions:
> >
> >1. I had wondered about the references to Kafka broker colocation I'd
> >seen around the place.  So for example in the 18-node sized cluster you
> >mention you'd have 18 Kafka brokers running there, 1 per host?  Do you
> >actually get any sort of data locality benefits from this, is there a
> >way to ensure that the Samza container on host x is processing the
> >partitions of each topic on the collocated Kafka broker?  Or am I missing
> the intent?
> >
> >2. Interested at your mention of using something like Samza for
> >processing of monitoring and metric type data, it's something we've
> >been talking about internally.  Anything been published on what you are
> >doing in that space?
> >
> >Thanks!
> >Garry
> >
> >-----Original Message-----
> >From: Chris Riccomini [mailto:criccomini@linkedin.com]
> >Sent: 17 October 2013 21:54
> >To: dev@samza.incubator.apache.org
> >Subject: Re: Special Bay Area HUG: Tajo and Samza
> >
> >Hey Guys,
> >
> >On a related note, my talk from the YARN meet up at LinkedIn is now
> >online:
> >
> >  https://www.youtube.com/watch?v=7YBmUKjzg7c
> >
> >If you're not too familiar with Samza, this is a great place to start.
> >
> >Also, feedback welcome on presentation content, style, etc.
> >
> >Cheers,
> >Chris
> >
> >On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
> >
> >>Hey everybody-
> >>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
> >>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
> >>YARN and Samza.
> >>
> >>http://www.meetup.com/hadoop/events/146077932/
> >>
> >>-Jakob
> >
> >
> >-----
> >No virus found in this message.
> >Checked by AVG - www.avg.com
> >Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
> >10/15/13
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13
>

RE: Special Bay Area HUG: Tajo and Samza

Posted by Garry Turkington <g....@improvedigital.com>.
Hi Chris,

Thanks for all this, makes sense.  Be interested to hear where things go with the locality optimizations. I'm just looking at deploying our first Kafka cluster to change how we do data distribution and that's not going to  initially be collocated with the Hadoop cluster.  Samza's tight Kafka integration is one of the things that has drawn me to it so I'm looking forward (!) to seeing what sort of performance/latency I get from the remote/smaller Kafka setup.

Looking forward to the blog post on the monitoring jobs written in Samza.  We're in the earlier stages of a common service framework so have the luxury of building on the experiences of others who learned this stuff the hard way. :)

Regards
Garry

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: 18 October 2013 19:01
To: dev@samza.incubator.apache.org
Subject: Re: Special Bay Area HUG: Tajo and Samza

Hey Gary,

Thanks!

Locality: A few things to note here.

1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
2. Samza does not explicitly try to do any co-location right now. Any locality that we get is purely luck.
3. YARN allows you to make resource requests for a specific host/rack.
This is the feature we would like to use to provide better locality.

We haven't done any meaningful evaluation of the locality we're getting (or would get) right now, though.

Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote to do some metrics/monitoring stuff. He can probably talk more about it than I can. We're planning on putting up a blog post in the near future about it.

More broadly, we have a pretty well defined service container at LinkedIn.
These services are called via RPC. Every time an RPC request is made, the service logs out information about the request: who sent the request, what method was called, how long it took to process, etc etc. In addition, we also have all WARN/ERROR log events flowing through Kafka as well (via Kafka's Log4j appender). There is a brief mention of this in:

  http://sites.computer.org/debull/A12june/pipeline.pdf

As you can imagine, there are a ton of things you can do with this data. :)

Cheers,
Chris

On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
wrote:

>Hi Chris,
>
>Nice presentation -- 2 questions:
>
>1. I had wondered about the references to Kafka broker colocation I'd 
>seen around the place.  So for example in the 18-node sized cluster you 
>mention you'd have 18 Kafka brokers running there, 1 per host?  Do you 
>actually get any sort of data locality benefits from this, is there a 
>way to ensure that the Samza container on host x is processing the 
>partitions of each topic on the collocated Kafka broker?  Or am I missing the intent?
>
>2. Interested at your mention of using something like Samza for 
>processing of monitoring and metric type data, it's something we've 
>been talking about internally.  Anything been published on what you are 
>doing in that space?
>
>Thanks!
>Garry
>
>-----Original Message-----
>From: Chris Riccomini [mailto:criccomini@linkedin.com]
>Sent: 17 October 2013 21:54
>To: dev@samza.incubator.apache.org
>Subject: Re: Special Bay Area HUG: Tajo and Samza
>
>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
>  https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new 
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop 
>>YARN and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>
>
>-----
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 
>10/15/13


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13

Re: Special Bay Area HUG: Tajo and Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Gary,

Thanks!

Locality: A few things to note here.

1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
2. Samza does not explicitly try to do any co-location right now. Any
locality that we get is purely luck.
3. YARN allows you to make resource requests for a specific host/rack.
This is the feature we would like to use to provide better locality.

We haven't done any meaningful evaluation of the locality we're getting
(or would get) right now, though.

Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote
to do some metrics/monitoring stuff. He can probably talk more about it
than I can. We're planning on putting up a blog post in the near future
about it.

More broadly, we have a pretty well defined service container at LinkedIn.
These services are called via RPC. Every time an RPC request is made, the
service logs out information about the request: who sent the request, what
method was called, how long it took to process, etc etc. In addition, we
also have all WARN/ERROR log events flowing through Kafka as well (via
Kafka's Log4j appender). There is a brief mention of this in:

  http://sites.computer.org/debull/A12june/pipeline.pdf

As you can imagine, there are a ton of things you can do with this data. :)

Cheers,
Chris

On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
wrote:

>Hi Chris,
>
>Nice presentation -- 2 questions:
>
>1. I had wondered about the references to Kafka broker colocation I'd
>seen around the place.  So for example in the 18-node sized cluster you
>mention you'd have 18 Kafka brokers running there, 1 per host?  Do you
>actually get any sort of data locality benefits from this, is there a way
>to ensure that the Samza container on host x is processing the partitions
>of each topic on the collocated Kafka broker?  Or am I missing the intent?
>
>2. Interested at your mention of using something like Samza for
>processing of monitoring and metric type data, it's something we've been
>talking about internally.  Anything been published on what you are doing
>in that space?
>
>Thanks!
>Garry
>
>-----Original Message-----
>From: Chris Riccomini [mailto:criccomini@linkedin.com]
>Sent: 17 October 2013 21:54
>To: dev@samza.incubator.apache.org
>Subject: Re: Special Bay Area HUG: Tajo and Samza
>
>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
>  https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>>YARN and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>
>
>-----
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13


RE: Special Bay Area HUG: Tajo and Samza

Posted by Garry Turkington <g....@improvedigital.com>.
Hi Chris,

Nice presentation -- 2 questions:

1. I had wondered about the references to Kafka broker colocation I'd seen around the place.  So for example in the 18-node sized cluster you mention you'd have 18 Kafka brokers running there, 1 per host?  Do you actually get any sort of data locality benefits from this, is there a way to ensure that the Samza container on host x is processing the partitions of each topic on the collocated Kafka broker?  Or am I missing the intent?

2. Interested at your mention of using something like Samza for processing of monitoring and metric type data, it's something we've been talking about internally.  Anything been published on what you are doing in that space?

Thanks!
Garry

-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com] 
Sent: 17 October 2013 21:54
To: dev@samza.incubator.apache.org
Subject: Re: Special Bay Area HUG: Tajo and Samza

Hey Guys,

On a related note, my talk from the YARN meet up at LinkedIn is now online:

  https://www.youtube.com/watch?v=7YBmUKjzg7c

If you're not too familiar with Samza, this is a great place to start.

Also, feedback welcome on presentation content, style, etc.

Cheers,
Chris

On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:

>Hey everybody-
>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new 
>awesome Incubator projects, Tajo, a low-latency SQL query engine atop 
>YARN and Samza.
>
>http://www.meetup.com/hadoop/events/146077932/
>
>-Jakob


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13

Re: Special Bay Area HUG: Tajo and Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,

Also, updated the Wiki with upcoming talks and meetups:

  https://wiki.apache.org/samza/PapersAndTalks

October 22, 2013
Samza: Real-time Stream Processing at LinkedIn
London HUG http://www.meetup.com/hadoop-users-group-uk/events/140308872/

November 5, 2013
Samza: Real-time Stream Processing at LinkedIn
HUG at LinkedIn http://www.meetup.com/hadoop/events/146077932/

November 13, 2013
Samza: Real-time Stream Processing at LinkedIn
QCon SF 2013 
http://qconsf.com/presentation/samza-real-time-stream-processing-linkedin


Cheers,
Chris

On 10/17/13 1:54 PM, "Chris Riccomini" <cr...@linkedin.com> wrote:

>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
>  https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>>YARN
>>and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>


Re: Special Bay Area HUG: Tajo and Samza

Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,

On a related note, my talk from the YARN meet up at LinkedIn is now online:

  https://www.youtube.com/watch?v=7YBmUKjzg7c

If you're not too familiar with Samza, this is a great place to start.

Also, feedback welcome on presentation content, style, etc.

Cheers,
Chris

On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:

>Hey everybody-
>   Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
>and Samza.
>
>http://www.meetup.com/hadoop/events/146077932/
>
>-Jakob