You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Jakob Homan <jg...@gmail.com> on 2013/10/17 20:08:13 UTC
Special Bay Area HUG: Tajo and Samza
Hey everybody-
Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.
http://www.meetup.com/hadoop/events/146077932/
-Jakob
Re: Special Bay Area HUG: Tajo and Samza
Posted by Chris Curtin <cu...@gmail.com>.
Hi Jay,
Do they record these meetups?
Thanks,
Chris
On Thu, Oct 17, 2013 at 5:03 PM, Jay Kreps <ja...@gmail.com> wrote:
> FYI.
>
> ---------- Forwarded message ----------
> From: Jakob Homan <jg...@gmail.com>
> Date: Thu, Oct 17, 2013 at 11:08 AM
> Subject: Special Bay Area HUG: Tajo and Samza
> To: dev@samza.incubator.apache.org
>
>
> Hey everybody-
> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
> awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
> and Samza.
>
> http://www.meetup.com/hadoop/events/146077932/
>
> -Jakob
>
Fwd: Special Bay Area HUG: Tajo and Samza
Posted by Jay Kreps <ja...@gmail.com>.
FYI.
---------- Forwarded message ----------
From: Jakob Homan <jg...@gmail.com>
Date: Thu, Oct 17, 2013 at 11:08 AM
Subject: Special Bay Area HUG: Tajo and Samza
To: dev@samza.incubator.apache.org
Hey everybody-
Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.
http://www.meetup.com/hadoop/events/146077932/
-Jakob
Fwd: Special Bay Area HUG: Tajo and Samza
Posted by Jay Kreps <ja...@gmail.com>.
FYI.
---------- Forwarded message ----------
From: Jakob Homan <jg...@gmail.com>
Date: Thu, Oct 17, 2013 at 11:08 AM
Subject: Special Bay Area HUG: Tajo and Samza
To: dev@samza.incubator.apache.org
Hey everybody-
Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
and Samza.
http://www.meetup.com/hadoop/events/146077932/
-Jakob
Re: Special Bay Area HUG: Tajo and Samza
Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,
Very cool. :) When we have something up, I'll try and follow up here to
let you know.
In the mean time, you might have a look at Google's Dapper paper:
http://research.google.com/pubs/pub36356.html
And the Google Wide Profiling paper:
http://research.google.com/pubs/pub36575.html
Both are great resources for folks building things in this area.
Cheers,
Chris
On 10/20/13 7:15 AM, "Philip Reynolds" <ph...@gmail.com> wrote:
>Just to chime in, I'd be very interested in the monitoring blog post too.
>We're doing a kafka implementation for a robust data pipeline. Initially,
>samza does look interesting for monitoring use-cases.
>
>On Sun, Oct 20, 2013 at 2:53 PM, Garry Turkington <
>g.turkington@improvedigital.com> wrote:
>
>> Hi Chris,
>>
>> Thanks for all this, makes sense. Be interested to hear where things go
>> with the locality optimizations. I'm just looking at deploying our first
>> Kafka cluster to change how we do data distribution and that's not
>>going to
>> initially be collocated with the Hadoop cluster. Samza's tight Kafka
>> integration is one of the things that has drawn me to it so I'm looking
>> forward (!) to seeing what sort of performance/latency I get from the
>> remote/smaller Kafka setup.
>>
>> Looking forward to the blog post on the monitoring jobs written in
>>Samza.
>> We're in the earlier stages of a common service framework so have the
>> luxury of building on the experiences of others who learned this stuff
>>the
>> hard way. :)
>>
>> Regards
>> Garry
>>
>> -----Original Message-----
>> From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> Sent: 18 October 2013 19:01
>> To: dev@samza.incubator.apache.org
>> Subject: Re: Special Bay Area HUG: Tajo and Samza
>>
>> Hey Gary,
>>
>> Thanks!
>>
>> Locality: A few things to note here.
>>
>> 1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
>> 2. Samza does not explicitly try to do any co-location right now. Any
>> locality that we get is purely luck.
>> 3. YARN allows you to make resource requests for a specific host/rack.
>> This is the feature we would like to use to provide better locality.
>>
>> We haven't done any meaningful evaluation of the locality we're getting
>> (or would get) right now, though.
>>
>> Operations: Yes, we have a pretty cool set of Samza jobs that Jakob
>>wrote
>> to do some metrics/monitoring stuff. He can probably talk more about it
>> than I can. We're planning on putting up a blog post in the near future
>> about it.
>>
>> More broadly, we have a pretty well defined service container at
>>LinkedIn.
>> These services are called via RPC. Every time an RPC request is made,
>>the
>> service logs out information about the request: who sent the request,
>>what
>> method was called, how long it took to process, etc etc. In addition, we
>> also have all WARN/ERROR log events flowing through Kafka as well (via
>> Kafka's Log4j appender). There is a brief mention of this in:
>>
>> http://sites.computer.org/debull/A12june/pipeline.pdf
>>
>> As you can imagine, there are a ton of things you can do with this
>>data. :)
>>
>> Cheers,
>> Chris
>>
>> On 10/18/13 4:44 AM, "Garry Turkington"
>><g....@improvedigital.com>
>> wrote:
>>
>> >Hi Chris,
>> >
>> >Nice presentation -- 2 questions:
>> >
>> >1. I had wondered about the references to Kafka broker colocation I'd
>> >seen around the place. So for example in the 18-node sized cluster you
>> >mention you'd have 18 Kafka brokers running there, 1 per host? Do you
>> >actually get any sort of data locality benefits from this, is there a
>> >way to ensure that the Samza container on host x is processing the
>> >partitions of each topic on the collocated Kafka broker? Or am I
>>missing
>> the intent?
>> >
>> >2. Interested at your mention of using something like Samza for
>> >processing of monitoring and metric type data, it's something we've
>> >been talking about internally. Anything been published on what you are
>> >doing in that space?
>> >
>> >Thanks!
>> >Garry
>> >
>> >-----Original Message-----
>> >From: Chris Riccomini [mailto:criccomini@linkedin.com]
>> >Sent: 17 October 2013 21:54
>> >To: dev@samza.incubator.apache.org
>> >Subject: Re: Special Bay Area HUG: Tajo and Samza
>> >
>> >Hey Guys,
>> >
>> >On a related note, my talk from the YARN meet up at LinkedIn is now
>> >online:
>> >
>> > https://www.youtube.com/watch?v=7YBmUKjzg7c
>> >
>> >If you're not too familiar with Samza, this is a great place to start.
>> >
>> >Also, feedback welcome on presentation content, style, etc.
>> >
>> >Cheers,
>> >Chris
>> >
>> >On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>> >
>> >>Hey everybody-
>> >> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>> >>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>> >>YARN and Samza.
>> >>
>> >>http://www.meetup.com/hadoop/events/146077932/
>> >>
>> >>-Jakob
>> >
>> >
>> >-----
>> >No virus found in this message.
>> >Checked by AVG - www.avg.com
>> >Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
>> >10/15/13
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
>>10/15/13
>>
Re: Special Bay Area HUG: Tajo and Samza
Posted by Philip Reynolds <ph...@gmail.com>.
Just to chime in, I'd be very interested in the monitoring blog post too.
We're doing a kafka implementation for a robust data pipeline. Initially,
samza does look interesting for monitoring use-cases.
On Sun, Oct 20, 2013 at 2:53 PM, Garry Turkington <
g.turkington@improvedigital.com> wrote:
> Hi Chris,
>
> Thanks for all this, makes sense. Be interested to hear where things go
> with the locality optimizations. I'm just looking at deploying our first
> Kafka cluster to change how we do data distribution and that's not going to
> initially be collocated with the Hadoop cluster. Samza's tight Kafka
> integration is one of the things that has drawn me to it so I'm looking
> forward (!) to seeing what sort of performance/latency I get from the
> remote/smaller Kafka setup.
>
> Looking forward to the blog post on the monitoring jobs written in Samza.
> We're in the earlier stages of a common service framework so have the
> luxury of building on the experiences of others who learned this stuff the
> hard way. :)
>
> Regards
> Garry
>
> -----Original Message-----
> From: Chris Riccomini [mailto:criccomini@linkedin.com]
> Sent: 18 October 2013 19:01
> To: dev@samza.incubator.apache.org
> Subject: Re: Special Bay Area HUG: Tajo and Samza
>
> Hey Gary,
>
> Thanks!
>
> Locality: A few things to note here.
>
> 1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
> 2. Samza does not explicitly try to do any co-location right now. Any
> locality that we get is purely luck.
> 3. YARN allows you to make resource requests for a specific host/rack.
> This is the feature we would like to use to provide better locality.
>
> We haven't done any meaningful evaluation of the locality we're getting
> (or would get) right now, though.
>
> Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote
> to do some metrics/monitoring stuff. He can probably talk more about it
> than I can. We're planning on putting up a blog post in the near future
> about it.
>
> More broadly, we have a pretty well defined service container at LinkedIn.
> These services are called via RPC. Every time an RPC request is made, the
> service logs out information about the request: who sent the request, what
> method was called, how long it took to process, etc etc. In addition, we
> also have all WARN/ERROR log events flowing through Kafka as well (via
> Kafka's Log4j appender). There is a brief mention of this in:
>
> http://sites.computer.org/debull/A12june/pipeline.pdf
>
> As you can imagine, there are a ton of things you can do with this data. :)
>
> Cheers,
> Chris
>
> On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
> wrote:
>
> >Hi Chris,
> >
> >Nice presentation -- 2 questions:
> >
> >1. I had wondered about the references to Kafka broker colocation I'd
> >seen around the place. So for example in the 18-node sized cluster you
> >mention you'd have 18 Kafka brokers running there, 1 per host? Do you
> >actually get any sort of data locality benefits from this, is there a
> >way to ensure that the Samza container on host x is processing the
> >partitions of each topic on the collocated Kafka broker? Or am I missing
> the intent?
> >
> >2. Interested at your mention of using something like Samza for
> >processing of monitoring and metric type data, it's something we've
> >been talking about internally. Anything been published on what you are
> >doing in that space?
> >
> >Thanks!
> >Garry
> >
> >-----Original Message-----
> >From: Chris Riccomini [mailto:criccomini@linkedin.com]
> >Sent: 17 October 2013 21:54
> >To: dev@samza.incubator.apache.org
> >Subject: Re: Special Bay Area HUG: Tajo and Samza
> >
> >Hey Guys,
> >
> >On a related note, my talk from the YARN meet up at LinkedIn is now
> >online:
> >
> > https://www.youtube.com/watch?v=7YBmUKjzg7c
> >
> >If you're not too familiar with Samza, this is a great place to start.
> >
> >Also, feedback welcome on presentation content, style, etc.
> >
> >Cheers,
> >Chris
> >
> >On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
> >
> >>Hey everybody-
> >> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
> >>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
> >>YARN and Samza.
> >>
> >>http://www.meetup.com/hadoop/events/146077932/
> >>
> >>-Jakob
> >
> >
> >-----
> >No virus found in this message.
> >Checked by AVG - www.avg.com
> >Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
> >10/15/13
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13
>
RE: Special Bay Area HUG: Tajo and Samza
Posted by Garry Turkington <g....@improvedigital.com>.
Hi Chris,
Thanks for all this, makes sense. Be interested to hear where things go with the locality optimizations. I'm just looking at deploying our first Kafka cluster to change how we do data distribution and that's not going to initially be collocated with the Hadoop cluster. Samza's tight Kafka integration is one of the things that has drawn me to it so I'm looking forward (!) to seeing what sort of performance/latency I get from the remote/smaller Kafka setup.
Looking forward to the blog post on the monitoring jobs written in Samza. We're in the earlier stages of a common service framework so have the luxury of building on the experiences of others who learned this stuff the hard way. :)
Regards
Garry
-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com]
Sent: 18 October 2013 19:01
To: dev@samza.incubator.apache.org
Subject: Re: Special Bay Area HUG: Tajo and Samza
Hey Gary,
Thanks!
Locality: A few things to note here.
1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
2. Samza does not explicitly try to do any co-location right now. Any locality that we get is purely luck.
3. YARN allows you to make resource requests for a specific host/rack.
This is the feature we would like to use to provide better locality.
We haven't done any meaningful evaluation of the locality we're getting (or would get) right now, though.
Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote to do some metrics/monitoring stuff. He can probably talk more about it than I can. We're planning on putting up a blog post in the near future about it.
More broadly, we have a pretty well defined service container at LinkedIn.
These services are called via RPC. Every time an RPC request is made, the service logs out information about the request: who sent the request, what method was called, how long it took to process, etc etc. In addition, we also have all WARN/ERROR log events flowing through Kafka as well (via Kafka's Log4j appender). There is a brief mention of this in:
http://sites.computer.org/debull/A12june/pipeline.pdf
As you can imagine, there are a ton of things you can do with this data. :)
Cheers,
Chris
On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
wrote:
>Hi Chris,
>
>Nice presentation -- 2 questions:
>
>1. I had wondered about the references to Kafka broker colocation I'd
>seen around the place. So for example in the 18-node sized cluster you
>mention you'd have 18 Kafka brokers running there, 1 per host? Do you
>actually get any sort of data locality benefits from this, is there a
>way to ensure that the Samza container on host x is processing the
>partitions of each topic on the collocated Kafka broker? Or am I missing the intent?
>
>2. Interested at your mention of using something like Samza for
>processing of monitoring and metric type data, it's something we've
>been talking about internally. Anything been published on what you are
>doing in that space?
>
>Thanks!
>Garry
>
>-----Original Message-----
>From: Chris Riccomini [mailto:criccomini@linkedin.com]
>Sent: 17 October 2013 21:54
>To: dev@samza.incubator.apache.org
>Subject: Re: Special Bay Area HUG: Tajo and Samza
>
>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
> https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>>YARN and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>
>
>-----
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date:
>10/15/13
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13
Re: Special Bay Area HUG: Tajo and Samza
Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Gary,
Thanks!
Locality: A few things to note here.
1. We run one broker per host, as you suggest (18 nodes = 18 brokers).
2. Samza does not explicitly try to do any co-location right now. Any
locality that we get is purely luck.
3. YARN allows you to make resource requests for a specific host/rack.
This is the feature we would like to use to provide better locality.
We haven't done any meaningful evaluation of the locality we're getting
(or would get) right now, though.
Operations: Yes, we have a pretty cool set of Samza jobs that Jakob wrote
to do some metrics/monitoring stuff. He can probably talk more about it
than I can. We're planning on putting up a blog post in the near future
about it.
More broadly, we have a pretty well defined service container at LinkedIn.
These services are called via RPC. Every time an RPC request is made, the
service logs out information about the request: who sent the request, what
method was called, how long it took to process, etc etc. In addition, we
also have all WARN/ERROR log events flowing through Kafka as well (via
Kafka's Log4j appender). There is a brief mention of this in:
http://sites.computer.org/debull/A12june/pipeline.pdf
As you can imagine, there are a ton of things you can do with this data. :)
Cheers,
Chris
On 10/18/13 4:44 AM, "Garry Turkington" <g....@improvedigital.com>
wrote:
>Hi Chris,
>
>Nice presentation -- 2 questions:
>
>1. I had wondered about the references to Kafka broker colocation I'd
>seen around the place. So for example in the 18-node sized cluster you
>mention you'd have 18 Kafka brokers running there, 1 per host? Do you
>actually get any sort of data locality benefits from this, is there a way
>to ensure that the Samza container on host x is processing the partitions
>of each topic on the collocated Kafka broker? Or am I missing the intent?
>
>2. Interested at your mention of using something like Samza for
>processing of monitoring and metric type data, it's something we've been
>talking about internally. Anything been published on what you are doing
>in that space?
>
>Thanks!
>Garry
>
>-----Original Message-----
>From: Chris Riccomini [mailto:criccomini@linkedin.com]
>Sent: 17 October 2013 21:54
>To: dev@samza.incubator.apache.org
>Subject: Re: Special Bay Area HUG: Tajo and Samza
>
>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
> https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>>YARN and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>
>
>-----
>No virus found in this message.
>Checked by AVG - www.avg.com
>Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13
RE: Special Bay Area HUG: Tajo and Samza
Posted by Garry Turkington <g....@improvedigital.com>.
Hi Chris,
Nice presentation -- 2 questions:
1. I had wondered about the references to Kafka broker colocation I'd seen around the place. So for example in the 18-node sized cluster you mention you'd have 18 Kafka brokers running there, 1 per host? Do you actually get any sort of data locality benefits from this, is there a way to ensure that the Samza container on host x is processing the partitions of each topic on the collocated Kafka broker? Or am I missing the intent?
2. Interested at your mention of using something like Samza for processing of monitoring and metric type data, it's something we've been talking about internally. Anything been published on what you are doing in that space?
Thanks!
Garry
-----Original Message-----
From: Chris Riccomini [mailto:criccomini@linkedin.com]
Sent: 17 October 2013 21:54
To: dev@samza.incubator.apache.org
Subject: Re: Special Bay Area HUG: Tajo and Samza
Hey Guys,
On a related note, my talk from the YARN meet up at LinkedIn is now online:
https://www.youtube.com/watch?v=7YBmUKjzg7c
If you're not too familiar with Samza, this is a great place to start.
Also, feedback welcome on presentation content, style, etc.
Cheers,
Chris
On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>Hey everybody-
> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>YARN and Samza.
>
>http://www.meetup.com/hadoop/events/146077932/
>
>-Jakob
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3408 / Virus Database: 3222/6751 - Release Date: 10/15/13
Re: Special Bay Area HUG: Tajo and Samza
Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,
Also, updated the Wiki with upcoming talks and meetups:
https://wiki.apache.org/samza/PapersAndTalks
October 22, 2013
Samza: Real-time Stream Processing at LinkedIn
London HUG http://www.meetup.com/hadoop-users-group-uk/events/140308872/
November 5, 2013
Samza: Real-time Stream Processing at LinkedIn
HUG at LinkedIn http://www.meetup.com/hadoop/events/146077932/
November 13, 2013
Samza: Real-time Stream Processing at LinkedIn
QCon SF 2013
http://qconsf.com/presentation/samza-real-time-stream-processing-linkedin
Cheers,
Chris
On 10/17/13 1:54 PM, "Chris Riccomini" <cr...@linkedin.com> wrote:
>Hey Guys,
>
>On a related note, my talk from the YARN meet up at LinkedIn is now
>online:
>
> https://www.youtube.com/watch?v=7YBmUKjzg7c
>
>If you're not too familiar with Samza, this is a great place to start.
>
>Also, feedback welcome on presentation content, style, etc.
>
>Cheers,
>Chris
>
>On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>
>>Hey everybody-
>> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>>awesome Incubator projects, Tajo, a low-latency SQL query engine atop
>>YARN
>>and Samza.
>>
>>http://www.meetup.com/hadoop/events/146077932/
>>
>>-Jakob
>
Re: Special Bay Area HUG: Tajo and Samza
Posted by Chris Riccomini <cr...@linkedin.com>.
Hey Guys,
On a related note, my talk from the YARN meet up at LinkedIn is now online:
https://www.youtube.com/watch?v=7YBmUKjzg7c
If you're not too familiar with Samza, this is a great place to start.
Also, feedback welcome on presentation content, style, etc.
Cheers,
Chris
On 10/17/13 11:08 AM, "Jakob Homan" <jg...@gmail.com> wrote:
>Hey everybody-
> Join us at LinkedIn Nov. 5 for a special HUG dedicated to two new
>awesome Incubator projects, Tajo, a low-latency SQL query engine atop YARN
>and Samza.
>
>http://www.meetup.com/hadoop/events/146077932/
>
>-Jakob