You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Oshoma Momoh <os...@pcglab.com> on 2014/04/24 21:18:52 UTC

Getting started with Samza on Amazon EC2

Hi all,

I am setting up a Samza cluster for the first time, and am now at the point
of deploying on EC2.  Hopefully this is the correct place to ask a few
newbie questions. I'm impressed and excited by what I've seen so far, eager
to get going with a real deployment.

1. Does anyone have good or bad experiences to report in running Samza atop
Ubuntu 14.04 LTS? (Versus 12.04.)

2. Any best practices to recommend in terms of setup on EC2? E.g. instance
types to use, EBS volumes versus non-EBS, and so on.  I've found several
threads with conflicting opinions on all of this. Our current plan is...
(a) Use EBS volumes, separating Zookeeper from Kafka.
(b) Start with three m3.large instances to begin with and upgrade later as
needed, since our initial data volume will be low
(c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka +
Zookeeper + Yarn Resource Manager on the third node.

Regards,

osh

Oshoma Momoh
http://pcglab.com

Re: Getting started with Samza on Amazon EC2

Posted by Oshoma Momoh <os...@pcglab.com>.
Thanks Darion and Garry, this is helpful.

I have read that Zookeeper is very latency-sensitive.

I'll definitely try YARN NM on all 3 hosts.

I'd be happy to contribute our findings to a FAQ or wiki page. One so far
is that YARN is the most complicated bit within this setup process, since
there is scant documentation on how to set up YARN without dragging in the
rest of Hadoop.

By the way I did come across an excellent presentation byPhilip O'Toole of
Loggly (video <https://www.youtube.com/watch?v=LpNbjXFPyZ0>,
slides<http://www.slideshare.net/AmazonWebServices/infrastructure-at-scale-apache-kafka-twitter-storm-elastic-search-arc303-aws-reinvent-2013>)
that discusses how they use Kafka and Storm on EC2. No Samza. O'Toole
mentions using EBS volumes for Kafka and says they create daily volume
snapshots for disaster recovery purposes. I haven't found any mention of
disaster recovery for Kafka or Samza and I wondered if that even makes
sense given the replication/partition approach.



On Thu, Apr 24, 2014 at 11:15 PM, darion <ch...@meilishuo.com>wrote:

> Samza is based on JVM  and Ubuntu maybe ok
>
> Samaza I haven't used  but  Spark  and  Storm  is working well  on EC2
>  both seems similar
>
> 于 14-4-25 上午3:18, Oshoma Momoh 写道:
>
>  Hi all,
>>
>> I am setting up a Samza cluster for the first time, and am now at the
>> point
>> of deploying on EC2.  Hopefully this is the correct place to ask a few
>> newbie questions. I'm impressed and excited by what I've seen so far,
>> eager
>> to get going with a real deployment.
>>
>> 1. Does anyone have good or bad experiences to report in running Samza
>> atop
>> Ubuntu 14.04 LTS? (Versus 12.04.)
>>
>> 2. Any best practices to recommend in terms of setup on EC2? E.g. instance
>> types to use, EBS volumes versus non-EBS, and so on.  I've found several
>> threads with conflicting opinions on all of this. Our current plan is...
>> (a) Use EBS volumes, separating Zookeeper from Kafka.
>> (b) Start with three m3.large instances to begin with and upgrade later as
>> needed, since our initial data volume will be low
>> (c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka +
>> Zookeeper + Yarn Resource Manager on the third node.
>>
>> Regards,
>>
>> osh
>>
>> Oshoma Momoh
>> http://pcglab.com
>>
>>
>

Re: Getting started with Samza on Amazon EC2

Posted by darion <ch...@meilishuo.com>.
Samza is based on JVM  and Ubuntu maybe ok

Samaza I haven't used  but  Spark  and  Storm  is working well  on 
EC2    both seems similar

于 14-4-25 上午3:18, Oshoma Momoh 写道:
> Hi all,
>
> I am setting up a Samza cluster for the first time, and am now at the point
> of deploying on EC2.  Hopefully this is the correct place to ask a few
> newbie questions. I'm impressed and excited by what I've seen so far, eager
> to get going with a real deployment.
>
> 1. Does anyone have good or bad experiences to report in running Samza atop
> Ubuntu 14.04 LTS? (Versus 12.04.)
>
> 2. Any best practices to recommend in terms of setup on EC2? E.g. instance
> types to use, EBS volumes versus non-EBS, and so on.  I've found several
> threads with conflicting opinions on all of this. Our current plan is...
> (a) Use EBS volumes, separating Zookeeper from Kafka.
> (b) Start with three m3.large instances to begin with and upgrade later as
> needed, since our initial data volume will be low
> (c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka +
> Zookeeper + Yarn Resource Manager on the third node.
>
> Regards,
>
> osh
>
> Oshoma Momoh
> http://pcglab.com
>


RE: Getting started with Samza on Amazon EC2

Posted by Garry Turkington <g....@improvedigital.com>.
Hi Osh,

I've not ran Samza on EC2 myself but have had numerous other workloads there.

I'm not surprised you find conflicting advice on these topics; hardware selection is a bit of a dark art and on EC2 even more so. For every recommended configuration that works for one person you'll find somebody for whom the exact same config almost destroyed their business. :)

If at all possible I'd suggest standing up the config you mentioned and trying it on as realistic a sample of data as you'll see in production. Particularly in terms of instance types and numbers this is the only data point that will actually be guaranteed to be valuable.

Your implicit assumption that ZK is likely to be the most sensitive to EC2 weirdness is almost certainly true. It may be worth going through the Zookeeper wiki and mailing list archives for any relevant best practice. Likely not a huge concern if your initial data rates are also low (you only mentioned volumes) but ZK can get into a pretty unhappy state if it starts seeing spikes in latency to the storage or between nodes in the ensemble.

One thing I would consider is to run the YARN NM on all 3 hosts -- the YARN RM is relatively lightly used so you are effectively limiting yourself to only 2 nodes for actual stream task processing.

Please feed back any experiences you have with Samza on EC2 as I suspect this will become a FAQ entry at some point once we have more experience. There's a desire to more directly support EC2 as a work scheduler but that's purely speculative at this point.

Good luck!
Garry

-----Original Message-----
From: Oshoma Momoh [mailto:osh@pcglab.com] 
Sent: 24 April 2014 20:19
To: dev@samza.incubator.apache.org
Subject: Getting started with Samza on Amazon EC2

Hi all,

I am setting up a Samza cluster for the first time, and am now at the point of deploying on EC2.  Hopefully this is the correct place to ask a few newbie questions. I'm impressed and excited by what I've seen so far, eager to get going with a real deployment.

1. Does anyone have good or bad experiences to report in running Samza atop Ubuntu 14.04 LTS? (Versus 12.04.)

2. Any best practices to recommend in terms of setup on EC2? E.g. instance types to use, EBS volumes versus non-EBS, and so on.  I've found several threads with conflicting opinions on all of this. Our current plan is...
(a) Use EBS volumes, separating Zookeeper from Kafka.
(b) Start with three m3.large instances to begin with and upgrade later as needed, since our initial data volume will be low
(c) Kafka + Zookeeper + Yarn Node Manager on two worker nodes, and Kafka + Zookeeper + Yarn Resource Manager on the third node.

Regards,

osh

Oshoma Momoh
http://pcglab.com

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4355 / Virus Database: 3920/7386 - Release Date: 04/23/14