You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Mahadev Konar <ma...@yahoo-inc.com> on 2009/11/07 03:40:01 UTC

ApacheCon 2009 Meetup talk.

Hi all,
  I had given a brief overview of ZooKeeper and BookKeeper at ApacheCon
meetup this week. The talk is uploaded at

http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations

In case you guys are interested.

Thanks
mahadev

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

Several of our search engines use pretty large heaps (12-24GB).  That means
that if they *ever* do a full collection, disaster ensues because it can
take so long.

That means that we have to use concurrent collectors as much as possible and
make sure that the concurrent collectors get all the ephemeral garbage.  One
server, for instance, uses the following java options:

      -verbose:gc
      -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution

These options give us lots of detail about what is happening in the
collections.  Most importantly, we need to know that the tenuring
distribution never has any significant tail of objects that might survive
into the space that will cause a full collection.  This is pretty safe in
general because our servers either create objects to respond to a single
request or create cached items that survive essentially forever.

      -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

Concurrent collectors are critical.  We use the hbase recommendations here.

      -XX:MaxTenuringThreshold=6 -XX:SurvivorRatio=6

Max tenuring threshold is related to what we saw on the tenuring
distribution.  We very rarely see any objects last 4 collections so we set
it so that it would have to last two more collections in order to become
tenured.  The survivor ratio is related to this and is set based on
recommendations for non-stop, low latency servers.

      -XX:CMSInitiatingOccupancyFraction=60
-XX:+UseCMSInitiatingOccupancyOnly

CMS collections have couple of ways to be triggered.  We limit it to a
single way to make the world simpler.  Again, this is taken from outside
recommendations from the hbase guys and other commentors on the web.

      -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC

I doubt that these are important.  It is always nice to get more information
and I want to avoid any possibility of some library triggering a huge
collection.

      -XX:ParallelGCThreads=8

If the parallel GC needs horsepower, I want it to get it.

      -Xdebug

Very rarely useful, but a royal pain if not installed.  I don't know if it
has a performance impact (I think not).

      -Xms8000m -Xmx8000m

Setting the minimum heap helps avoid full GC's during the early life of the
server.

On Tue, Nov 10, 2009 at 11:27 AM, Patrick Hunt <ph...@apache.org> wrote:

> Can you elaborate on "gc tuning" - you are using the incremental collector?
>
> Patrick
>
>
> Ted Dunning wrote:
>
>> The server side is a fairly standard (but old) config:
>>
>> tickTime=2000
>> dataDir=/home/zookeeper/
>> clientPort=2181
>> initLimit=5
>> syncLimit=2
>>
>> Most of our clients now use 5 seconds as the timeout, but I think that we
>> went to longer timeouts in the past.  Without digging in to determine the
>> truth of the matter, my guess is that we needed the longer timeouts before
>> we tuned the GC parameters and that after tuning GC, we were able to
>> return
>> to a more reasonable timeout.  In retrospect, I think that we blamed EC2
>> for
>> some of our own GC misconfiguration.
>>
>> I would not use our configuration here as canonical since we didn't apply
>> a
>> whole lot of brainpower to this problem.
>>
>> On Tue, Nov 10, 2009 at 9:29 AM, Patrick Hunt <ph...@apache.org> wrote:
>>
>>  Ted, could you provide your configuration information for the cluster
>>> (incl
>>> the client timeout you use), if you're willing I'd be happy to put this
>>> up
>>> on the wiki for others interested in running in EC2.
>>>
>>>
>>
>>
>>

-- 
Ted Dunning, CTO
DeepDyve

Re: ZK on EC2

Posted by Patrick Hunt <ph...@apache.org>.

Can you elaborate on "gc tuning" - you are using the incremental collector?

Patrick

Ted Dunning wrote:
> The server side is a fairly standard (but old) config:
> 
> tickTime=2000
> dataDir=/home/zookeeper/
> clientPort=2181
> initLimit=5
> syncLimit=2
> 
> Most of our clients now use 5 seconds as the timeout, but I think that we
> went to longer timeouts in the past.  Without digging in to determine the
> truth of the matter, my guess is that we needed the longer timeouts before
> we tuned the GC parameters and that after tuning GC, we were able to return
> to a more reasonable timeout.  In retrospect, I think that we blamed EC2 for
> some of our own GC misconfiguration.
> 
> I would not use our configuration here as canonical since we didn't apply a
> whole lot of brainpower to this problem.
> 
> On Tue, Nov 10, 2009 at 9:29 AM, Patrick Hunt <ph...@apache.org> wrote:
> 
>> Ted, could you provide your configuration information for the cluster (incl
>> the client timeout you use), if you're willing I'd be happy to put this up
>> on the wiki for others interested in running in EC2.
>>
> 
> 
>

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

The server side is a fairly standard (but old) config:

tickTime=2000
dataDir=/home/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2

Most of our clients now use 5 seconds as the timeout, but I think that we
went to longer timeouts in the past.  Without digging in to determine the
truth of the matter, my guess is that we needed the longer timeouts before
we tuned the GC parameters and that after tuning GC, we were able to return
to a more reasonable timeout.  In retrospect, I think that we blamed EC2 for
some of our own GC misconfiguration.

I would not use our configuration here as canonical since we didn't apply a
whole lot of brainpower to this problem.

On Tue, Nov 10, 2009 at 9:29 AM, Patrick Hunt <ph...@apache.org> wrote:

> Ted, could you provide your configuration information for the cluster (incl
> the client timeout you use), if you're willing I'd be happy to put this up
> on the wiki for others interested in running in EC2.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: ZK on EC2

Posted by Patrick Hunt <ph...@apache.org>.

Ok, good. Based on the comparison of perf numbers, and Ted's experience 
with large instances on ec2 running zk, this makes sense to me. A large 
is about half (very roughly) the horsepower of what I was using for my 
tests. Ted mentioned that he hasn't seen issues on ec2 running with 
large instances and that correlates to these numbers (again, this is all 
rough back of the envelope type stuff but good enough imo).

Anyone have a small that they could run the same cpu/disk/network tests? 
  I'd be interested to see how that stacks up.

Ted, could you provide your configuration information for the cluster 
(incl the client timeout you use), if you're willing I'd be happy to put 
this up on the wiki for others interested in running in EC2.

Thanks!

Patrick

Ted Dunning wrote:
> I only have one large instance live.  My impression from previously is that
> between host bandwidth is generally about what you saw.  We have been able
> to sustain 20-30MB/s into EC2 to a single node which should be harder than
> moving data between nodes.  I have heard rumors that others were able to get
> double what I got for incoming transfer.
> 
> On Mon, Nov 9, 2009 at 9:47 PM, Patrick Hunt <ph...@apache.org> wrote:
> 
>> Could you test networking - scping data between hosts? (I was seeing
>> 64.1MB/s for a 512mb file - the one created by dd, random data)
>>
> 
> 
>

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

I only have one large instance live.  My impression from previously is that
between host bandwidth is generally about what you saw.  We have been able
to sustain 20-30MB/s into EC2 to a single node which should be harder than
moving data between nodes.  I have heard rumors that others were able to get
double what I got for incoming transfer.

On Mon, Nov 9, 2009 at 9:47 PM, Patrick Hunt <ph...@apache.org> wrote:

> Could you test networking - scping data between hosts? (I was seeing
> 64.1MB/s for a 512mb file - the one created by dd, random data)
>

-- 
Ted Dunning, CTO
DeepDyve

Re: ZK on EC2

Posted by Patrick Hunt <ph...@apache.org>.

Interesting, so comparing a large (4cores and "high" i/o performance) 
ec2 instance (the first number on each line below) vs the host I used in 
the latency test (the second number on each line):

ebs cache        817 vs 11532 ~  7% (ec2 7% as performant)
ebs bufread       53 vs    88 ~ 60%
native cache     829 vs 11532 ~  7%
native bufread    80 vs    88 ~ 90%

dd 512m 106s vs 74s ~ 43% longer for ec2 large

md5sum 512m 2.13s vs 1.5 ~ 42% longer

Good thing we don't rely on disk cache. ;-) Raw processing power looks 
about half. Could you test networking - scping data between hosts? (I 
was seeing 64.1MB/s for a 512mb file - the one created by dd, random data)

Small anyone?

Patrick

Ted Dunning wrote:
> /dev/sdp is an EBS volume.  /dev/sdb is a native volume.
> 
> This is a large instance.
> 
> root@domU-<mumble>#:~# hdparm -tT /dev/sdp
> 
> /dev/sdp:
>  Timing cached reads:   1634 MB in  2.00 seconds = 817.30 MB/sec
>  Timing buffered disk reads:  160 MB in  3.00 seconds =  53.27 MB/sec
> root@domU-<mumble>:~# hdparm -tT /dev/sdb
> 
> /dev/sdb:
>  Timing cached reads:   1658 MB in  2.00 seconds = 829.44 MB/sec
>  Timing buffered disk reads:  242 MB in  3.00 seconds =  80.56 MB/sec
> root@domU-<mumble>:~# time dd if=/dev/urandom bs=512000 of=/tmp/memtest
> count=1050
> 1050+0 records in
> 1050+0 records out
> 537600000 bytes (538 MB) copied, 106.525 s, 5.0 MB/s
> 
> real    1m46.517s
> user    0m0.000s
> sys    1m46.127s
> root@domU-<mumble>:~# time md5sum /tmp/memtest; time md5sum /tmp/memtest;
> time md5sum /tmp/memtest
> f79304f68ce04011ca0aebfbd548134a  /tmp/memtest
> 
> real    0m2.234s
> user    0m1.613s
> sys    0m0.590s
> f79304f68ce04011ca0aebfbd548134a  /tmp/memtest
> 
> real    0m2.136s
> user    0m1.560s
> sys    0m0.584s
> f79304f68ce04011ca0aebfbd548134a  /tmp/memtest
> 
> real    0m2.123s
> user    0m1.640s
> sys    0m0.481s
> root@domU-<mumble>:~#
> 
> 
> On Mon, Nov 9, 2009 at 4:54 PM, Patrick Hunt <ph...@apache.org> wrote:
> 
>> I'm really interested to know how ec2 compares wrt disk and network
>> performance to what I've documented here under the "hardware" section:
>> http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview#Hardware
>>
>> Is it possible for someone to compare the network and disk performance
>> (scp, dd, md5sum, etc...) that I document in the wiki page on say, EC2
>> small/large nodes? I'd do it myself but I've not used ec2. If anyone could
>> try these and report I'd appreciate it.
>>
>> Patrick
>>
>>
>> Ted Dunning wrote:
>>
>>> Worked pretty well for me.  We did extend all of our timeouts.  The
>>> biggest
>>> worry for us was timeouts on the client side.  The ZK server side was no
>>> problem in that respect.
>>>
>>> On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>>>
>>>  Has anyone deployed ZK on EC2? What's the experience there? Are there
>>>> more
>>>> timeouts, lead re-election, etc? Thanks,
>>>>
>>>> Jun
>>>> IBM Almaden Research Center
>>>> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>>>>
>>>> junrao@almaden.ibm.com
>>>>
>>>
>>>
>>>
>>>
> 
>

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

/dev/sdp is an EBS volume.  /dev/sdb is a native volume.

This is a large instance.

root@domU-<mumble>#:~# hdparm -tT /dev/sdp

/dev/sdp:
 Timing cached reads:   1634 MB in  2.00 seconds = 817.30 MB/sec
 Timing buffered disk reads:  160 MB in  3.00 seconds =  53.27 MB/sec
root@domU-<mumble>:~# hdparm -tT /dev/sdb

/dev/sdb:
 Timing cached reads:   1658 MB in  2.00 seconds = 829.44 MB/sec
 Timing buffered disk reads:  242 MB in  3.00 seconds =  80.56 MB/sec
root@domU-<mumble>:~# time dd if=/dev/urandom bs=512000 of=/tmp/memtest
count=1050
1050+0 records in
1050+0 records out
537600000 bytes (538 MB) copied, 106.525 s, 5.0 MB/s

real    1m46.517s
user    0m0.000s
sys    1m46.127s
root@domU-<mumble>:~# time md5sum /tmp/memtest; time md5sum /tmp/memtest;
time md5sum /tmp/memtest
f79304f68ce04011ca0aebfbd548134a  /tmp/memtest

real    0m2.234s
user    0m1.613s
sys    0m0.590s
f79304f68ce04011ca0aebfbd548134a  /tmp/memtest

real    0m2.136s
user    0m1.560s
sys    0m0.584s
f79304f68ce04011ca0aebfbd548134a  /tmp/memtest

real    0m2.123s
user    0m1.640s
sys    0m0.481s
root@domU-<mumble>:~#


On Mon, Nov 9, 2009 at 4:54 PM, Patrick Hunt <ph...@apache.org> wrote:

> I'm really interested to know how ec2 compares wrt disk and network
> performance to what I've documented here under the "hardware" section:
> http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview#Hardware
>
> Is it possible for someone to compare the network and disk performance
> (scp, dd, md5sum, etc...) that I document in the wiki page on say, EC2
> small/large nodes? I'd do it myself but I've not used ec2. If anyone could
> try these and report I'd appreciate it.
>
> Patrick
>
>
> Ted Dunning wrote:
>
>> Worked pretty well for me.  We did extend all of our timeouts.  The
>> biggest
>> worry for us was timeouts on the client side.  The ZK server side was no
>> problem in that respect.
>>
>> On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>>
>>  Has anyone deployed ZK on EC2? What's the experience there? Are there
>>> more
>>> timeouts, lead re-election, etc? Thanks,
>>>
>>> Jun
>>> IBM Almaden Research Center
>>> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>>>
>>> junrao@almaden.ibm.com
>>>
>>
>>
>>
>>
>>


-- 
Ted Dunning, CTO
DeepDyve

Re: ZK on EC2

Posted by Patrick Hunt <ph...@apache.org>.

I'm really interested to know how ec2 compares wrt disk and network 
performance to what I've documented here under the "hardware" section:
http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview#Hardware

Is it possible for someone to compare the network and disk performance 
(scp, dd, md5sum, etc...) that I document in the wiki page on say, EC2 
small/large nodes? I'd do it myself but I've not used ec2. If anyone 
could try these and report I'd appreciate it.

Patrick

Ted Dunning wrote:
> Worked pretty well for me.  We did extend all of our timeouts.  The biggest
> worry for us was timeouts on the client side.  The ZK server side was no
> problem in that respect.
> 
> On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
> 
>> Has anyone deployed ZK on EC2? What's the experience there? Are there more
>> timeouts, lead re-election, etc? Thanks,
>>
>> Jun
>> IBM Almaden Research Center
>> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>>
>> junrao@almaden.ibm.com
> 
> 
> 
>

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

10-30s at different times.  Not sure what the final numbers were.

On Mon, Nov 9, 2009 at 4:39 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

> Thanks, Ted.
>
> What long did you set the client timeout?
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Ted Dunning <te...@gmail.com> wrote on 11/09/2009 04:24:16 PM:
>
> > [image removed]
> >
> > Re: ZK on EC2
> >
> > Ted Dunning
> >
> > to:
> >
> > zookeeper-user
> >
> > 11/09/2009 04:25 PM
> >
> > Please respond to zookeeper-user
> >
> > Worked pretty well for me.  We did extend all of our timeouts.  The
> biggest
> > worry for us was timeouts on the client side.  The ZK server side was no
> > problem in that respect.
> >
> > On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
> >
> > > Has anyone deployed ZK on EC2? What's the experience there? Are there
> more
> > > timeouts, lead re-election, etc? Thanks,
> > >
> > > Jun
> > > IBM Almaden Research Center
> > > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> > >
> > > junrao@almaden.ibm.com
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>



-- 
Ted Dunning, CTO
DeepDyve

Re: ZK on EC2

Posted by Jun Rao <ju...@almaden.ibm.com>.

Thanks, Ted.

What long did you set the client timeout?

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Ted Dunning <te...@gmail.com> wrote on 11/09/2009 04:24:16 PM:

> [image removed]
>
> Re: ZK on EC2
>
> Ted Dunning
>
> to:
>
> zookeeper-user
>
> 11/09/2009 04:25 PM
>
> Please respond to zookeeper-user
>
> Worked pretty well for me.  We did extend all of our timeouts.  The
biggest
> worry for us was timeouts on the client side.  The ZK server side was no
> problem in that respect.
>
> On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:
>
> > Has anyone deployed ZK on EC2? What's the experience there? Are there
more
> > timeouts, lead re-election, etc? Thanks,
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve

Re: ZK on EC2

Posted by Ted Dunning <te...@gmail.com>.

Worked pretty well for me.  We did extend all of our timeouts.  The biggest
worry for us was timeouts on the client side.  The ZK server side was no
problem in that respect.

On Mon, Nov 9, 2009 at 4:20 PM, Jun Rao <ju...@almaden.ibm.com> wrote:

> Has anyone deployed ZK on EC2? What's the experience there? Are there more
> timeouts, lead re-election, etc? Thanks,
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com

-- 
Ted Dunning, CTO
DeepDyve

ZK on EC2

Posted by Jun Rao <ju...@almaden.ibm.com>.

Has anyone deployed ZK on EC2? What's the experience there? Are there more
timeouts, lead re-election, etc? Thanks,

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com