You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Aaron Cordova <aa...@cordovas.org> on 2012/02/24 20:35:42 UTC

a model for accumulo write scaling performance

In my experience with Accumulo on EC2, I've seen about an 85% increase in aggregate write rate each time the size of the cluster is doubled. I've tried to capture that behavior in a model to help myself understand it.

The model I came up with is the following:


where 
	w: aggregate write rate (writes per second)
	m: number of machines
	k: standalone single server performance (in my experience about 30k writes per second on average)

the units of k and w are writes per second

for those of you without the ability to see graphics in email, the model is:
	
	w = m * pow(0.85, log(m, 2)) * k

First of all, my algebra may be rusty, so it may be possible to simplify the model ... second, does the model make sense? 

Re: a model for accumulo write scaling performance

Posted by Keith Turner <ke...@deenlo.com>.
What are the characteristics of the data you are writing?  Does each client
generate data that spreads across the cluster?

What version of Accumulo are you using?  1.5 has two walog improvements
that should help as a cluster grows.  It has group commit and writes to
logs in parallel.  In 1.4 when a batch of data comes in from a client, the
walog is locked and then that data is written to the two logs serially.

On Fri, Feb 24, 2012 at 2:35 PM, Aaron Cordova <aa...@cordovas.org> wrote:

> In my experience with Accumulo on EC2, I've seen about an 85% increase in
> aggregate write rate each time the size of the cluster is doubled. I've
> tried to capture that behavior in a model to help myself understand it.
>
> The model I came up with is the following:
>
> where
> w: aggregate write rate (writes per second)
> m: number of machines
> k: standalone single server performance (in my experience about 30k writes
> per second on average)
>
> the units of k and w are writes per second
>
> for those of you without the ability to see graphics in email, the model
> is:
>  w = m * pow(0.85, log(m, 2)) * k
>
> First of all, my algebra may be rusty, so it may be possible to simplify
> the model ... second, does the model make sense?
>

RE: a model for accumulo write scaling performance

Posted by Dave Marion <dl...@comcast.net>.
 

 

That may be a good metric for your workload on EC2 virtualized hardware at
different scales; could be useful for regression testing different versions
of Hadoop + Accumulo. Certainly workload and hardware differences could end
up with a different model.

 

From: Aaron Cordova [mailto:aaron@cordovas.org] 
Sent: Friday, February 24, 2012 2:36 PM
To: accumulo-dev@incubator.apache.org
Subject: a model for accumulo write scaling performance

 

In my experience with Accumulo on EC2, I've seen about an 85% increase in
aggregate write rate each time the size of the cluster is doubled. I've
tried to capture that behavior in a model to help myself understand it.

 

The model I came up with is the following:



 

where 

            w: aggregate write rate (writes per second)

            m: number of machines

            k: standalone single server performance (in my experience about
30k writes per second on average)

 

the units of k and w are writes per second

 

for those of you without the ability to see graphics in email, the model is:

            

            w = m * pow(0.85, log(m, 2)) * k

 

First of all, my algebra may be rusty, so it may be possible to simplify the
model ... second, does the model make sense? 


Re: a model for accumulo write scaling performance

Posted by Clint Green <cl...@gmail.com>.
What are the instances you are using for this?

Are you seeing bottlenecks in the network on this scale-out?

How many nodes have you used to demonstrate this behavior?

On Fri, Feb 24, 2012 at 2:35 PM, Aaron Cordova <aa...@cordovas.org> wrote:

> In my experience with Accumulo on EC2, I've seen about an 85% increase in
> aggregate write rate each time the size of the cluster is doubled. I've
> tried to capture that behavior in a model to help myself understand it.
>
> The model I came up with is the following:
>
> where
> w: aggregate write rate (writes per second)
> m: number of machines
> k: standalone single server performance (in my experience about 30k writes
> per second on average)
>
> the units of k and w are writes per second
>
> for those of you without the ability to see graphics in email, the model
> is:
>  w = m * pow(0.85, log(m, 2)) * k
>
> First of all, my algebra may be rusty, so it may be possible to simplify
> the model ... second, does the model make sense?
>