You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Arshak Navruzyan <ar...@gmail.com> on 2014/01/05 18:44:03 UTC

deployment architecture

Is there a document that describes best practices for Accumulo deployments?

In particular:

1.  Should you run Accumulo on HD data nodes and name nodes? (Is enabling
HDFS short-circuit local reads a good idea?)
2.  If so do you disable map/reduce for nodes that run Accumulo tservers?
3.  Is auto-splitting (by size) done in the real world or do most real apps
have pre-set split points?
4.  Do you let Accumulo decide when to flush and compact or do people write
these into their apps (based on their knowledge of app behavior)

I know the generic answer is "it all depends on your app/workload" but if
anyone wants to still describe their environment it would be helpful.

Thanks.

Re: deployment architecture

Posted by Arshak Navruzyan <ar...@gmail.com>.

Josh,

Thanks.  This is helpful.  One additional question.  Are these drops that I
am seeing in the number of ingest entries/s inevitable as compaction kicks
in?  Or is this is a side effect of my tiny 2 node test environment (in
other words if I had hundreds of tservers the compaction activity of a
handful of nodes wouldn't impact the overall ingest rate so severely).

This btw if the result of the batchwriter creating 50 byte random entries.

Arshak

[image: Inline image 1]

 On 1/5/14, 12:44 PM, Arshak Navruzyan wrote:

> Is there a document that describes best practices for Accumulo deployments?
>

I'm guessing the Accumulo user-manual[1] covers some of this, but I'm not
positive.

 In particular:
>
> 1.  Should you run Accumulo on HD data nodes and name nodes? (Is
> enabling HDFS short-circuit local reads a good idea?)
>

Datanodes and tasktrackers/nodemanagers, yes. I wouldn't run it on the
Namenode though.

 2.  If so do you disable map/reduce for nodes that run Accumulo tservers?
>

With conscious awareness of your resource allocation (make sure there are
still physical resources for Accumulo) this should be fine, but be careful
if you're running a heavy M/R load.

 3.  Is auto-splitting (by size) done in the real world or do most real
> apps have pre-set split points?
>

Adding some split points is probably always a good idea. Making sure each
tabletserver has at least a few tablets for your table is good, after that,
you can increase the size of the split threshold (default is 1GB) for that
table so you get a good distribution of tablets/tservers for the amount of
data you're storing (100-200 tablets is a good target). The splits
themselves obviously depend on your data, though.

 4.  Do you let Accumulo decide when to flush and compact or do people
> write these into their apps (based on their knowledge of app behavior)
>

Unless you have retention policies which are stringent upon data being
physically removed from disk (as opposed to not visible through Accumulo's
API), I'm not coming up with a reason that you would have to automate
flush/compact. If you're doing data age-off (e.g. keeping N months of data,
and rolling off the oldest day of data each data), it's probably not a bad
idea to just do a range compaction on that old day to clean it up before
your users are hitting your system full swing.

 I know the generic answer is "it all depends on your app/workload" but
> if anyone wants to still describe their environment it would be helpful.
>
> Thanks.
>

[1] http://accumulo.apache.org/1.5/accumulo_user_manual.html

Re: deployment architecture

Posted by Arshak Navruzyan <ar...@gmail.com>.

BTW, I also found a good description of a "typical cluster" in the upcoming
O'Reilly Accumulo book: http://shop.oreilly.com/product/0636920032304.do




On Sun, Jan 5, 2014 at 10:08 AM, Josh Elser <jo...@gmail.com> wrote:

> On 1/5/14, 12:44 PM, Arshak Navruzyan wrote:
>
>> Is there a document that describes best practices for Accumulo
>> deployments?
>>
>
> I'm guessing the Accumulo user-manual[1] covers some of this, but I'm not
> positive.
>
>
>  In particular:
>>
>> 1.  Should you run Accumulo on HD data nodes and name nodes? (Is
>> enabling HDFS short-circuit local reads a good idea?)
>>
>
> Datanodes and tasktrackers/nodemanagers, yes. I wouldn't run it on the
> Namenode though.
>
>
>  2.  If so do you disable map/reduce for nodes that run Accumulo tservers?
>>
>
> With conscious awareness of your resource allocation (make sure there are
> still physical resources for Accumulo) this should be fine, but be careful
> if you're running a heavy M/R load.
>
>
>  3.  Is auto-splitting (by size) done in the real world or do most real
>> apps have pre-set split points?
>>
>
> Adding some split points is probably always a good idea. Making sure each
> tabletserver has at least a few tablets for your table is good, after that,
> you can increase the size of the split threshold (default is 1GB) for that
> table so you get a good distribution of tablets/tservers for the amount of
> data you're storing (100-200 tablets is a good target). The splits
> themselves obviously depend on your data, though.
>
>
>  4.  Do you let Accumulo decide when to flush and compact or do people
>> write these into their apps (based on their knowledge of app behavior)
>>
>
> Unless you have retention policies which are stringent upon data being
> physically removed from disk (as opposed to not visible through Accumulo's
> API), I'm not coming up with a reason that you would have to automate
> flush/compact. If you're doing data age-off (e.g. keeping N months of data,
> and rolling off the oldest day of data each data), it's probably not a bad
> idea to just do a range compaction on that old day to clean it up before
> your users are hitting your system full swing.
>
>
>  I know the generic answer is "it all depends on your app/workload" but
>> if anyone wants to still describe their environment it would be helpful.
>>
>> Thanks.
>>
>
> [1] http://accumulo.apache.org/1.5/accumulo_user_manual.html
>

Re: deployment architecture

Posted by Josh Elser <jo...@gmail.com>.

On 1/5/14, 12:44 PM, Arshak Navruzyan wrote:
> Is there a document that describes best practices for Accumulo deployments?

I'm guessing the Accumulo user-manual[1] covers some of this, but I'm 
not positive.

> In particular:
>
> 1.  Should you run Accumulo on HD data nodes and name nodes? (Is
> enabling HDFS short-circuit local reads a good idea?)

Datanodes and tasktrackers/nodemanagers, yes. I wouldn't run it on the 
Namenode though.

> 2.  If so do you disable map/reduce for nodes that run Accumulo tservers?

With conscious awareness of your resource allocation (make sure there 
are still physical resources for Accumulo) this should be fine, but be 
careful if you're running a heavy M/R load.

> 3.  Is auto-splitting (by size) done in the real world or do most real
> apps have pre-set split points?

Adding some split points is probably always a good idea. Making sure 
each tabletserver has at least a few tablets for your table is good, 
after that, you can increase the size of the split threshold (default is 
1GB) for that table so you get a good distribution of tablets/tservers 
for the amount of data you're storing (100-200 tablets is a good 
target). The splits themselves obviously depend on your data, though.

> 4.  Do you let Accumulo decide when to flush and compact or do people
> write these into their apps (based on their knowledge of app behavior)

Unless you have retention policies which are stringent upon data being 
physically removed from disk (as opposed to not visible through 
Accumulo's API), I'm not coming up with a reason that you would have to 
automate flush/compact. If you're doing data age-off (e.g. keeping N 
months of data, and rolling off the oldest day of data each data), it's 
probably not a bad idea to just do a range compaction on that old day to 
clean it up before your users are hitting your system full swing.

> I know the generic answer is "it all depends on your app/workload" but
> if anyone wants to still describe their environment it would be helpful.
>
> Thanks.

[1] http://accumulo.apache.org/1.5/accumulo_user_manual.html