You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Yamini Joshi <ya...@gmail.com> on 2016/11/10 16:27:25 UTC

HDFS Replication of data

Hello all

Does the HDFS replication improve performance of queries on Accumulo or is
it transparent to the Accumulo system? If it does improve the performance
by some notion of load balancing, is there is a Read Only or Write Only
copy of data on HDFS for Accumulo?

Best regards,
Yamini Joshi

Re: HDFS Replication of data

Posted by Josh Elser <jo...@gmail.com>.
Likely, there isn't going to be a positive impact to read performance 
with an increased number of replicas (unless the number of replicas 
approaches the number of datanodes, which is infeasible except for very, 
very small instances).

Given Accumulo's lax policy of Tablet placement WRT HDFS block location, 
the only benefit is rack-local or node-local network communication 
instead of cross-rack communication. This highly depends  on the network 
bandwidth between the nodes and racks in your system.

Accumulo tries to keep Tablets assigned to the same TabletServer under 
the assumption that there should be a local copy of all blocks for the 
files a Tablet references. However, once a TabletServer dies or the HDFS 
balancer is run, there's likely zero HDFS block locality until the files 
for the Tablet are compacted.

Christopher wrote:
> HDFS replication is transparent to Accumulo (though, the number of
> replicas is configurable in Accumulo, on a per-table basis). Its primary
> purpose is failure tolerance, but it *may* have an impact on read
> performance. I'm not certain how significant that is, though.
>
> There is no separate read-only and write-only copies of data on HDFS.
> HDFS replication is at the block level, and files are updated by
> appending new blocks to the files. All blocks are readable, and only new
> blocks are written.
>
> On Thu, Nov 10, 2016 at 11:28 AM Yamini Joshi <yamini.1691@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hello all
>
>     Does the HDFS replication improve performance of queries on Accumulo
>     or is it transparent to the Accumulo system? If it does improve the
>     performance by some notion of load balancing, is there is a Read
>     Only or Write Only copy of data on HDFS for Accumulo?
>
>     Best regards,
>     Yamini Joshi
>

Re: HDFS Replication of data

Posted by Christopher <ct...@apache.org>.
HDFS replication is transparent to Accumulo (though, the number of replicas
is configurable in Accumulo, on a per-table basis). Its primary purpose is
failure tolerance, but it *may* have an impact on read performance. I'm not
certain how significant that is, though.

There is no separate read-only and write-only copies of data on HDFS. HDFS
replication is at the block level, and files are updated by appending new
blocks to the files. All blocks are readable, and only new blocks are
written.

On Thu, Nov 10, 2016 at 11:28 AM Yamini Joshi <ya...@gmail.com> wrote:

> Hello all
>
> Does the HDFS replication improve performance of queries on Accumulo or is
> it transparent to the Accumulo system? If it does improve the performance
> by some notion of load balancing, is there is a Read Only or Write Only
> copy of data on HDFS for Accumulo?
>
> Best regards,
> Yamini Joshi
>