You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by stack <st...@duboce.net> on 2010/01/04 22:50:18 UTC

Re: hadoop dfs.replication parameter and hbase/performance for random/scanner access

On Mon, Jan 4, 2010 at 9:39 AM, TuX RaceR <tu...@gmail.com> wrote:

> ...
> I am trying to have information to increase the performance of the two
> access modes.
>
> I would expect that mode a) performance does not really depend on the
> number of replicas in HDFS
> but that mode b) speed depends on the number of replicas in HDFS. It has
> been said previously that random read accesses are limited by the
> performance of the disks.
> Can I artificially boost standard disks by adding more replicas to improve
> random reads?
>
>
The amount of replication should have no effect on either access mode.
 Whether scanning or random-accessing, only one of the N replicas is
accessed.  We'll only go to the other versions if there is trouble accessing
the first.

So, more replicas will not change the performance profile.

What do you need to improve?  Are both scans and random-reads slow for you?
  You've seen the performance page up on the wiki (I'm sure you have).
 Nothing there helps?

St.Ack

Re: hadoop dfs.replication parameter and hbase/performance for random/scanner access

Posted by TuX RaceR <tu...@gmail.com>.

Jean-Daniel Cryans wrote:
> With HBase you have to consider that the region servers have regions
> that have blocks that can be located on many Datanodes, most of the
> time the local one. HBase doesn't serve the same data from more than 1
> region server, instead it applies horizontal partitioning
> automatically on your table.
>   
Ok that's clear now. Thanks Jean-Daniel for your answer. This list is 
really cool as comitters are really available to answer users questions ;)

Re: hadoop dfs.replication parameter and hbase/performance for random/scanner access

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> I am not sure if hbase or hadoop is responsible for choosing the location of
> the replica. Having more replica may not avoid the disk access random read
> limitations but it should probably avoid network latency?
> If I have and web application with N clients accessing hbase, if one of
> those clients has to get the value for a  key it should be faster to access
> it if the value for that key is stored on that node? (as we avoid a network
> call). But you are right it does not seem I can get around the disk random
> read performance limitations.

The Namenode chooses the replicas location, always starting with the
local datanode is one exists. It will be faster for HBase to fetch a
block for a local Datanode, that is true. If the client is in the same
RegionServer that is on the same Datanode that has the block
containing your key, you will probably save some more trips but it's
not what you want to do (don't want client competing with DB).

>
> Unfortunately I am not in a position to really benchmark my application as I
> currently can't run it on a true cluster (using a cluster of virtual
> machines would lead to obviously wrong results ;). At this stage I am just
> trying to understand how hbase/hadoop works to avoid big mistakes in the
> design of the architecture. My application currently runs in production on a
> postgresql database: I replicate it over several nodes and read access
> performs better when I have more replicas because each node connects to a
> local database.

With HBase you have to consider that the region servers have regions
that have blocks that can be located on many Datanodes, most of the
time the local one. HBase doesn't serve the same data from more than 1
region server, instead it applies horizontal partitioning
automatically on your table.

>
> Thanks
> TuX
>
>

Re: hadoop dfs.replication parameter and hbase/performance for random/scanner access

Posted by TuX RaceR <tu...@gmail.com>.

Thanks a lot St.Ack for the time you spend to answer user questions and 
for developing this nice piece of software (hbase)

stack wrote:
> The amount of replication should have no effect on either access mode.
>  Whether scanning or random-accessing, only one of the N replicas is
> accessed.  We'll only go to the other versions if there is trouble accessing
> the first.
> So, more replicas will not change the performance profile.
>   
I am not sure if hbase or hadoop is responsible for choosing the 
location of the replica. Having more replica may not avoid the disk 
access random read limitations but it should probably avoid network latency?
If I have and web application with N clients accessing hbase, if one of 
those clients has to get the value for a  key it should be faster to 
access it if the value for that key is stored on that node? (as we avoid 
a network call). But you are right it does not seem I can get around the 
disk random read performance limitations.
> What do you need to improve?  Are both scans and random-reads slow for you?
>   You've seen the performance page up on the wiki (I'm sure you have).
>   
Unfortunately I am not in a position to really benchmark my application 
as I currently can't run it on a true cluster (using a cluster of 
virtual machines would lead to obviously wrong results ;). At this stage 
I am just trying to understand how hbase/hadoop works to avoid big 
mistakes in the design of the architecture. My application currently 
runs in production on a postgresql database: I replicate it over several 
nodes and read access performs better when I have more replicas because 
each node connects to a local database.

Thanks
TuX