You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Felix Sprick <fs...@gmail.com> on 2011/05/17 14:44:43 UTC

M/R: Data-local vs Rack-local

Hi,

We have a setup with 4 regionservers and a replication factor of 3. We are
running MapReduce tasks using Hbase as data-source and sink. When running
MapReduce tasks over data stored on the 4 nodes we noticed that in the
statistics of a successfully completed job, the majority of the maps are
"rack-local" and not "data-local". In this particular case we had 48 maps
where 19 of them were data-local and 29 rack-local. I would have expected to
have the majority of them "data-local" as the data should be available on 3
out of 4 nodes due to the replication. Is this a configuration issue or am I
just thinking in a wrong way?

thanks,
Felix

Re: M/R: Data-local vs Rack-local

Posted by Joey Echeverria <jo...@cloudera.com>.
When running map reduce jobs against HBase, a task needs to be scheduled on the region server serving the region you're reading from to be considered local. You have three replicas of the data at the HDFS level, but not at the HBase level.

-Joey

On May 17, 2011, at 5:44, Felix Sprick <fs...@gmail.com> wrote:

> Hi,
> 
> We have a setup with 4 regionservers and a replication factor of 3. We are
> running MapReduce tasks using Hbase as data-source and sink. When running
> MapReduce tasks over data stored on the 4 nodes we noticed that in the
> statistics of a successfully completed job, the majority of the maps are
> "rack-local" and not "data-local". In this particular case we had 48 maps
> where 19 of them were data-local and 29 rack-local. I would have expected to
> have the majority of them "data-local" as the data should be available on 3
> out of 4 nodes due to the replication. Is this a configuration issue or am I
> just thinking in a wrong way?
> 
> thanks,
> Felix