You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bryan Keller <br...@gmail.com> on 2012/08/02 08:31:30 UTC

Poor data locality of MR job

I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.

The performance of the rack local mappers is poor and causes overall scan performance to suffer.

The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?


Re: Poor data locality of MR job

Posted by Alex Baranau <al...@gmail.com>.
My reply may be unrelated to the metric that shows "data local tasks" and
more about data blocks locality (as J-D mentioned, this is separate thing
and not reflected on Mapper metrics), but it might be also helpful for
understanding performance of feeding data from HBase into Mapper with
regard to "data blocks locality".

> The regionservers have gone down on occassion
> but have been up for a while (weeks)

RS failures can be the reason for breaking data (blocks) locality.

The way HDFS writes data (let's first assume no failures happened), there
will be (likely) one full replica of bolcks of HFiles of the Region
assigned to RS. And other replicas (if replication >1) will be distributed
randomly over the cluster. So there (likely) will be no other full replica
of HFiles of this specific Region at any other server.

Now, imagine that this RS fails. After it is reassigned to a different
server, it no longer serves only local data: there's no full replica of
data of that Region locally. And there's not much you can do about it,
since block replicas are distributed randomly over the cluster.

Periodic major_compaction which rewrites old files helps to recover blocks
data locality (until next RS failure) since it will again write full
replica of newly created files locally. But note that it will rewrite not
all files: e.g. those old (already major-compacted some time ago) HFiles of
regions which were not changed since last major compaction will remain same.

I believe someone is working on making replication process (replicas
balancer) to be more smart at the moment. Hopes are to see this work soon :)

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Aug 2, 2012 at 5:56 PM, Bryan Keller <br...@gmail.com> wrote:

> I see what you mean about block locality, that is at the regionserver
> level, transparent to the MR job. This doesn't happen only to the final
> mappers, some of the early mappers are rack local. The table is reasonably
> well distributed across the nodes but not perfectly (that is a question I
> have in a different thread).
>
> Performance of the rack local mappers vs data local is roughly 2x slower,
> so the performance hit is significant.
>
> On Aug 2, 2012, at 11:37 AM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
>
> > On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <br...@gmail.com> wrote:
> >> I have an 8 node cluster and a table that is pretty well balanced with
> on average 36 regions/node. When I run a mapreduce job on the cluster
> against this table, the data locality of the mappers is poor, e.g 100 rack
> local mappers and only 188 data local mappers. I would expect nearly all of
> the mappers to be data local. DNS appears to be fine, i.e. the hostname in
> the splits is the same as the hostnames in the task attempts.
> >
> > Thanks for looking at this already, it's the first thing that came in
> > mind when looking at the title.
> >
> >> The table isn't new and from what I understand, HDFS replication will
> eventually keep region data blocks local to the regionserver. Are there
> other reasons for data locality to be poor and any way to fix it?
> >
> > Block locality doesn't play a role here, TableInputFormat publishes
> > where the region is but then where the data belonging to that region
> > is is another matter that's not taken into account. In any case, even
> > if the region was on node A and all the data happened to be on node B,
> > C, and D, your mapper would still only talk to A since that's where
> > the region is.
> >
> > So only 1 region server serves a region meaning that there's only 1
> > node where you can send the map to in order to have data locality. I'm
> > ready to guess that the the maps that aren't local are launched
> > towards the end of the job, because you might have slower maps and/or
> > not a perfect balance of regions per region server.
> >
> > For example, let's say one node is full of data-local maps but it
> > still has 2-3 more to process while other nodes have availability. The
> > JT has a locality timeout for each map so that if one node is just too
> > busy it will fall back to rack-local nodes instead. In this example
> > those 2-3 maps might get sent elsewhere.
> >
> > There are ways to tune this depending on which scheduler you are
> > using, but it will mostly involve waiting more for each task to make
> > sure they can get to the right node.
> >
> > At your scale it sounds more to me like over-optimizations. How big of
> > a hit are you taking?
> >
> > J-D
>
>


-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Poor data locality of MR job

Posted by Bryan Keller <br...@gmail.com>.
I see what you mean about block locality, that is at the regionserver level, transparent to the MR job. This doesn't happen only to the final mappers, some of the early mappers are rack local. The table is reasonably well distributed across the nodes but not perfectly (that is a question I have in a different thread).

Performance of the rack local mappers vs data local is roughly 2x slower, so the performance hit is significant.

On Aug 2, 2012, at 11:37 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:

> On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <br...@gmail.com> wrote:
>> I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.
> 
> Thanks for looking at this already, it's the first thing that came in
> mind when looking at the title.
> 
>> The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?
> 
> Block locality doesn't play a role here, TableInputFormat publishes
> where the region is but then where the data belonging to that region
> is is another matter that's not taken into account. In any case, even
> if the region was on node A and all the data happened to be on node B,
> C, and D, your mapper would still only talk to A since that's where
> the region is.
> 
> So only 1 region server serves a region meaning that there's only 1
> node where you can send the map to in order to have data locality. I'm
> ready to guess that the the maps that aren't local are launched
> towards the end of the job, because you might have slower maps and/or
> not a perfect balance of regions per region server.
> 
> For example, let's say one node is full of data-local maps but it
> still has 2-3 more to process while other nodes have availability. The
> JT has a locality timeout for each map so that if one node is just too
> busy it will fall back to rack-local nodes instead. In this example
> those 2-3 maps might get sent elsewhere.
> 
> There are ways to tune this depending on which scheduler you are
> using, but it will mostly involve waiting more for each task to make
> sure they can get to the right node.
> 
> At your scale it sounds more to me like over-optimizations. How big of
> a hit are you taking?
> 
> J-D


Re: Poor data locality of MR job

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <br...@gmail.com> wrote:
> I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.

Thanks for looking at this already, it's the first thing that came in
mind when looking at the title.

> The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?

Block locality doesn't play a role here, TableInputFormat publishes
where the region is but then where the data belonging to that region
is is another matter that's not taken into account. In any case, even
if the region was on node A and all the data happened to be on node B,
C, and D, your mapper would still only talk to A since that's where
the region is.

So only 1 region server serves a region meaning that there's only 1
node where you can send the map to in order to have data locality. I'm
ready to guess that the the maps that aren't local are launched
towards the end of the job, because you might have slower maps and/or
not a perfect balance of regions per region server.

For example, let's say one node is full of data-local maps but it
still has 2-3 more to process while other nodes have availability. The
JT has a locality timeout for each map so that if one node is just too
busy it will fall back to rack-local nodes instead. In this example
those 2-3 maps might get sent elsewhere.

There are ways to tune this depending on which scheduler you are
using, but it will mostly involve waiting more for each task to make
sure they can get to the right node.

At your scale it sounds more to me like over-optimizations. How big of
a hit are you taking?

J-D

Re: Poor data locality of MR job

Posted by Bryan Keller <br...@gmail.com>.
I presplit the table. The regionservers have gone down on occassion but have been up for a while (weeks). How could that result in having no regions on one node?


On Aug 1, 2012, at 11:39 PM, Adrien Mogenet <ad...@gmail.com> wrote:

> Did you pre split your table or did you let balancer assign regions to
> regionservers for you ?
> 
> Did your regionserver(s) fail ?
> 
> On Thu, Aug 2, 2012 at 8:31 AM, Bryan Keller <br...@gmail.com> wrote:
> 
>> I have an 8 node cluster and a table that is pretty well balanced with on
>> average 36 regions/node. When I run a mapreduce job on the cluster against
>> this table, the data locality of the mappers is poor, e.g 100 rack local
>> mappers and only 188 data local mappers. I would expect nearly all of the
>> mappers to be data local. DNS appears to be fine, i.e. the hostname in the
>> splits is the same as the hostnames in the task attempts.
>> 
>> The performance of the rack local mappers is poor and causes overall scan
>> performance to suffer.
>> 
>> The table isn't new and from what I understand, HDFS replication will
>> eventually keep region data blocks local to the regionserver. Are there
>> other reasons for data locality to be poor and any way to fix it?
>> 
>> 
> 
> 
> -- 
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me


Re: Poor data locality of MR job

Posted by Adrien Mogenet <ad...@gmail.com>.
Did you pre split your table or did you let balancer assign regions to
regionservers for you ?

Did your regionserver(s) fail ?

On Thu, Aug 2, 2012 at 8:31 AM, Bryan Keller <br...@gmail.com> wrote:

> I have an 8 node cluster and a table that is pretty well balanced with on
> average 36 regions/node. When I run a mapreduce job on the cluster against
> this table, the data locality of the mappers is poor, e.g 100 rack local
> mappers and only 188 data local mappers. I would expect nearly all of the
> mappers to be data local. DNS appears to be fine, i.e. the hostname in the
> splits is the same as the hostnames in the task attempts.
>
> The performance of the rack local mappers is poor and causes overall scan
> performance to suffer.
>
> The table isn't new and from what I understand, HDFS replication will
> eventually keep region data blocks local to the regionserver. Are there
> other reasons for data locality to be poor and any way to fix it?
>
>


-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me