You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2010/01/12 22:45:03 UTC

Re: how to load big files into Hbase without crashing?

Michael,

This question should be addressed to the hbase-user mailing list as it
is strictly about HBase's usage of MapReduce, the framework itself
doen't have any knowledge of how the region servers are configured. I
CC'd it.

Uploading into an empty table is always a problem as you saw since
there's no load distribution. I would recommend instead to write
directly into HFiles as documented here:
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk

Other useful information for us: hbase/hadoop versions, hardware used,
optimizations used to do the insert, configuration files.

Thx,

J-D

On Tue, Jan 12, 2010 at 1:35 PM, Clements, Michael
<Mi...@disney.com> wrote:
> This leads to one quick & easy question: how does one reduce the number
> of map tasks for a job? My goal is to limit the # of Map tasks so they
> don't overwhelm the HBase region servers.
>
> The Docs point in several directions.
>
> There's a method job.setNumReduceTasks(), but no setNumMapTasks().
>
> There is a job Configuration setting setNumMapTasks(), but it's
> deprecated and says it only can increase, not reduce, the number of
> tasks.
>
> There's InputFormat and its subclasses, which do the actual file splits.
> But no single method to simply set the number of splits. One would have
> to write his own subclass to measure the total size of all input files,
> divide by the desired # of mappers and split it all up.
>
> The last option is not trivial but it is doable. Before I jump in I
> figured I'd ask if there is an easier way.
>
> Thanks
>
> -----Original Message-----
> From:
> mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apache.org
> [mailto:mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apa
> che.org] On Behalf Of Clements, Michael
> Sent: Tuesday, January 12, 2010 10:53 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: how to load big files into Hbase without crashing?
>
> I have 15-node Hadoop cluster that is working for most jobs. But every
> time I upload large data files into HBase, the job fails.
>
> I surmise that this file (15GB in size) is big enough, there are so many
> tasks (about 55 at once), they swamp the region server processes.
>
> Each cluster node is also an HBase region server, so there are a minimum
> of about 4 jobs for each region server. But when the table is small,
> there are few regions so each region server is hosting many more tasks.
> For example if the table starts out empty there is a single region, so a
> single region server has to handle calls from all 55 tasks. It can't
> handle this, the tasks give up and the job fails.
>
> This is just conjecture on my part. Does it sound reasonable?
>
> If so, what methods are there to prevent this? Limiting the number of
> tasks for the upload job is one obvious solution, but what is a good
> limit? The more general question is, how many map tasks can a typical
> region server support?
>
> Limiting the number of tasks is tedious and error-prone, as it requires
> somebody to look at the HBase table, see how many regions it has, on
> which servers, and manually configure the job accordingly. If the job is
> big enough, then the number of regions will grow during the job and the
> initial task counts won't be ideal anymore.
>
> Ideally, the Hadoop framework would be smart enough to look at how many
> regions & region servers exist and dynamically allocate a reasonable
> number of tasks.
>
> Does the community have any knowledge or techniques to handle this?
>
> Thanks
>
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
>
>
>

RE: how to load big files into Hbase without crashing?

Posted by "Clements, Michael" <Mi...@disney.com>.
After some investigation I find this feature - a max cap on the number
of tasks in a job - is upcoming. Looks like Kevin and Matei wrote it:
http://issues.apache.org/jira/browse/MAPREDUCE-698

We'll create a Fair Scheduler pool used exclusively for uploading data
to HBase. In this pool we'll cap the max tasks at 1 per server. Since
each server is an HBase region server, this will have 1 task per region
server, which in our tests gives the best performance for tasks adding
rows to HBase.

Currently (version 0.20) there is no pool setting that can cap the max
tasks. The above issue adds this new config setting. It's scheduled for
version 0.22.

Incidentally, from the JIRA comments it looks like there was a lot of
discussion and churn on this issue over a long time. I'm pleased to see
a good solution came out of it.

- Mike

-----Original Message-----
From:
mapreduce-user-return-273-Michael.Clements=disney.com@hadoop.apache.org
[mailto:mapreduce-user-return-273-Michael.Clements=disney.com@hadoop.apa
che.org] On Behalf Of Clements, Michael
Sent: Tuesday, January 12, 2010 1:52 PM
To: mapreduce-user@hadoop.apache.org; hbase-user@hadoop.apache.org
Subject: RE: how to load big files into Hbase without crashing?

It's true that my specific case is specific to HBase. But there is also
a more general question of how to set the # of mappers for a particular
job. There may be reasons other than HBase to do this. For example, a
job may need to be a singleton per machine due to resources it uses,
statics, etc.

In my case it's reading directly from HDFS and writing to HBase. So
whatever solution can limit the number of mappers, is likely applicable
to any map-reduce task. That is, the solution is not necessarily
specific to HBase.

If Hadoop supported an easy way to set the number of Mappers for a
particular job (as it already does with reducers), it would solve *all*
variants of this problem, not just mine which happens to be related to
HBase.

-Mike

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 1:45 PM
To: mapreduce-user@hadoop.apache.org; hbase-user@hadoop.apache.org
Subject: Re: how to load big files into Hbase without crashing?

Michael,

This question should be addressed to the hbase-user mailing list as it
is strictly about HBase's usage of MapReduce, the framework itself
doen't have any knowledge of how the region servers are configured. I
CC'd it.

Uploading into an empty table is always a problem as you saw since
there's no load distribution. I would recommend instead to write
directly into HFiles as documented here:
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
mapreduce/package-summary.html#bulk

Other useful information for us: hbase/hadoop versions, hardware used,
optimizations used to do the insert, configuration files.

Thx,

J-D

On Tue, Jan 12, 2010 at 1:35 PM, Clements, Michael
<Mi...@disney.com> wrote:
> This leads to one quick & easy question: how does one reduce the
number
> of map tasks for a job? My goal is to limit the # of Map tasks so they
> don't overwhelm the HBase region servers.
>
> The Docs point in several directions.
>
> There's a method job.setNumReduceTasks(), but no setNumMapTasks().
>
> There is a job Configuration setting setNumMapTasks(), but it's
> deprecated and says it only can increase, not reduce, the number of
> tasks.
>
> There's InputFormat and its subclasses, which do the actual file
splits.
> But no single method to simply set the number of splits. One would
have
> to write his own subclass to measure the total size of all input
files,
> divide by the desired # of mappers and split it all up.
>
> The last option is not trivial but it is doable. Before I jump in I
> figured I'd ask if there is an easier way.
>
> Thanks
>
> -----Original Message-----
> From:
>
mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apache.org
>
[mailto:mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apa
> che.org] On Behalf Of Clements, Michael
> Sent: Tuesday, January 12, 2010 10:53 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: how to load big files into Hbase without crashing?
>
> I have 15-node Hadoop cluster that is working for most jobs. But every
> time I upload large data files into HBase, the job fails.
>
> I surmise that this file (15GB in size) is big enough, there are so
many
> tasks (about 55 at once), they swamp the region server processes.
>
> Each cluster node is also an HBase region server, so there are a
minimum
> of about 4 jobs for each region server. But when the table is small,
> there are few regions so each region server is hosting many more
tasks.
> For example if the table starts out empty there is a single region, so
a
> single region server has to handle calls from all 55 tasks. It can't
> handle this, the tasks give up and the job fails.
>
> This is just conjecture on my part. Does it sound reasonable?
>
> If so, what methods are there to prevent this? Limiting the number of
> tasks for the upload job is one obvious solution, but what is a good
> limit? The more general question is, how many map tasks can a typical
> region server support?
>
> Limiting the number of tasks is tedious and error-prone, as it
requires
> somebody to look at the HBase table, see how many regions it has, on
> which servers, and manually configure the job accordingly. If the job
is
> big enough, then the number of regions will grow during the job and
the
> initial task counts won't be ideal anymore.
>
> Ideally, the Hadoop framework would be smart enough to look at how
many
> regions & region servers exist and dynamically allocate a reasonable
> number of tasks.
>
> Does the community have any knowledge or techniques to handle this?
>
> Thanks
>
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
>
>
>

RE: how to load big files into Hbase without crashing?

Posted by "Clements, Michael" <Mi...@disney.com>.
It's true that my specific case is specific to HBase. But there is also
a more general question of how to set the # of mappers for a particular
job. There may be reasons other than HBase to do this. For example, a
job may need to be a singleton per machine due to resources it uses,
statics, etc.

In my case it's reading directly from HDFS and writing to HBase. So
whatever solution can limit the number of mappers, is likely applicable
to any map-reduce task. That is, the solution is not necessarily
specific to HBase.

If Hadoop supported an easy way to set the number of Mappers for a
particular job (as it already does with reducers), it would solve *all*
variants of this problem, not just mine which happens to be related to
HBase.

-Mike

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 1:45 PM
To: mapreduce-user@hadoop.apache.org; hbase-user@hadoop.apache.org
Subject: Re: how to load big files into Hbase without crashing?

Michael,

This question should be addressed to the hbase-user mailing list as it
is strictly about HBase's usage of MapReduce, the framework itself
doen't have any knowledge of how the region servers are configured. I
CC'd it.

Uploading into an empty table is always a problem as you saw since
there's no load distribution. I would recommend instead to write
directly into HFiles as documented here:
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
mapreduce/package-summary.html#bulk

Other useful information for us: hbase/hadoop versions, hardware used,
optimizations used to do the insert, configuration files.

Thx,

J-D

On Tue, Jan 12, 2010 at 1:35 PM, Clements, Michael
<Mi...@disney.com> wrote:
> This leads to one quick & easy question: how does one reduce the
number
> of map tasks for a job? My goal is to limit the # of Map tasks so they
> don't overwhelm the HBase region servers.
>
> The Docs point in several directions.
>
> There's a method job.setNumReduceTasks(), but no setNumMapTasks().
>
> There is a job Configuration setting setNumMapTasks(), but it's
> deprecated and says it only can increase, not reduce, the number of
> tasks.
>
> There's InputFormat and its subclasses, which do the actual file
splits.
> But no single method to simply set the number of splits. One would
have
> to write his own subclass to measure the total size of all input
files,
> divide by the desired # of mappers and split it all up.
>
> The last option is not trivial but it is doable. Before I jump in I
> figured I'd ask if there is an easier way.
>
> Thanks
>
> -----Original Message-----
> From:
>
mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apache.org
>
[mailto:mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apa
> che.org] On Behalf Of Clements, Michael
> Sent: Tuesday, January 12, 2010 10:53 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: how to load big files into Hbase without crashing?
>
> I have 15-node Hadoop cluster that is working for most jobs. But every
> time I upload large data files into HBase, the job fails.
>
> I surmise that this file (15GB in size) is big enough, there are so
many
> tasks (about 55 at once), they swamp the region server processes.
>
> Each cluster node is also an HBase region server, so there are a
minimum
> of about 4 jobs for each region server. But when the table is small,
> there are few regions so each region server is hosting many more
tasks.
> For example if the table starts out empty there is a single region, so
a
> single region server has to handle calls from all 55 tasks. It can't
> handle this, the tasks give up and the job fails.
>
> This is just conjecture on my part. Does it sound reasonable?
>
> If so, what methods are there to prevent this? Limiting the number of
> tasks for the upload job is one obvious solution, but what is a good
> limit? The more general question is, how many map tasks can a typical
> region server support?
>
> Limiting the number of tasks is tedious and error-prone, as it
requires
> somebody to look at the HBase table, see how many regions it has, on
> which servers, and manually configure the job accordingly. If the job
is
> big enough, then the number of regions will grow during the job and
the
> initial task counts won't be ideal anymore.
>
> Ideally, the Hadoop framework would be smart enough to look at how
many
> regions & region servers exist and dynamically allocate a reasonable
> number of tasks.
>
> Does the community have any knowledge or techniques to handle this?
>
> Thanks
>
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
>
>
>