You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Chris Anderson <jc...@grabb.it> on 2008/06/26 03:59:53 UTC

process limits for streaming jar

Hi there,

I'm running some streaming jobs on ec2 (ruby parsing scripts) and in
my most recent test I managed to spike the load on my large instances
to 25 or so. As a result, I lost communication with one instance. I
think I took down sshd. Whoops.

My question is, has anyone got strategies for managing resources used
by the processes spawned by streaming jar? Ideally I'd like to run my
ruby scripts under nice.

I can hack something together with wrappers, but I'm thinking there
might be a configuration option to handle this within Streaming jar.
Thanks for any suggestions!

-- 
Chris Anderson
http://jchris.mfdz.com

Re: process limits for streaming jar

Posted by Rick Cox <ri...@gmail.com>.
On Fri, Jun 27, 2008 at 08:57, Chris Anderson <jc...@grabb.it> wrote:

> The problem is that when there are a large number of map tasks to
> complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead,
> it is spawning 8 map tasks per tasktracker (even when I change the
> mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the
> master). The cluster was booted with the setting at 8. Do I need to
> change hadoop-site.xml on all the slaves, and restart the task
> trackers, in order to make the limit apply? That seems unlikely - I'd
> really like to manage this parameter on a per-job level.
>

Yes, mapred.tasktracker.map.tasks.maximum is configured per
tasktracker on startup. It can't be configured per job because it's
not a job-scope parameter (if there are multiple concurrent jobs, they
have to share the task limit).

rick

Re: process limits for streaming jar

Posted by Chris Anderson <jc...@grabb.it>.
Having experimented some more, I've found that the simple solution is
to limit the resource usage by limiting the # of map tasks and the
memory they are allowed to consume.

I'm specifying the constraints on the command line like this:

-jobconf mapred.tasktracker.map.tasks.maximum=2 mapred.child.ulimit=1048576

The configuration parameters seem to take, in the job.xml available
from the web console, I see these lines:

mapred.child.ulimit	1048576
mapred.tasktracker.map.tasks.maximum	2

The problem is that when there are a large number of map tasks to
complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead,
it is spawning 8 map tasks per tasktracker (even when I change the
mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the
master). The cluster was booted with the setting at 8. Do I need to
change hadoop-site.xml on all the slaves, and restart the task
trackers, in order to make the limit apply? That seems unlikely - I'd
really like to manage this parameter on a per-job level.

Thanks for any input!

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: process limits for streaming jar

Posted by Vinod KV <vi...@yahoo-inc.com>.
Allen Wittenauer wrote:
> This is essentially what we're doing via torque (and therefore hod).
>

When we intend to move away from HOD(and thus torque) to using Hadoop 
Resource manager(HADOOP-3421) and scheduler(HADOOP-3412) interfaces, we 
need to move this resource management functionality into Hadoop.

> As I said in HADOOP-3280, I'm still convinced that having Hadoop set
> these types of restrictions directly is the wrong approach. It is almost
> always going to be better to use the OS controls that are specific to the
> installation. Enabling OS specific features means that, at a maximum,
> hadoop should likely being calling a script rather than doing ulimits or
> having the equivalent of #ifdef code everywhere. [After all, what if I 
> want
> to use Solaris-specific features like projects or privileges?]
>
> But I suspect most of the tunables that people will care about can
> likely be managed at the OS level before hadoop is even involved.

Please see HADOOP-3581(Prevent memory intensive user tasks from taking 
down nodes). This issue is aimed at a general solution for putting 
aggregate memory limits on the tasks and any subprocesses that tasks 
might launch.

Your comments don't seem to be in line with what I proposed on this 
issue JIRA. Having Hadoop to just call a sript, rather than doing 
ulimits itself, does look nice, but by doing just ulimits alone, Hadoop 
cannot have a complete control over what the tasks do. For e.g. as 
stated on the JIRA, run-away tasks that fork themselves repeatedly could 
wreck havoc and disturb the normal functioning of not only Hadoop 
daemons but also might bring down the nodes themselves.

The current solution HADOOP-3280, doesn't preclude this, it only limits 
memory usable by a single process, not its subprocesses. Neither do 
ulimits via limits.conf will suffice - it just limits vmem usable 
per-process per-user as I checked.

Not to sound imitating Torque a bit too much, even torque resorts to 
have control over the process tree itself rather than depending directly 
on OS specific tools - this is done by specific code for each platform.

We need a consensus on all of this. And I might be missing something 
too. Can you please comment on HADOOP-3581?

Thanks,
-Vinod

Re: process limits for streaming jar

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.
On 6/26/08 10:14 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
> However - in our environment - we spawn all streaming tasks through a
> wrapper program (with the wrapper being defaulted across all users). We
> can control resource uses from this wrapper (and also do some level of
> job control).

    This is essentially what we're doing via torque (and therefore hod).

> This does require some minor code changes and would be happy to contrib
> - but the approach doesn't quite fit in well with the approach in 2765
> that passes each one of these resource limits as a hadoop config param.

    As I said in HADOOP-3280, I'm still convinced that having Hadoop set
these types of restrictions directly is the wrong approach.  It is almost
always going to be better to use the OS controls that are specific to the
installation.  Enabling OS specific features means that, at a maximum,
hadoop should likely being calling a script rather than doing ulimits or
having the equivalent of #ifdef code everywhere. [After all, what if I want
to use Solaris-specific features like projects or privileges?]

    But I suspect most of the tunables that people will care about can
likely be managed at the OS level before hadoop is even involved.


RE: process limits for streaming jar

Posted by Joydeep Sen Sarma <js...@facebook.com>.
Memory limits were handled in this jira:
https://issues.apache.org/jira/browse/HADOOP-2765

However - in our environment - we spawn all streaming tasks through a
wrapper program (with the wrapper being defaulted across all users). We
can control resource uses from this wrapper (and also do some level of
job control).
	
This does require some minor code changes and would be happy to contrib
- but the approach doesn't quite fit in well with the approach in 2765
that passes each one of these resource limits as a hadoop config param.

-----Original Message-----
From: jchris@gmail.com [mailto:jchris@gmail.com] On Behalf Of Chris
Anderson
Sent: Wednesday, June 25, 2008 7:00 PM
To: core-user@hadoop.apache.org
Subject: process limits for streaming jar

Hi there,

I'm running some streaming jobs on ec2 (ruby parsing scripts) and in
my most recent test I managed to spike the load on my large instances
to 25 or so. As a result, I lost communication with one instance. I
think I took down sshd. Whoops.

My question is, has anyone got strategies for managing resources used
by the processes spawned by streaming jar? Ideally I'd like to run my
ruby scripts under nice.

I can hack something together with wrappers, but I'm thinking there
might be a configuration option to handle this within Streaming jar.
Thanks for any suggestions!

-- 
Chris Anderson
http://jchris.mfdz.com