You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Olga Natkovich <ol...@yahoo-inc.com> on 2008/01/09 00:55:27 UTC

question about file glob in hadoop 0.15

Hi,
 
According to 0.15 documentation, FileSystem::globPaths supports {ab,cd}
matching. However, when I tried to use it with pattern
/data/mydata/{data1,data2} I got no results even though I could find the
individual files.
 
Any ideas?
 
Thanks,
 
Olga

Re: Question on running simultaneous jobs

Posted by Khalil Honsali <k....@gmail.com>.

I noticed it is not possible to run simultaneously two jobs by the same
user, the second job gets stuck at map 0% reduce 0%


On 10/01/2008, Jeff Hammerbacher <je...@gmail.com> wrote:
>
> it's a stopgap and doesn't seem to be working well for y!:
> https://issues.apache.org/jira/browse/HADOOP-2510
>
> On Jan 9, 2008 5:30 PM, Ted Dunning <td...@veoh.com> wrote:
>
> >
> > What is the status of Hadoop on Demand?  Is it ready for prime time?
> >
> >
> > On 1/9/08 4:58 PM, "Aaron Kimball" <ak...@cs.washington.edu> wrote:
> >
> > > I will add to the discussion that the ability to have multiple tasks
> of
> > > equal priority all making progress simultaneously is important in
> > > academic environments. There are a number of undergraduate programs
> > > which are starting to use Hadoop in code labs for students.
> > >
> > > Multiple students should be able to submit jobs and if one student's
> > > poorly-written task is grinding up a lot of cycles on a shared
> cluster,
> > > other students still need to be able to test their code in the
> meantime;
> > > ideally, they would not need to enter a lengthy job queue. ... I'd say
> > > that this actually applies to development clusters in general, where
> > > individual task performance is less important than the ability of
> > > multiple developers to test code concurrently.
> > >
> > > - Aaron
> > >
> > >
> > >
> > > Joydeep Sen Sarma wrote:
> > >>> that can run(per job) at any given time.
> > >>
> > >> not possible afaik - but i will be happy to hear otherwise.
> > >>
> > >> priorities are a good substitute though. there's no point needlessly
> > >> restricting concurrency if there's nothing else to run. if there is
> > something
> > >> else more important to run - then in most cases, assigning a higher
> > priority
> > >> to that other thing would make the right thing happen.
> > >>
> > >> except with long running tasks (usually reducers) that cannot be
> > preempted.
> > >> (Hadoop does not seem to use OS process priorities at all. I wonder
> if
> > >> process priorities can be used as a substitute for pre-emption.)
> > >>
> > >> HOD is another solution that you might want to look into - my
> > understanding
> > >> is that with HOD u can restrict the number of machines used by a job.
> > >>
> > >> ________________________________
> > >>
> > >> From: Xavier Stevens [mailto:Xavier.Stevens@fox.com]
> > >> Sent: Wed 1/9/2008 2:57 PM
> > >> To: hadoop-user@lucene.apache.org
> > >> Subject: RE: Question on running simultaneous jobs
> > >>
> > >>
> > >>
> > >> This doesn't work to solve this issue because it sets the total
> number
> > >> of map/reduce tasks. When setting the total number of map tasks I get
> > an
> > >> ArrayOutOfBoundsException within Hadoop; I believe because of the
> input
> > >> dataset size (around 90 million lines).
> > >>
> > >> I think it is important to make a distinction between setting total
> > >> number of map/reduce tasks and the number that can run(per job) at
> any
> > >> given time.  I would like only to restrict the later, while allowing
> > >> Hadoop to divide the data into chunks as it sees fit.
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Ted Dunning [mailto:tdunning@veoh.com]
> > >> Sent: Wednesday, January 09, 2008 1:50 PM
> > >> To: hadoop-user@lucene.apache.org
> > >> Subject: Re: Question on running simultaneous jobs
> > >>
> > >>
> > >> You may need to upgrade, but 15.1 does just fine with multiple jobs
> in
> > >> the cluster.  Use conf.setNumMapTasks(int) and
> > >> conf.setNumReduceTasks(int).
> > >>
> > >>
> > >> On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:
> > >>
> > >>> Does Hadoop support running simultaneous jobs?  If so, what
> parameters
> > >>
> > >>> do I need to set in my job configuration?  We basically want to give
> a
> > >>
> > >>> job that takes a really long time, half of the total resources of
> the
> > >>> cluster so other jobs don't queue up behind it.
> > >>>
> > >>> I am using Hadoop 0.14.2 currently.  I tried setting
> > >>> mapred.tasktracker.tasks.maximum to be half of the maximum specified
> > >>> in mapred-default.xml.  This shows the change in the web
> > >>> administration page for the job, but it has no effect on the actual
> > >>> numbers of tasks running.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Xavier
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
> >
>



-- 
---------------------------------------------------------
شهر مبارك كريم
كل عام و أنتم بخير
---------------------------------------------------------
Honsali Khalil − 本査理 カリル
Academic>Japan>NIT>Grad. Sc. Eng.>Dept. CS>Matsuo&Tsumura Lab.
http://www.matlab.nitech.ac.jp/~k-hon/
+81 (zero-)eight-zero 5134 8119
k.honsali@ezweb.ne.jp (instant reply mail)

Re: Question on running simultaneous jobs

Posted by Jeff Hammerbacher <je...@gmail.com>.

it's a stopgap and doesn't seem to be working well for y!:
https://issues.apache.org/jira/browse/HADOOP-2510

On Jan 9, 2008 5:30 PM, Ted Dunning <td...@veoh.com> wrote:

>
> What is the status of Hadoop on Demand?  Is it ready for prime time?
>
>
> On 1/9/08 4:58 PM, "Aaron Kimball" <ak...@cs.washington.edu> wrote:
>
> > I will add to the discussion that the ability to have multiple tasks of
> > equal priority all making progress simultaneously is important in
> > academic environments. There are a number of undergraduate programs
> > which are starting to use Hadoop in code labs for students.
> >
> > Multiple students should be able to submit jobs and if one student's
> > poorly-written task is grinding up a lot of cycles on a shared cluster,
> > other students still need to be able to test their code in the meantime;
> > ideally, they would not need to enter a lengthy job queue. ... I'd say
> > that this actually applies to development clusters in general, where
> > individual task performance is less important than the ability of
> > multiple developers to test code concurrently.
> >
> > - Aaron
> >
> >
> >
> > Joydeep Sen Sarma wrote:
> >>> that can run(per job) at any given time.
> >>
> >> not possible afaik - but i will be happy to hear otherwise.
> >>
> >> priorities are a good substitute though. there's no point needlessly
> >> restricting concurrency if there's nothing else to run. if there is
> something
> >> else more important to run - then in most cases, assigning a higher
> priority
> >> to that other thing would make the right thing happen.
> >>
> >> except with long running tasks (usually reducers) that cannot be
> preempted.
> >> (Hadoop does not seem to use OS process priorities at all. I wonder if
> >> process priorities can be used as a substitute for pre-emption.)
> >>
> >> HOD is another solution that you might want to look into - my
> understanding
> >> is that with HOD u can restrict the number of machines used by a job.
> >>
> >> ________________________________
> >>
> >> From: Xavier Stevens [mailto:Xavier.Stevens@fox.com]
> >> Sent: Wed 1/9/2008 2:57 PM
> >> To: hadoop-user@lucene.apache.org
> >> Subject: RE: Question on running simultaneous jobs
> >>
> >>
> >>
> >> This doesn't work to solve this issue because it sets the total number
> >> of map/reduce tasks. When setting the total number of map tasks I get
> an
> >> ArrayOutOfBoundsException within Hadoop; I believe because of the input
> >> dataset size (around 90 million lines).
> >>
> >> I think it is important to make a distinction between setting total
> >> number of map/reduce tasks and the number that can run(per job) at any
> >> given time.  I would like only to restrict the later, while allowing
> >> Hadoop to divide the data into chunks as it sees fit.
> >>
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:tdunning@veoh.com]
> >> Sent: Wednesday, January 09, 2008 1:50 PM
> >> To: hadoop-user@lucene.apache.org
> >> Subject: Re: Question on running simultaneous jobs
> >>
> >>
> >> You may need to upgrade, but 15.1 does just fine with multiple jobs in
> >> the cluster.  Use conf.setNumMapTasks(int) and
> >> conf.setNumReduceTasks(int).
> >>
> >>
> >> On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:
> >>
> >>> Does Hadoop support running simultaneous jobs?  If so, what parameters
> >>
> >>> do I need to set in my job configuration?  We basically want to give a
> >>
> >>> job that takes a really long time, half of the total resources of the
> >>> cluster so other jobs don't queue up behind it.
> >>>
> >>> I am using Hadoop 0.14.2 currently.  I tried setting
> >>> mapred.tasktracker.tasks.maximum to be half of the maximum specified
> >>> in mapred-default.xml.  This shows the change in the web
> >>> administration page for the job, but it has no effect on the actual
> >>> numbers of tasks running.
> >>>
> >>> Thanks,
> >>>
> >>> Xavier
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

RE: Question on running simultaneous jobs

Posted by Joydeep Sen Sarma <js...@facebook.com>.

perhaps the java process can have a signal handler to checkpoint in memory data and then suspend itself?
 
what i mean is it could finish any intermediate sort/merge runs and then suspend. next time it restarts, it would, hopefully, just wouldn't need much of the memory from it's data segment and could start afresh.
 
but this is speculating that most of the memory consumption is sort buffering .. 

________________________________

From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thu 1/10/2008 2:39 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs



Joydeep Sen Sarma wrote:
> being paged out is sad - but the worst case is still no worse than killing the job (where all the data has to be *recomputed* back into memory on restart - not just swapped in from disk)

In my experience, once a large process is paged out it is almost always
faster to restart it than to wait for it to get paged back in with
random disk accesses.  If there were a way to explicitly write out a
process's working set, and then restore it later, using sequential disk
accesses, that might be effective.  Virtualization systems support such
operations, so perhaps tasktrackers should start a Xen instance per task?

Doug

Re: Question on running simultaneous jobs

Posted by Doug Cutting <cu...@apache.org>.

Joydeep Sen Sarma wrote:
> being paged out is sad - but the worst case is still no worse than killing the job (where all the data has to be *recomputed* back into memory on restart - not just swapped in from disk)

In my experience, once a large process is paged out it is almost always 
faster to restart it than to wait for it to get paged back in with 
random disk accesses.  If there were a way to explicitly write out a 
process's working set, and then restore it later, using sequential disk 
accesses, that might be effective.  Virtualization systems support such 
operations, so perhaps tasktrackers should start a Xen instance per task?

Doug

RE: Question on running simultaneous jobs

Posted by Joydeep Sen Sarma <js...@facebook.com>.

being paged out is sad - but the worst case is still no worse than killing the job (where all the data has to be *recomputed* back into memory on restart - not just swapped in from disk)
 
the best and average cases are likely way better ..
 
(disk capacity seems no issue at all - but perhaps we are blessed to be in this state).

________________________________

From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thu 1/10/2008 2:24 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs



Joydeep Sen Sarma wrote:
> can we suspend jobs (just unix suspend) instead of killing them?

We could, but they'd still consume RAM and disk.  The RAM might
eventually get paged out, but relying on that is probably a bad idea.
So, this could work for tasks that don't use much memory and whose
intermediate data is small, but that's frequently not the case.

Doug

Re: Question on running simultaneous jobs

Posted by Doug Cutting <cu...@apache.org>.

Joydeep Sen Sarma wrote:
> can we suspend jobs (just unix suspend) instead of killing them?

We could, but they'd still consume RAM and disk.  The RAM might 
eventually get paged out, but relying on that is probably a bad idea. 
So, this could work for tasks that don't use much memory and whose 
intermediate data is small, but that's frequently not the case.

Doug

RE: Question on running simultaneous jobs

Posted by Joydeep Sen Sarma <js...@facebook.com>.

can we suspend jobs (just unix suspend) instead of killing them?

if we can - perhaps we don't even have to bother delaying the use of additional slots beyond limit.

________________________________

From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thu 1/10/2008 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs

Runping Qi wrote:
> An improvement over Doug's proposal is to make the limit soft in the
> following sense:
>
> 1. A job is entitled to run up to the limit number of tasks.
> 2. If there are free slots and no other job waits for their entitled
> slots, a job can run more tasks than the limit.
> 3. When a job runs more tasks than its limit, and a new job comes, we
> may do one of the two:
>       a) kill some of the tasks to make room for the new job.
>       b) all the running tasks run to complete. Any freed up slot will
> be assigned to the new job.

I think this would be a good second phase, as it will be trickier to
implement.

Jobs that disable speculative execution may not like having tasks killed
(although they must in general still be tolerant of it) so we might only
permit jobs with speculative execution enabled to exceed their limit.

Also there should be a delay before a job is permitted to run over its
limit, in order to give other jobs an opportunity to launch.  For
example, if a user is submitting a series of jobs, each consuming the
output of the previous, then we wouldn't want an already running job to
immediately consume all the free slots when one job completes, since
another job will soon be started that is more deserving of these slots.
  Perhaps, when portions of the cluster are idle, jobs should gradually
be permitted to exceed their limit.  Then, if new jobs are launched,
tasks should only gradually be killed, first giving them the opportunity
to finish normally.  Some tuning will probably be required to get this
right.

Ideally the limit would be dynamic, perhaps something like max(10,
#slots/#jobs), so jobs would only be queued when there are fewer than 10
slots/job.  But a static limit would still be a significant improvement
and easier to implement in the first version.

Doug

Re: Question on running simultaneous jobs

Posted by Doug Cutting <cu...@apache.org>.

Runping Qi wrote:
> An improvement over Doug's proposal is to make the limit soft in the
> following sense:
> 
> 1. A job is entitled to run up to the limit number of tasks.
> 2. If there are free slots and no other job waits for their entitled
> slots, a job can run more tasks than the limit.
> 3. When a job runs more tasks than its limit, and a new job comes, we
> may do one of the two:
> 	a) kill some of the tasks to make room for the new job.
> 	b) all the running tasks run to complete. Any freed up slot will
> be assigned to the new job.

I think this would be a good second phase, as it will be trickier to 
implement.

Jobs that disable speculative execution may not like having tasks killed 
(although they must in general still be tolerant of it) so we might only 
permit jobs with speculative execution enabled to exceed their limit.

Also there should be a delay before a job is permitted to run over its 
limit, in order to give other jobs an opportunity to launch.  For 
example, if a user is submitting a series of jobs, each consuming the 
output of the previous, then we wouldn't want an already running job to 
immediately consume all the free slots when one job completes, since 
another job will soon be started that is more deserving of these slots. 
  Perhaps, when portions of the cluster are idle, jobs should gradually 
be permitted to exceed their limit.  Then, if new jobs are launched, 
tasks should only gradually be killed, first giving them the opportunity 
to finish normally.  Some tuning will probably be required to get this 
right.

Ideally the limit would be dynamic, perhaps something like max(10, 
#slots/#jobs), so jobs would only be queued when there are fewer than 10 
slots/job.  But a static limit would still be a significant improvement 
and easier to implement in the first version.

Doug

RE: Question on running simultaneous jobs

Posted by Runping Qi <ru...@yahoo-inc.com>.

An improvement over Doug's proposal is to make the limit soft in the
following sense:

1. A job is entitled to run up to the limit number of tasks.
2. If there are free slots and no other job waits for their entitled
slots, a job can run more tasks than the limit.
3. When a job runs more tasks than its limit, and a new job comes, we
may do one of the two:
	a) kill some of the tasks to make room for the new job.
	b) all the running tasks run to complete. Any freed up slot will
be assigned to the new job.

Runping


> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
> Sent: Thursday, January 10, 2008 9:57 AM
> To: hadoop-user@lucene.apache.org; hadoop-user@lucene.apache.org
> Subject: RE: Question on running simultaneous jobs
> 
> this may be simple - but is this the right solution? (and i 
> have the same concern about hod)
> 
> if the cluster is unused - why restrict parallelism? if 
> someone's willing to wake up at 4am to beat the crowd - they 
> would just absolutely hate this.
> 
> 
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Thu 1/10/2008 9:50 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Question on running simultaneous jobs
>  
> Aaron Kimball wrote:
> > Multiple students should be able to submit jobs and if one 
> student's 
> > poorly-written task is grinding up a lot of cycles on a shared 
> > cluster, other students still need to be able to test their code in 
> > the meantime;
> 
> I think a simple approach to address this is to limit the 
> number of tasks from a job that are permitted to execute 
> simultaneously.  If, for example, you have a cluster of 50 
> dual-core nodes, with 100 map task slots and 100 reduce task 
> slots, and the configured limit is 25 simultaneous tasks/job, 
> then four or more jobs will be able to run at a time.  This 
> will permit faster jobs to pass slower jobs.  This approach 
> also avoids some problems we've seen with HOD, where nodes 
> are underutilized during the tail of jobs, and with input locality.
> 
> The JobTracker already handles simultaneously executing jobs, 
> so the primary change required is just to task allocation, 
> and thus should not prove intractable.
> 
> I've added a Jira issue for this:
> 
>    https://issues.apache.org/jira/browse/HADOOP-2573
> 
> Please add further comments there.
> 
> Doug
> 
>

Re: Question on running simultaneous jobs

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Thu, Jan 10, 2008 at 10:26:46AM -0800, Doug Cutting wrote:
>Joydeep Sen Sarma wrote:
>>if the cluster is unused - why restrict parallelism? if someone's willing 
>>to wake up at 4am to beat the crowd - they would just absolutely hate this.
>
>[It would be better to make your comments in Jira. ]
>
>But if someone starts a long-running job at night that uses the entire 
>cluster then they could monopolize the cluster into the day.  If 
>speculative execution is enabled, then some tasks could be killed to 
>make room for other jobs are started in the morning, but that's not 
>always possible.  And, if it's not, pickling a job's state and swapping 
>it to HDFS would be expensive.
>

I'd like to throw *job priority* into this festering pool...

At least changing the job-priority (done by the cluster-admin) should result 
in a change in number of max_slots... thoughts?

Arun

PS: Yes, I do wish this was in jira - I'll add a comment there.

>Note also that a task-limiting cluster cluster will still run faster at 
>night.  If you've got 50 nodes with up to 200 tasks running at a time, 
>then tasks will run faster when only 50 are running.  The network is 
>also a primary bottleneck, and it will be less congested when fewer jobs 
>are running, and disk contention will be lower too.  So night owls would 
>still have significant advantages.
>
>It's not intended as a perfect solution, but rather a substantial 
>improvement for many users that's not too hard to implement.
>
>Doug

Re: Question on running simultaneous jobs

Posted by Doug Cutting <cu...@apache.org>.

Joydeep Sen Sarma wrote:
> if the cluster is unused - why restrict parallelism? if someone's willing to wake up at 4am to beat the crowd - they would just absolutely hate this.

[It would be better to make your comments in Jira. ]

But if someone starts a long-running job at night that uses the entire 
cluster then they could monopolize the cluster into the day.  If 
speculative execution is enabled, then some tasks could be killed to 
make room for other jobs are started in the morning, but that's not 
always possible.  And, if it's not, pickling a job's state and swapping 
it to HDFS would be expensive.

Note also that a task-limiting cluster cluster will still run faster at 
night.  If you've got 50 nodes with up to 200 tasks running at a time, 
then tasks will run faster when only 50 are running.  The network is 
also a primary bottleneck, and it will be less congested when fewer jobs 
are running, and disk contention will be lower too.  So night owls would 
still have significant advantages.

It's not intended as a perfect solution, but rather a substantial 
improvement for many users that's not too hard to implement.

Doug

Re: Question on running simultaneous jobs

Posted by Ted Dunning <td...@veoh.com>.

Presumably the limit could be made dynamic.  The limit could be
max(static_limit, number of cores in cluster / # active jobs)


On 1/10/08 9:56 AM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> this may be simple - but is this the right solution? (and i have the same
> concern about hod)
> 
> if the cluster is unused - why restrict parallelism? if someone's willing to
> wake up at 4am to beat the crowd - they would just absolutely hate this.
> 
> 
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Thu 1/10/2008 9:50 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Question on running simultaneous jobs
>  
> Aaron Kimball wrote:
>> Multiple students should be able to submit jobs and if one student's
>> poorly-written task is grinding up a lot of cycles on a shared cluster,
>> other students still need to be able to test their code in the meantime;
> 
> I think a simple approach to address this is to limit the number of
> tasks from a job that are permitted to execute simultaneously.  If, for
> example, you have a cluster of 50 dual-core nodes, with 100 map task
> slots and 100 reduce task slots, and the configured limit is 25
> simultaneous tasks/job, then four or more jobs will be able to run at a
> time.  This will permit faster jobs to pass slower jobs.  This approach
> also avoids some problems we've seen with HOD, where nodes are
> underutilized during the tail of jobs, and with input locality.
> 
> The JobTracker already handles simultaneously executing jobs, so the
> primary change required is just to task allocation, and thus should not
> prove intractable.
> 
> I've added a Jira issue for this:
> 
>    https://issues.apache.org/jira/browse/HADOOP-2573
> 
> Please add further comments there.
> 
> Doug
>

RE: Question on running simultaneous jobs

Posted by Joydeep Sen Sarma <js...@facebook.com>.

this may be simple - but is this the right solution? (and i have the same concern about hod)

if the cluster is unused - why restrict parallelism? if someone's willing to wake up at 4am to beat the crowd - they would just absolutely hate this.

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Thu 1/10/2008 9:50 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs

Aaron Kimball wrote:
> Multiple students should be able to submit jobs and if one student's 
> poorly-written task is grinding up a lot of cycles on a shared cluster, 
> other students still need to be able to test their code in the meantime; 

I think a simple approach to address this is to limit the number of 
tasks from a job that are permitted to execute simultaneously.  If, for 
example, you have a cluster of 50 dual-core nodes, with 100 map task 
slots and 100 reduce task slots, and the configured limit is 25 
simultaneous tasks/job, then four or more jobs will be able to run at a 
time.  This will permit faster jobs to pass slower jobs.  This approach 
also avoids some problems we've seen with HOD, where nodes are 
underutilized during the tail of jobs, and with input locality.

The JobTracker already handles simultaneously executing jobs, so the 
primary change required is just to task allocation, and thus should not 
prove intractable.

I've added a Jira issue for this:

   https://issues.apache.org/jira/browse/HADOOP-2573

Please add further comments there.

Doug

Re: Question on running simultaneous jobs

Posted by Doug Cutting <cu...@apache.org>.

Aaron Kimball wrote:
> Multiple students should be able to submit jobs and if one student's 
> poorly-written task is grinding up a lot of cycles on a shared cluster, 
> other students still need to be able to test their code in the meantime; 

I think a simple approach to address this is to limit the number of 
tasks from a job that are permitted to execute simultaneously.  If, for 
example, you have a cluster of 50 dual-core nodes, with 100 map task 
slots and 100 reduce task slots, and the configured limit is 25 
simultaneous tasks/job, then four or more jobs will be able to run at a 
time.  This will permit faster jobs to pass slower jobs.  This approach 
also avoids some problems we've seen with HOD, where nodes are 
underutilized during the tail of jobs, and with input locality.

The JobTracker already handles simultaneously executing jobs, so the 
primary change required is just to task allocation, and thus should not 
prove intractable.

I've added a Jira issue for this:

   https://issues.apache.org/jira/browse/HADOOP-2573

Please add further comments there.

Doug

Re: Question on running simultaneous jobs

Posted by Ted Dunning <td...@veoh.com>.

What is the status of Hadoop on Demand?  Is it ready for prime time?


On 1/9/08 4:58 PM, "Aaron Kimball" <ak...@cs.washington.edu> wrote:

> I will add to the discussion that the ability to have multiple tasks of
> equal priority all making progress simultaneously is important in
> academic environments. There are a number of undergraduate programs
> which are starting to use Hadoop in code labs for students.
> 
> Multiple students should be able to submit jobs and if one student's
> poorly-written task is grinding up a lot of cycles on a shared cluster,
> other students still need to be able to test their code in the meantime;
> ideally, they would not need to enter a lengthy job queue. ... I'd say
> that this actually applies to development clusters in general, where
> individual task performance is less important than the ability of
> multiple developers to test code concurrently.
> 
> - Aaron
> 
> 
> 
> Joydeep Sen Sarma wrote:
>>> that can run(per job) at any given time.
>>  
>> not possible afaik - but i will be happy to hear otherwise.
>>  
>> priorities are a good substitute though. there's no point needlessly
>> restricting concurrency if there's nothing else to run. if there is something
>> else more important to run - then in most cases, assigning a higher priority
>> to that other thing would make the right thing happen.
>>  
>> except with long running tasks (usually reducers) that cannot be preempted.
>> (Hadoop does not seem to use OS process priorities at all. I wonder if
>> process priorities can be used as a substitute for pre-emption.)
>>  
>> HOD is another solution that you might want to look into - my understanding
>> is that with HOD u can restrict the number of machines used by a job.
>>  
>> ________________________________
>> 
>> From: Xavier Stevens [mailto:Xavier.Stevens@fox.com]
>> Sent: Wed 1/9/2008 2:57 PM
>> To: hadoop-user@lucene.apache.org
>> Subject: RE: Question on running simultaneous jobs
>> 
>> 
>> 
>> This doesn't work to solve this issue because it sets the total number
>> of map/reduce tasks. When setting the total number of map tasks I get an
>> ArrayOutOfBoundsException within Hadoop; I believe because of the input
>> dataset size (around 90 million lines).
>> 
>> I think it is important to make a distinction between setting total
>> number of map/reduce tasks and the number that can run(per job) at any
>> given time.  I would like only to restrict the later, while allowing
>> Hadoop to divide the data into chunks as it sees fit.
>> 
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@veoh.com]
>> Sent: Wednesday, January 09, 2008 1:50 PM
>> To: hadoop-user@lucene.apache.org
>> Subject: Re: Question on running simultaneous jobs
>> 
>> 
>> You may need to upgrade, but 15.1 does just fine with multiple jobs in
>> the cluster.  Use conf.setNumMapTasks(int) and
>> conf.setNumReduceTasks(int).
>> 
>> 
>> On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:
>> 
>>> Does Hadoop support running simultaneous jobs?  If so, what parameters
>> 
>>> do I need to set in my job configuration?  We basically want to give a
>> 
>>> job that takes a really long time, half of the total resources of the
>>> cluster so other jobs don't queue up behind it.
>>> 
>>> I am using Hadoop 0.14.2 currently.  I tried setting
>>> mapred.tasktracker.tasks.maximum to be half of the maximum specified
>>> in mapred-default.xml.  This shows the change in the web
>>> administration page for the job, but it has no effect on the actual
>>> numbers of tasks running.
>>> 
>>> Thanks,
>>> 
>>> Xavier
>>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Question on running simultaneous jobs

Posted by Aaron Kimball <ak...@cs.washington.edu>.

I will add to the discussion that the ability to have multiple tasks of 
equal priority all making progress simultaneously is important in 
academic environments. There are a number of undergraduate programs 
which are starting to use Hadoop in code labs for students.

Multiple students should be able to submit jobs and if one student's 
poorly-written task is grinding up a lot of cycles on a shared cluster, 
other students still need to be able to test their code in the meantime; 
ideally, they would not need to enter a lengthy job queue. ... I'd say 
that this actually applies to development clusters in general, where 
individual task performance is less important than the ability of 
multiple developers to test code concurrently.

- Aaron



Joydeep Sen Sarma wrote:
>> that can run(per job) at any given time.  
>  
> not possible afaik - but i will be happy to hear otherwise.
>  
> priorities are a good substitute though. there's no point needlessly restricting concurrency if there's nothing else to run. if there is something else more important to run - then in most cases, assigning a higher priority to that other thing would make the right thing happen.
>  
> except with long running tasks (usually reducers) that cannot be preempted. (Hadoop does not seem to use OS process priorities at all. I wonder if process priorities can be used as a substitute for pre-emption.)
>  
> HOD is another solution that you might want to look into - my understanding is that with HOD u can restrict the number of machines used by a job.
>  
> ________________________________
> 
> From: Xavier Stevens [mailto:Xavier.Stevens@fox.com]
> Sent: Wed 1/9/2008 2:57 PM
> To: hadoop-user@lucene.apache.org
> Subject: RE: Question on running simultaneous jobs
> 
> 
> 
> This doesn't work to solve this issue because it sets the total number
> of map/reduce tasks. When setting the total number of map tasks I get an
> ArrayOutOfBoundsException within Hadoop; I believe because of the input
> dataset size (around 90 million lines).
> 
> I think it is important to make a distinction between setting total
> number of map/reduce tasks and the number that can run(per job) at any
> given time.  I would like only to restrict the later, while allowing
> Hadoop to divide the data into chunks as it sees fit.
> 
> 
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Wednesday, January 09, 2008 1:50 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Question on running simultaneous jobs
> 
> 
> You may need to upgrade, but 15.1 does just fine with multiple jobs in
> the cluster.  Use conf.setNumMapTasks(int) and
> conf.setNumReduceTasks(int).
> 
> 
> On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:
> 
>> Does Hadoop support running simultaneous jobs?  If so, what parameters
> 
>> do I need to set in my job configuration?  We basically want to give a
> 
>> job that takes a really long time, half of the total resources of the
>> cluster so other jobs don't queue up behind it.
>>
>> I am using Hadoop 0.14.2 currently.  I tried setting
>> mapred.tasktracker.tasks.maximum to be half of the maximum specified
>> in mapred-default.xml.  This shows the change in the web
>> administration page for the job, but it has no effect on the actual
>> numbers of tasks running.
>>
>> Thanks,
>>
>> Xavier
>>
> 
> 
> 
> 
> 
>

RE: Question on running simultaneous jobs

Posted by Joydeep Sen Sarma <js...@facebook.com>.

> that can run(per job) at any given time.  

not possible afaik - but i will be happy to hear otherwise.

priorities are a good substitute though. there's no point needlessly restricting concurrency if there's nothing else to run. if there is something else more important to run - then in most cases, assigning a higher priority to that other thing would make the right thing happen.

except with long running tasks (usually reducers) that cannot be preempted. (Hadoop does not seem to use OS process priorities at all. I wonder if process priorities can be used as a substitute for pre-emption.)

HOD is another solution that you might want to look into - my understanding is that with HOD u can restrict the number of machines used by a job.

________________________________

From: Xavier Stevens [mailto:Xavier.Stevens@fox.com]
Sent: Wed 1/9/2008 2:57 PM
To: hadoop-user@lucene.apache.org
Subject: RE: Question on running simultaneous jobs

This doesn't work to solve this issue because it sets the total number
of map/reduce tasks. When setting the total number of map tasks I get an
ArrayOutOfBoundsException within Hadoop; I believe because of the input
dataset size (around 90 million lines).

I think it is important to make a distinction between setting total
number of map/reduce tasks and the number that can run(per job) at any
given time.  I would like only to restrict the later, while allowing
Hadoop to divide the data into chunks as it sees fit.

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com]
Sent: Wednesday, January 09, 2008 1:50 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs

You may need to upgrade, but 15.1 does just fine with multiple jobs in
the cluster.  Use conf.setNumMapTasks(int) and
conf.setNumReduceTasks(int).

On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:

> Does Hadoop support running simultaneous jobs?  If so, what parameters

> do I need to set in my job configuration?  We basically want to give a

> job that takes a really long time, half of the total resources of the
> cluster so other jobs don't queue up behind it.
>
> I am using Hadoop 0.14.2 currently.  I tried setting
> mapred.tasktracker.tasks.maximum to be half of the maximum specified
> in mapred-default.xml.  This shows the change in the web
> administration page for the job, but it has no effect on the actual
> numbers of tasks running.
>
> Thanks,
>
> Xavier
>

RE: Question on running simultaneous jobs

Posted by Xavier Stevens <Xa...@fox.com>.

This doesn't work to solve this issue because it sets the total number
of map/reduce tasks. When setting the total number of map tasks I get an
ArrayOutOfBoundsException within Hadoop; I believe because of the input
dataset size (around 90 million lines).

I think it is important to make a distinction between setting total
number of map/reduce tasks and the number that can run(per job) at any
given time.  I would like only to restrict the later, while allowing
Hadoop to divide the data into chunks as it sees fit.

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, January 09, 2008 1:50 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs

You may need to upgrade, but 15.1 does just fine with multiple jobs in
the cluster.  Use conf.setNumMapTasks(int) and
conf.setNumReduceTasks(int).

On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:

> Does Hadoop support running simultaneous jobs?  If so, what parameters

> do I need to set in my job configuration?  We basically want to give a

> job that takes a really long time, half of the total resources of the 
> cluster so other jobs don't queue up behind it.
> 
> I am using Hadoop 0.14.2 currently.  I tried setting 
> mapred.tasktracker.tasks.maximum to be half of the maximum specified 
> in mapred-default.xml.  This shows the change in the web 
> administration page for the job, but it has no effect on the actual 
> numbers of tasks running.
> 
> Thanks,
> 
> Xavier
>

Re: Question on running simultaneous jobs

Posted by Michael Bieniosek <mi...@powerset.com>.

Hadoop-0.14 introduced job priorities (https://issues.apache.org/jira/ 
browse/HADOOP-1433); you might be able to get somewhere with this.

Another possibility is to create two mapreduce clusters on top of the  
same dfs cluster.

The mapred.tasktracker.tasks.maximum doesn't do what you think -- it  
actually controls the numbers of tasks that run simultaneously on a  
tasktracker machine.

-Michael

On Jan 9, 2008, at 11:25 AM, Xavier Stevens wrote:

> Does Hadoop support running simultaneous jobs?  If so, what parameters
> do I need to set in my job configuration?  We basically want to give a
> job that takes a really long time, half of the total resources of the
> cluster so other jobs don't queue up behind it.
>
> I am using Hadoop 0.14.2 currently.  I tried setting
> mapred.tasktracker.tasks.maximum to be half of the maximum  
> specified in
> mapred-default.xml.  This shows the change in the web administration
> page for the job, but it has no effect on the actual numbers of tasks
> running.
>
> Thanks,
>
> Xavier
>

Re: Question on running simultaneous jobs

Posted by Ted Dunning <td...@veoh.com>.

You may need to upgrade, but 15.1 does just fine with multiple jobs in the
cluster.  Use conf.setNumMapTasks(int) and conf.setNumReduceTasks(int).


On 1/9/08 11:25 AM, "Xavier Stevens" <Xa...@fox.com> wrote:

> Does Hadoop support running simultaneous jobs?  If so, what parameters
> do I need to set in my job configuration?  We basically want to give a
> job that takes a really long time, half of the total resources of the
> cluster so other jobs don't queue up behind it.
> 
> I am using Hadoop 0.14.2 currently.  I tried setting
> mapred.tasktracker.tasks.maximum to be half of the maximum specified in
> mapred-default.xml.  This shows the change in the web administration
> page for the job, but it has no effect on the actual numbers of tasks
> running.
> 
> Thanks,
> 
> Xavier
>

Question on running simultaneous jobs

Posted by Xavier Stevens <Xa...@fox.com>.

Does Hadoop support running simultaneous jobs?  If so, what parameters
do I need to set in my job configuration?  We basically want to give a
job that takes a really long time, half of the total resources of the
cluster so other jobs don't queue up behind it.

I am using Hadoop 0.14.2 currently.  I tried setting
mapred.tasktracker.tasks.maximum to be half of the maximum specified in
mapred-default.xml.  This shows the change in the web administration
page for the job, but it has no effect on the actual numbers of tasks
running.

Thanks,

Xavier

RE: question about file glob in hadoop 0.15

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

Hi Olga,

Yes, there is a bug in the code that cause the error that you described.
Seven {} related unit test cases do not cover this most common case.
Sigh... I filed a jira at
https://issues.apache.org/jira/browse/HADOOP-2562. Hopefully it will get
into still-debating 0.15.3 release.

Hairong

-----Original Message-----
From: Olga Natkovich [mailto:olgan@yahoo-inc.com] 
Sent: Tuesday, January 08, 2008 3:55 PM
To: hadoop-user@lucene.apache.org
Subject: question about file glob in hadoop 0.15

Hi,
 
According to 0.15 documentation, FileSystem::globPaths supports {ab,cd}
matching. However, when I tried to use it with pattern
/data/mydata/{data1,data2} I got no results even though I could find the
individual files.
 
Any ideas?
 
Thanks,
 
Olga