You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kris Jirapinyo <kj...@biz360.com> on 2009/02/13 23:29:43 UTC

Running Map and Reduce Sequentially

Is there a way to tell Hadoop to not run Map and Reduce concurrently?  I'm
running into a problem where I set the jvm to Xmx768 and it seems like 2
mappers and 2 reducers are running on each machine that only has 1.7GB of
ram, so it complains of not being able to allocate memory...(which makes
sense since 4x768mb > 1.7GB).  So, if it would just finish the Map and then
start on Reduce, then there would be 2 jvm's running on one machine at any
given time and thus possibly avoid this out of memory error.

Re: Running Map and Reduce Sequentially

Posted by Matei Zaharia <ma...@cloudera.com>.
Do your mappers really need 768 MB? You can set the heap size differently
for them than for the reducers. The way you do this is to pass a different
value for mapred.child.java.opts to the reducers than the mappers (by adding
it in the JobConf in your driver program, or using -D
mapred.child.java.opts=whatever if you use bin/hadoop).

2009/2/13 Amandeep Khurana <am...@gmail.com>

> Yes, number of output files = number of reducers. There is no downside of
> having a 50GB file. That really isnt too much of data. Ofcourse, multiple
> reducers would be much faster. But since you want a sequential run, having
> a
> single reducer is the only option I am aware of.
>
> You could consider lowering the memory allocated to the JVMs as well so
> that
> 4 tasks can run. I dont know if you want to do that or not.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
>
> > Thanks for the recommendation, haven't really looked into how the
> combiner
> > might be able to help.  Now, are there any downsides to having one 50GB
> > file
> > as an output?  If I understand correctly, the number of reducers you set
> > for
> > your job is the number of files you will get as output.
> >
> > 2009/2/13 Amandeep Khurana <am...@gmail.com>
> >
> > > What you can probably do is have the combine function do some reducing
> > > before the single reducer starts off. That might help.
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> > >
> > > > I can't afford to have only one reducer as my dataset is huge...right
> > now
> > > > it
> > > > is 50GB and so the output.collect() in the reducer will surely run
> out
> > of
> > > > java heap space.
> > > >
> > > > 2009/2/13 Amandeep Khurana <am...@gmail.com>
> > > >
> > > > > Have only one instance of the reduce task. This will run once your
> > map
> > > > > tasks
> > > > > are completed. You can set this in your job conf by using
> > > > > conf.setNumReducers(1)
> > > > >
> > > > >
> > > > > Amandeep Khurana
> > > > > Computer Science Graduate Student
> > > > > University of California, Santa Cruz
> > > > >
> > > > >
> > > > > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> > > > >
> > > > > > What do you mean when I have only 1 reducer?
> > > > > >
> > > > > > On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <
> rasitozdas@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Kris,
> > > > > > > This is the case when you have only 1 reducer.
> > > > > > > If it doesn't have any side effects for you..
> > > > > > >
> > > > > > > Rasit
> > > > > > >
> > > > > > >
> > > > > > > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > > > > > > Is there a way to tell Hadoop to not run Map and Reduce
> > > > concurrently?
> > > > > > >  I'm
> > > > > > > > running into a problem where I set the jvm to Xmx768 and it
> > seems
> > > > > like
> > > > > > 2
> > > > > > > > mappers and 2 reducers are running on each machine that only
> > has
> > > > > 1.7GB
> > > > > > of
> > > > > > > > ram, so it complains of not being able to allocate
> > > memory...(which
> > > > > > makes
> > > > > > > > sense since 4x768mb > 1.7GB).  So, if it would just finish
> the
> > > Map
> > > > > and
> > > > > > > then
> > > > > > > > start on Reduce, then there would be 2 jvm's running on one
> > > machine
> > > > > at
> > > > > > > any
> > > > > > > > given time and thus possibly avoid this out of memory error.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > M. Raşit ÖZDAŞ
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Running Map and Reduce Sequentially

Posted by Amandeep Khurana <am...@gmail.com>.
Yes, number of output files = number of reducers. There is no downside of
having a 50GB file. That really isnt too much of data. Ofcourse, multiple
reducers would be much faster. But since you want a sequential run, having a
single reducer is the only option I am aware of.

You could consider lowering the memory allocated to the JVMs as well so that
4 tasks can run. I dont know if you want to do that or not.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


2009/2/13 Kris Jirapinyo <kr...@biz360.com>

> Thanks for the recommendation, haven't really looked into how the combiner
> might be able to help.  Now, are there any downsides to having one 50GB
> file
> as an output?  If I understand correctly, the number of reducers you set
> for
> your job is the number of files you will get as output.
>
> 2009/2/13 Amandeep Khurana <am...@gmail.com>
>
> > What you can probably do is have the combine function do some reducing
> > before the single reducer starts off. That might help.
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> >
> > > I can't afford to have only one reducer as my dataset is huge...right
> now
> > > it
> > > is 50GB and so the output.collect() in the reducer will surely run out
> of
> > > java heap space.
> > >
> > > 2009/2/13 Amandeep Khurana <am...@gmail.com>
> > >
> > > > Have only one instance of the reduce task. This will run once your
> map
> > > > tasks
> > > > are completed. You can set this in your job conf by using
> > > > conf.setNumReducers(1)
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> > > >
> > > > > What do you mean when I have only 1 reducer?
> > > > >
> > > > > On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <rasitozdas@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Kris,
> > > > > > This is the case when you have only 1 reducer.
> > > > > > If it doesn't have any side effects for you..
> > > > > >
> > > > > > Rasit
> > > > > >
> > > > > >
> > > > > > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > > > > > Is there a way to tell Hadoop to not run Map and Reduce
> > > concurrently?
> > > > > >  I'm
> > > > > > > running into a problem where I set the jvm to Xmx768 and it
> seems
> > > > like
> > > > > 2
> > > > > > > mappers and 2 reducers are running on each machine that only
> has
> > > > 1.7GB
> > > > > of
> > > > > > > ram, so it complains of not being able to allocate
> > memory...(which
> > > > > makes
> > > > > > > sense since 4x768mb > 1.7GB).  So, if it would just finish the
> > Map
> > > > and
> > > > > > then
> > > > > > > start on Reduce, then there would be 2 jvm's running on one
> > machine
> > > > at
> > > > > > any
> > > > > > > given time and thus possibly avoid this out of memory error.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > M. Raşit ÖZDAŞ
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Running Map and Reduce Sequentially

Posted by Kris Jirapinyo <kr...@biz360.com>.
Thanks for the recommendation, haven't really looked into how the combiner
might be able to help.  Now, are there any downsides to having one 50GB file
as an output?  If I understand correctly, the number of reducers you set for
your job is the number of files you will get as output.

2009/2/13 Amandeep Khurana <am...@gmail.com>

> What you can probably do is have the combine function do some reducing
> before the single reducer starts off. That might help.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
>
> > I can't afford to have only one reducer as my dataset is huge...right now
> > it
> > is 50GB and so the output.collect() in the reducer will surely run out of
> > java heap space.
> >
> > 2009/2/13 Amandeep Khurana <am...@gmail.com>
> >
> > > Have only one instance of the reduce task. This will run once your map
> > > tasks
> > > are completed. You can set this in your job conf by using
> > > conf.setNumReducers(1)
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> > >
> > > > What do you mean when I have only 1 reducer?
> > > >
> > > > On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <ra...@gmail.com>
> > > wrote:
> > > >
> > > > > Kris,
> > > > > This is the case when you have only 1 reducer.
> > > > > If it doesn't have any side effects for you..
> > > > >
> > > > > Rasit
> > > > >
> > > > >
> > > > > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > > > > Is there a way to tell Hadoop to not run Map and Reduce
> > concurrently?
> > > > >  I'm
> > > > > > running into a problem where I set the jvm to Xmx768 and it seems
> > > like
> > > > 2
> > > > > > mappers and 2 reducers are running on each machine that only has
> > > 1.7GB
> > > > of
> > > > > > ram, so it complains of not being able to allocate
> memory...(which
> > > > makes
> > > > > > sense since 4x768mb > 1.7GB).  So, if it would just finish the
> Map
> > > and
> > > > > then
> > > > > > start on Reduce, then there would be 2 jvm's running on one
> machine
> > > at
> > > > > any
> > > > > > given time and thus possibly avoid this out of memory error.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > M. Raşit ÖZDAŞ
> > > > >
> > > >
> > >
> >
>

Re: Running Map and Reduce Sequentially

Posted by Amandeep Khurana <am...@gmail.com>.
What you can probably do is have the combine function do some reducing
before the single reducer starts off. That might help.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


2009/2/13 Kris Jirapinyo <kr...@biz360.com>

> I can't afford to have only one reducer as my dataset is huge...right now
> it
> is 50GB and so the output.collect() in the reducer will surely run out of
> java heap space.
>
> 2009/2/13 Amandeep Khurana <am...@gmail.com>
>
> > Have only one instance of the reduce task. This will run once your map
> > tasks
> > are completed. You can set this in your job conf by using
> > conf.setNumReducers(1)
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
> >
> > > What do you mean when I have only 1 reducer?
> > >
> > > On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <ra...@gmail.com>
> > wrote:
> > >
> > > > Kris,
> > > > This is the case when you have only 1 reducer.
> > > > If it doesn't have any side effects for you..
> > > >
> > > > Rasit
> > > >
> > > >
> > > > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > > > Is there a way to tell Hadoop to not run Map and Reduce
> concurrently?
> > > >  I'm
> > > > > running into a problem where I set the jvm to Xmx768 and it seems
> > like
> > > 2
> > > > > mappers and 2 reducers are running on each machine that only has
> > 1.7GB
> > > of
> > > > > ram, so it complains of not being able to allocate memory...(which
> > > makes
> > > > > sense since 4x768mb > 1.7GB).  So, if it would just finish the Map
> > and
> > > > then
> > > > > start on Reduce, then there would be 2 jvm's running on one machine
> > at
> > > > any
> > > > > given time and thus possibly avoid this out of memory error.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > M. Raşit ÖZDAŞ
> > > >
> > >
> >
>

Re: Running Map and Reduce Sequentially

Posted by Kris Jirapinyo <kr...@biz360.com>.
I can't afford to have only one reducer as my dataset is huge...right now it
is 50GB and so the output.collect() in the reducer will surely run out of
java heap space.

2009/2/13 Amandeep Khurana <am...@gmail.com>

> Have only one instance of the reduce task. This will run once your map
> tasks
> are completed. You can set this in your job conf by using
> conf.setNumReducers(1)
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> 2009/2/13 Kris Jirapinyo <kr...@biz360.com>
>
> > What do you mean when I have only 1 reducer?
> >
> > On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <ra...@gmail.com>
> wrote:
> >
> > > Kris,
> > > This is the case when you have only 1 reducer.
> > > If it doesn't have any side effects for you..
> > >
> > > Rasit
> > >
> > >
> > > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > > Is there a way to tell Hadoop to not run Map and Reduce concurrently?
> > >  I'm
> > > > running into a problem where I set the jvm to Xmx768 and it seems
> like
> > 2
> > > > mappers and 2 reducers are running on each machine that only has
> 1.7GB
> > of
> > > > ram, so it complains of not being able to allocate memory...(which
> > makes
> > > > sense since 4x768mb > 1.7GB).  So, if it would just finish the Map
> and
> > > then
> > > > start on Reduce, then there would be 2 jvm's running on one machine
> at
> > > any
> > > > given time and thus possibly avoid this out of memory error.
> > > >
> > >
> > >
> > >
> > > --
> > > M. Raşit ÖZDAŞ
> > >
> >
>

Re: Running Map and Reduce Sequentially

Posted by Amandeep Khurana <am...@gmail.com>.
Have only one instance of the reduce task. This will run once your map tasks
are completed. You can set this in your job conf by using
conf.setNumReducers(1)


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


2009/2/13 Kris Jirapinyo <kr...@biz360.com>

> What do you mean when I have only 1 reducer?
>
> On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <ra...@gmail.com> wrote:
>
> > Kris,
> > This is the case when you have only 1 reducer.
> > If it doesn't have any side effects for you..
> >
> > Rasit
> >
> >
> > 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > > Is there a way to tell Hadoop to not run Map and Reduce concurrently?
> >  I'm
> > > running into a problem where I set the jvm to Xmx768 and it seems like
> 2
> > > mappers and 2 reducers are running on each machine that only has 1.7GB
> of
> > > ram, so it complains of not being able to allocate memory...(which
> makes
> > > sense since 4x768mb > 1.7GB).  So, if it would just finish the Map and
> > then
> > > start on Reduce, then there would be 2 jvm's running on one machine at
> > any
> > > given time and thus possibly avoid this out of memory error.
> > >
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>

Re: Running Map and Reduce Sequentially

Posted by Kris Jirapinyo <kr...@biz360.com>.
What do you mean when I have only 1 reducer?

On Fri, Feb 13, 2009 at 4:11 PM, Rasit OZDAS <ra...@gmail.com> wrote:

> Kris,
> This is the case when you have only 1 reducer.
> If it doesn't have any side effects for you..
>
> Rasit
>
>
> 2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> > Is there a way to tell Hadoop to not run Map and Reduce concurrently?
>  I'm
> > running into a problem where I set the jvm to Xmx768 and it seems like 2
> > mappers and 2 reducers are running on each machine that only has 1.7GB of
> > ram, so it complains of not being able to allocate memory...(which makes
> > sense since 4x768mb > 1.7GB).  So, if it would just finish the Map and
> then
> > start on Reduce, then there would be 2 jvm's running on one machine at
> any
> > given time and thus possibly avoid this out of memory error.
> >
>
>
>
> --
> M. Raşit ÖZDAŞ
>

Re: Running Map and Reduce Sequentially

Posted by Rasit OZDAS <ra...@gmail.com>.
Kris,
This is the case when you have only 1 reducer.
If it doesn't have any side effects for you..

Rasit


2009/2/14 Kris Jirapinyo <kj...@biz360.com>:
> Is there a way to tell Hadoop to not run Map and Reduce concurrently?  I'm
> running into a problem where I set the jvm to Xmx768 and it seems like 2
> mappers and 2 reducers are running on each machine that only has 1.7GB of
> ram, so it complains of not being able to allocate memory...(which makes
> sense since 4x768mb > 1.7GB).  So, if it would just finish the Map and then
> start on Reduce, then there would be 2 jvm's running on one machine at any
> given time and thus possibly avoid this out of memory error.
>



-- 
M. Raşit ÖZDAŞ