You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by mete <ef...@gmail.com> on 2012/04/22 09:20:32 UTC

hadoop.tmp.dir with multiple disks

Hello folks,

I have a job that processes text files from hdfs on local fs (temp
directory) and then copies those back to hdfs.
I added another drive to each server to have better io performance, but as
far as i could see hadoop.tmp.dir will not benefit from multiple disks,even
if i setup two different folders on different disks. (dfs.data.dir works
fine). As a result the disk with temp folder set is highy utilized, where
the other one is a little bit idle.
Does anyone have an idea on what to do? (i am using cdh3u3)

Thanks in advance
Mete

Re: hadoop.tmp.dir with multiple disks

Posted by mete <ef...@gmail.com>.

Harsh, thanks for the heads up, that seemed to do the trick.

Jay, i am building local files from the input, then compressing them on the
local drive, then copy back to hdfs.
So in my case it is really about io to the local fs..

On Sun, Apr 22, 2012 at 5:44 PM, Edward Capriolo <ed...@gmail.com>wrote:

> Since each hadoop tasks is isolated from others having more tmp
> directories allows you to isolate that disk bandwidth as well. By
> listing the disks you give more firepower to shuffle-sorting and
> merging processes.
>
> Edward
>
> On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas <ja...@gmail.com> wrote:
> > I don't understand why multiple disks would be particularly beneficial
> for
> > a Map/Reduce job..... would I/O for a map/reduce job be i/o *as well as
> CPU
> > bound* ?   I would think that simply reading and parsing large files
> would
> > still require dedicated CPU blocks. ?
> >
> > On Sun, Apr 22, 2012 at 3:14 AM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> You can use mapred.local.dir for this purpose. It accepts a list of
> >> directories tasks may use, just like dfs.data.dir uses multiple disks
> >> for block writes/reads.
> >>
> >> On Sun, Apr 22, 2012 at 12:50 PM, mete <ef...@gmail.com> wrote:
> >> > Hello folks,
> >> >
> >> > I have a job that processes text files from hdfs on local fs (temp
> >> > directory) and then copies those back to hdfs.
> >> > I added another drive to each server to have better io performance,
> but
> >> as
> >> > far as i could see hadoop.tmp.dir will not benefit from multiple
> >> disks,even
> >> > if i setup two different folders on different disks. (dfs.data.dir
> works
> >> > fine). As a result the disk with temp folder set is highy utilized,
> where
> >> > the other one is a little bit idle.
> >> > Does anyone have an idea on what to do? (i am using cdh3u3)
> >> >
> >> > Thanks in advance
> >> > Mete
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
>

Re: hadoop.tmp.dir with multiple disks

Posted by Edward Capriolo <ed...@gmail.com>.

Since each hadoop tasks is isolated from others having more tmp
directories allows you to isolate that disk bandwidth as well. By
listing the disks you give more firepower to shuffle-sorting and
merging processes.

Edward

On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas <ja...@gmail.com> wrote:
> I don't understand why multiple disks would be particularly beneficial for
> a Map/Reduce job..... would I/O for a map/reduce job be i/o *as well as CPU
> bound* ?   I would think that simply reading and parsing large files would
> still require dedicated CPU blocks. ?
>
> On Sun, Apr 22, 2012 at 3:14 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You can use mapred.local.dir for this purpose. It accepts a list of
>> directories tasks may use, just like dfs.data.dir uses multiple disks
>> for block writes/reads.
>>
>> On Sun, Apr 22, 2012 at 12:50 PM, mete <ef...@gmail.com> wrote:
>> > Hello folks,
>> >
>> > I have a job that processes text files from hdfs on local fs (temp
>> > directory) and then copies those back to hdfs.
>> > I added another drive to each server to have better io performance, but
>> as
>> > far as i could see hadoop.tmp.dir will not benefit from multiple
>> disks,even
>> > if i setup two different folders on different disks. (dfs.data.dir works
>> > fine). As a result the disk with temp folder set is highy utilized, where
>> > the other one is a little bit idle.
>> > Does anyone have an idea on what to do? (i am using cdh3u3)
>> >
>> > Thanks in advance
>> > Mete
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC

Re: hadoop.tmp.dir with multiple disks

Posted by Jay Vyas <ja...@gmail.com>.

I don't understand why multiple disks would be particularly beneficial for
a Map/Reduce job..... would I/O for a map/reduce job be i/o *as well as CPU
bound* ?   I would think that simply reading and parsing large files would
still require dedicated CPU blocks. ?

On Sun, Apr 22, 2012 at 3:14 AM, Harsh J <ha...@cloudera.com> wrote:

> You can use mapred.local.dir for this purpose. It accepts a list of
> directories tasks may use, just like dfs.data.dir uses multiple disks
> for block writes/reads.
>
> On Sun, Apr 22, 2012 at 12:50 PM, mete <ef...@gmail.com> wrote:
> > Hello folks,
> >
> > I have a job that processes text files from hdfs on local fs (temp
> > directory) and then copies those back to hdfs.
> > I added another drive to each server to have better io performance, but
> as
> > far as i could see hadoop.tmp.dir will not benefit from multiple
> disks,even
> > if i setup two different folders on different disks. (dfs.data.dir works
> > fine). As a result the disk with temp folder set is highy utilized, where
> > the other one is a little bit idle.
> > Does anyone have an idea on what to do? (i am using cdh3u3)
> >
> > Thanks in advance
> > Mete
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC

Re: hadoop.tmp.dir with multiple disks

Posted by Harsh J <ha...@cloudera.com>.

You can use mapred.local.dir for this purpose. It accepts a list of
directories tasks may use, just like dfs.data.dir uses multiple disks
for block writes/reads.

On Sun, Apr 22, 2012 at 12:50 PM, mete <ef...@gmail.com> wrote:
> Hello folks,
>
> I have a job that processes text files from hdfs on local fs (temp
> directory) and then copies those back to hdfs.
> I added another drive to each server to have better io performance, but as
> far as i could see hadoop.tmp.dir will not benefit from multiple disks,even
> if i setup two different folders on different disks. (dfs.data.dir works
> fine). As a result the disk with temp folder set is highy utilized, where
> the other one is a little bit idle.
> Does anyone have an idea on what to do? (i am using cdh3u3)
>
> Thanks in advance
> Mete



-- 
Harsh J