You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Meng Mao <me...@gmail.com> on 2011/10/05 07:32:17 UTC

ways to expand hadoop.tmp.dir capacity?

Currently, we've got defined:
  <property>
     <name>hadoop.tmp.dir</name>
     <value>/hadoop/hadoop-metadata/cache/</value>
  </property>

In our experiments with SOLR, the intermediate files are so large that they
tend to blow out disk space and fail (and annoyingly leave behind their huge
failed attempts). We've had issues with it in the past, but we're having
real problems with SOLR if we can't comfortably get more space out of
hadoop.tmp.dir somehow.

1) It seems we never set *mapred.system.dir* to anything special, so it's
defaulting to ${hadoop.tmp.dir}/mapred/system.
Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir
had ${user.name} in it, which ours doesn't.

1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
system files." To me, that means there's must be 1 single path for
mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
Otherwise, one might imagine that you could specify multiple paths to store
hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?

2) IIRC, there's a -D switch for supplying config name/value pairs into
indivdiual jobs. Does such a switch exist? Googling for single letters is
fruitless. If we had a path on our workers with more space (in our case,
another hard disk), could we simply pass that path in as hadoop.tmp.dir for
our SOLR jobs? Without incurring any consistency issues on future jobs that
might use the SOLR output on HDFS?

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Meng Mao <me...@gmail.com>.

k. One more related question --
is there any mechanism in place to remove failed task attempt directories
from the TaskTracker's jobcache?

It seems like for us, the only way to get rid of them is manually.

On Wed, Oct 26, 2011 at 3:07 AM, Harsh J <ha...@cloudera.com> wrote:

> Meng,
>
> You should have no issue at all, with respect to the late addition.
>
> On Wednesday, October 26, 2011, Meng Mao <me...@gmail.com> wrote:
> > If we do that rolling restart scenario, will we have a completely quiet
> > migration? That is, if no jobs are running during the rolling restart of
> > TaskTrackers, then we will end up with expanded capacity with no risk of
> > data inconsistency in the cache paths?
> >
> > Our data nodes already use multiple disks for storage. It was an early
> lack
> > of foresight that brings us to the present day where mapred.local.dir
> isn't
> > "distributed."
> >
> > That said, one of our problems is that the SOLR index files we're
> building
> > are just plain huge. Even with expand disk capacity, I think we'd still
> run
> > into disk space issues. Is this something that's been generally reported
> for
> > SOLR hadoop jobs?
> >
> > On Mon, Oct 10, 2011 at 10:08 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> Meng,
> >>
> >> Yes, configure the mapred-site.xml (mapred.local.dir) to add the
> >> property and roll-restart your TaskTrackers. If you'd like to expand
> >> your DataNode to multiple disks as well (helps HDFS I/O greatly), do
> >> the same with hdfs-site.xml (dfs.data.dir) and perform the same
> >> rolling restart of DataNodes.
> >>
> >> Ensure that for each service, the directories you create are owned by
> >> the same user as the one running the process. This will help avoid
> >> permission nightmares.
> >>
> >> On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao <me...@gmail.com> wrote:
> >> > So the only way we can expand to multiple mapred.local.dir paths is to
> >> > config our site.xml and to restart the DataNode?
> >> >
> >> > On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
> >> > marcosluis2186@googlemail.com> wrote:
> >> >
> >> >> 2011/10/9 Harsh J <ha...@cloudera.com>
> >> >>
> >> >> > Hello Meng,
> >> >> >
> >> >> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com>
> wrote:
> >> >> > > Currently, we've got defined:
> >> >> > >  <property>
> >> >> > >     <name>hadoop.tmp.dir</name>
> >> >> > >     <value>/hadoop/hadoop-metadata/cache/</value>
> >> >> > >  </property>
> >> >> > >
> >> >> > > In our experiments with SOLR, the intermediate files are so large
> >> that
> >> >> > they
> >> >> > > tend to blow out disk space and fail (and annoyingly leave behind
> >> their
> >> >> > huge
> >> >> > > failed attempts). We've had issues with it in the past, but we're
> >> >> having
> >> >> > > real problems with SOLR if we can't comfortably get more space
> out
> >> of
> >> >> > > hadoop.tmp.dir somehow.
> >> >> > >
> >> >> > > 1) It seems we never set *mapred.system.dir* to anything special,
> so
> >> >> it's
> >> >> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
> >> >> > > Is this a problem? The docs seem to recommend against it when
> >> >> > hadoop.tmp.dir
> >> >> > > had ${user.name} in it, which ours doesn't.
> >> >> >
> >> >> > The {mapred.system.dir} is a HDFS location, and you shouldn't
> really
> >> >> > be worried about it as much.
> >> >> >
> >> >> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
> >> >> > MapReduce
> >> >> > > system files." To me, that means there's must be 1 single path
> for
> >> >> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1
> path.
> >> >> > > Otherwise, one might imagine that you could specify multiple
> paths
> >> to
> >> >> > store
> >> >> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> >> >> > > interpretation? -- hadoop.tmp.dir could live on multiple
> paths/disks
> >> if
> >> >> > > there were more mapping/lookup between mapred.system.dir and
> >> >> > hadoop.tmp.dir?
> >> >> >
> >> >> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although
> it
> >> >> > is on HDFS, and hence is confusing, but there should just be one
> >> >> > mapred.system.dir, yes.
> >> >> >
> >> >> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What
> you
> >> >> > need here is a proper {mapred.local.dir} configuration.
> >> >> >
> >> >> > > 2) IIRC, there's a -D switch for supplying config name/value
> pairs
> >> into
> >> >> > > indivdiual jobs. Does such a switch exist? Googling for single
> >> letters
> >> >> is
> >> >> > > fruitless. If we had a path on our workers with more space (in
> our
> >> >> case,
> >> >> > > another hard disk), could
>
> --
> Harsh J
>

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Harsh J <ha...@cloudera.com>.

Meng,

You should have no issue at all, with respect to the late addition.

On Wednesday, October 26, 2011, Meng Mao <me...@gmail.com> wrote:
> If we do that rolling restart scenario, will we have a completely quiet
> migration? That is, if no jobs are running during the rolling restart of
> TaskTrackers, then we will end up with expanded capacity with no risk of
> data inconsistency in the cache paths?
>
> Our data nodes already use multiple disks for storage. It was an early
lack
> of foresight that brings us to the present day where mapred.local.dir
isn't
> "distributed."
>
> That said, one of our problems is that the SOLR index files we're building
> are just plain huge. Even with expand disk capacity, I think we'd still
run
> into disk space issues. Is this something that's been generally reported
for
> SOLR hadoop jobs?
>
> On Mon, Oct 10, 2011 at 10:08 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Meng,
>>
>> Yes, configure the mapred-site.xml (mapred.local.dir) to add the
>> property and roll-restart your TaskTrackers. If you'd like to expand
>> your DataNode to multiple disks as well (helps HDFS I/O greatly), do
>> the same with hdfs-site.xml (dfs.data.dir) and perform the same
>> rolling restart of DataNodes.
>>
>> Ensure that for each service, the directories you create are owned by
>> the same user as the one running the process. This will help avoid
>> permission nightmares.
>>
>> On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao <me...@gmail.com> wrote:
>> > So the only way we can expand to multiple mapred.local.dir paths is to
>> > config our site.xml and to restart the DataNode?
>> >
>> > On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
>> > marcosluis2186@googlemail.com> wrote:
>> >
>> >> 2011/10/9 Harsh J <ha...@cloudera.com>
>> >>
>> >> > Hello Meng,
>> >> >
>> >> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
>> >> > > Currently, we've got defined:
>> >> > >  <property>
>> >> > >     <name>hadoop.tmp.dir</name>
>> >> > >     <value>/hadoop/hadoop-metadata/cache/</value>
>> >> > >  </property>
>> >> > >
>> >> > > In our experiments with SOLR, the intermediate files are so large
>> that
>> >> > they
>> >> > > tend to blow out disk space and fail (and annoyingly leave behind
>> their
>> >> > huge
>> >> > > failed attempts). We've had issues with it in the past, but we're
>> >> having
>> >> > > real problems with SOLR if we can't comfortably get more space out
>> of
>> >> > > hadoop.tmp.dir somehow.
>> >> > >
>> >> > > 1) It seems we never set *mapred.system.dir* to anything special,
so
>> >> it's
>> >> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
>> >> > > Is this a problem? The docs seem to recommend against it when
>> >> > hadoop.tmp.dir
>> >> > > had ${user.name} in it, which ours doesn't.
>> >> >
>> >> > The {mapred.system.dir} is a HDFS location, and you shouldn't really
>> >> > be worried about it as much.
>> >> >
>> >> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
>> >> > MapReduce
>> >> > > system files." To me, that means there's must be 1 single path for
>> >> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1
path.
>> >> > > Otherwise, one might imagine that you could specify multiple paths
>> to
>> >> > store
>> >> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
>> >> > > interpretation? -- hadoop.tmp.dir could live on multiple
paths/disks
>> if
>> >> > > there were more mapping/lookup between mapred.system.dir and
>> >> > hadoop.tmp.dir?
>> >> >
>> >> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although
it
>> >> > is on HDFS, and hence is confusing, but there should just be one
>> >> > mapred.system.dir, yes.
>> >> >
>> >> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
>> >> > need here is a proper {mapred.local.dir} configuration.
>> >> >
>> >> > > 2) IIRC, there's a -D switch for supplying config name/value pairs
>> into
>> >> > > indivdiual jobs. Does such a switch exist? Googling for single
>> letters
>> >> is
>> >> > > fruitless. If we had a path on our workers with more space (in our
>> >> case,
>> >> > > another hard disk), could

-- 
Harsh J

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Meng Mao <me...@gmail.com>.

If we do that rolling restart scenario, will we have a completely quiet
migration? That is, if no jobs are running during the rolling restart of
TaskTrackers, then we will end up with expanded capacity with no risk of
data inconsistency in the cache paths?

Our data nodes already use multiple disks for storage. It was an early lack
of foresight that brings us to the present day where mapred.local.dir isn't
"distributed."

That said, one of our problems is that the SOLR index files we're building
are just plain huge. Even with expand disk capacity, I think we'd still run
into disk space issues. Is this something that's been generally reported for
SOLR hadoop jobs?

On Mon, Oct 10, 2011 at 10:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Meng,
>
> Yes, configure the mapred-site.xml (mapred.local.dir) to add the
> property and roll-restart your TaskTrackers. If you'd like to expand
> your DataNode to multiple disks as well (helps HDFS I/O greatly), do
> the same with hdfs-site.xml (dfs.data.dir) and perform the same
> rolling restart of DataNodes.
>
> Ensure that for each service, the directories you create are owned by
> the same user as the one running the process. This will help avoid
> permission nightmares.
>
> On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao <me...@gmail.com> wrote:
> > So the only way we can expand to multiple mapred.local.dir paths is to
> > config our site.xml and to restart the DataNode?
> >
> > On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
> > marcosluis2186@googlemail.com> wrote:
> >
> >> 2011/10/9 Harsh J <ha...@cloudera.com>
> >>
> >> > Hello Meng,
> >> >
> >> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
> >> > > Currently, we've got defined:
> >> > >  <property>
> >> > >     <name>hadoop.tmp.dir</name>
> >> > >     <value>/hadoop/hadoop-metadata/cache/</value>
> >> > >  </property>
> >> > >
> >> > > In our experiments with SOLR, the intermediate files are so large
> that
> >> > they
> >> > > tend to blow out disk space and fail (and annoyingly leave behind
> their
> >> > huge
> >> > > failed attempts). We've had issues with it in the past, but we're
> >> having
> >> > > real problems with SOLR if we can't comfortably get more space out
> of
> >> > > hadoop.tmp.dir somehow.
> >> > >
> >> > > 1) It seems we never set *mapred.system.dir* to anything special, so
> >> it's
> >> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
> >> > > Is this a problem? The docs seem to recommend against it when
> >> > hadoop.tmp.dir
> >> > > had ${user.name} in it, which ours doesn't.
> >> >
> >> > The {mapred.system.dir} is a HDFS location, and you shouldn't really
> >> > be worried about it as much.
> >> >
> >> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
> >> > MapReduce
> >> > > system files." To me, that means there's must be 1 single path for
> >> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> >> > > Otherwise, one might imagine that you could specify multiple paths
> to
> >> > store
> >> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> >> > > interpretation? -- hadoop.tmp.dir could live on multiple paths/disks
> if
> >> > > there were more mapping/lookup between mapred.system.dir and
> >> > hadoop.tmp.dir?
> >> >
> >> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
> >> > is on HDFS, and hence is confusing, but there should just be one
> >> > mapred.system.dir, yes.
> >> >
> >> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
> >> > need here is a proper {mapred.local.dir} configuration.
> >> >
> >> > > 2) IIRC, there's a -D switch for supplying config name/value pairs
> into
> >> > > indivdiual jobs. Does such a switch exist? Googling for single
> letters
> >> is
> >> > > fruitless. If we had a path on our workers with more space (in our
> >> case,
> >> > > another hard disk), could we simply pass that path in as
> hadoop.tmp.dir
> >> > for
> >> > > our SOLR jobs? Without incurring any consistency issues on future
> jobs
> >> > that
> >> > > might use the SOLR output on HDFS?
> >> >
> >> > Only a few parameters of a job are user-configurable. Stuff like
> >> > hadoop.tmp.dir and mapred.local.dir are not override-able by user set
> >> > parameters as they are server side configurations (static).
> >> >
> >> > > Given that the default value is ${hadoop.tmp.dir}/mapred/local,
> would
> >> the
> >> > > expanded capacity we're looking for be as easily accomplished as by
> >> > defining
> >> > > mapred.local.dir to span multiple disks? Setting aside the issue of
> >> temp
> >> > > files so big that they could still fill a whole disk.
> >> >
> >> > 1. You can set mapred.local.dir independent of hadoop.tmp.dir
> >> > 2. mapred.local.dir can have comma separated values in it, spanning
> >> > multiple disks
> >> > 3. Intermediate outputs may spread across these disks but shall not
> >> > consume > 1 disk at a time. So if your largest configured disk is 500
> >> > GB while the total set of them may be 2 TB, then your intermediate
> >> > output size can't really exceed 500 GB, cause only one disk is
> >> > consumed by one task -- the multiple disks are for better I/O
> >> > parallelism between tasks.
> >> >
> >> > Know that hadoop.tmp.dir is a convenience property, for quickly
> >> > starting up dev clusters and such. For a proper configuration, you
> >> > need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
> >> > the server side, once the right properties are configured - ex:
> >> > dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
> >> >
> >> > --
> >> > Harsh J
> >> >
> >>
> >> Here it's a excellent explanation how to install Apache Hadoop manually,
> >> and
> >> Lars explains this very good.
> >>
> >>
> >>
> http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/
> >>
> >> Regards
> >>
> >> --
> >> Marcos Luis Ortíz Valmaseda
> >>  Linux Infrastructure Engineer
> >>  Linux User # 418229
> >>  http://marcosluis2186.posterous.com
> >>  http://www.linkedin.com/in/marcosluis2186
> >>  Twitter: @marcosluis2186
> >>
> >
>
>
>
> --
> Harsh J
>

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Harsh J <ha...@cloudera.com>.

Meng,

Yes, configure the mapred-site.xml (mapred.local.dir) to add the
property and roll-restart your TaskTrackers. If you'd like to expand
your DataNode to multiple disks as well (helps HDFS I/O greatly), do
the same with hdfs-site.xml (dfs.data.dir) and perform the same
rolling restart of DataNodes.

Ensure that for each service, the directories you create are owned by
the same user as the one running the process. This will help avoid
permission nightmares.

On Tue, Oct 11, 2011 at 3:58 AM, Meng Mao <me...@gmail.com> wrote:
> So the only way we can expand to multiple mapred.local.dir paths is to
> config our site.xml and to restart the DataNode?
>
> On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
> marcosluis2186@googlemail.com> wrote:
>
>> 2011/10/9 Harsh J <ha...@cloudera.com>
>>
>> > Hello Meng,
>> >
>> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
>> > > Currently, we've got defined:
>> > >  <property>
>> > >     <name>hadoop.tmp.dir</name>
>> > >     <value>/hadoop/hadoop-metadata/cache/</value>
>> > >  </property>
>> > >
>> > > In our experiments with SOLR, the intermediate files are so large that
>> > they
>> > > tend to blow out disk space and fail (and annoyingly leave behind their
>> > huge
>> > > failed attempts). We've had issues with it in the past, but we're
>> having
>> > > real problems with SOLR if we can't comfortably get more space out of
>> > > hadoop.tmp.dir somehow.
>> > >
>> > > 1) It seems we never set *mapred.system.dir* to anything special, so
>> it's
>> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
>> > > Is this a problem? The docs seem to recommend against it when
>> > hadoop.tmp.dir
>> > > had ${user.name} in it, which ours doesn't.
>> >
>> > The {mapred.system.dir} is a HDFS location, and you shouldn't really
>> > be worried about it as much.
>> >
>> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
>> > MapReduce
>> > > system files." To me, that means there's must be 1 single path for
>> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
>> > > Otherwise, one might imagine that you could specify multiple paths to
>> > store
>> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
>> > > interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
>> > > there were more mapping/lookup between mapred.system.dir and
>> > hadoop.tmp.dir?
>> >
>> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
>> > is on HDFS, and hence is confusing, but there should just be one
>> > mapred.system.dir, yes.
>> >
>> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
>> > need here is a proper {mapred.local.dir} configuration.
>> >
>> > > 2) IIRC, there's a -D switch for supplying config name/value pairs into
>> > > indivdiual jobs. Does such a switch exist? Googling for single letters
>> is
>> > > fruitless. If we had a path on our workers with more space (in our
>> case,
>> > > another hard disk), could we simply pass that path in as hadoop.tmp.dir
>> > for
>> > > our SOLR jobs? Without incurring any consistency issues on future jobs
>> > that
>> > > might use the SOLR output on HDFS?
>> >
>> > Only a few parameters of a job are user-configurable. Stuff like
>> > hadoop.tmp.dir and mapred.local.dir are not override-able by user set
>> > parameters as they are server side configurations (static).
>> >
>> > > Given that the default value is ${hadoop.tmp.dir}/mapred/local, would
>> the
>> > > expanded capacity we're looking for be as easily accomplished as by
>> > defining
>> > > mapred.local.dir to span multiple disks? Setting aside the issue of
>> temp
>> > > files so big that they could still fill a whole disk.
>> >
>> > 1. You can set mapred.local.dir independent of hadoop.tmp.dir
>> > 2. mapred.local.dir can have comma separated values in it, spanning
>> > multiple disks
>> > 3. Intermediate outputs may spread across these disks but shall not
>> > consume > 1 disk at a time. So if your largest configured disk is 500
>> > GB while the total set of them may be 2 TB, then your intermediate
>> > output size can't really exceed 500 GB, cause only one disk is
>> > consumed by one task -- the multiple disks are for better I/O
>> > parallelism between tasks.
>> >
>> > Know that hadoop.tmp.dir is a convenience property, for quickly
>> > starting up dev clusters and such. For a proper configuration, you
>> > need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
>> > the server side, once the right properties are configured - ex:
>> > dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
>> >
>> > --
>> > Harsh J
>> >
>>
>> Here it's a excellent explanation how to install Apache Hadoop manually,
>> and
>> Lars explains this very good.
>>
>>
>> http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/
>>
>> Regards
>>
>> --
>> Marcos Luis Ortíz Valmaseda
>>  Linux Infrastructure Engineer
>>  Linux User # 418229
>>  http://marcosluis2186.posterous.com
>>  http://www.linkedin.com/in/marcosluis2186
>>  Twitter: @marcosluis2186
>>
>



-- 
Harsh J

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Meng Mao <me...@gmail.com>.

So the only way we can expand to multiple mapred.local.dir paths is to
config our site.xml and to restart the DataNode?

On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda <
marcosluis2186@googlemail.com> wrote:

> 2011/10/9 Harsh J <ha...@cloudera.com>
>
> > Hello Meng,
> >
> > On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
> > > Currently, we've got defined:
> > >  <property>
> > >     <name>hadoop.tmp.dir</name>
> > >     <value>/hadoop/hadoop-metadata/cache/</value>
> > >  </property>
> > >
> > > In our experiments with SOLR, the intermediate files are so large that
> > they
> > > tend to blow out disk space and fail (and annoyingly leave behind their
> > huge
> > > failed attempts). We've had issues with it in the past, but we're
> having
> > > real problems with SOLR if we can't comfortably get more space out of
> > > hadoop.tmp.dir somehow.
> > >
> > > 1) It seems we never set *mapred.system.dir* to anything special, so
> it's
> > > defaulting to ${hadoop.tmp.dir}/mapred/system.
> > > Is this a problem? The docs seem to recommend against it when
> > hadoop.tmp.dir
> > > had ${user.name} in it, which ours doesn't.
> >
> > The {mapred.system.dir} is a HDFS location, and you shouldn't really
> > be worried about it as much.
> >
> > > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
> > MapReduce
> > > system files." To me, that means there's must be 1 single path for
> > > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> > > Otherwise, one might imagine that you could specify multiple paths to
> > store
> > > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> > > interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> > > there were more mapping/lookup between mapred.system.dir and
> > hadoop.tmp.dir?
> >
> > {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
> > is on HDFS, and hence is confusing, but there should just be one
> > mapred.system.dir, yes.
> >
> > Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
> > need here is a proper {mapred.local.dir} configuration.
> >
> > > 2) IIRC, there's a -D switch for supplying config name/value pairs into
> > > indivdiual jobs. Does such a switch exist? Googling for single letters
> is
> > > fruitless. If we had a path on our workers with more space (in our
> case,
> > > another hard disk), could we simply pass that path in as hadoop.tmp.dir
> > for
> > > our SOLR jobs? Without incurring any consistency issues on future jobs
> > that
> > > might use the SOLR output on HDFS?
> >
> > Only a few parameters of a job are user-configurable. Stuff like
> > hadoop.tmp.dir and mapred.local.dir are not override-able by user set
> > parameters as they are server side configurations (static).
> >
> > > Given that the default value is ${hadoop.tmp.dir}/mapred/local, would
> the
> > > expanded capacity we're looking for be as easily accomplished as by
> > defining
> > > mapred.local.dir to span multiple disks? Setting aside the issue of
> temp
> > > files so big that they could still fill a whole disk.
> >
> > 1. You can set mapred.local.dir independent of hadoop.tmp.dir
> > 2. mapred.local.dir can have comma separated values in it, spanning
> > multiple disks
> > 3. Intermediate outputs may spread across these disks but shall not
> > consume > 1 disk at a time. So if your largest configured disk is 500
> > GB while the total set of them may be 2 TB, then your intermediate
> > output size can't really exceed 500 GB, cause only one disk is
> > consumed by one task -- the multiple disks are for better I/O
> > parallelism between tasks.
> >
> > Know that hadoop.tmp.dir is a convenience property, for quickly
> > starting up dev clusters and such. For a proper configuration, you
> > need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
> > the server side, once the right properties are configured - ex:
> > dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
> >
> > --
> > Harsh J
> >
>
> Here it's a excellent explanation how to install Apache Hadoop manually,
> and
> Lars explains this very good.
>
>
> http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/
>
> Regards
>
> --
> Marcos Luis Ortíz Valmaseda
>  Linux Infrastructure Engineer
>  Linux User # 418229
>  http://marcosluis2186.posterous.com
>  http://www.linkedin.com/in/marcosluis2186
>  Twitter: @marcosluis2186
>

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Marcos Luis Ortiz Valmaseda <ma...@googlemail.com>.

2011/10/9 Harsh J <ha...@cloudera.com>

> Hello Meng,
>
> On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
> > Currently, we've got defined:
> >  <property>
> >     <name>hadoop.tmp.dir</name>
> >     <value>/hadoop/hadoop-metadata/cache/</value>
> >  </property>
> >
> > In our experiments with SOLR, the intermediate files are so large that
> they
> > tend to blow out disk space and fail (and annoyingly leave behind their
> huge
> > failed attempts). We've had issues with it in the past, but we're having
> > real problems with SOLR if we can't comfortably get more space out of
> > hadoop.tmp.dir somehow.
> >
> > 1) It seems we never set *mapred.system.dir* to anything special, so it's
> > defaulting to ${hadoop.tmp.dir}/mapred/system.
> > Is this a problem? The docs seem to recommend against it when
> hadoop.tmp.dir
> > had ${user.name} in it, which ours doesn't.
>
> The {mapred.system.dir} is a HDFS location, and you shouldn't really
> be worried about it as much.
>
> > 1b) The doc says mapred.system.dir is "the in-HDFS path to shared
> MapReduce
> > system files." To me, that means there's must be 1 single path for
> > mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> > Otherwise, one might imagine that you could specify multiple paths to
> store
> > hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> > interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> > there were more mapping/lookup between mapred.system.dir and
> hadoop.tmp.dir?
>
> {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
> is on HDFS, and hence is confusing, but there should just be one
> mapred.system.dir, yes.
>
> Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
> need here is a proper {mapred.local.dir} configuration.
>
> > 2) IIRC, there's a -D switch for supplying config name/value pairs into
> > indivdiual jobs. Does such a switch exist? Googling for single letters is
> > fruitless. If we had a path on our workers with more space (in our case,
> > another hard disk), could we simply pass that path in as hadoop.tmp.dir
> for
> > our SOLR jobs? Without incurring any consistency issues on future jobs
> that
> > might use the SOLR output on HDFS?
>
> Only a few parameters of a job are user-configurable. Stuff like
> hadoop.tmp.dir and mapred.local.dir are not override-able by user set
> parameters as they are server side configurations (static).
>
> > Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
> > expanded capacity we're looking for be as easily accomplished as by
> defining
> > mapred.local.dir to span multiple disks? Setting aside the issue of temp
> > files so big that they could still fill a whole disk.
>
> 1. You can set mapred.local.dir independent of hadoop.tmp.dir
> 2. mapred.local.dir can have comma separated values in it, spanning
> multiple disks
> 3. Intermediate outputs may spread across these disks but shall not
> consume > 1 disk at a time. So if your largest configured disk is 500
> GB while the total set of them may be 2 TB, then your intermediate
> output size can't really exceed 500 GB, cause only one disk is
> consumed by one task -- the multiple disks are for better I/O
> parallelism between tasks.
>
> Know that hadoop.tmp.dir is a convenience property, for quickly
> starting up dev clusters and such. For a proper configuration, you
> need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
> the server side, once the right properties are configured - ex:
> dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)
>
> --
> Harsh J
>

Here it's a excellent explanation how to install Apache Hadoop manually, and
Lars explains this very good.

http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/

Regards

-- 
Marcos Luis Ortíz Valmaseda
 Linux Infrastructure Engineer
 Linux User # 418229
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Harsh J <ha...@cloudera.com>.

Hello Meng,

On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao <me...@gmail.com> wrote:
> Currently, we've got defined:
>  <property>
>     <name>hadoop.tmp.dir</name>
>     <value>/hadoop/hadoop-metadata/cache/</value>
>  </property>
>
> In our experiments with SOLR, the intermediate files are so large that they
> tend to blow out disk space and fail (and annoyingly leave behind their huge
> failed attempts). We've had issues with it in the past, but we're having
> real problems with SOLR if we can't comfortably get more space out of
> hadoop.tmp.dir somehow.
>
> 1) It seems we never set *mapred.system.dir* to anything special, so it's
> defaulting to ${hadoop.tmp.dir}/mapred/system.
> Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir
> had ${user.name} in it, which ours doesn't.

The {mapred.system.dir} is a HDFS location, and you shouldn't really
be worried about it as much.

> 1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
> system files." To me, that means there's must be 1 single path for
> mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> Otherwise, one might imagine that you could specify multiple paths to store
> hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?

{hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it
is on HDFS, and hence is confusing, but there should just be one
mapred.system.dir, yes.

Also, the config {hadoop.tmp.dir} doesn't support > 1 path. What you
need here is a proper {mapred.local.dir} configuration.

> 2) IIRC, there's a -D switch for supplying config name/value pairs into
> indivdiual jobs. Does such a switch exist? Googling for single letters is
> fruitless. If we had a path on our workers with more space (in our case,
> another hard disk), could we simply pass that path in as hadoop.tmp.dir for
> our SOLR jobs? Without incurring any consistency issues on future jobs that
> might use the SOLR output on HDFS?

Only a few parameters of a job are user-configurable. Stuff like
hadoop.tmp.dir and mapred.local.dir are not override-able by user set
parameters as they are server side configurations (static).

> Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
> expanded capacity we're looking for be as easily accomplished as by defining
> mapred.local.dir to span multiple disks? Setting aside the issue of temp
> files so big that they could still fill a whole disk.

1. You can set mapred.local.dir independent of hadoop.tmp.dir
2. mapred.local.dir can have comma separated values in it, spanning
multiple disks
3. Intermediate outputs may spread across these disks but shall not
consume > 1 disk at a time. So if your largest configured disk is 500
GB while the total set of them may be 2 TB, then your intermediate
output size can't really exceed 500 GB, cause only one disk is
consumed by one task -- the multiple disks are for better I/O
parallelism between tasks.

Know that hadoop.tmp.dir is a convenience property, for quickly
starting up dev clusters and such. For a proper configuration, you
need to remove dependency on it (almost nothing uses hadoop.tmp.dir on
the server side, once the right properties are configured - ex:
dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.)

-- 
Harsh J

Re: ways to expand hadoop.tmp.dir capacity?

Posted by Meng Mao <me...@gmail.com>.

I just read this:

MapReduce performance can also be improved by distributing the temporary
data generated by MapReduce tasks across multiple disks on each machine:

  <property>
    <name>mapred.local.dir</name>
    <value>/d1/mapred/local,/d2/mapred/local,/d3/mapred/local,/d4/mapred/local</value>
    <final>true</final>

  </property>

Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
expanded capacity we're looking for be as easily accomplished as by defining
mapred.local.dir to span multiple disks? Setting aside the issue of temp
files so big that they could still fill a whole disk.

On Wed, Oct 5, 2011 at 1:32 AM, Meng Mao <me...@gmail.com> wrote:

> Currently, we've got defined:
>   <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/hadoop/hadoop-metadata/cache/</value>
>   </property>
>
> In our experiments with SOLR, the intermediate files are so large that they
> tend to blow out disk space and fail (and annoyingly leave behind their huge
> failed attempts). We've had issues with it in the past, but we're having
> real problems with SOLR if we can't comfortably get more space out of
> hadoop.tmp.dir somehow.
>
> 1) It seems we never set *mapred.system.dir* to anything special, so it's
> defaulting to ${hadoop.tmp.dir}/mapred/system.
> Is this a problem? The docs seem to recommend against it when
> hadoop.tmp.dir had ${user.name} in it, which ours doesn't.
>
> 1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
> system files." To me, that means there's must be 1 single path for
> mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> Otherwise, one might imagine that you could specify multiple paths to store
> hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?
>
> 2) IIRC, there's a -D switch for supplying config name/value pairs into
> indivdiual jobs. Does such a switch exist? Googling for single letters is
> fruitless. If we had a path on our workers with more space (in our case,
> another hard disk), could we simply pass that path in as hadoop.tmp.dir for
> our SOLR jobs? Without incurring any consistency issues on future jobs that
> might use the SOLR output on HDFS?
>
>
>
>
>