You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by S B <so...@yahoo.com> on 2009/09/23 18:32:15 UTC

Hadoop On Cluster

Hello Everyone!
 
May be a stupid question. 
 
I want to install and configure Hadoop on a cluster for our users with SGE support. 
 
We have a scrarch space that is NFS moutned across each of the compute nodes for every user on our system. Ideally I would like to set this up so that every user runs their hadoop job within their scratch space. Is this possible, if yes can you provide me some reference to any docs. 
 
Does it defeat the Hadoop purpose/design of distributed resources and computing? 
 
Thansks! 
-SB

Re: Hadoop On Cluster

Posted by Ted Dunning <te...@gmail.com>.

To amplify this point, don't try reading from mySQL with a whole bunch of
map tasks either.

It is very impressive how quickly Hadoop can take down a database.

On Wed, Sep 23, 2009 at 12:51 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> NFS mounts can be quite flaky at scale




-- 
Ted Dunning, CTO
DeepDyve

Re: Hadoop On Cluster

Posted by Brian Vargas <br...@ardvaark.net>.

Jeff,

Thanks for the advice.  I'll keep that in mind as we grow.  At present,
the cluster is only six machines, and for these small (2K-ish) scripts,
it has worked flawlessly, with the caveats mentioned.  When it starts
failing, or if I need to move beyond these very-small-sizes, I'll
certainly use the Distributed Cache.

Brian

Jeff Hammerbacher wrote:
> Hey Brian,
> 
> Having tried and failed to use NFS to store shared resources for a large
> Hadoop cluster, I feel the need to say: you may want to reconsider that
> strategy as your cluster grows. NFS mounts can be quite flaky at scale, as
> Ted mentions. As Allen mentions, the Distributed Cache is intended to allow
> access to shared resources on the cluster; see
> http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#DistributedCachefor
> more information.
> 
> Later,
> Jeff
> 
> On Wed, Sep 23, 2009 at 10:19 AM, Allen Wittenauer <awittenauer@linkedin.com
>> wrote:
> 
>>
>>
>> On 9/23/09 10:09 AM, "Brian Vargas" <br...@ardvaark.net> wrote:
>>
>>> Although it can be quite useful to store small shared resources on an
>>> NFS mount.  For example, I find it easier to store various scripts
>>> called by a streaming job on NFS rather than distributing them from the
>>> command-line.
>>>
>>> Of course, then you have to be sure they don't change out from under the
>>> running jobs.  Tradeoffs.  :-)
>> You should probably look into distributed cache archives.  This eliminates
>> the NFS bottleneck, avoids the 'magically changing file' problem, and
>> allows
>> you to use different versions with different job submissions such that you
>> can test changes on the fly without having to redeploy.
>>
>>
>

Re: Hadoop On Cluster

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Brian,

Having tried and failed to use NFS to store shared resources for a large
Hadoop cluster, I feel the need to say: you may want to reconsider that
strategy as your cluster grows. NFS mounts can be quite flaky at scale, as
Ted mentions. As Allen mentions, the Distributed Cache is intended to allow
access to shared resources on the cluster; see
http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#DistributedCachefor
more information.

Later,
Jeff

On Wed, Sep 23, 2009 at 10:19 AM, Allen Wittenauer <awittenauer@linkedin.com
> wrote:

>
>
>
> On 9/23/09 10:09 AM, "Brian Vargas" <br...@ardvaark.net> wrote:
>
> > Although it can be quite useful to store small shared resources on an
> > NFS mount.  For example, I find it easier to store various scripts
> > called by a streaming job on NFS rather than distributing them from the
> > command-line.
> >
> > Of course, then you have to be sure they don't change out from under the
> > running jobs.  Tradeoffs.  :-)
>
> You should probably look into distributed cache archives.  This eliminates
> the NFS bottleneck, avoids the 'magically changing file' problem, and
> allows
> you to use different versions with different job submissions such that you
> can test changes on the fly without having to redeploy.
>
>

Re: Hadoop On Cluster

Posted by Allen Wittenauer <aw...@linkedin.com>.

On 9/23/09 10:09 AM, "Brian Vargas" <br...@ardvaark.net> wrote:

> Although it can be quite useful to store small shared resources on an
> NFS mount.  For example, I find it easier to store various scripts
> called by a streaming job on NFS rather than distributing them from the
> command-line.
>
> Of course, then you have to be sure they don't change out from under the
> running jobs.  Tradeoffs.  :-)

You should probably look into distributed cache archives.  This eliminates
the NFS bottleneck, avoids the 'magically changing file' problem, and allows
you to use different versions with different job submissions such that you
can test changes on the fly without having to redeploy.

Re: Hadoop On Cluster

Posted by Brian Vargas <br...@ardvaark.net>.

Although it can be quite useful to store small shared resources on an
NFS mount.  For example, I find it easier to store various scripts
called by a streaming job on NFS rather than distributing them from the
command-line.

Of course, then you have to be sure they don't change out from under the
running jobs.  Tradeoffs.  :-)

Brian

Ted Dunning wrote:
> Yes.  This defeats the point of Hadoop.
> 
> Hadoop has its own scratch space on each machine.  IN order to get parallel
> speedup, it is important to avoid shared resources like NFS.
> 
> On Wed, Sep 23, 2009 at 9:32 AM, S B <so...@yahoo.com> wrote:
> 
>> We have a scrarch space that is NFS moutned across each of the compute
>> nodes for every user on our system. Ideally I would like to set this up so
>> that every user runs their hadoop job within their scratch space. Is this
>> possible, if yes can you provide me some reference to any docs.
>>
>> Does it defeat the Hadoop purpose/design of distributed resources and
>> computing?
>>
> 
> 
>

Re: Hadoop On Cluster

Posted by Ted Dunning <te...@gmail.com>.

Yes.  This defeats the point of Hadoop.

Hadoop has its own scratch space on each machine.  IN order to get parallel
speedup, it is important to avoid shared resources like NFS.

On Wed, Sep 23, 2009 at 9:32 AM, S B <so...@yahoo.com> wrote:

>
> We have a scrarch space that is NFS moutned across each of the compute
> nodes for every user on our system. Ideally I would like to set this up so
> that every user runs their hadoop job within their scratch space. Is this
> possible, if yes can you provide me some reference to any docs.
>
> Does it defeat the Hadoop purpose/design of distributed resources and
> computing?
>

-- 
Ted Dunning, CTO
DeepDyve