You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/09/21 15:02:09 UTC

Can we run job on some datanodes ?

Is there any way that we can run a particular job in a hadoop on subset of
datanodes ?

My problem is I don't want to use all the nodes to run some job,
I am trying to make Job completion Vs No. of nodes graph for a particular
job.
One way to do is I can remove datanodes, and then see how much time the job
is taking.

Just for curiosity sake, want to know is there any other way possible to do
this, without removing datanodes.
I am afraid, if I remove datanodes, I can loose some data blocks that reside
on those machines as I have some files with replication = 1 ?

Thanks,
Praveenesh

Re: Can we run job on some datanodes ?

Posted by Robert Evans <ev...@yahoo-inc.com>.

Praveen,

If you are doing performance measurements be aware that having more datanodes then tasktrackers will impact the performance as well (Don't really know for sure how).  It will not be the same performance as running on a cluster with just fewer nodes over all.  Also if you do shut off datanodes as well as task trackers you will need to give the cluster a while for re-replication to finish before you try to run your performance numbers.

--Bobby Evans


On 9/21/11 8:27 AM, "Harsh J" <ha...@cloudera.com> wrote:

Praveenesh,

Absolutely right. Just stop them individually :)

On Wed, Sep 21, 2011 at 6:53 PM, praveenesh kumar <pr...@gmail.com> wrote:
> Oh wow.. I didn't know that..
> Actually for me datanodes/tasktrackers are running on same machines.
> I mention datanodes because if I delete those machines from masters list,
> chances are the data will also loose.
> So I don't want to do that..
> but now I guess by stoping tasktrackers individually... I can decrease the
> strength of my cluster by decreasing the number of nodes that will run
> tasktracker .. right ?? This  way I won't loose my data also.. Right ??
>
>
>
> On Wed, Sep 21, 2011 at 6:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Praveenesh,
>>
>> TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
>> you can statically control loads on nodes by removing away
>> TaskTrackers from your cluster.
>>
>> i.e, if you "service hadoop-0.20-tasktracker stop" or
>> "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
>> run there anymore.
>>
>> Is this what you're looking for?
>>
>> (There are ways to achieve the exclusion dynamically, by writing a
>> scheduler, but hard to tell without knowing what you need
>> specifically, and why do you require it?)
>>
>> On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <pr...@gmail.com>
>> wrote:
>> > Is there any way that we can run a particular job in a hadoop on subset
>> of
>> > datanodes ?
>> >
>> > My problem is I don't want to use all the nodes to run some job,
>> > I am trying to make Job completion Vs No. of nodes graph for a particular
>> > job.
>> > One way to do is I can remove datanodes, and then see how much time the
>> job
>> > is taking.
>> >
>> > Just for curiosity sake, want to know is there any other way possible to
>> do
>> > this, without removing datanodes.
>> > I am afraid, if I remove datanodes, I can loose some data blocks that
>> reside
>> > on those machines as I have some files with replication = 1 ?
>> >
>> > Thanks,
>> > Praveenesh
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



--
Harsh J

Re: Can we run job on some datanodes ?

Posted by Harsh J <ha...@cloudera.com>.

Praveenesh,

Absolutely right. Just stop them individually :)

On Wed, Sep 21, 2011 at 6:53 PM, praveenesh kumar <pr...@gmail.com> wrote:
> Oh wow.. I didn't know that..
> Actually for me datanodes/tasktrackers are running on same machines.
> I mention datanodes because if I delete those machines from masters list,
> chances are the data will also loose.
> So I don't want to do that..
> but now I guess by stoping tasktrackers individually... I can decrease the
> strength of my cluster by decreasing the number of nodes that will run
> tasktracker .. right ?? This  way I won't loose my data also.. Right ??
>
>
>
> On Wed, Sep 21, 2011 at 6:39 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Praveenesh,
>>
>> TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
>> you can statically control loads on nodes by removing away
>> TaskTrackers from your cluster.
>>
>> i.e, if you "service hadoop-0.20-tasktracker stop" or
>> "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
>> run there anymore.
>>
>> Is this what you're looking for?
>>
>> (There are ways to achieve the exclusion dynamically, by writing a
>> scheduler, but hard to tell without knowing what you need
>> specifically, and why do you require it?)
>>
>> On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <pr...@gmail.com>
>> wrote:
>> > Is there any way that we can run a particular job in a hadoop on subset
>> of
>> > datanodes ?
>> >
>> > My problem is I don't want to use all the nodes to run some job,
>> > I am trying to make Job completion Vs No. of nodes graph for a particular
>> > job.
>> > One way to do is I can remove datanodes, and then see how much time the
>> job
>> > is taking.
>> >
>> > Just for curiosity sake, want to know is there any other way possible to
>> do
>> > this, without removing datanodes.
>> > I am afraid, if I remove datanodes, I can loose some data blocks that
>> reside
>> > on those machines as I have some files with replication = 1 ?
>> >
>> > Thanks,
>> > Praveenesh
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J

Re: Can we run job on some datanodes ?

Posted by praveenesh kumar <pr...@gmail.com>.

Oh wow.. I didn't know that..
Actually for me datanodes/tasktrackers are running on same machines.
I mention datanodes because if I delete those machines from masters list,
chances are the data will also loose.
So I don't want to do that..
but now I guess by stoping tasktrackers individually... I can decrease the
strength of my cluster by decreasing the number of nodes that will run
tasktracker .. right ?? This  way I won't loose my data also.. Right ??



On Wed, Sep 21, 2011 at 6:39 PM, Harsh J <ha...@cloudera.com> wrote:

> Praveenesh,
>
> TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
> you can statically control loads on nodes by removing away
> TaskTrackers from your cluster.
>
> i.e, if you "service hadoop-0.20-tasktracker stop" or
> "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
> run there anymore.
>
> Is this what you're looking for?
>
> (There are ways to achieve the exclusion dynamically, by writing a
> scheduler, but hard to tell without knowing what you need
> specifically, and why do you require it?)
>
> On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <pr...@gmail.com>
> wrote:
> > Is there any way that we can run a particular job in a hadoop on subset
> of
> > datanodes ?
> >
> > My problem is I don't want to use all the nodes to run some job,
> > I am trying to make Job completion Vs No. of nodes graph for a particular
> > job.
> > One way to do is I can remove datanodes, and then see how much time the
> job
> > is taking.
> >
> > Just for curiosity sake, want to know is there any other way possible to
> do
> > this, without removing datanodes.
> > I am afraid, if I remove datanodes, I can loose some data blocks that
> reside
> > on those machines as I have some files with replication = 1 ?
> >
> > Thanks,
> > Praveenesh
> >
>
>
>
> --
> Harsh J
>

Re: Can we run job on some datanodes ?

Posted by Harsh J <ha...@cloudera.com>.

Praveenesh,

TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
you can statically control loads on nodes by removing away
TaskTrackers from your cluster.

i.e, if you "service hadoop-0.20-tasktracker stop" or
"hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
run there anymore.

Is this what you're looking for?

(There are ways to achieve the exclusion dynamically, by writing a
scheduler, but hard to tell without knowing what you need
specifically, and why do you require it?)

On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <pr...@gmail.com> wrote:
> Is there any way that we can run a particular job in a hadoop on subset of
> datanodes ?
>
> My problem is I don't want to use all the nodes to run some job,
> I am trying to make Job completion Vs No. of nodes graph for a particular
> job.
> One way to do is I can remove datanodes, and then see how much time the job
> is taking.
>
> Just for curiosity sake, want to know is there any other way possible to do
> this, without removing datanodes.
> I am afraid, if I remove datanodes, I can loose some data blocks that reside
> on those machines as I have some files with replication = 1 ?
>
> Thanks,
> Praveenesh
>



-- 
Harsh J