You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andrey Pankov <ap...@iponweb.net> on 2008/03/13 09:42:23 UTC

Separate data-nodes from worker-nodes

Hi,

Is it possible to configure hadoop cluster in such manner where there 
are separately data-nodes and separately worker-nodes? I.e. when nodes 
1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and 
take data from HDFS?

If it's possible what impact will be on performance? Any suggestions?

Thanks in advance,

--- Andrey Pankov

Re: Separate data-nodes from worker-nodes

Posted by Doug Cutting <cu...@apache.org>.

Andrey Pankov wrote:
> It's a little bit expensive to have big cluster running for a long 
> period, especially if you use EC2. So, as possible solution, we can 
> start additional nodes and include them into cluster before running job, 
> and then, after finishing, kill unused nodes.

As Ted has indicated, that should work.  It won't be as fast as if you 
keep the entire cluster running the whole time, but it will be much cheaper.

An alternative is to store your persistent data in S3.  Then you can 
shut down your cluster altogether when you're not computing.  Your 
startup time each day will be slower, since reading from S3 is slower 
than reading from HDFS, so this may or may not be practical for you.

Doug

Re: Separate data-nodes from worker-nodes

Posted by Andrey Pankov <ap...@iponweb.net>.

Ted,

Actually, the main idea was to keep persistently running cluster. It 
should store some data in HDFS and from time to time (say daily) do some 
  jobs with stored data and newly injected in. The result should be 
stored in HDFS so next day we run the same jobs on existing and fresh 
data. It's a little bit expensive to have big cluster running for a long 
period, especially if you use EC2. So, as possible solution, we can 
start additional nodes and include them into cluster before running job, 
and then, after finishing, kill unused nodes.


Ted Dunning wrote:
> It is very possible (even easy).
> 
> The data nodes run the datanode process.  The task nodes run the task
> tracker.  If the data nodes don't have a task tracker running, then they
> won't do any computation.
> 
> 
> On 3/13/08 8:22 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:
> 
>> Thanks, Ted!
>>
>> I also thought it is not good one to separate them out. Just was
>> wondering is it possible at all. Thanks!
>>
>>
>> Ted Dunning wrote:
>>> It is quite possible to do this.
>>>
>>> It is also a bad idea.
>>>
>>> One of the great things about map-reduce architectures is that data is near
>>> the computation so that you don't have to wait for the network.  If you
>>> separate data and computation, you impose additional load on the cluster.
>>>
>>> What this will do to your throughput is an open question and it depends a
>>> lot on your programs.
>>>
>>>
>>> On 3/13/08 1:42 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is it possible to configure hadoop cluster in such manner where there
>>>> are separately data-nodes and separately worker-nodes? I.e. when nodes
>>>> 1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and
>>>> take data from HDFS?
>>>>
>>>> If it's possible what impact will be on performance? Any suggestions?
>>>>
>>>> Thanks in advance,
>>>>
>>>> --- Andrey Pankov
>>>
>> ---
>> Andrey Pankov
> 
> 

---
Andrey Pankov

Re: Separate data-nodes from worker-nodes

Posted by Ted Dunning <td...@veoh.com>.

It is very possible (even easy).

The data nodes run the datanode process.  The task nodes run the task
tracker.  If the data nodes don't have a task tracker running, then they
won't do any computation.


On 3/13/08 8:22 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:

> Thanks, Ted!
> 
> I also thought it is not good one to separate them out. Just was
> wondering is it possible at all. Thanks!
> 
> 
> Ted Dunning wrote:
>> It is quite possible to do this.
>> 
>> It is also a bad idea.
>> 
>> One of the great things about map-reduce architectures is that data is near
>> the computation so that you don't have to wait for the network.  If you
>> separate data and computation, you impose additional load on the cluster.
>> 
>> What this will do to your throughput is an open question and it depends a
>> lot on your programs.
>> 
>> 
>> On 3/13/08 1:42 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:
>> 
>>> Hi,
>>> 
>>> Is it possible to configure hadoop cluster in such manner where there
>>> are separately data-nodes and separately worker-nodes? I.e. when nodes
>>> 1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and
>>> take data from HDFS?
>>> 
>>> If it's possible what impact will be on performance? Any suggestions?
>>> 
>>> Thanks in advance,
>>> 
>>> --- Andrey Pankov
>> 
>> 
> 
> ---
> Andrey Pankov

Re: Separate data-nodes from worker-nodes

Posted by Andrey Pankov <ap...@iponweb.net>.

Thanks, Ted!

I also thought it is not good one to separate them out. Just was 
wondering is it possible at all. Thanks!


Ted Dunning wrote:
> It is quite possible to do this.
> 
> It is also a bad idea.
> 
> One of the great things about map-reduce architectures is that data is near
> the computation so that you don't have to wait for the network.  If you
> separate data and computation, you impose additional load on the cluster.
> 
> What this will do to your throughput is an open question and it depends a
> lot on your programs.
> 
> 
> On 3/13/08 1:42 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:
> 
>> Hi,
>>
>> Is it possible to configure hadoop cluster in such manner where there
>> are separately data-nodes and separately worker-nodes? I.e. when nodes
>> 1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and
>> take data from HDFS?
>>
>> If it's possible what impact will be on performance? Any suggestions?
>>
>> Thanks in advance,
>>
>> --- Andrey Pankov
> 
> 

---
Andrey Pankov

Re: Separate data-nodes from worker-nodes

Posted by Ted Dunning <td...@veoh.com>.

It is quite possible to do this.

It is also a bad idea.

One of the great things about map-reduce architectures is that data is near
the computation so that you don't have to wait for the network.  If you
separate data and computation, you impose additional load on the cluster.

What this will do to your throughput is an open question and it depends a
lot on your programs.

On 3/13/08 1:42 AM, "Andrey Pankov" <ap...@iponweb.net> wrote:

> Hi,
> 
> Is it possible to configure hadoop cluster in such manner where there
> are separately data-nodes and separately worker-nodes? I.e. when nodes
> 1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and
> take data from HDFS?
> 
> If it's possible what impact will be on performance? Any suggestions?
> 
> Thanks in advance,
> 
> --- Andrey Pankov