You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Indranil Majumder (imajumde)" <im...@cisco.com> on 2013/12/14 11:33:34 UTC

Hadoop setup doubts

I stared with Hadoop few days ago, I do have few doubts on the setup,


1.       For name node I do format the name directory, is it recommended to do the same for the data node directories too.

2.       How does log aggregation work?

3.       Does resource manager run on every node (both Name and Data) or it can run as a separate node?

4.       What is the purpose of the webproxy? Is it really required?

5.       Is there any documentation on how to decide which scheduler type based on certain parameters?

6.       What is the recommended way of pushing  data into Hadoop cluster & submitting  mapred jobs, i.e should we use another client  node, if so is there any client daemon to run on it ?

7.       For the following nodes in clustered mode

A.      NameNode

B.      Secondary NameNode

C.      DataNode (2)

D.      Resource Manager

E.       WebProxy

F.       History Server( Map Reduce )
I want to write a PID monitor. Does anybody has the list of processes that would run on this clusters when fully operational [may be output of ps -ef | grep "somekeyword" will do]

Thanks & Regards,
Indranil

Re: Hadoop setup doubts

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

> 2.       How does log aggregation work?
>
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

> 4.       What is the purpose of the webproxy? Is it really required?
>
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html


> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
I am not sure, if I fully understand the question.
You can use only one scheduler at the same time. On run-time, you can
decided which pool or queue, your job should be submitted to, if you use
Fair or Capacity schedule.

> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> ---- Do you have experiance with UNIX, if so hadoop commands are similer
> to UNIX commands. Ex. below command works fine for me.
>
> hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>
>
Usually, we push data to the cluster + submit mapreduce jobs, from machines
called "edgenodes". In Hadoop, the edgenode is a machine where the hadoop
client libraries are installed (+ pig, hive, sqoop etc, if you want to use
them), but no Hadoop daemon is running.

Hope this helps a bit!


> On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
> imajumde@cisco.com> wrote:
>
>>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>>
>>
>>
>> 1.       For name node I do format the name directory, is it recommended
>> to do the same for the data node directories too.
>>
>> 2.       How does log aggregation work?
>>
>> 3.       Does resource manager run on every node (both Name and Data) or
>> it can run as a separate node?
>>
>> 4.       What is the purpose of the webproxy? Is it really required?
>>
>> 5.       Is there any documentation on how to decide which scheduler
>> type based on certain parameters?
>>
>> 6.       What is the recommended way of pushing  data into Hadoop
>> cluster & submitting  mapred jobs, i.e should we use another client  node,
>> if so is there any client daemon to run on it ?
>>
>> 7.       For the following nodes in clustered mode
>>
>> A.      NameNode
>>
>> B.      Secondary NameNode
>>
>> C.      DataNode (2)
>>
>> D.      Resource Manager
>>
>> E.       WebProxy
>>
>> F.       History Server( Map Reduce )
>>
>> I want to write a PID monitor. Does anybody has the list of processes
>> that would run on this clusters when fully operational [may be output of ps
>> –ef | grep “somekeyword” will do]
>>
>>
>>
>> Thanks & Regards,
>>
>> Indranil
>>
>
>

Re: Hadoop setup doubts

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

> 2.       How does log aggregation work?
>
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

> 4.       What is the purpose of the webproxy? Is it really required?
>
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html


> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
I am not sure, if I fully understand the question.
You can use only one scheduler at the same time. On run-time, you can
decided which pool or queue, your job should be submitted to, if you use
Fair or Capacity schedule.

> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> ---- Do you have experiance with UNIX, if so hadoop commands are similer
> to UNIX commands. Ex. below command works fine for me.
>
> hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>
>
Usually, we push data to the cluster + submit mapreduce jobs, from machines
called "edgenodes". In Hadoop, the edgenode is a machine where the hadoop
client libraries are installed (+ pig, hive, sqoop etc, if you want to use
them), but no Hadoop daemon is running.

Hope this helps a bit!


> On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
> imajumde@cisco.com> wrote:
>
>>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>>
>>
>>
>> 1.       For name node I do format the name directory, is it recommended
>> to do the same for the data node directories too.
>>
>> 2.       How does log aggregation work?
>>
>> 3.       Does resource manager run on every node (both Name and Data) or
>> it can run as a separate node?
>>
>> 4.       What is the purpose of the webproxy? Is it really required?
>>
>> 5.       Is there any documentation on how to decide which scheduler
>> type based on certain parameters?
>>
>> 6.       What is the recommended way of pushing  data into Hadoop
>> cluster & submitting  mapred jobs, i.e should we use another client  node,
>> if so is there any client daemon to run on it ?
>>
>> 7.       For the following nodes in clustered mode
>>
>> A.      NameNode
>>
>> B.      Secondary NameNode
>>
>> C.      DataNode (2)
>>
>> D.      Resource Manager
>>
>> E.       WebProxy
>>
>> F.       History Server( Map Reduce )
>>
>> I want to write a PID monitor. Does anybody has the list of processes
>> that would run on this clusters when fully operational [may be output of ps
>> –ef | grep “somekeyword” will do]
>>
>>
>>
>> Thanks & Regards,
>>
>> Indranil
>>
>
>

Re: Hadoop setup doubts

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

> 2.       How does log aggregation work?
>
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

> 4.       What is the purpose of the webproxy? Is it really required?
>
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html


> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
I am not sure, if I fully understand the question.
You can use only one scheduler at the same time. On run-time, you can
decided which pool or queue, your job should be submitted to, if you use
Fair or Capacity schedule.

> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> ---- Do you have experiance with UNIX, if so hadoop commands are similer
> to UNIX commands. Ex. below command works fine for me.
>
> hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>
>
Usually, we push data to the cluster + submit mapreduce jobs, from machines
called "edgenodes". In Hadoop, the edgenode is a machine where the hadoop
client libraries are installed (+ pig, hive, sqoop etc, if you want to use
them), but no Hadoop daemon is running.

Hope this helps a bit!


> On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
> imajumde@cisco.com> wrote:
>
>>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>>
>>
>>
>> 1.       For name node I do format the name directory, is it recommended
>> to do the same for the data node directories too.
>>
>> 2.       How does log aggregation work?
>>
>> 3.       Does resource manager run on every node (both Name and Data) or
>> it can run as a separate node?
>>
>> 4.       What is the purpose of the webproxy? Is it really required?
>>
>> 5.       Is there any documentation on how to decide which scheduler
>> type based on certain parameters?
>>
>> 6.       What is the recommended way of pushing  data into Hadoop
>> cluster & submitting  mapred jobs, i.e should we use another client  node,
>> if so is there any client daemon to run on it ?
>>
>> 7.       For the following nodes in clustered mode
>>
>> A.      NameNode
>>
>> B.      Secondary NameNode
>>
>> C.      DataNode (2)
>>
>> D.      Resource Manager
>>
>> E.       WebProxy
>>
>> F.       History Server( Map Reduce )
>>
>> I want to write a PID monitor. Does anybody has the list of processes
>> that would run on this clusters when fully operational [may be output of ps
>> –ef | grep “somekeyword” will do]
>>
>>
>>
>> Thanks & Regards,
>>
>> Indranil
>>
>
>

Re: Hadoop setup doubts

Posted by Adam Kawa <ka...@gmail.com>.

Hi,

> 2.       How does log aggregation work?
>
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

> 4.       What is the purpose of the webproxy? Is it really required?
>
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html


> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
I am not sure, if I fully understand the question.
You can use only one scheduler at the same time. On run-time, you can
decided which pool or queue, your job should be submitted to, if you use
Fair or Capacity schedule.

> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> ---- Do you have experiance with UNIX, if so hadoop commands are similer
> to UNIX commands. Ex. below command works fine for me.
>
> hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>
>
Usually, we push data to the cluster + submit mapreduce jobs, from machines
called "edgenodes". In Hadoop, the edgenode is a machine where the hadoop
client libraries are installed (+ pig, hive, sqoop etc, if you want to use
them), but no Hadoop daemon is running.

Hope this helps a bit!


> On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
> imajumde@cisco.com> wrote:
>
>>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>>
>>
>>
>> 1.       For name node I do format the name directory, is it recommended
>> to do the same for the data node directories too.
>>
>> 2.       How does log aggregation work?
>>
>> 3.       Does resource manager run on every node (both Name and Data) or
>> it can run as a separate node?
>>
>> 4.       What is the purpose of the webproxy? Is it really required?
>>
>> 5.       Is there any documentation on how to decide which scheduler
>> type based on certain parameters?
>>
>> 6.       What is the recommended way of pushing  data into Hadoop
>> cluster & submitting  mapred jobs, i.e should we use another client  node,
>> if so is there any client daemon to run on it ?
>>
>> 7.       For the following nodes in clustered mode
>>
>> A.      NameNode
>>
>> B.      Secondary NameNode
>>
>> C.      DataNode (2)
>>
>> D.      Resource Manager
>>
>> E.       WebProxy
>>
>> F.       History Server( Map Reduce )
>>
>> I want to write a PID monitor. Does anybody has the list of processes
>> that would run on this clusters when fully operational [may be output of ps
>> –ef | grep “somekeyword” will do]
>>
>>
>>
>> Thanks & Regards,
>>
>> Indranil
>>
>
>

Re: Hadoop setup doubts

Posted by Manjunath Hegde <he...@gmail.com>.

Please find my answers inline. Even i have started hadoop few days back, i
may be wrong.

1.       For name node I do format the name directory, is it recommended to
do the same for the data node directories too.
-----No, we do not format datanode

2.       How does log aggregation work?

3.       Does resource manager run on every node (both Name and Data) or it
can run as a separate node?

-------Only on node you have specified. It will usuall run on single node.

4.       What is the purpose of the webproxy? Is it really required?

5.       Is there any documentation on how to decide which scheduler type
based on certain parameters?

6.       What is the recommended way of pushing  data into Hadoop cluster &
submitting  mapred jobs, i.e should we use another client  node, if so is
there any client daemon to run on it ?

---- Do you have experiance with UNIX, if so hadoop commands are similer to
UNIX commands. Ex. below command works fine for me.

hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>

7.       For the following nodes in clustered mode

A.      NameNode

B.      Secondary NameNode

C.      DataNode (2)

D.      Resource Manager

E.       WebProxy

F.       History Server( Map Reduce )

I want to write a PID monitor. Does anybody has the list of processes that
would run on this clusters when fully operational [may be output of ps –ef
| grep “somekeyword” will do]

--- Just use jps if you only need to monitor process. It really depends on
your requirements.

Thanks & Regards,
Indranil

On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
imajumde@cisco.com> wrote:

>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>
>
>
> 1.       For name node I do format the name directory, is it recommended
> to do the same for the data node directories too.
>
> 2.       How does log aggregation work?
>
> 3.       Does resource manager run on every node (both Name and Data) or
> it can run as a separate node?
>
> 4.       What is the purpose of the webproxy? Is it really required?
>
> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> 7.       For the following nodes in clustered mode
>
> A.      NameNode
>
> B.      Secondary NameNode
>
> C.      DataNode (2)
>
> D.      Resource Manager
>
> E.       WebProxy
>
> F.       History Server( Map Reduce )
>
> I want to write a PID monitor. Does anybody has the list of processes that
> would run on this clusters when fully operational [may be output of ps –ef
> | grep “somekeyword” will do]
>
>
>
> Thanks & Regards,
>
> Indranil
>

Re: Hadoop setup doubts

Posted by Manjunath Hegde <he...@gmail.com>.

Please find my answers inline. Even i have started hadoop few days back, i
may be wrong.

1.       For name node I do format the name directory, is it recommended to
do the same for the data node directories too.
-----No, we do not format datanode

2.       How does log aggregation work?

3.       Does resource manager run on every node (both Name and Data) or it
can run as a separate node?

-------Only on node you have specified. It will usuall run on single node.

4.       What is the purpose of the webproxy? Is it really required?

5.       Is there any documentation on how to decide which scheduler type
based on certain parameters?

6.       What is the recommended way of pushing  data into Hadoop cluster &
submitting  mapred jobs, i.e should we use another client  node, if so is
there any client daemon to run on it ?

---- Do you have experiance with UNIX, if so hadoop commands are similer to
UNIX commands. Ex. below command works fine for me.

hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>

7.       For the following nodes in clustered mode

A.      NameNode

B.      Secondary NameNode

C.      DataNode (2)

D.      Resource Manager

E.       WebProxy

F.       History Server( Map Reduce )

I want to write a PID monitor. Does anybody has the list of processes that
would run on this clusters when fully operational [may be output of ps –ef
| grep “somekeyword” will do]

--- Just use jps if you only need to monitor process. It really depends on
your requirements.

Thanks & Regards,
Indranil

On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
imajumde@cisco.com> wrote:

>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>
>
>
> 1.       For name node I do format the name directory, is it recommended
> to do the same for the data node directories too.
>
> 2.       How does log aggregation work?
>
> 3.       Does resource manager run on every node (both Name and Data) or
> it can run as a separate node?
>
> 4.       What is the purpose of the webproxy? Is it really required?
>
> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> 7.       For the following nodes in clustered mode
>
> A.      NameNode
>
> B.      Secondary NameNode
>
> C.      DataNode (2)
>
> D.      Resource Manager
>
> E.       WebProxy
>
> F.       History Server( Map Reduce )
>
> I want to write a PID monitor. Does anybody has the list of processes that
> would run on this clusters when fully operational [may be output of ps –ef
> | grep “somekeyword” will do]
>
>
>
> Thanks & Regards,
>
> Indranil
>

Re: Hadoop setup doubts

Posted by Manjunath Hegde <he...@gmail.com>.

Please find my answers inline. Even i have started hadoop few days back, i
may be wrong.

1.       For name node I do format the name directory, is it recommended to
do the same for the data node directories too.
-----No, we do not format datanode

2.       How does log aggregation work?

3.       Does resource manager run on every node (both Name and Data) or it
can run as a separate node?

-------Only on node you have specified. It will usuall run on single node.

4.       What is the purpose of the webproxy? Is it really required?

5.       Is there any documentation on how to decide which scheduler type
based on certain parameters?

6.       What is the recommended way of pushing  data into Hadoop cluster &
submitting  mapred jobs, i.e should we use another client  node, if so is
there any client daemon to run on it ?

---- Do you have experiance with UNIX, if so hadoop commands are similer to
UNIX commands. Ex. below command works fine for me.

hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>

7.       For the following nodes in clustered mode

A.      NameNode

B.      Secondary NameNode

C.      DataNode (2)

D.      Resource Manager

E.       WebProxy

F.       History Server( Map Reduce )

I want to write a PID monitor. Does anybody has the list of processes that
would run on this clusters when fully operational [may be output of ps –ef
| grep “somekeyword” will do]

--- Just use jps if you only need to monitor process. It really depends on
your requirements.

Thanks & Regards,
Indranil

On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
imajumde@cisco.com> wrote:

>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>
>
>
> 1.       For name node I do format the name directory, is it recommended
> to do the same for the data node directories too.
>
> 2.       How does log aggregation work?
>
> 3.       Does resource manager run on every node (both Name and Data) or
> it can run as a separate node?
>
> 4.       What is the purpose of the webproxy? Is it really required?
>
> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> 7.       For the following nodes in clustered mode
>
> A.      NameNode
>
> B.      Secondary NameNode
>
> C.      DataNode (2)
>
> D.      Resource Manager
>
> E.       WebProxy
>
> F.       History Server( Map Reduce )
>
> I want to write a PID monitor. Does anybody has the list of processes that
> would run on this clusters when fully operational [may be output of ps –ef
> | grep “somekeyword” will do]
>
>
>
> Thanks & Regards,
>
> Indranil
>

Re: Hadoop setup doubts

Posted by Manjunath Hegde <he...@gmail.com>.

Please find my answers inline. Even i have started hadoop few days back, i
may be wrong.

1.       For name node I do format the name directory, is it recommended to
do the same for the data node directories too.
-----No, we do not format datanode

2.       How does log aggregation work?

3.       Does resource manager run on every node (both Name and Data) or it
can run as a separate node?

-------Only on node you have specified. It will usuall run on single node.

4.       What is the purpose of the webproxy? Is it really required?

5.       Is there any documentation on how to decide which scheduler type
based on certain parameters?

6.       What is the recommended way of pushing  data into Hadoop cluster &
submitting  mapred jobs, i.e should we use another client  node, if so is
there any client daemon to run on it ?

---- Do you have experiance with UNIX, if so hadoop commands are similer to
UNIX commands. Ex. below command works fine for me.

hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>

7.       For the following nodes in clustered mode

A.      NameNode

B.      Secondary NameNode

C.      DataNode (2)

D.      Resource Manager

E.       WebProxy

F.       History Server( Map Reduce )

I want to write a PID monitor. Does anybody has the list of processes that
would run on this clusters when fully operational [may be output of ps –ef
| grep “somekeyword” will do]

--- Just use jps if you only need to monitor process. It really depends on
your requirements.

Thanks & Regards,
Indranil

On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
imajumde@cisco.com> wrote:

>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>
>
>
> 1.       For name node I do format the name directory, is it recommended
> to do the same for the data node directories too.
>
> 2.       How does log aggregation work?
>
> 3.       Does resource manager run on every node (both Name and Data) or
> it can run as a separate node?
>
> 4.       What is the purpose of the webproxy? Is it really required?
>
> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> 7.       For the following nodes in clustered mode
>
> A.      NameNode
>
> B.      Secondary NameNode
>
> C.      DataNode (2)
>
> D.      Resource Manager
>
> E.       WebProxy
>
> F.       History Server( Map Reduce )
>
> I want to write a PID monitor. Does anybody has the list of processes that
> would run on this clusters when fully operational [may be output of ps –ef
> | grep “somekeyword” will do]
>
>
>
> Thanks & Regards,
>
> Indranil
>