You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Arthur.hk.chan@gmail.com" <ar...@gmail.com> on 2014/08/04 13:08:18 UTC

Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node, 

Please advise
Regards
Arthur


2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
	at org.apache.hadoop.ipc.Client.call(Client.java:1414)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
	... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.

You need additional settings to make ResourceManager auto-failover.

http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

JobHistoryServer does not have automatic failover feature.

Regards,
Akira

(2014/08/05 20:15), Arthur.hk.chan@gmail.com wrote:
> Hi
>
> I have set up the Hadoop 2.4.1 with HDFS High Availability using the
> Quorum Journal Manager.
>
> I am verifying Automatic Failover: I manually used “kill -9” command to
> disable all running Hadoop services in active node (NN-1), I can find
> that the Standby node (NN-2) now becomes ACTIVE now which is good,
> however, the “ResourceManager” service cannot be found in NN-2, please
> advise how to make ResourceManager and JobHistoryServer auto-failover?
> or do I miss some important setup? missing some settings in
> hdfs-site.xml or core-site.xml?
>
> Please help!
>
> Regards
> Arthur
>
>
>
>
> BEFORE TESTING:
> NN-1:
> jps
> 9564 NameNode
> 10176 JobHistoryServer
> 21215 Jps
> 17636 QuorumPeerMain
> 20838 NodeManager
> 9678 DataNode
> 9933 JournalNode
> 10085 DFSZKFailoverController
> 20724 ResourceManager
>
> NN-2 (Standby Name node)
> jps
> 14064 Jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
>
>
>
> AFTER
> NN-1
> dips
> 17636 QuorumPeerMain
> 21508 Jps
>
> NN-2
> jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
> 14165 Jps
>
>
>

Re: Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

I restarted the cluster, and the usercahe all gone automatically.  no longer an issue.  thanks

On 20 Aug, 2014, at 7:05 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi, 
> 
>  i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB
> 
> I found that these files are all under tmp/nm-local-dir/usercache
> 
> Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

I restarted the cluster, and the usercahe all gone automatically.  no longer an issue.  thanks

On 20 Aug, 2014, at 7:05 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi, 
> 
>  i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB
> 
> I found that these files are all under tmp/nm-local-dir/usercache
> 
> Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

I restarted the cluster, and the usercahe all gone automatically.  no longer an issue.  thanks

On 20 Aug, 2014, at 7:05 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi, 
> 
>  i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB
> 
> I found that these files are all under tmp/nm-local-dir/usercache
> 
> Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

I restarted the cluster, and the usercahe all gone automatically.  no longer an issue.  thanks

On 20 Aug, 2014, at 7:05 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi, 
> 
>  i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB
> 
> I found that these files are all under tmp/nm-local-dir/usercache
> 
> Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?
> 
> Regards
> Arthur
> 
>

Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi, 

 i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB

I found that these files are all under tmp/nm-local-dir/usercache

Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?

Regards
Arthur

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Thanks for your reply.  However I think it is not about 32-bit version issue, cus my Hadoop is 64-bit as I compiled it from source.  I think my way to install snappy should be wrong, 

Arthur
On 19 Aug, 2014, at 11:53 pm, Andre Kelpe <ak...@concurrentinc.com> wrote:

> Could this be caused by the fact that hadoop no longer ships with 64bit libs? https://issues.apache.org/jira/browse/HADOOP-9911
> 
> - André
> 
> 
> On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> I am trying Snappy in Hadoop 2.4.1, here are my steps: 
> 
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
> 
> 2)
> added the following 
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
> 
> 3) 
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
> 
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout
> 
> I got the following warning, actually there is no any test file created in hdfs:
> 
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.
> 
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?
> 
> Regards
> Arthur
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> André Kelpe
> andre@concurrentinc.com
> http://concurrentinc.com

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Thanks for your reply.  However I think it is not about 32-bit version issue, cus my Hadoop is 64-bit as I compiled it from source.  I think my way to install snappy should be wrong, 

Arthur
On 19 Aug, 2014, at 11:53 pm, Andre Kelpe <ak...@concurrentinc.com> wrote:

> Could this be caused by the fact that hadoop no longer ships with 64bit libs? https://issues.apache.org/jira/browse/HADOOP-9911
> 
> - André
> 
> 
> On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> I am trying Snappy in Hadoop 2.4.1, here are my steps: 
> 
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
> 
> 2)
> added the following 
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
> 
> 3) 
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
> 
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout
> 
> I got the following warning, actually there is no any test file created in hdfs:
> 
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.
> 
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?
> 
> Regards
> Arthur
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> André Kelpe
> andre@concurrentinc.com
> http://concurrentinc.com

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Thanks for your reply.  However I think it is not about 32-bit version issue, cus my Hadoop is 64-bit as I compiled it from source.  I think my way to install snappy should be wrong, 

Arthur
On 19 Aug, 2014, at 11:53 pm, Andre Kelpe <ak...@concurrentinc.com> wrote:

> Could this be caused by the fact that hadoop no longer ships with 64bit libs? https://issues.apache.org/jira/browse/HADOOP-9911
> 
> - André
> 
> 
> On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> I am trying Snappy in Hadoop 2.4.1, here are my steps: 
> 
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
> 
> 2)
> added the following 
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
> 
> 3) 
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
> 
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout
> 
> I got the following warning, actually there is no any test file created in hdfs:
> 
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.
> 
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?
> 
> Regards
> Arthur
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> André Kelpe
> andre@concurrentinc.com
> http://concurrentinc.com

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Thanks for your reply.  However I think it is not about 32-bit version issue, cus my Hadoop is 64-bit as I compiled it from source.  I think my way to install snappy should be wrong, 

Arthur
On 19 Aug, 2014, at 11:53 pm, Andre Kelpe <ak...@concurrentinc.com> wrote:

> Could this be caused by the fact that hadoop no longer ships with 64bit libs? https://issues.apache.org/jira/browse/HADOOP-9911
> 
> - André
> 
> 
> On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> I am trying Snappy in Hadoop 2.4.1, here are my steps: 
> 
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
> 
> 2)
> added the following 
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>     <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
> 
> 3) 
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
> 
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout
> 
> I got the following warning, actually there is no any test file created in hdfs:
> 
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.
> 
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?
> 
> Regards
> Arthur
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> André Kelpe
> andre@concurrentinc.com
> http://concurrentinc.com

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Could this be caused by the fact that hadoop no longer ships with 64bit
libs? https://issues.apache.org/jira/browse/HADOOP-9911

- André


On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> I am trying Snappy in Hadoop 2.4.1, here are my steps:
>
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
>
> 2)
> added the following
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>
> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> 3)
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen
> 100000 /tmp/teragenout
>
> I got the following warning, actually there is no any test file created in
> hdfs:
>
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.map.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the map JVM env using
> mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.reduce.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the reduce JVM env using
> mapreduce.admin.user.env config settings.
>
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1?
> or what would be wrong? or is my new change in mapred-site.xml incorrect?
>
> Regards
> Arthur
>
>
>
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi, 

 i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB

I found that these files are all under tmp/nm-local-dir/usercache

Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?

Regards
Arthur

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Could this be caused by the fact that hadoop no longer ships with 64bit
libs? https://issues.apache.org/jira/browse/HADOOP-9911

- André


On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> I am trying Snappy in Hadoop 2.4.1, here are my steps:
>
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
>
> 2)
> added the following
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>
> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> 3)
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen
> 100000 /tmp/teragenout
>
> I got the following warning, actually there is no any test file created in
> hdfs:
>
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.map.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the map JVM env using
> mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.reduce.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the reduce JVM env using
> mapreduce.admin.user.env config settings.
>
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1?
> or what would be wrong? or is my new change in mapred-site.xml incorrect?
>
> Regards
> Arthur
>
>
>
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi, 

 i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB

I found that these files are all under tmp/nm-local-dir/usercache

Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?

Regards
Arthur

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Could this be caused by the fact that hadoop no longer ships with 64bit
libs? https://issues.apache.org/jira/browse/HADOOP-9911

- André


On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> I am trying Snappy in Hadoop 2.4.1, here are my steps:
>
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
>
> 2)
> added the following
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>
> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> 3)
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen
> 100000 /tmp/teragenout
>
> I got the following warning, actually there is no any test file created in
> hdfs:
>
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.map.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the map JVM env using
> mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.reduce.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the reduce JVM env using
> mapreduce.admin.user.env config settings.
>
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1?
> or what would be wrong? or is my new change in mapred-site.xml incorrect?
>
> Regards
> Arthur
>
>
>
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Re: Hadoop 2.4.1 Snappy Smoke Test failed

Posted by Andre Kelpe <ak...@concurrentinc.com>.

Could this be caused by the fact that hadoop no longer ships with 64bit
libs? https://issues.apache.org/jira/browse/HADOOP-9911

- André


On Tue, Aug 19, 2014 at 5:40 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> I am trying Snappy in Hadoop 2.4.1, here are my steps:
>
> (CentOS 64-bit)
> 1)
> yum install snappy snappy-devel
>
> 2)
> added the following
> (core-site.xml)
>    <property>
>     <name>io.compression.codecs</name>
>
> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> 3)
> mapred-site.xml
>    <property>
>     <name>mapreduce.admin.map.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>    <property>
>     <name>mapreduce.admin.reduce.child.java.opts</name>
>     <value>-server -XX:NewRatio=8
> -Djava.library.path=/usr/lib/hadoop/lib/native/
> -Djava.net.preferIPv4Stack=true</value>
>     <final>true</final>
>    </property>
>
> 4) smoke test
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen
> 100000 /tmp/teragenout
>
> I got the following warning, actually there is no any test file created in
> hdfs:
>
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.map.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the map JVM env using
> mapreduce.admin.user.env config settings.
> 14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in
> mapreduce.admin.reduce.child.java.opts can cause programs to no longer
> function if hadoop native libraries are used. These values should be set as
> part of the LD_LIBRARY_PATH in the reduce JVM env using
> mapreduce.admin.user.env config settings.
>
> Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1?
> or what would be wrong? or is my new change in mapred-site.xml incorrect?
>
> Regards
> Arthur
>
>
>
>
>
>


-- 
André Kelpe
andre@concurrentinc.com
http://concurrentinc.com

Hadoop 2.4.1 How to clear usercache

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi, 

 i use Hadoop 2.4.1, in my cluster,  Non DFS Used: 2.09 TB

I found that these files are all under tmp/nm-local-dir/usercache

Is there any Hadoop command to remove these unused user cache files tmp/nm-local-dir/usercache ?

Regards
Arthur

Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

I am trying Snappy in Hadoop 2.4.1, here are my steps: 

(CentOS 64-bit)
1)
yum install snappy snappy-devel

2)
added the following 
(core-site.xml)
   <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
   </property>

3) 
mapred-site.xml
   <property>
    <name>mapreduce.admin.map.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>
   <property>
    <name>mapreduce.admin.reduce.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>

4) smoke test
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout

I got the following warning, actually there is no any test file created in hdfs:

14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.

Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?

Regards
Arthur

Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

I am trying Snappy in Hadoop 2.4.1, here are my steps: 

(CentOS 64-bit)
1)
yum install snappy snappy-devel

2)
added the following 
(core-site.xml)
   <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
   </property>

3) 
mapred-site.xml
   <property>
    <name>mapreduce.admin.map.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>
   <property>
    <name>mapreduce.admin.reduce.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>

4) smoke test
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout

I got the following warning, actually there is no any test file created in hdfs:

14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.

Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?

Regards
Arthur

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

   Could you show me the error message for rm2. please ?


Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 10:17 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> Thank y very much!
>
> At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager
> in rm2 is not started accordingly.  Please advise what would be wrong?
> Thanks
>
> Regards
> Arthur
>
>
>
>
> On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> No, need to start multiple RMs separately.
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
> Interesting question. But one of the design for auto-failover is that the
> down-time of RM is invisible to end users. The end users can submit
> applications normally even if the failover happens.
>
> We can monitor the status of RMs by using the command-line (you did
> previously) or from webUI/webService
> (rm_address:portnumber/cluster/cluster). We can get the current status from
> there.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is
>> my yarn-site.xml.
>>
>> At the moment, the ResourceManager HA works if:
>>
>> 1) at rm1, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> active
>>
>> yarn rmadmin -getServiceState rm2
>> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
>> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
>> connection exception: java.net.ConnectException: Connection refused; For
>> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>
>>
>> 2) at rm2, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> standby
>>
>>
>> Some questions:
>> Q1)  I need start yarn in EACH master separately, is this normal? Is
>> there a way that I just run ./sbin/start-yarn.sh in rm1 and get the
>> STANDBY ResourceManager in rm2 started as well?
>>
>> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
>> down in an auto-failover env? or how do you monitor the status of
>> ACTIVE/STANDBY ResourceManager?
>>
>>
>> Regards
>> Arthur
>>
>>
>> <?xml version="1.0"?>
>> <configuration>
>>
>> <!-- Site specific YARN configuration properties -->
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services</name>
>>       <value>mapreduce_shuffle</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.resourcemanager.address</name>
>>       <value>192.168.1.1:8032</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.resource-tracker.address</name>
>>        <value>192.168.1.1:8031</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.admin.address</name>
>>        <value>192.168.1.1:8033</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.scheduler.address</name>
>>        <value>192.168.1.1:8030</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.loacl-dirs</name>
>>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>>        <final>true</final>
>>    </property>
>>
>>    <property>
>>        <name>yarn.web-proxy.address</name>
>>        <value>192.168.1.1:8888</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>>    </property>
>>
>>
>>
>>
>>    <property>
>>       <name>yarn.nodemanager.resource.memory-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.minimum-allocation-mb</name>
>>       <value>9216</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.maximum-allocation-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>
>>
>>   <property>
>>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>>     <value>2000</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.cluster-id</name>
>>     <value>cluster_rm</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.rm-ids</name>
>>     <value>rm1,rm2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm1</name>
>>     <value>192.168.1.1</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm2</name>
>>     <value>192.168.1.2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.recovery.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.store.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>>   </property>
>>   <property>
>>       <name>yarn.resourcemanager.zk-address</name>
>>       <value>rm1:2181,m135:2181,m137:2181</value>
>>   </property>
>>   <property>
>>
>> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>>     <value>5000</value>
>>   </property>
>>
>>   <!-- RM1 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm1</name>
>>     <value>192.168.1.1:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>>     <value>192.168.1.1:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>>     <value>192.168.1.1:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>>     <value>192.168.1.1:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>>     <value>192.168.1.1:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm1</name>
>>     <value>192.168.1.1:23142</value>
>>   </property>
>>
>>
>>   <!-- RM2 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm2</name>
>>     <value>192.168.1.2:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>>     <value>192.168.1.2:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>>     <value>192.168.1.2:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>>     <value>192.168.1.2:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>>     <value>192.168.1.2:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm2</name>
>>     <value>192.168.1.2:23142</value>
>>   </property>
>>
>>   <property>
>>     <name>yarn.nodemanager.remote-app-log-dir</name>
>>     <value>/edh/hadoop_logs/hadoop/</value>
>>   </property>
>>
>> </configuration>
>>
>>
>>
>> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>>
>> Hey, Arthur:
>>
>>     Did you use single node cluster or multiple nodes cluster? Could you
>> share your configuration file (yarn-site.xml) ? This looks like a
>> configuration issue.
>>
>> Thanks
>>
>> Xuan Gong
>>
>>
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
>> arthur.hk.chan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> If I have TWO nodes for ResourceManager HA, what should be the correct
>>> steps and commands to start and stop ResourceManager in a ResourceManager
>>> HA cluster ?
>>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it
>>> seems that  ./sbin/start-yarn.sh can only start YARN in a node at a
>>> time.
>>>
>>> Regards
>>> Arthur
>>>
>>>
>>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

   Could you show me the error message for rm2. please ?


Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 10:17 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> Thank y very much!
>
> At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager
> in rm2 is not started accordingly.  Please advise what would be wrong?
> Thanks
>
> Regards
> Arthur
>
>
>
>
> On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> No, need to start multiple RMs separately.
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
> Interesting question. But one of the design for auto-failover is that the
> down-time of RM is invisible to end users. The end users can submit
> applications normally even if the failover happens.
>
> We can monitor the status of RMs by using the command-line (you did
> previously) or from webUI/webService
> (rm_address:portnumber/cluster/cluster). We can get the current status from
> there.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is
>> my yarn-site.xml.
>>
>> At the moment, the ResourceManager HA works if:
>>
>> 1) at rm1, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> active
>>
>> yarn rmadmin -getServiceState rm2
>> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
>> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
>> connection exception: java.net.ConnectException: Connection refused; For
>> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>
>>
>> 2) at rm2, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> standby
>>
>>
>> Some questions:
>> Q1)  I need start yarn in EACH master separately, is this normal? Is
>> there a way that I just run ./sbin/start-yarn.sh in rm1 and get the
>> STANDBY ResourceManager in rm2 started as well?
>>
>> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
>> down in an auto-failover env? or how do you monitor the status of
>> ACTIVE/STANDBY ResourceManager?
>>
>>
>> Regards
>> Arthur
>>
>>
>> <?xml version="1.0"?>
>> <configuration>
>>
>> <!-- Site specific YARN configuration properties -->
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services</name>
>>       <value>mapreduce_shuffle</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.resourcemanager.address</name>
>>       <value>192.168.1.1:8032</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.resource-tracker.address</name>
>>        <value>192.168.1.1:8031</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.admin.address</name>
>>        <value>192.168.1.1:8033</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.scheduler.address</name>
>>        <value>192.168.1.1:8030</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.loacl-dirs</name>
>>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>>        <final>true</final>
>>    </property>
>>
>>    <property>
>>        <name>yarn.web-proxy.address</name>
>>        <value>192.168.1.1:8888</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>>    </property>
>>
>>
>>
>>
>>    <property>
>>       <name>yarn.nodemanager.resource.memory-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.minimum-allocation-mb</name>
>>       <value>9216</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.maximum-allocation-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>
>>
>>   <property>
>>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>>     <value>2000</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.cluster-id</name>
>>     <value>cluster_rm</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.rm-ids</name>
>>     <value>rm1,rm2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm1</name>
>>     <value>192.168.1.1</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm2</name>
>>     <value>192.168.1.2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.recovery.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.store.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>>   </property>
>>   <property>
>>       <name>yarn.resourcemanager.zk-address</name>
>>       <value>rm1:2181,m135:2181,m137:2181</value>
>>   </property>
>>   <property>
>>
>> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>>     <value>5000</value>
>>   </property>
>>
>>   <!-- RM1 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm1</name>
>>     <value>192.168.1.1:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>>     <value>192.168.1.1:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>>     <value>192.168.1.1:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>>     <value>192.168.1.1:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>>     <value>192.168.1.1:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm1</name>
>>     <value>192.168.1.1:23142</value>
>>   </property>
>>
>>
>>   <!-- RM2 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm2</name>
>>     <value>192.168.1.2:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>>     <value>192.168.1.2:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>>     <value>192.168.1.2:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>>     <value>192.168.1.2:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>>     <value>192.168.1.2:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm2</name>
>>     <value>192.168.1.2:23142</value>
>>   </property>
>>
>>   <property>
>>     <name>yarn.nodemanager.remote-app-log-dir</name>
>>     <value>/edh/hadoop_logs/hadoop/</value>
>>   </property>
>>
>> </configuration>
>>
>>
>>
>> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>>
>> Hey, Arthur:
>>
>>     Did you use single node cluster or multiple nodes cluster? Could you
>> share your configuration file (yarn-site.xml) ? This looks like a
>> configuration issue.
>>
>> Thanks
>>
>> Xuan Gong
>>
>>
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
>> arthur.hk.chan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> If I have TWO nodes for ResourceManager HA, what should be the correct
>>> steps and commands to start and stop ResourceManager in a ResourceManager
>>> HA cluster ?
>>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it
>>> seems that  ./sbin/start-yarn.sh can only start YARN in a node at a
>>> time.
>>>
>>> Regards
>>> Arthur
>>>
>>>
>>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

   Could you show me the error message for rm2. please ?


Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 10:17 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> Thank y very much!
>
> At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager
> in rm2 is not started accordingly.  Please advise what would be wrong?
> Thanks
>
> Regards
> Arthur
>
>
>
>
> On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> No, need to start multiple RMs separately.
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
> Interesting question. But one of the design for auto-failover is that the
> down-time of RM is invisible to end users. The end users can submit
> applications normally even if the failover happens.
>
> We can monitor the status of RMs by using the command-line (you did
> previously) or from webUI/webService
> (rm_address:portnumber/cluster/cluster). We can get the current status from
> there.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is
>> my yarn-site.xml.
>>
>> At the moment, the ResourceManager HA works if:
>>
>> 1) at rm1, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> active
>>
>> yarn rmadmin -getServiceState rm2
>> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
>> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
>> connection exception: java.net.ConnectException: Connection refused; For
>> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>
>>
>> 2) at rm2, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> standby
>>
>>
>> Some questions:
>> Q1)  I need start yarn in EACH master separately, is this normal? Is
>> there a way that I just run ./sbin/start-yarn.sh in rm1 and get the
>> STANDBY ResourceManager in rm2 started as well?
>>
>> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
>> down in an auto-failover env? or how do you monitor the status of
>> ACTIVE/STANDBY ResourceManager?
>>
>>
>> Regards
>> Arthur
>>
>>
>> <?xml version="1.0"?>
>> <configuration>
>>
>> <!-- Site specific YARN configuration properties -->
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services</name>
>>       <value>mapreduce_shuffle</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.resourcemanager.address</name>
>>       <value>192.168.1.1:8032</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.resource-tracker.address</name>
>>        <value>192.168.1.1:8031</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.admin.address</name>
>>        <value>192.168.1.1:8033</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.scheduler.address</name>
>>        <value>192.168.1.1:8030</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.loacl-dirs</name>
>>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>>        <final>true</final>
>>    </property>
>>
>>    <property>
>>        <name>yarn.web-proxy.address</name>
>>        <value>192.168.1.1:8888</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>>    </property>
>>
>>
>>
>>
>>    <property>
>>       <name>yarn.nodemanager.resource.memory-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.minimum-allocation-mb</name>
>>       <value>9216</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.maximum-allocation-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>
>>
>>   <property>
>>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>>     <value>2000</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.cluster-id</name>
>>     <value>cluster_rm</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.rm-ids</name>
>>     <value>rm1,rm2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm1</name>
>>     <value>192.168.1.1</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm2</name>
>>     <value>192.168.1.2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.recovery.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.store.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>>   </property>
>>   <property>
>>       <name>yarn.resourcemanager.zk-address</name>
>>       <value>rm1:2181,m135:2181,m137:2181</value>
>>   </property>
>>   <property>
>>
>> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>>     <value>5000</value>
>>   </property>
>>
>>   <!-- RM1 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm1</name>
>>     <value>192.168.1.1:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>>     <value>192.168.1.1:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>>     <value>192.168.1.1:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>>     <value>192.168.1.1:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>>     <value>192.168.1.1:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm1</name>
>>     <value>192.168.1.1:23142</value>
>>   </property>
>>
>>
>>   <!-- RM2 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm2</name>
>>     <value>192.168.1.2:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>>     <value>192.168.1.2:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>>     <value>192.168.1.2:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>>     <value>192.168.1.2:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>>     <value>192.168.1.2:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm2</name>
>>     <value>192.168.1.2:23142</value>
>>   </property>
>>
>>   <property>
>>     <name>yarn.nodemanager.remote-app-log-dir</name>
>>     <value>/edh/hadoop_logs/hadoop/</value>
>>   </property>
>>
>> </configuration>
>>
>>
>>
>> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>>
>> Hey, Arthur:
>>
>>     Did you use single node cluster or multiple nodes cluster? Could you
>> share your configuration file (yarn-site.xml) ? This looks like a
>> configuration issue.
>>
>> Thanks
>>
>> Xuan Gong
>>
>>
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
>> arthur.hk.chan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> If I have TWO nodes for ResourceManager HA, what should be the correct
>>> steps and commands to start and stop ResourceManager in a ResourceManager
>>> HA cluster ?
>>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it
>>> seems that  ./sbin/start-yarn.sh can only start YARN in a node at a
>>> time.
>>>
>>> Regards
>>> Arthur
>>>
>>>
>>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

I am trying Snappy in Hadoop 2.4.1, here are my steps: 

(CentOS 64-bit)
1)
yum install snappy snappy-devel

2)
added the following 
(core-site.xml)
   <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
   </property>

3) 
mapred-site.xml
   <property>
    <name>mapreduce.admin.map.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>
   <property>
    <name>mapreduce.admin.reduce.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>

4) smoke test
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout

I got the following warning, actually there is no any test file created in hdfs:

14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.

Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?

Regards
Arthur

Hadoop 2.4.1 Snappy Smoke Test failed

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

I am trying Snappy in Hadoop 2.4.1, here are my steps: 

(CentOS 64-bit)
1)
yum install snappy snappy-devel

2)
added the following 
(core-site.xml)
   <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
   </property>

3) 
mapred-site.xml
   <property>
    <name>mapreduce.admin.map.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>
   <property>
    <name>mapreduce.admin.reduce.child.java.opts</name>
    <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>
    <final>true</final>
   </property>

4) smoke test
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  teragen 100000 /tmp/teragenout

I got the following warning, actually there is no any test file created in hdfs:

14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.map.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the map JVM env using mapreduce.admin.user.env config settings.
14/08/19 22:50:10 WARN mapred.YARNRunner: Usage of -Djava.library.path in mapreduce.admin.reduce.child.java.opts can cause programs to no longer function if hadoop native libraries are used. These values should be set as part of the LD_LIBRARY_PATH in the reduce JVM env using mapreduce.admin.user.env config settings.

Can anyone please advise how to install and enable SNAPPY in Hadoop 2.4.1? or what would be wrong? or is my new change in mapred-site.xml incorrect?

Regards
Arthur

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

   Could you show me the error message for rm2. please ?


Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 10:17 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> Thank y very much!
>
> At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager
> in rm2 is not started accordingly.  Please advise what would be wrong?
> Thanks
>
> Regards
> Arthur
>
>
>
>
> On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> No, need to start multiple RMs separately.
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
> Interesting question. But one of the design for auto-failover is that the
> down-time of RM is invisible to end users. The end users can submit
> applications normally even if the failover happens.
>
> We can monitor the status of RMs by using the command-line (you did
> previously) or from webUI/webService
> (rm_address:portnumber/cluster/cluster). We can get the current status from
> there.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is
>> my yarn-site.xml.
>>
>> At the moment, the ResourceManager HA works if:
>>
>> 1) at rm1, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> active
>>
>> yarn rmadmin -getServiceState rm2
>> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
>> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
>> connection exception: java.net.ConnectException: Connection refused; For
>> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>
>>
>> 2) at rm2, run ./sbin/start-yarn.sh
>>
>> yarn rmadmin -getServiceState rm1
>> standby
>>
>>
>> Some questions:
>> Q1)  I need start yarn in EACH master separately, is this normal? Is
>> there a way that I just run ./sbin/start-yarn.sh in rm1 and get the
>> STANDBY ResourceManager in rm2 started as well?
>>
>> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
>> down in an auto-failover env? or how do you monitor the status of
>> ACTIVE/STANDBY ResourceManager?
>>
>>
>> Regards
>> Arthur
>>
>>
>> <?xml version="1.0"?>
>> <configuration>
>>
>> <!-- Site specific YARN configuration properties -->
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services</name>
>>       <value>mapreduce_shuffle</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.resourcemanager.address</name>
>>       <value>192.168.1.1:8032</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.resource-tracker.address</name>
>>        <value>192.168.1.1:8031</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.admin.address</name>
>>        <value>192.168.1.1:8033</value>
>>    </property>
>>
>>    <property>
>>        <name>yarn.resourcemanager.scheduler.address</name>
>>        <value>192.168.1.1:8030</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.loacl-dirs</name>
>>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>>        <final>true</final>
>>    </property>
>>
>>    <property>
>>        <name>yarn.web-proxy.address</name>
>>        <value>192.168.1.1:8888</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>>    </property>
>>
>>
>>
>>
>>    <property>
>>       <name>yarn.nodemanager.resource.memory-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.minimum-allocation-mb</name>
>>       <value>9216</value>
>>    </property>
>>
>>    <property>
>>       <name>yarn.scheduler.maximum-allocation-mb</name>
>>       <value>18432</value>
>>    </property>
>>
>>
>>
>>   <property>
>>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>>     <value>2000</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.cluster-id</name>
>>     <value>cluster_rm</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.ha.rm-ids</name>
>>     <value>rm1,rm2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm1</name>
>>     <value>192.168.1.1</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.hostname.rm2</name>
>>     <value>192.168.1.2</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.recovery.enabled</name>
>>     <value>true</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.store.class</name>
>>
>> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>>   </property>
>>   <property>
>>       <name>yarn.resourcemanager.zk-address</name>
>>       <value>rm1:2181,m135:2181,m137:2181</value>
>>   </property>
>>   <property>
>>
>> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>>     <value>5000</value>
>>   </property>
>>
>>   <!-- RM1 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm1</name>
>>     <value>192.168.1.1:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>>     <value>192.168.1.1:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>>     <value>192.168.1.1:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>>     <value>192.168.1.1:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>>     <value>192.168.1.1:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm1</name>
>>     <value>192.168.1.1:23142</value>
>>   </property>
>>
>>
>>   <!-- RM2 configs -->
>>   <property>
>>     <name>yarn.resourcemanager.address.rm2</name>
>>     <value>192.168.1.2:23140</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>>     <value>192.168.1.2:23130</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>>     <value>192.168.1.2:23189</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>>     <value>192.168.1.2:23188</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>>     <value>192.168.1.2:23125</value>
>>   </property>
>>   <property>
>>     <name>yarn.resourcemanager.admin.address.rm2</name>
>>     <value>192.168.1.2:23142</value>
>>   </property>
>>
>>   <property>
>>     <name>yarn.nodemanager.remote-app-log-dir</name>
>>     <value>/edh/hadoop_logs/hadoop/</value>
>>   </property>
>>
>> </configuration>
>>
>>
>>
>> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>>
>> Hey, Arthur:
>>
>>     Did you use single node cluster or multiple nodes cluster? Could you
>> share your configuration file (yarn-site.xml) ? This looks like a
>> configuration issue.
>>
>> Thanks
>>
>> Xuan Gong
>>
>>
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
>> arthur.hk.chan@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> If I have TWO nodes for ResourceManager HA, what should be the correct
>>> steps and commands to start and stop ResourceManager in a ResourceManager
>>> HA cluster ?
>>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it
>>> seems that  ./sbin/start-yarn.sh can only start YARN in a node at a
>>> time.
>>>
>>> Regards
>>> Arthur
>>>
>>>
>>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thank y very much!

At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager in rm2 is not started accordingly.  Please advise what would be wrong? Thanks

Regards
Arthur




On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:

> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> No, need to start multiple RMs separately.
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? 
> 
> Interesting question. But one of the design for auto-failover is that the down-time of RM is invisible to end users. The end users can submit applications normally even if the failover happens. 
> 
> We can monitor the status of RMs by using the command-line (you did previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). We can get the current status from there.
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.
> 
> At the moment, the ResourceManager HA works if:
> 
> 1) at rm1, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> active
> 
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 
> 
> 2) at rm2, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> standby
> 
> 
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   
> 
> 
> Regards
> Arthur
> 
> 
> <?xml version="1.0"?>
> <configuration>
> 
> <!-- Site specific YARN configuration properties -->
> 
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
> 
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
> 
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
> 
> 
> 
> 
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
> 
> 
> 
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>     <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
> 
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
> 
> 
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
> 
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
> 
>> Hey, Arthur:
>> 
>>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
>> 
>> Thanks
>> 
>> Xuan Gong
>> 
>> 
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
>> Hi,
>> 
>> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>> 
>> Regards
>> Arthur
>> 
>> 
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thank y very much!

At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager in rm2 is not started accordingly.  Please advise what would be wrong? Thanks

Regards
Arthur




On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:

> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> No, need to start multiple RMs separately.
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? 
> 
> Interesting question. But one of the design for auto-failover is that the down-time of RM is invisible to end users. The end users can submit applications normally even if the failover happens. 
> 
> We can monitor the status of RMs by using the command-line (you did previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). We can get the current status from there.
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.
> 
> At the moment, the ResourceManager HA works if:
> 
> 1) at rm1, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> active
> 
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 
> 
> 2) at rm2, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> standby
> 
> 
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   
> 
> 
> Regards
> Arthur
> 
> 
> <?xml version="1.0"?>
> <configuration>
> 
> <!-- Site specific YARN configuration properties -->
> 
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
> 
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
> 
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
> 
> 
> 
> 
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
> 
> 
> 
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>     <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
> 
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
> 
> 
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
> 
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
> 
>> Hey, Arthur:
>> 
>>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
>> 
>> Thanks
>> 
>> Xuan Gong
>> 
>> 
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
>> Hi,
>> 
>> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>> 
>> Regards
>> Arthur
>> 
>> 
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thank y very much!

At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager in rm2 is not started accordingly.  Please advise what would be wrong? Thanks

Regards
Arthur




On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:

> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> No, need to start multiple RMs separately.
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? 
> 
> Interesting question. But one of the design for auto-failover is that the down-time of RM is invisible to end users. The end users can submit applications normally even if the failover happens. 
> 
> We can monitor the status of RMs by using the command-line (you did previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). We can get the current status from there.
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.
> 
> At the moment, the ResourceManager HA works if:
> 
> 1) at rm1, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> active
> 
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 
> 
> 2) at rm2, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> standby
> 
> 
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   
> 
> 
> Regards
> Arthur
> 
> 
> <?xml version="1.0"?>
> <configuration>
> 
> <!-- Site specific YARN configuration properties -->
> 
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
> 
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
> 
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
> 
> 
> 
> 
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
> 
> 
> 
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>     <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
> 
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
> 
> 
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
> 
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
> 
>> Hey, Arthur:
>> 
>>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
>> 
>> Thanks
>> 
>> Xuan Gong
>> 
>> 
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
>> Hi,
>> 
>> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>> 
>> Regards
>> Arthur
>> 
>> 
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thank y very much!

At the moment if I run ./sbin/start-yarn.sh in rm1, the standby STANDBY ResourceManager in rm2 is not started accordingly.  Please advise what would be wrong? Thanks

Regards
Arthur




On 12 Aug, 2014, at 1:13 pm, Xuan Gong <xg...@hortonworks.com> wrote:

> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> No, need to start multiple RMs separately.
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager? 
> 
> Interesting question. But one of the design for auto-failover is that the down-time of RM is invisible to end users. The end users can submit applications normally even if the failover happens. 
> 
> We can monitor the status of RMs by using the command-line (you did previously) or from webUI/webService (rm_address:portnumber/cluster/cluster). We can get the current status from there.
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.
> 
> At the moment, the ResourceManager HA works if:
> 
> 1) at rm1, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> active
> 
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 
> 
> 2) at rm2, run ./sbin/start-yarn.sh
> 
> yarn rmadmin -getServiceState rm1
> standby
> 
> 
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?
> 
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   
> 
> 
> Regards
> Arthur
> 
> 
> <?xml version="1.0"?>
> <configuration>
> 
> <!-- Site specific YARN configuration properties -->
> 
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
> 
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
> 
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
> 
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
> 
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
> 
> 
> 
> 
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
> 
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
> 
> 
> 
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>     <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>     <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
> 
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
> 
> 
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
> 
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
> 
> </configuration>
> 
> 
> 
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
> 
>> Hey, Arthur:
>> 
>>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
>> 
>> Thanks
>> 
>> Xuan Gong
>> 
>> 
>> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
>> Hi,
>> 
>> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>> 
>> Regards
>> Arthur
>> 
>> 
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there
a way that I just run ./sbin/start-yarn.sh in rm1 and get the
STANDBY ResourceManager in rm2 started as well?

No, need to start multiple RMs separately.

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down
in an auto-failover env? or how do you monitor the status of
ACTIVE/STANDBY ResourceManager?

Interesting question. But one of the design for auto-failover is that the
down-time of RM is invisible to end users. The end users can submit
applications normally even if the failover happens.

We can monitor the status of RMs by using the command-line (you did
previously) or from webUI/webService
(rm_address:portnumber/cluster/cluster). We can get the current status from
there.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my
> yarn-site.xml.
>
> At the moment, the ResourceManager HA works if:
>
> 1) at rm1, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> active
>
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
> connection exception: java.net.ConnectException: Connection refused; For
> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>
>
> 2) at rm2, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> standby
>
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
>
> Regards
> Arthur
>
>
> <?xml version="1.0"?>
> <configuration>
>
> <!-- Site specific YARN configuration properties -->
>
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
>
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
>
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
>
>
>
>
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
>
>
>
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>
> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
>
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
>
>
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
>
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
>
> </configuration>
>
>
>
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Hey, Arthur:
>
>     Did you use single node cluster or multiple nodes cluster? Could you
> share your configuration file (yarn-site.xml) ? This looks like a
> configuration issue.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> If I have TWO nodes for ResourceManager HA, what should be the correct
>> steps and commands to start and stop ResourceManager in a ResourceManager
>> HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
>> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>>
>> Regards
>> Arthur
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there
a way that I just run ./sbin/start-yarn.sh in rm1 and get the
STANDBY ResourceManager in rm2 started as well?

No, need to start multiple RMs separately.

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down
in an auto-failover env? or how do you monitor the status of
ACTIVE/STANDBY ResourceManager?

Interesting question. But one of the design for auto-failover is that the
down-time of RM is invisible to end users. The end users can submit
applications normally even if the failover happens.

We can monitor the status of RMs by using the command-line (you did
previously) or from webUI/webService
(rm_address:portnumber/cluster/cluster). We can get the current status from
there.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my
> yarn-site.xml.
>
> At the moment, the ResourceManager HA works if:
>
> 1) at rm1, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> active
>
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
> connection exception: java.net.ConnectException: Connection refused; For
> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>
>
> 2) at rm2, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> standby
>
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
>
> Regards
> Arthur
>
>
> <?xml version="1.0"?>
> <configuration>
>
> <!-- Site specific YARN configuration properties -->
>
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
>
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
>
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
>
>
>
>
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
>
>
>
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>
> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
>
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
>
>
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
>
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
>
> </configuration>
>
>
>
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Hey, Arthur:
>
>     Did you use single node cluster or multiple nodes cluster? Could you
> share your configuration file (yarn-site.xml) ? This looks like a
> configuration issue.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> If I have TWO nodes for ResourceManager HA, what should be the correct
>> steps and commands to start and stop ResourceManager in a ResourceManager
>> HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
>> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>>
>> Regards
>> Arthur
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there
a way that I just run ./sbin/start-yarn.sh in rm1 and get the
STANDBY ResourceManager in rm2 started as well?

No, need to start multiple RMs separately.

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down
in an auto-failover env? or how do you monitor the status of
ACTIVE/STANDBY ResourceManager?

Interesting question. But one of the design for auto-failover is that the
down-time of RM is invisible to end users. The end users can submit
applications normally even if the failover happens.

We can monitor the status of RMs by using the command-line (you did
previously) or from webUI/webService
(rm_address:portnumber/cluster/cluster). We can get the current status from
there.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my
> yarn-site.xml.
>
> At the moment, the ResourceManager HA works if:
>
> 1) at rm1, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> active
>
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
> connection exception: java.net.ConnectException: Connection refused; For
> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>
>
> 2) at rm2, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> standby
>
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
>
> Regards
> Arthur
>
>
> <?xml version="1.0"?>
> <configuration>
>
> <!-- Site specific YARN configuration properties -->
>
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
>
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
>
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
>
>
>
>
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
>
>
>
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>
> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
>
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
>
>
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
>
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
>
> </configuration>
>
>
>
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Hey, Arthur:
>
>     Did you use single node cluster or multiple nodes cluster? Could you
> share your configuration file (yarn-site.xml) ? This looks like a
> configuration issue.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> If I have TWO nodes for ResourceManager HA, what should be the correct
>> steps and commands to start and stop ResourceManager in a ResourceManager
>> HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
>> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>>
>> Regards
>> Arthur
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there
a way that I just run ./sbin/start-yarn.sh in rm1 and get the
STANDBY ResourceManager in rm2 started as well?

No, need to start multiple RMs separately.

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down
in an auto-failover env? or how do you monitor the status of
ACTIVE/STANDBY ResourceManager?

Interesting question. But one of the design for auto-failover is that the
down-time of RM is invisible to end users. The end users can submit
applications normally even if the failover happens.

We can monitor the status of RMs by using the command-line (you did
previously) or from webUI/webService
(rm_address:portnumber/cluster/cluster). We can get the current status from
there.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 5:12 PM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my
> yarn-site.xml.
>
> At the moment, the ResourceManager HA works if:
>
> 1) at rm1, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> active
>
> yarn rmadmin -getServiceState rm2
> 14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/
> 192.168.1.1:23142. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on
> connection exception: java.net.ConnectException: Connection refused; For
> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>
>
> 2) at rm2, run ./sbin/start-yarn.sh
>
> yarn rmadmin -getServiceState rm1
> standby
>
>
> Some questions:
> Q1)  I need start yarn in EACH master separately, is this normal? Is there
> a way that I just run ./sbin/start-yarn.sh in rm1 and get the
> STANDBY ResourceManager in rm2 started as well?
>
> Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is
> down in an auto-failover env? or how do you monitor the status of
> ACTIVE/STANDBY ResourceManager?
>
>
> Regards
> Arthur
>
>
> <?xml version="1.0"?>
> <configuration>
>
> <!-- Site specific YARN configuration properties -->
>
>    <property>
>       <name>yarn.nodemanager.aux-services</name>
>       <value>mapreduce_shuffle</value>
>    </property>
>
>    <property>
>       <name>yarn.resourcemanager.address</name>
>       <value>192.168.1.1:8032</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.resource-tracker.address</name>
>        <value>192.168.1.1:8031</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.admin.address</name>
>        <value>192.168.1.1:8033</value>
>    </property>
>
>    <property>
>        <name>yarn.resourcemanager.scheduler.address</name>
>        <value>192.168.1.1:8030</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.loacl-dirs</name>
>        <value>/edh/hadoop_data/mapred/nodemanager</value>
>        <final>true</final>
>    </property>
>
>    <property>
>        <name>yarn.web-proxy.address</name>
>        <value>192.168.1.1:8888</value>
>    </property>
>
>    <property>
>       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
>       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
>    </property>
>
>
>
>
>    <property>
>       <name>yarn.nodemanager.resource.memory-mb</name>
>       <value>18432</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.minimum-allocation-mb</name>
>       <value>9216</value>
>    </property>
>
>    <property>
>       <name>yarn.scheduler.maximum-allocation-mb</name>
>       <value>18432</value>
>    </property>
>
>
>
>   <property>
>     <name>yarn.resourcemanager.connect.retry-interval.ms</name>
>     <value>2000</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.cluster-id</name>
>     <value>cluster_rm</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.ha.rm-ids</name>
>     <value>rm1,rm2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm1</name>
>     <value>192.168.1.1</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.hostname.rm2</name>
>     <value>192.168.1.2</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.recovery.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.store.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
>   </property>
>   <property>
>       <name>yarn.resourcemanager.zk-address</name>
>       <value>rm1:2181,m135:2181,m137:2181</value>
>   </property>
>   <property>
>
> <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
>     <value>5000</value>
>   </property>
>
>   <!-- RM1 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm1</name>
>     <value>192.168.1.1:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm1</name>
>     <value>192.168.1.1:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm1</name>
>     <value>192.168.1.1:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm1</name>
>     <value>192.168.1.1:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
>     <value>192.168.1.1:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm1</name>
>     <value>192.168.1.1:23142</value>
>   </property>
>
>
>   <!-- RM2 configs -->
>   <property>
>     <name>yarn.resourcemanager.address.rm2</name>
>     <value>192.168.1.2:23140</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.scheduler.address.rm2</name>
>     <value>192.168.1.2:23130</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.https.address.rm2</name>
>     <value>192.168.1.2:23189</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.webapp.address.rm2</name>
>     <value>192.168.1.2:23188</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
>     <value>192.168.1.2:23125</value>
>   </property>
>   <property>
>     <name>yarn.resourcemanager.admin.address.rm2</name>
>     <value>192.168.1.2:23142</value>
>   </property>
>
>   <property>
>     <name>yarn.nodemanager.remote-app-log-dir</name>
>     <value>/edh/hadoop_logs/hadoop/</value>
>   </property>
>
> </configuration>
>
>
>
> On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:
>
> Hey, Arthur:
>
>     Did you use single node cluster or multiple nodes cluster? Could you
> share your configuration file (yarn-site.xml) ? This looks like a
> configuration issue.
>
> Thanks
>
> Xuan Gong
>
>
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
>> Hi,
>>
>> If I have TWO nodes for ResourceManager HA, what should be the correct
>> steps and commands to start and stop ResourceManager in a ResourceManager
>> HA cluster ?
>> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
>> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>>
>> Regards
>> Arthur
>>
>>
>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.

At the moment, the ResourceManager HA works if:

1) at rm1, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
active

yarn rmadmin -getServiceState rm2
14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused


2) at rm2, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
standby


Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   


Regards
Arthur


<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->

   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>

   <property>
      <name>yarn.resourcemanager.address</name>
      <value>192.168.1.1:8032</value>
   </property>

   <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>192.168.1.1:8031</value>
   </property>

   <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>192.168.1.1:8033</value>
   </property>

   <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>192.168.1.1:8030</value>
   </property>

   <property>
      <name>yarn.nodemanager.loacl-dirs</name>
       <value>/edh/hadoop_data/mapred/nodemanager</value>
       <final>true</final>
   </property>

   <property>
       <name>yarn.web-proxy.address</name>
       <value>192.168.1.1:8888</value>
   </property>

   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>




   <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>18432</value>
   </property>

   <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>9216</value>
   </property>

   <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>18432</value>
   </property>



  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster_rm</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>192.168.1.1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>192.168.1.2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>rm1:2181,m135:2181,m137:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>

  <!-- RM1 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>192.168.1.1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>192.168.1.1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>192.168.1.1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>192.168.1.1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>192.168.1.1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>192.168.1.1:23142</value>
  </property>


  <!-- RM2 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>192.168.1.2:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>192.168.1.2:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>192.168.1.2:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>192.168.1.2:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>192.168.1.2:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>192.168.1.2:23142</value>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/edh/hadoop_logs/hadoop/</value>
  </property>

</configuration>



On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:

> Hey, Arthur:
> 
>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.

At the moment, the ResourceManager HA works if:

1) at rm1, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
active

yarn rmadmin -getServiceState rm2
14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused


2) at rm2, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
standby


Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   


Regards
Arthur


<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->

   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>

   <property>
      <name>yarn.resourcemanager.address</name>
      <value>192.168.1.1:8032</value>
   </property>

   <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>192.168.1.1:8031</value>
   </property>

   <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>192.168.1.1:8033</value>
   </property>

   <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>192.168.1.1:8030</value>
   </property>

   <property>
      <name>yarn.nodemanager.loacl-dirs</name>
       <value>/edh/hadoop_data/mapred/nodemanager</value>
       <final>true</final>
   </property>

   <property>
       <name>yarn.web-proxy.address</name>
       <value>192.168.1.1:8888</value>
   </property>

   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>




   <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>18432</value>
   </property>

   <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>9216</value>
   </property>

   <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>18432</value>
   </property>



  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster_rm</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>192.168.1.1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>192.168.1.2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>rm1:2181,m135:2181,m137:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>

  <!-- RM1 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>192.168.1.1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>192.168.1.1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>192.168.1.1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>192.168.1.1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>192.168.1.1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>192.168.1.1:23142</value>
  </property>


  <!-- RM2 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>192.168.1.2:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>192.168.1.2:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>192.168.1.2:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>192.168.1.2:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>192.168.1.2:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>192.168.1.2:23142</value>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/edh/hadoop_logs/hadoop/</value>
  </property>

</configuration>



On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:

> Hey, Arthur:
> 
>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.

At the moment, the ResourceManager HA works if:

1) at rm1, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
active

yarn rmadmin -getServiceState rm2
14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused


2) at rm2, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
standby


Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   


Regards
Arthur


<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->

   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>

   <property>
      <name>yarn.resourcemanager.address</name>
      <value>192.168.1.1:8032</value>
   </property>

   <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>192.168.1.1:8031</value>
   </property>

   <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>192.168.1.1:8033</value>
   </property>

   <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>192.168.1.1:8030</value>
   </property>

   <property>
      <name>yarn.nodemanager.loacl-dirs</name>
       <value>/edh/hadoop_data/mapred/nodemanager</value>
       <final>true</final>
   </property>

   <property>
       <name>yarn.web-proxy.address</name>
       <value>192.168.1.1:8888</value>
   </property>

   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>




   <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>18432</value>
   </property>

   <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>9216</value>
   </property>

   <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>18432</value>
   </property>



  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster_rm</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>192.168.1.1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>192.168.1.2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>rm1:2181,m135:2181,m137:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>

  <!-- RM1 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>192.168.1.1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>192.168.1.1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>192.168.1.1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>192.168.1.1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>192.168.1.1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>192.168.1.1:23142</value>
  </property>


  <!-- RM2 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>192.168.1.2:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>192.168.1.2:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>192.168.1.2:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>192.168.1.2:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>192.168.1.2:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>192.168.1.2:23142</value>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/edh/hadoop_logs/hadoop/</value>
  </property>

</configuration>



On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:

> Hey, Arthur:
> 
>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

it is a multiple-node cluster, two master nodes (rm1 and rm2), below is my yarn-site.xml.

At the moment, the ResourceManager HA works if:

1) at rm1, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
active

yarn rmadmin -getServiceState rm2
14/08/12 07:47:59 INFO ipc.Client: Retrying connect to server: rm1/192.168.1.1:23142. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From rm2/192.168.1.2 to rm2:23142 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused


2) at rm2, run ./sbin/start-yarn.sh

yarn rmadmin -getServiceState rm1
standby


Some questions:
Q1)  I need start yarn in EACH master separately, is this normal? Is there a way that I just run ./sbin/start-yarn.sh in rm1 and get the STANDBY ResourceManager in rm2 started as well?

Q2) How to get alerts (e.g. by email) if the ACTIVE ResourceManager is down in an auto-failover env? or how do you monitor the status of ACTIVE/STANDBY ResourceManager?   


Regards
Arthur


<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->

   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>

   <property>
      <name>yarn.resourcemanager.address</name>
      <value>192.168.1.1:8032</value>
   </property>

   <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>192.168.1.1:8031</value>
   </property>

   <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>192.168.1.1:8033</value>
   </property>

   <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>192.168.1.1:8030</value>
   </property>

   <property>
      <name>yarn.nodemanager.loacl-dirs</name>
       <value>/edh/hadoop_data/mapred/nodemanager</value>
       <final>true</final>
   </property>

   <property>
       <name>yarn.web-proxy.address</name>
       <value>192.168.1.1:8888</value>
   </property>

   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>




   <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>18432</value>
   </property>

   <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>9216</value>
   </property>

   <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>18432</value>
   </property>



  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>cluster_rm</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>192.168.1.1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>192.168.1.2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
      <name>yarn.resourcemanager.zk-address</name>
      <value>rm1:2181,m135:2181,m137:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>

  <!-- RM1 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>192.168.1.1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>192.168.1.1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>192.168.1.1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>192.168.1.1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>192.168.1.1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>192.168.1.1:23142</value>
  </property>


  <!-- RM2 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>192.168.1.2:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>192.168.1.2:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>192.168.1.2:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>192.168.1.2:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>192.168.1.2:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>192.168.1.2:23142</value>
  </property>

  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/edh/hadoop_logs/hadoop/</value>
  </property>

</configuration>



On 12 Aug, 2014, at 1:49 am, Xuan Gong <xg...@hortonworks.com> wrote:

> Hey, Arthur:
> 
>     Did you use single node cluster or multiple nodes cluster? Could you share your configuration file (yarn-site.xml) ? This looks like a configuration issue. 
> 
> Thanks
> 
> Xuan Gong
> 
> 
> On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:
> Hi,
> 
> If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
> 
> Regards
> Arthur
> 
>

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

    Did you use single node cluster or multiple nodes cluster? Could you
share your configuration file (yarn-site.xml) ? This looks like a
configuration issue.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> If I have TWO nodes for ResourceManager HA, what should be the correct
> steps and commands to start and stop ResourceManager in a ResourceManager
> HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>
> Regards
> Arthur
>
>
> On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
> Hi
>
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and
> NM2). When verifying ResourceManager failover, I use “kill -9” to terminate
> the ResourceManager in name node 1 (NM1), if I run the the test job, it
> seems that the failover of ResourceManager keeps trying NM1 and NM2
> non-stop.
>
> Does anyone have the idea what would be wrong about this?  Thanks
>
> Regards
> Arthur
>
>
>
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar
> pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> ….
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

    Did you use single node cluster or multiple nodes cluster? Could you
share your configuration file (yarn-site.xml) ? This looks like a
configuration issue.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> If I have TWO nodes for ResourceManager HA, what should be the correct
> steps and commands to start and stop ResourceManager in a ResourceManager
> HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>
> Regards
> Arthur
>
>
> On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
> Hi
>
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and
> NM2). When verifying ResourceManager failover, I use “kill -9” to terminate
> the ResourceManager in name node 1 (NM1), if I run the the test job, it
> seems that the failover of ResourceManager keeps trying NM1 and NM2
> non-stop.
>
> Does anyone have the idea what would be wrong about this?  Thanks
>
> Regards
> Arthur
>
>
>
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar
> pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> ….
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

    Did you use single node cluster or multiple nodes cluster? Could you
share your configuration file (yarn-site.xml) ? This looks like a
configuration issue.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> If I have TWO nodes for ResourceManager HA, what should be the correct
> steps and commands to start and stop ResourceManager in a ResourceManager
> HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>
> Regards
> Arthur
>
>
> On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
> Hi
>
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and
> NM2). When verifying ResourceManager failover, I use “kill -9” to terminate
> the ResourceManager in name node 1 (NM1), if I run the the test job, it
> seems that the failover of ResourceManager keeps trying NM1 and NM2
> non-stop.
>
> Does anyone have the idea what would be wrong about this?  Thanks
>
> Regards
> Arthur
>
>
>
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar
> pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> ….
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by Xuan Gong <xg...@hortonworks.com>.

Hey, Arthur:

    Did you use single node cluster or multiple nodes cluster? Could you
share your configuration file (yarn-site.xml) ? This looks like a
configuration issue.

Thanks

Xuan Gong


On Mon, Aug 11, 2014 at 9:45 AM, Arthur.hk.chan@gmail.com <
arthur.hk.chan@gmail.com> wrote:

> Hi,
>
> If I have TWO nodes for ResourceManager HA, what should be the correct
> steps and commands to start and stop ResourceManager in a ResourceManager
> HA cluster ?
> Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems
> that  ./sbin/start-yarn.sh can only start YARN in a node at a time.
>
> Regards
> Arthur
>
>
> On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <
> arthur.hk.chan@gmail.com> wrote:
>
> Hi
>
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and
> NM2). When verifying ResourceManager failover, I use “kill -9” to terminate
> the ResourceManager in name node 1 (NM1), if I run the the test job, it
> seems that the failover of ResourceManager keeps trying NM1 and NM2
> non-stop.
>
> Does anyone have the idea what would be wrong about this?  Thanks
>
> Regards
> Arthur
>
>
>
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar
> pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to nm1
> ….
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.

Regards
Arthur


On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi 
> 
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 
> 
> Does anyone have the idea what would be wrong about this?  Thanks
> 
> Regards
> Arthur
> 
> 
> 
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> ….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.

Regards
Arthur


On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi 
> 
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 
> 
> Does anyone have the idea what would be wrong about this?  Thanks
> 
> Regards
> Arthur
> 
> 
> 
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> ….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.

Regards
Arthur


On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi 
> 
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 
> 
> Does anyone have the idea what would be wrong about this?  Thanks
> 
> Regards
> Arthur
> 
> 
> 
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> ….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

If I have TWO nodes for ResourceManager HA, what should be the correct steps and commands to start and stop ResourceManager in a ResourceManager HA cluster ?
Unlike ./sbin/start-dfs.sh (which can start all NNs from a NN), it seems that  ./sbin/start-yarn.sh can only start YARN in a node at a time.

Regards
Arthur


On 11 Aug, 2014, at 11:04 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi 
> 
> I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 
> 
> Does anyone have the idea what would be wrong about this?  Thanks
> 
> Regards
> Arthur
> 
> 
> 
> bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
> Number of Maps  = 5
> Samples per Map = 1010000000
> Wrote input for Map #0
> Wrote input for Map #1
> Wrote input for Map #2
> Wrote input for Map #3
> Wrote input for Map #4
> Starting Job
> 14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> 14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
> 14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
> ….

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 

Does anyone have the idea what would be wrong about this?  Thanks

Regards
Arthur



bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
Number of Maps  = 5
Samples per Map = 1010000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.

You need additional settings to make ResourceManager auto-failover.

http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

JobHistoryServer does not have automatic failover feature.

Regards,
Akira

(2014/08/05 20:15), Arthur.hk.chan@gmail.com wrote:
> Hi
>
> I have set up the Hadoop 2.4.1 with HDFS High Availability using the
> Quorum Journal Manager.
>
> I am verifying Automatic Failover: I manually used “kill -9” command to
> disable all running Hadoop services in active node (NN-1), I can find
> that the Standby node (NN-2) now becomes ACTIVE now which is good,
> however, the “ResourceManager” service cannot be found in NN-2, please
> advise how to make ResourceManager and JobHistoryServer auto-failover?
> or do I miss some important setup? missing some settings in
> hdfs-site.xml or core-site.xml?
>
> Please help!
>
> Regards
> Arthur
>
>
>
>
> BEFORE TESTING:
> NN-1:
> jps
> 9564 NameNode
> 10176 JobHistoryServer
> 21215 Jps
> 17636 QuorumPeerMain
> 20838 NodeManager
> 9678 DataNode
> 9933 JournalNode
> 10085 DFSZKFailoverController
> 20724 ResourceManager
>
> NN-2 (Standby Name node)
> jps
> 14064 Jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
>
>
>
> AFTER
> NN-1
> dips
> 17636 QuorumPeerMain
> 21508 Jps
>
> NN-2
> jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
> 14165 Jps
>
>
>

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 

Does anyone have the idea what would be wrong about this?  Thanks

Regards
Arthur



bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
Number of Maps  = 5
Samples per Map = 1010000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.

You need additional settings to make ResourceManager auto-failover.

http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

JobHistoryServer does not have automatic failover feature.

Regards,
Akira

(2014/08/05 20:15), Arthur.hk.chan@gmail.com wrote:
> Hi
>
> I have set up the Hadoop 2.4.1 with HDFS High Availability using the
> Quorum Journal Manager.
>
> I am verifying Automatic Failover: I manually used “kill -9” command to
> disable all running Hadoop services in active node (NN-1), I can find
> that the Standby node (NN-2) now becomes ACTIVE now which is good,
> however, the “ResourceManager” service cannot be found in NN-2, please
> advise how to make ResourceManager and JobHistoryServer auto-failover?
> or do I miss some important setup? missing some settings in
> hdfs-site.xml or core-site.xml?
>
> Please help!
>
> Regards
> Arthur
>
>
>
>
> BEFORE TESTING:
> NN-1:
> jps
> 9564 NameNode
> 10176 JobHistoryServer
> 21215 Jps
> 17636 QuorumPeerMain
> 20838 NodeManager
> 9678 DataNode
> 9933 JournalNode
> 10085 DFSZKFailoverController
> 20724 ResourceManager
>
> NN-2 (Standby Name node)
> jps
> 14064 Jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
>
>
>
> AFTER
> NN-1
> dips
> 17636 QuorumPeerMain
> 21508 Jps
>
> NN-2
> jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
> 14165 Jps
>
>
>

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 

Does anyone have the idea what would be wrong about this?  Thanks

Regards
Arthur



bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
Number of Maps  = 5
Samples per Map = 1010000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
….

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by Akira AJISAKA <aj...@oss.nttdata.co.jp>.

You need additional settings to make ResourceManager auto-failover.

http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

JobHistoryServer does not have automatic failover feature.

Regards,
Akira

(2014/08/05 20:15), Arthur.hk.chan@gmail.com wrote:
> Hi
>
> I have set up the Hadoop 2.4.1 with HDFS High Availability using the
> Quorum Journal Manager.
>
> I am verifying Automatic Failover: I manually used “kill -9” command to
> disable all running Hadoop services in active node (NN-1), I can find
> that the Standby node (NN-2) now becomes ACTIVE now which is good,
> however, the “ResourceManager” service cannot be found in NN-2, please
> advise how to make ResourceManager and JobHistoryServer auto-failover?
> or do I miss some important setup? missing some settings in
> hdfs-site.xml or core-site.xml?
>
> Please help!
>
> Regards
> Arthur
>
>
>
>
> BEFORE TESTING:
> NN-1:
> jps
> 9564 NameNode
> 10176 JobHistoryServer
> 21215 Jps
> 17636 QuorumPeerMain
> 20838 NodeManager
> 9678 DataNode
> 9933 JournalNode
> 10085 DFSZKFailoverController
> 20724 ResourceManager
>
> NN-2 (Standby Name node)
> jps
> 14064 Jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
>
>
>
> AFTER
> NN-1
> dips
> 17636 QuorumPeerMain
> 21508 Jps
>
> NN-2
> jps
> 32046 NameNode
> 13765 NodeManager
> 32126 DataNode
> 32271 DFSZKFailoverController
> 14165 Jps
>
>
>

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I am running Hadoop 2.4.1 with YARN HA enabled (two name nodes, NM1 and NM2). When verifying ResourceManager failover, I use “kill -9” to terminate the ResourceManager in name node 1 (NM1), if I run the the test job, it seems that the failover of ResourceManager keeps trying NM1 and NM2 non-stop. 

Does anyone have the idea what would be wrong about this?  Thanks

Regards
Arthur



bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi  5 1010000000
Number of Maps  = 5
Samples per Map = 1010000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
14/08/11 22:35:23 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:30 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:32 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:37 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
14/08/11 22:35:39 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm2
14/08/11 22:35:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to nm1
….

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I have set up the Hadoop 2.4.1 with HDFS High Availability using the Quorum Journal Manager.

I am verifying Automatic Failover: I manually used “kill -9” command to disable all running Hadoop services in active node (NN-1), I can find that the Standby node (NN-2) now becomes ACTIVE now which is good, however, the “ResourceManager” service cannot be found in NN-2, please advise how to make ResourceManager and JobHistoryServer auto-failover? or do I miss some important setup? missing some settings in hdfs-site.xml or core-site.xml?

Please help!

Regards
Arthur




BEFORE TESTING:
NN-1:
jps
9564 NameNode
10176 JobHistoryServer
21215 Jps
17636 QuorumPeerMain
20838 NodeManager
9678 DataNode
9933 JournalNode
10085 DFSZKFailoverController
20724 ResourceManager

NN-2 (Standby Name node)
jps
14064 Jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController



AFTER
NN-1
dips
17636 QuorumPeerMain
21508 Jps

NN-2
jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController
14165 Jps

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I have set up the Hadoop 2.4.1 with HDFS High Availability using the Quorum Journal Manager.

I am verifying Automatic Failover: I manually used “kill -9” command to disable all running Hadoop services in active node (NN-1), I can find that the Standby node (NN-2) now becomes ACTIVE now which is good, however, the “ResourceManager” service cannot be found in NN-2, please advise how to make ResourceManager and JobHistoryServer auto-failover? or do I miss some important setup? missing some settings in hdfs-site.xml or core-site.xml?

Please help!

Regards
Arthur




BEFORE TESTING:
NN-1:
jps
9564 NameNode
10176 JobHistoryServer
21215 Jps
17636 QuorumPeerMain
20838 NodeManager
9678 DataNode
9933 JournalNode
10085 DFSZKFailoverController
20724 ResourceManager

NN-2 (Standby Name node)
jps
14064 Jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController



AFTER
NN-1
dips
17636 QuorumPeerMain
21508 Jps

NN-2
jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController
14165 Jps

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I have set up the Hadoop 2.4.1 with HDFS High Availability using the Quorum Journal Manager.

I am verifying Automatic Failover: I manually used “kill -9” command to disable all running Hadoop services in active node (NN-1), I can find that the Standby node (NN-2) now becomes ACTIVE now which is good, however, the “ResourceManager” service cannot be found in NN-2, please advise how to make ResourceManager and JobHistoryServer auto-failover? or do I miss some important setup? missing some settings in hdfs-site.xml or core-site.xml?

Please help!

Regards
Arthur




BEFORE TESTING:
NN-1:
jps
9564 NameNode
10176 JobHistoryServer
21215 Jps
17636 QuorumPeerMain
20838 NodeManager
9678 DataNode
9933 JournalNode
10085 DFSZKFailoverController
20724 ResourceManager

NN-2 (Standby Name node)
jps
14064 Jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController



AFTER
NN-1
dips
17636 QuorumPeerMain
21508 Jps

NN-2
jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController
14165 Jps

Hadoop 2.4.1 Verifying Automatic Failover Failed: ResourceManager and JobHistoryServer do not auto-failover to Standby Node

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi 

I have set up the Hadoop 2.4.1 with HDFS High Availability using the Quorum Journal Manager.

I am verifying Automatic Failover: I manually used “kill -9” command to disable all running Hadoop services in active node (NN-1), I can find that the Standby node (NN-2) now becomes ACTIVE now which is good, however, the “ResourceManager” service cannot be found in NN-2, please advise how to make ResourceManager and JobHistoryServer auto-failover? or do I miss some important setup? missing some settings in hdfs-site.xml or core-site.xml?

Please help!

Regards
Arthur




BEFORE TESTING:
NN-1:
jps
9564 NameNode
10176 JobHistoryServer
21215 Jps
17636 QuorumPeerMain
20838 NodeManager
9678 DataNode
9933 JournalNode
10085 DFSZKFailoverController
20724 ResourceManager

NN-2 (Standby Name node)
jps
14064 Jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController



AFTER
NN-1
dips
17636 QuorumPeerMain
21508 Jps

NN-2
jps
32046 NameNode
13765 NodeManager
32126 DataNode
32271 DFSZKFailoverController
14165 Jps

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

ZKFC LOG:

By Default , it will be under HADOOP_HOME/logs/hadoop_******zkfc.log

Same can be confirmed by using the following commands(to get the log location)

jinfo 7370 | grep -i hadoop.log.dir

ps -eaf | grep -i DFSZKFailoverController | grep -i hadoop.log.dir

WEB Console :

And Default port for NameNode web console is 50070. you can check value of "dfs.namenode.http-address" in hdfs-site.xml..

Default values, you can check from the following link..

http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 6:07 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page”
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur

On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com>> wrote:

HI,

DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com<ma...@gmail.com> [arthur.hk.chan@gmail.com<ma...@gmail.com>]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur

2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

ZKFC LOG:

By Default , it will be under HADOOP_HOME/logs/hadoop_******zkfc.log

Same can be confirmed by using the following commands(to get the log location)

jinfo 7370 | grep -i hadoop.log.dir

ps -eaf | grep -i DFSZKFailoverController | grep -i hadoop.log.dir

WEB Console :

And Default port for NameNode web console is 50070. you can check value of "dfs.namenode.http-address" in hdfs-site.xml..

Default values, you can check from the following link..

http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 6:07 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page”
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur

On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com>> wrote:

HI,

DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com<ma...@gmail.com> [arthur.hk.chan@gmail.com<ma...@gmail.com>]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur

2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

ZKFC LOG:

By Default , it will be under HADOOP_HOME/logs/hadoop_******zkfc.log

Same can be confirmed by using the following commands(to get the log location)

jinfo 7370 | grep -i hadoop.log.dir

ps -eaf | grep -i DFSZKFailoverController | grep -i hadoop.log.dir

WEB Console :

And Default port for NameNode web console is 50070. you can check value of "dfs.namenode.http-address" in hdfs-site.xml..

Default values, you can check from the following link..

http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 6:07 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page”
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur

On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com>> wrote:

HI,

DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com<ma...@gmail.com> [arthur.hk.chan@gmail.com<ma...@gmail.com>]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur

2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

ZKFC LOG:

By Default , it will be under HADOOP_HOME/logs/hadoop_******zkfc.log

Same can be confirmed by using the following commands(to get the log location)

jinfo 7370 | grep -i hadoop.log.dir

ps -eaf | grep -i DFSZKFailoverController | grep -i hadoop.log.dir

WEB Console :

And Default port for NameNode web console is 50070. you can check value of "dfs.namenode.http-address" in hdfs-site.xml..

Default values, you can check from the following link..

http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 6:07 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page”
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur

On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com>> wrote:

HI,

DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted

Thanks & Regards

Brahma Reddy Battula

________________________________
From: Arthur.hk.chan@gmail.com<ma...@gmail.com> [arthur.hk.chan@gmail.com<ma...@gmail.com>]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur

2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?  

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page” 
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur




On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com> wrote:

> HI,
> 
> 
> DO you mean Active Namenode which is killed is not transition to STANDBY..?
> 
> >>> Here Namenode will not start as standby if you kill..Again you need to start manually.
>         
>       Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)
> 
> Please refer the following doc for same ..( Section : Verifying automatic failover)
> 
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
> 
> OR
> 
>  DO you mean Standby Namenode is not transition to ACTIVE..?
> 
> >>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted
> 
> 
> Thanks & Regards
>  
> Brahma Reddy Battula
>  
> 
> From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
> Sent: Monday, August 04, 2014 4:38 PM
> To: user@hadoop.apache.org
> Cc: Arthur.hk.chan@gmail.com
> Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN
> 
> Hi,
> 
> I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node, 
> 
> Please advise
> Regards
> Arthur
> 
> 
> 2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
> java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1414)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
> at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
> ... 11 more
> 2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?  

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page” 
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur




On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com> wrote:

> HI,
> 
> 
> DO you mean Active Namenode which is killed is not transition to STANDBY..?
> 
> >>> Here Namenode will not start as standby if you kill..Again you need to start manually.
>         
>       Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)
> 
> Please refer the following doc for same ..( Section : Verifying automatic failover)
> 
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
> 
> OR
> 
>  DO you mean Standby Namenode is not transition to ACTIVE..?
> 
> >>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted
> 
> 
> Thanks & Regards
>  
> Brahma Reddy Battula
>  
> 
> From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
> Sent: Monday, August 04, 2014 4:38 PM
> To: user@hadoop.apache.org
> Cc: Arthur.hk.chan@gmail.com
> Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN
> 
> Hi,
> 
> I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node, 
> 
> Please advise
> Regards
> Arthur
> 
> 
> 2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
> java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1414)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
> at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
> ... 11 more
> 2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?  

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page” 
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur




On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com> wrote:

> HI,
> 
> 
> DO you mean Active Namenode which is killed is not transition to STANDBY..?
> 
> >>> Here Namenode will not start as standby if you kill..Again you need to start manually.
>         
>       Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)
> 
> Please refer the following doc for same ..( Section : Verifying automatic failover)
> 
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
> 
> OR
> 
>  DO you mean Standby Namenode is not transition to ACTIVE..?
> 
> >>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted
> 
> 
> Thanks & Regards
>  
> Brahma Reddy Battula
>  
> 
> From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
> Sent: Monday, August 04, 2014 4:38 PM
> To: user@hadoop.apache.org
> Cc: Arthur.hk.chan@gmail.com
> Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN
> 
> Hi,
> 
> I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node, 
> 
> Please advise
> Regards
> Arthur
> 
> 
> 2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
> java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1414)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
> at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
> ... 11 more
> 2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

Re: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.

Hi,

Thanks for your reply.
It was about StandBy Namenode not promoted to Active.
Can you please advise what the path of ZKFC logs?  

"Similar to Namenode status web page, a Cluster Web Console is added in federation to monitor the federated cluster at http://<any_nn_host:port>/dfsclusterhealth.jsp. Any Namenode in the cluster can be used to access this web page” 
What is the default port for the cluster console? I tried 8088 but no luck.

Please advise.

Regards
Arthur




On 4 Aug, 2014, at 7:22 pm, Brahma Reddy Battula <br...@huawei.com> wrote:

> HI,
> 
> 
> DO you mean Active Namenode which is killed is not transition to STANDBY..?
> 
> >>> Here Namenode will not start as standby if you kill..Again you need to start manually.
>         
>       Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)
> 
> Please refer the following doc for same ..( Section : Verifying automatic failover)
> 
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
> 
> OR
> 
>  DO you mean Standby Namenode is not transition to ACTIVE..?
> 
> >>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted
> 
> 
> Thanks & Regards
>  
> Brahma Reddy Battula
>  
> 
> From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
> Sent: Monday, August 04, 2014 4:38 PM
> To: user@hadoop.apache.org
> Cc: Arthur.hk.chan@gmail.com
> Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN
> 
> Hi,
> 
> I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node, 
> 
> Please advise
> Regards
> Arthur
> 
> 
> 2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
> java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1414)
> at org.apache.hadoop.ipc.Client.call(Client.java:1363)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
> at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
> at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
> at org.apache.hadoop.ipc.Client.call(Client.java:1381)
> ... 11 more
> 2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
> 2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

HI,


DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted



Thanks & Regards



Brahma Reddy Battula




________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur


2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

HI,


DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted



Thanks & Regards



Brahma Reddy Battula




________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur


2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

HI,


DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted



Thanks & Regards



Brahma Reddy Battula




________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur


2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

RE: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Posted by Brahma Reddy Battula <br...@huawei.com>.

HI,


DO you mean Active Namenode which is killed is not transition to STANDBY..?

>>> Here Namenode will not start as standby if you kill..Again you need to start manually.

      Automatic failover means when over Active goes down Standy Node will transition to Active automatically..it's not like starting killed process and making the Active(which is standby.)

Please refer the following doc for same ..( Section : Verifying automatic failover)

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

OR

 DO you mean Standby Namenode is not transition to ACTIVE..?

>>>> Please check ZKFC logs,, Mostly this might not happen from the logs you pasted



Thanks & Regards



Brahma Reddy Battula




________________________________
From: Arthur.hk.chan@gmail.com [arthur.hk.chan@gmail.com]
Sent: Monday, August 04, 2014 4:38 PM
To: user@hadoop.apache.org
Cc: Arthur.hk.chan@gmail.com
Subject: Hadoop 2.4.1 Verifying Automatic Failover Failed: Unable to trigger a roll of the active NN

Hi,

I have setup Hadoop 2.4.1 HA Cluster using Quorum Journal, I am verifying automatic failover, after killing the process of namenode from Active one, the name node was not failover to standby node,

Please advise
Regards
Arthur


2014-08-04 18:54:40,453 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
java.net.ConnectException: Call From standbynode  to  activenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 11 more
2014-08-04 18:55:03,458 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:06,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54571 Call#17 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:16,643 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#1: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:19,530 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from activenode:54610 Call#17 Retry#5: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2014-08-04 18:55:20,756 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from activenode:54602 Call#0 Retry#3: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby