You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@ambari.apache.org by Weiwei Yang <ch...@hotmail.com> on 2017/01/03 06:09:44 UTC

Re: Review Request 55009: HDFS Service check fails if previous active NN is down

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/
-----------------------------------------------------------

(Updated \u4e00\u6708 3, 2017, 6:09 a.m.)


Review request for Ambari, Alejandro Fernandez and Di Li.


Bugs: AMBARI-19289
    https://issues.apache.org/jira/browse/AMBARI-19289


Repository: ambari


Description
-------

On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.


Diffs
-----

  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 737ae04 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 

Diff: https://reviews.apache.org/r/55009/diff/


Testing (updated)
-------

HA cluster

1. Both NN started

Service check succeed

2. Shutdown active

Service check succeed

3. Shutdown standby

Service check succeed

4. Both NN started and manually put NN in safemode

Service check failed and the stack trace looked like

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
    HdfsServiceCheck().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
  ...
  resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
{
  "RemoteException": {
    "exception": "RemoteException", 
    "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
    "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
...

5. Shutdown active and standby

Service check failed

Non-HA cluster

1. NN started

Service check succeed

2. NN started and manually put NN in safemode

Service check failed and the stack trace looked like #4 of HA setup

3. NN stopped

Service check failed


Thanks,

Weiwei Yang


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Weiwei Yang <ch...@hotmail.com>.

> On \u4e00\u6708 9, 2017, 7:30 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py, line 40
> > <https://reviews.apache.org/r/55009/diff/1/?file=1591530#file1591530line40>
> >
> >     Please also include in HDFS 3.0.0.3.0 for now until we know for sure. It's ok to comment it out and add a note as to why it was commented.
> >     
> >     Fix it and Ship it
> 
> Weiwei Yang wrote:
>     Hi Alejandro
>     
>     I uploaded a new patch. Commented out the code and note added. Also I've done the changes in both 2.1.0.2.0 and 3.0.0.3.0 packages. Please let me know if this looks good now. Thank you.

Hi Alejandro

Does latest patch looks good to you? Please let me know


- Weiwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review160941
-----------------------------------------------------------


On \u4e00\u6708 10, 2017, 2:25 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated \u4e00\u6708 10, 2017, 2:25 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 47fc646 
>   ambari-server/src/main/resources/common-services/HDFS/3.0.0.3.0/package/scripts/service_check.py 981f002 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Weiwei Yang <ch...@hotmail.com>.

> On \u4e00\u6708 9, 2017, 7:30 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py, line 40
> > <https://reviews.apache.org/r/55009/diff/1/?file=1591530#file1591530line40>
> >
> >     Please also include in HDFS 3.0.0.3.0 for now until we know for sure. It's ok to comment it out and add a note as to why it was commented.
> >     
> >     Fix it and Ship it

Hi Alejandro

I uploaded a new patch. Commented out the code and note added. Also I've done the changes in both 2.1.0.2.0 and 3.0.0.3.0 packages. Please let me know if this looks good now. Thank you.


- Weiwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review160941
-----------------------------------------------------------


On \u4e00\u6708 10, 2017, 2:25 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated \u4e00\u6708 10, 2017, 2:25 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 47fc646 
>   ambari-server/src/main/resources/common-services/HDFS/3.0.0.3.0/package/scripts/service_check.py 981f002 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review160941
-----------------------------------------------------------


Fix it, then Ship it!





ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py (line 40)
<https://reviews.apache.org/r/55009/#comment232178>

    Please also include in HDFS 3.0.0.3.0 for now until we know for sure. It's ok to comment it out and add a note as to why it was commented.
    
    Fix it and Ship it


- Alejandro Fernandez


On Jan. 3, 2017, 6:09 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2017, 6:09 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 737ae04 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Weiwei Yang <ch...@hotmail.com>.

> On \u4e00\u6708 5, 2017, 7:26 p.m., Alejandro Fernandez wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py, line 40
> > <https://reviews.apache.org/r/55009/diff/1/?file=1591530#file1591530line40>
> >
> >     Do these changes also need to be made for HDFS 3.0.0.3.0 ?

Hi Alejandro

Yes. HDFS-8277 is still open, there is some debates about maintaince the consistency of transications. As a result, HDFS 3.0 doesn't have a fix to command

```
hdfs dfsadmin -safemode get
```

be able to work when there is only 1 namenode up. Unless this can be resolved, it is inappropriate to use this command on a HA cluster to check safemode status. So propose to remove.


- Weiwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review160618
-----------------------------------------------------------


On \u4e00\u6708 3, 2017, 6:09 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated \u4e00\u6708 3, 2017, 6:09 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 737ae04 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review160618
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py (line 40)
<https://reviews.apache.org/r/55009/#comment231756>

    Do these changes also need to be made for HDFS 3.0.0.3.0 ?


- Alejandro Fernandez


On Jan. 3, 2017, 6:09 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2017, 6:09 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 737ae04 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/#review161285
-----------------------------------------------------------


Ship it!




Ship It!

- Alejandro Fernandez


On Jan. 10, 2017, 2:25 a.m., Weiwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55009/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2017, 2:25 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez and Di Li.
> 
> 
> Bugs: AMBARI-19289
>     https://issues.apache.org/jira/browse/AMBARI-19289
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 47fc646 
>   ambari-server/src/main/resources/common-services/HDFS/3.0.0.3.0/package/scripts/service_check.py 981f002 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 
> 
> Diff: https://reviews.apache.org/r/55009/diff/
> 
> 
> Testing
> -------
> 
> HA cluster
> 
> 1. Both NN started
> 
> Service check succeed
> 
> 2. Shutdown active
> 
> Service check succeed
> 
> 3. Shutdown standby
> 
> Service check succeed
> 
> 4. Both NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like
> 
> Traceback (most recent call last):
>   File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
>     HdfsServiceCheck().execute()
>   File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
>   ...
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RemoteException", 
>     "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
>     "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
> ...
> 
> 5. Shutdown active and standby
> 
> Service check failed
> 
> Non-HA cluster
> 
> 1. NN started
> 
> Service check succeed
> 
> 2. NN started and manually put NN in safemode
> 
> Service check failed and the stack trace looked like #4 of HA setup
> 
> 3. NN stopped
> 
> Service check failed
> 
> 
> Thanks,
> 
> Weiwei Yang
> 
>


Re: Review Request 55009: HDFS Service check fails if previous active NN is down

Posted by Weiwei Yang <ch...@hotmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55009/
-----------------------------------------------------------

(Updated \u4e00\u6708 10, 2017, 2:25 a.m.)


Review request for Ambari, Alejandro Fernandez and Di Li.


Changes
-------

Commented out safemode check code and added comments about why they are commented out.


Bugs: AMBARI-19289
    https://issues.apache.org/jira/browse/AMBARI-19289


Repository: ambari


Description
-------

On a HA cluster, I manually stop the active NN, standby takes over and becomes to be active, HDFS is healthy. However hdfs service check fails.


Diffs (updated)
-----

  ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py 47fc646 
  ambari-server/src/main/resources/common-services/HDFS/3.0.0.3.0/package/scripts/service_check.py 981f002 
  ambari-server/src/test/python/stacks/2.0.6/HDFS/test_service_check.py bbc1b3a 

Diff: https://reviews.apache.org/r/55009/diff/


Testing
-------

HA cluster

1. Both NN started

Service check succeed

2. Shutdown active

Service check succeed

3. Shutdown standby

Service check succeed

4. Both NN started and manually put NN in safemode

Service check failed and the stack trace looked like

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/service_check.py", line 138, in <module>
    HdfsServiceCheck().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
  ...
  resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT --data-binary @/etc/passwd 'http://eked2.fyre.ibm.com:50070/webhdfs/v1/tmp/id10ac8da7_date010217?op=CREATE&user.name=hdfs&overwrite=True'' returned status_code=403. 
{
  "RemoteException": {
    "exception": "RemoteException", 
    "javaClassName": "org.apache.hadoop.ipc.RemoteException", 
    "message": "Cannot create file/tmp/id10ac8da7_date010217. Name node is in safe mode.\nIt was turned on manually. Use \"hdfs dfsadmin -safemode leave\" to turn safe mode off.\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode
...

5. Shutdown active and standby

Service check failed

Non-HA cluster

1. NN started

Service check succeed

2. NN started and manually put NN in safemode

Service check failed and the stack trace looked like #4 of HA setup

3. NN stopped

Service check failed


Thanks,

Weiwei Yang