You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhihai xu (JIRA)" <ji...@apache.org> on 2014/11/18 02:57:34 UTC

[jira] [Created] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

zhihai xu created YARN-2873:
-------------------------------

             Summary: improve LevelDB error handling for missing files DBException to avoid NM start failure.
                 Key: YARN-2873
                 URL: https://issues.apache.org/jira/browse/YARN-2873
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 2.5.0
            Reporter: zhihai xu
            Assignee: zhihai xu


improve LevelDB error handling for missing files DBException to avoid NM start failure.
We saw the following three level DB exceptions, all these exceptions cause NM start failure.
DBException 1 in ShuffleHandler
{code}
INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state STARTED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
	at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
	at org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
	at org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
	at org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
	... 10 more
{code}

DBException 2 in NMLeveldbStateStoreService:
{code}
Error starting NodeManager 
org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst 
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) 
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) 
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) 
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) 
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) 
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst 
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) 
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) 
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) 
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842) 
at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195) 
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
{code}

DBException 3 in NMLeveldbStateStoreService:
{code}
INFO	org.apache.hadoop.service.AbstractService	
Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004: No such file or directory
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004: No such file or directory
	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)
	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)
	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
{code}

DBException 1 and 2 is due to Sorted table file 000005.sst  being deleted accidentally.
DBException 3 is due to MANIFEST being deleted accidentally.

It would be better to handle these errors instead of  NM failed to start with DBException.
For these DBExceptions, if we delete the LevelDB text file CURRENT, NM will recover successfully from the DBException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)