You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2010/06/14 19:15:27 UTC

Task process exit with nonzero status of 1 - deleting userlogs helps

Hi,

i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a situation where every task scheduled on 2 of the 4 nodes failed. 
Seems like the child jvm crashes. There are no child logs under logs/userlogs. Tasktracker gives this:

2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201006091425_0049_m_-946174604 spawned.
2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201006091425_0049_m_003179_0 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job created the logs/userlogs again and no error ocuured anymore on this host.
The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD contains about 378M in 132747 files. When copying the content of userlogsOLD into userlogs, the tasks of the belonging node starts failing again.

Some questions:
- this seems to me like a problem with too many files in one folder - any thoughts on this ?
- is the content of logs/userlogs cleaned up by hadoop regularly ?
- the logs/stdout file of the tasks are not existent, the logs/out fiels of the tasktracker hasn't any specific message (other then message posted above) - is there any log file left where an error message could be found ?


best regards
Johannes

Re: Task process exit with nonzero status of 1 - deleting userlogshelps

Posted by Johannes Zillmann <jz...@googlemail.com>.
Seems like this is something with folder restrictions.
Tried:
  cp -r logs/userlogsOLD/* logs/userlogs/
and got
  cp: cannot create directory `logs/userlogs/attempt_201006091425_0049_m_003169_0': Too many links  

Johannes

On Jun 16, 2010, at 9:30 AM, Manhee Jo wrote:

> Hi,
> 
> I've also encountered the same nonzero status of 1 error before.
> What did you set to mapred.child.ulimit and mapred.child.java.opts?
> mapred.child.ulimit must be greater than the -Xmx passed to JavaVM,
> else the VM might not start. That's wat MR tutorial says.
> Setting bigger ulimit, I could solve the problem.
> Hope this help.
> 
> 
> Regards,
> Manhee
> 
> ----- Original Message ----- From: "Edward Capriolo" <ed...@gmail.com>
> To: <co...@hadoop.apache.org>
> Sent: Tuesday, June 15, 2010 2:47 AM
> Subject: Re: Task process exit with nonzero status of 1 - deleting userlogshelps
> 
> 
>> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
>>> wrote:
>> 
>>> Hi,
>>> 
>>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
>>> a situation where every task scheduled on 2 of the 4 nodes failed.
>>> Seems like the child jvm crashes. There are no child logs under
>>> logs/userlogs. Tasktracker gives this:
>>> 
>>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>>> attempt_201006091425_0049_m_003179_0 Child Error
>>> java.io.IOException: Task process exit with nonzero status of 1.
>>>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>> 
>>> 
>>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
>>> created the logs/userlogs again and no error ocuured anymore on this host.
>>> The permissions of userlogs and userlogsOLD are exactly the same.
>>> userlogsOLD contains about 378M in 132747 files. When copying the content of
>>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>>> again.
>>> 
>>> Some questions:
>>> - this seems to me like a problem with too many files in one folder - any
>>> thoughts on this ?
>>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>>> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
>>> the tasktracker hasn't any specific message (other then message posted
>>> above) - is there any log file left where an error message could be found ?
>>> 
>>> 
>>> best regards
>>> Johannes
>> 
>> 
>> Most file systems have an upper limit on number of subfiles/folders in a
>> folder. You have probably hit the EXT3 limit. If you launch lots and lots of
>> jobs you can hit the limit before any cleanup happens.
>> 
>> You can experiment with cleanup and other filesystems. The following log
>> related issue might be relevant.
>> 
>> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
>> 
>> Regards,
>> Edward
> 
> 


Re: Task process exit with nonzero status of 1 - deleting userlogshelps

Posted by Manhee Jo <jo...@nttdocomo.com>.
Hi,

I've also encountered the same nonzero status of 1 error before.
What did you set to mapred.child.ulimit and mapred.child.java.opts?
mapred.child.ulimit must be greater than the -Xmx passed to JavaVM,
else the VM might not start. That's wat MR tutorial says.
Setting bigger ulimit, I could solve the problem.
Hope this help.


Regards,
Manhee

----- Original Message ----- 
From: "Edward Capriolo" <ed...@gmail.com>
To: <co...@hadoop.apache.org>
Sent: Tuesday, June 15, 2010 2:47 AM
Subject: Re: Task process exit with nonzero status of 1 - deleting 
userlogshelps


> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann 
> <jzillmann@googlemail.com
>> wrote:
>
>> Hi,
>>
>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run 
>> into
>> a situation where every task scheduled on 2 of the 4 nodes failed.
>> Seems like the child jvm crashes. There are no child logs under
>> logs/userlogs. Tasktracker gives this:
>>
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_201006091425_0049_m_003179_0 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>
>>
>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new 
>> job
>> created the logs/userlogs again and no error ocuured anymore on this 
>> host.
>> The permissions of userlogs and userlogsOLD are exactly the same.
>> userlogsOLD contains about 378M in 132747 files. When copying the content 
>> of
>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>> again.
>>
>> Some questions:
>> - this seems to me like a problem with too many files in one folder - any
>> thoughts on this ?
>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>> - the logs/stdout file of the tasks are not existent, the logs/out fiels 
>> of
>> the tasktracker hasn't any specific message (other then message posted
>> above) - is there any log file left where an error message could be found 
>> ?
>>
>>
>> best regards
>> Johannes
>
>
> Most file systems have an upper limit on number of subfiles/folders in a
> folder. You have probably hit the EXT3 limit. If you launch lots and lots 
> of
> jobs you can hit the limit before any cleanup happens.
>
> You can experiment with cleanup and other filesystems. The following log
> related issue might be relevant.
>
> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
>
> Regards,
> Edward
> 



Re: Hadoop and IP on InfiniBand (IPoIB)

Posted by Russell Brown <ru...@oracle.com>.
FYI, Allen Wittnauer,

I'm using Linux not Solaris, but I'll pay attention to your comment 
about Solaris if I install Solaris on the cluster.  Thanks again for 
your helpful comments.

Russ

On 06/15/10 11:10 AM, Allen Wittenauer wrote:
> On Jun 15, 2010, at 7:40 AM, Russell Brown wrote:
>
>   
>> Thanks, Allen, for responding.
>>
>> So, if I understand you correctly, the dfs.datanode.dns.interface and mapred.tasktracker.dns.interface options may be used to define inbound connections only?
>>     
>
> Correct.  The daemons will bind to those interfaces and use those names as their 'official' connection in.
>
>   
>> Concerning the OS configuration, my /etc/hosts files assign unique host names to the ethernet and IB interfaces.  However, even if I specify the IB host names in the masters and slaves files, communication still occurs via ethernet, not via IB.
>>     
>
> BTW, are you doing this on Solaris or Linux?
>
> Solaris is notorious for not honoring inbound and outbound interfaces. [In other words, just because the packet came in on bge0, that is no guarantee that the reply will go out on bge0 if another route is available.  Particularly frustrating with NFS and SunCluster.]
>
>   
>> Your recommendation would therefore be to define IB instead of ethernet as the default network interface connection, right?
>>     
>
> Yup.  Or at least give it a lower cost in the routing table.


-- 

------------------------------------------------------------
Russell A. Brown                |  Oracle
russ.brown@oracle.com           |  UMPK14-260
(650) 786-3011 (office)         |  14 Network Circle
(650) 786-3453 (fax)            |  Menlo Park, CA 94025
------------------------------------------------------------



Re: Hadoop and IP on InfiniBand (IPoIB)

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jun 15, 2010, at 7:40 AM, Russell Brown wrote:

> Thanks, Allen, for responding.
> 
> So, if I understand you correctly, the dfs.datanode.dns.interface and mapred.tasktracker.dns.interface options may be used to define inbound connections only?

Correct.  The daemons will bind to those interfaces and use those names as their 'official' connection in.

> Concerning the OS configuration, my /etc/hosts files assign unique host names to the ethernet and IB interfaces.  However, even if I specify the IB host names in the masters and slaves files, communication still occurs via ethernet, not via IB.

BTW, are you doing this on Solaris or Linux?

Solaris is notorious for not honoring inbound and outbound interfaces. [In other words, just because the packet came in on bge0, that is no guarantee that the reply will go out on bge0 if another route is available.  Particularly frustrating with NFS and SunCluster.]

> Your recommendation would therefore be to define IB instead of ethernet as the default network interface connection, right?

Yup.  Or at least give it a lower cost in the routing table.

Re: Hadoop and IP on InfiniBand (IPoIB)

Posted by Russell Brown <ru...@oracle.com>.
Thanks, Allen, for responding.

So, if I understand you correctly, the dfs.datanode.dns.interface and 
mapred.tasktracker.dns.interface options may be used to define inbound 
connections only?

Concerning the OS configuration, my /etc/hosts files assign unique host 
names to the ethernet and IB interfaces.  However, even if I specify the 
IB host names in the masters and slaves files, communication still 
occurs via ethernet, not via IB.

Your recommendation would therefore be to define IB instead of ethernet 
as the default network interface connection, right?

Thanks,

Russ


On 06/14/10 12:32 PM, Allen Wittenauer wrote:
> On Jun 14, 2010, at 10:57 AM, Russell Brown wrote:
>
>   
>> I'm a new user of Hadoop.  I have a Linux cluster with both gigabit ethernet and InfiniBand communications interfaces.  Could someone please tell me how to switch IP communication from ethernet (the default) to InfiniBand?  Thanks.
>>     
>
>
> Hadoop will bind inbound connections via the interface settings in the various hadoop configuration files.  Outbound connections are unbound and based solely on OS configuration.  I filed a jira to fix this, but it is obviously low priority since few people run multi-nic boxes.  Best bet is to down the ethernet and up the IB, changing routing, etc, as necessary.


-- 

------------------------------------------------------------
Russell A. Brown                |  Oracle
russ.brown@oracle.com           |  UMPK14-260
(650) 786-3011 (office)         |  14 Network Circle
(650) 786-3453 (fax)            |  Menlo Park, CA 94025
------------------------------------------------------------



Re: Hadoop and IP on InfiniBand (IPoIB)

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jun 14, 2010, at 10:57 AM, Russell Brown wrote:

> I'm a new user of Hadoop.  I have a Linux cluster with both gigabit ethernet and InfiniBand communications interfaces.  Could someone please tell me how to switch IP communication from ethernet (the default) to InfiniBand?  Thanks.


Hadoop will bind inbound connections via the interface settings in the various hadoop configuration files.  Outbound connections are unbound and based solely on OS configuration.  I filed a jira to fix this, but it is obviously low priority since few people run multi-nic boxes.  Best bet is to down the ethernet and up the IB, changing routing, etc, as necessary.

Hadoop and IP on InfiniBand (IPoIB)

Posted by Russell Brown <ru...@oracle.com>.
I'm a new user of Hadoop.  I have a Linux cluster with both gigabit 
ethernet and InfiniBand communications interfaces.  Could someone please 
tell me how to switch IP communication from ethernet (the default) to 
InfiniBand?  Thanks.

-- 

------------------------------------------------------------
Russell A. Brown                |  Oracle
russ.brown@oracle.com           |  UMPK14-260
(650) 786-3011 (office)         |  14 Network Circle
(650) 786-3453 (fax)            |  Menlo Park, CA 94025
------------------------------------------------------------



Re: Task process exit with nonzero status of 1 - deleting userlogs helps

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
The issue is fixed in branch 0.21 through http://issues.apache.org/jira/browse/MAPREDUCE-927.
Now, the attempt directories are moved inside job directory. So, userlogs directory will have only job directories.

Thanks
Amareshwari
On 6/16/10 12:47 PM, "Johannes Zillmann" <jz...@googlemail.com> wrote:

Hi Edward,

i copied the userlogs folder which caused the error.
Two things which is speak against the too-many files theory.
a) i can add new files to this folder (touch userlogsOLD/a, etc... )
b) the sysctl fs.file-max shows 817874 whereas the file count on the first level of userlogsOLD is 31999 and all files recursively are 107400.

Any thoughts ?
Johannes


On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
>> wrote:
>
>> Hi,
>>
>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
>> a situation where every task scheduled on 2 of the 4 nodes failed.
>> Seems like the child jvm crashes. There are no child logs under
>> logs/userlogs. Tasktracker gives this:
>>
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_201006091425_0049_m_003179_0 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>
>>
>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
>> created the logs/userlogs again and no error ocuured anymore on this host.
>> The permissions of userlogs and userlogsOLD are exactly the same.
>> userlogsOLD contains about 378M in 132747 files. When copying the content of
>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>> again.
>>
>> Some questions:
>> - this seems to me like a problem with too many files in one folder - any
>> thoughts on this ?
>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
>> the tasktracker hasn't any specific message (other then message posted
>> above) - is there any log file left where an error message could be found ?
>>
>>
>> best regards
>> Johannes
>
>
> Most file systems have an upper limit on number of subfiles/folders in a
> folder. You have probably hit the EXT3 limit. If you launch lots and lots of
> jobs you can hit the limit before any cleanup happens.
>
> You can experiment with cleanup and other filesystems. The following log
> related issue might be relevant.
>
> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
>
> Regards,
> Edward



Re: Task process exit with nonzero status of 1 - deleting userlogs helps

Posted by Johannes Zillmann <jz...@googlemail.com>.
Hi Edward,

i copied the userlogs folder which caused the error. 
Two things which is speak against the too-many files theory.
a) i can add new files to this folder (touch userlogsOLD/a, etc... ) 
b) the sysctl fs.file-max shows 817874 whereas the file count on the first level of userlogsOLD is 31999 and all files recursively are 107400.

Any thoughts ?
Johannes


On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
>> wrote:
> 
>> Hi,
>> 
>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
>> a situation where every task scheduled on 2 of the 4 nodes failed.
>> Seems like the child jvm crashes. There are no child logs under
>> logs/userlogs. Tasktracker gives this:
>> 
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_201006091425_0049_m_003179_0 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>> 
>> 
>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
>> created the logs/userlogs again and no error ocuured anymore on this host.
>> The permissions of userlogs and userlogsOLD are exactly the same.
>> userlogsOLD contains about 378M in 132747 files. When copying the content of
>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>> again.
>> 
>> Some questions:
>> - this seems to me like a problem with too many files in one folder - any
>> thoughts on this ?
>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
>> the tasktracker hasn't any specific message (other then message posted
>> above) - is there any log file left where an error message could be found ?
>> 
>> 
>> best regards
>> Johannes
> 
> 
> Most file systems have an upper limit on number of subfiles/folders in a
> folder. You have probably hit the EXT3 limit. If you launch lots and lots of
> jobs you can hit the limit before any cleanup happens.
> 
> You can experiment with cleanup and other filesystems. The following log
> related issue might be relevant.
> 
> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
> 
> Regards,
> Edward


Re: Task process exit with nonzero status of 1 - deleting userlogs helps

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Hi,
>
> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
> a situation where every task scheduled on 2 of the 4 nodes failed.
> Seems like the child jvm crashes. There are no child logs under
> logs/userlogs. Tasktracker gives this:
>
> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
> Runner jvm_201006091425_0049_m_-946174604 spawned.
> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
> attempt_201006091425_0049_m_003179_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 1.
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>
>
> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
> created the logs/userlogs again and no error ocuured anymore on this host.
> The permissions of userlogs and userlogsOLD are exactly the same.
> userlogsOLD contains about 378M in 132747 files. When copying the content of
> userlogsOLD into userlogs, the tasks of the belonging node starts failing
> again.
>
> Some questions:
> - this seems to me like a problem with too many files in one folder - any
> thoughts on this ?
> - is the content of logs/userlogs cleaned up by hadoop regularly ?
> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
> the tasktracker hasn't any specific message (other then message posted
> above) - is there any log file left where an error message could be found ?
>
>
> best regards
> Johannes


Most file systems have an upper limit on number of subfiles/folders in a
folder. You have probably hit the EXT3 limit. If you launch lots and lots of
jobs you can hit the limit before any cleanup happens.

You can experiment with cleanup and other filesystems. The following log
related issue might be relevant.

https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

Regards,
Edward