You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Greg Langmead <gl...@sdl.com> on 2010/11/16 23:50:17 UTC

Problem identifying cause of a failed job

Newbie alert.

I have a Pig script I tested on small data and am now running it on a larger
data set (85GB). My cluster is two machines right now, each with 16 cores
and 32G of ram. I configured Hadoop to have 15 tasktrackers on each of these
nodes. One of them is the namenode, one is the secondary name node. I¹m
using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora Core
12, 64-bit.

My Pig job starts, and eventually a reduce task fails. I¹d like to find out
why. Here¹s what I know:

The webUI lists the failed reduce tasks and indicates this error:

java.io.IOException: Task process exit with nonzero status of 134.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

The userlog userlogs/attempt_201011151350_0001_r_000063_0/stdout says this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff74158463c, pid=27109, tid=140699912791824
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
[thread 140699484784400 also had an error]# Problematic frame:

# V  [libjvm.so+0x62263c]
#
# An error report file with more information is saved as:
# 
/tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201011151350_0001/a
ttempt_201011151350_0001_r_000063_0/work/hs_err_pid27109.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

My mapred-site.xml already includes this:

<property>
<name>keep.failed.task.files</name>
<value>true</value>
</property>

So I was hoping that the file hs_err_pid27109.log would exist but it
doesn¹t. I was sure to check the /tmp dir on both tasktrackers. In fact
there is no dir  

  jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0

only

  
jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0.cleanup

I¹d like to find the source of the segfault, can anyone point me in the
right direction? 

Of course let me know if you need more information!

Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300
  
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.

Re: Problem identifying cause of a failed job

Posted by Kiss Tibor <ki...@gmail.com>.
Hi!

Myself I had similiar issue with just a simple distcp from s3n to hdfs.. a
small file (< 10MByte) I would like to copy it.

If I start the m1.small instance it works, if I start m1.large I always get
this error in tasktracker. See the attached logfile.

Unfortunately, here I have the latest jdk version.

Tibor

On Wed, Nov 17, 2010 at 6:44 PM, Matt Pouttu-Clarke <
Matt.Pouttu-Clarke@icrossing.com> wrote:

> We were getting SIGSEGV and fixed it by upgrading the JVM.  We are using
> 1.6.0_21 currently.
>
>
>
>
>
> On Nov 16, 2010, at 3:50 PM, "Greg Langmead" <gl...@sdl.com> wrote:
>
>  Newbie alert.
>>
>> I have a Pig script I tested on small data and am now running it on a
>> larger
>> data set (85GB). My cluster is two machines right now, each with 16 cores
>> and 32G of ram. I configured Hadoop to have 15 tasktrackers on each of
>> these
>> nodes. One of them is the namenode, one is the secondary name node. I�m
>> using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora Core
>> 12, 64-bit.
>>
>> My Pig job starts, and eventually a reduce task fails. I�d like to find
>> out
>> why. Here�s what I know:
>>
>> The webUI lists the failed reduce tasks and indicates this error:
>>
>> java.io.IOException: Task process exit with nonzero status of 134.
>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>
>> The userlog userlogs/attempt_201011151350_0001_r_000063_0/stdout says
>> this:
>>
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00007ff74158463c, pid=27109, tid=140699912791824
>> #
>> # JRE version: 6.0_18-b07
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
>> linux-amd64 )
>> [thread 140699484784400 also had an error]# Problematic frame:
>>
>> # V  [libjvm.so+0x62263c]
>> #
>> # An error report file with more information is saved as:
>> #
>>
>> /tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201011151350_0001/a
>> ttempt_201011151350_0001_r_000063_0/work/hs_err_pid27109.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>> #
>>
>> My mapred-site.xml already includes this:
>>
>> <property>
>> <name>keep.failed.task.files</name>
>> <value>true</value>
>> </property>
>>
>> So I was hoping that the file hs_err_pid27109.log would exist but it
>> doesn�t. I was sure to check the /tmp dir on both tasktrackers. In fact
>> there is no dir
>>
>>  jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0
>>
>> only
>>
>>
>>
>> jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0.cleanup
>>
>> I�d like to find the source of the segfault, can anyone point me in the
>> right direction?
>>
>> Of course let me know if you need more information!
>>
>> Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1
>> 310
>> 437 7300
>>
>> SDL PLC confidential, all rights reserved.
>> If you are not the intended recipient of this mail SDL requests and
>> requires that you delete it without acting upon or copying any of its
>> contents, and we further request that you advise us.
>> SDL PLC is a public limited company registered in England and Wales.
>>  Registered number: 02675207.
>> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6
>> 7DY, UK.
>>
>
> iCrossing Privileged and Confidential Information
> This email message is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information of iCrossing. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email and
> destroy all copies of the original message.
>
>
>

Re: AWS Hadoop 20.2 AMIs

Posted by Chris Schneider <Sc...@TransPac.com>.
Hi Mike,

At 10:41 AM -0800 11/17/10, Gangl, Michael E (388K) wrote:
>I've been running into an issue today.
>
>I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large instances, so I figured I could use the c1.xlarge instances with the x86_64 versions.

Note that the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI (ami-48b54021) was designed for m1.small and c1.medium instances. We put together a different public AMI (ami-827c8beb) for m1.large instances.

>When I start these with the src/contrib/ec2/bin scripts, however, the master starts but then I'm unable to connect to the xlarge instances. I can still use the 1.large instances, but these are too slow for me, so I'd like to use the bigger machine. Has anyone else been having problems today, or in the past with getting an AMI to work on the xlarge instances?

At 11:26 AM -0800 11/17/10, Gangl, Michael E (388K) wrote:
>FYI, I commented out the Kernal version in the hadoop-ec2-env.sh script for the c1.xlarge if statements (at the bottom).
>
>Before it was using aki-427d952b
>
>Now it's using aki-b51cf9dc
>
>And I'm able to connect. Turns out the problem was a hang during the boot. This should probably be changed in the future releases of the ec2 scripts (if it's not changed already :) )

The Hadoop 0.20.2's /src/contrib/ec2/bin/hadoop-ec2-env.sh.template actually specifies $AMI_KERNEL=aki-800e5f2 for c1.xlarge:

http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.2/src/contrib/ec2/bin/hadoop-ec2-env.sh.template

I haven't tried launching any c1.xlarge instances myself, though.

FYI,

- Chris
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: AWS Hadoop 20.2 AMIs

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Mike,

Do you have time to submit a patch? You could probably create a jira issue here [1] and then attach a diff of your update...

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/HADOOP

On Nov 17, 2010, at 11:26 AM, Gangl, Michael E (388K) wrote:

> FYI, I commented out the Kernal version in the hadoop-ec2-env.sh script for the c1.xlarge if statements (at the bottom).
> 
> Before it was using aki-427d952b
> 
> Now it's using aki-b51cf9dc
> 
> And I'm able to connect. Turns out the problem was a hang during the boot. This should probably be changed in the future releases of the ec2 scripts (if it's not changed already :) )
> 
> -Mike
> 
> 
> On 11/17/10 10:41 AM, "Michael Gangl" <Mi...@jpl.nasa.gov> wrote:
> 
> I've been running into an issue today.
> 
> I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large instances, so I figured I could use the c1.xlarge instances with the x86_64 versions.
> 
> When I start these with the src/contrib/ec2/bin scripts, however, the master starts but then I'm unable to connect to the xlarge instances. I can still use the 1.large instances, but these are too slow for me, so I'd like to use the bigger machine. Has anyone else been having problems today, or in the past with getting an AMI to work on the xlarge instances?
> 
> Thanks,
> 
> Mike
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: AWS Hadoop 20.2 AMIs

Posted by "Gangl, Michael E (388K)" <Mi...@jpl.nasa.gov>.
FYI, I commented out the Kernal version in the hadoop-ec2-env.sh script for the c1.xlarge if statements (at the bottom).

Before it was using aki-427d952b

Now it's using aki-b51cf9dc

And I'm able to connect. Turns out the problem was a hang during the boot. This should probably be changed in the future releases of the ec2 scripts (if it's not changed already :) )

-Mike


On 11/17/10 10:41 AM, "Michael Gangl" <Mi...@jpl.nasa.gov> wrote:

I've been running into an issue today.

I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large instances, so I figured I could use the c1.xlarge instances with the x86_64 versions.

When I start these with the src/contrib/ec2/bin scripts, however, the master starts but then I'm unable to connect to the xlarge instances. I can still use the 1.large instances, but these are too slow for me, so I'd like to use the bigger machine. Has anyone else been having problems today, or in the past with getting an AMI to work on the xlarge instances?

Thanks,

Mike


AWS Hadoop 20.2 AMIs

Posted by "Gangl, Michael E (388K)" <Mi...@jpl.nasa.gov>.
I've been running into an issue today.

I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large instances, so I figured I could use the c1.xlarge instances with the x86_64 versions.

When I start these with the src/contrib/ec2/bin scripts, however, the master starts but then I'm unable to connect to the xlarge instances. I can still use the 1.large instances, but these are too slow for me, so I'd like to use the bigger machine. Has anyone else been having problems today, or in the past with getting an AMI to work on the xlarge instances?

Thanks,

Mike

Re: Problem identifying cause of a failed job

Posted by Matt Pouttu-Clarke <Ma...@icrossing.com>.
We were getting SIGSEGV and fixed it by upgrading the JVM.  We are  
using 1.6.0_21 currently.




On Nov 16, 2010, at 3:50 PM, "Greg Langmead" <gl...@sdl.com> wrote:

> Newbie alert.
>
> I have a Pig script I tested on small data and am now running it on  
> a larger
> data set (85GB). My cluster is two machines right now, each with 16  
> cores
> and 32G of ram. I configured Hadoop to have 15 tasktrackers on each  
> of these
> nodes. One of them is the namenode, one is the secondary name node.  
> I’m
> using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora  
> Core
> 12, 64-bit.
>
> My Pig job starts, and eventually a reduce task fails. I’d like to f 
> ind out
> why. Here’s what I know:
>
> The webUI lists the failed reduce tasks and indicates this error:
>
> java.io.IOException: Task process exit with nonzero status of 134.
>    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>
> The userlog userlogs/attempt_201011151350_0001_r_000063_0/stdout  
> says this:
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007ff74158463c, pid=27109, tid=140699912791824
> #
> # JRE version: 6.0_18-b07
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
> linux-amd64 )
> [thread 140699484784400 also had an error]# Problematic frame:
>
> # V  [libjvm.so+0x62263c]
> #
> # An error report file with more information is saved as:
> #
> /tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/ 
> job_201011151350_0001/a
> ttempt_201011151350_0001_r_000063_0/work/hs_err_pid27109.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
>
> My mapred-site.xml already includes this:
>
> <property>
> <name>keep.failed.task.files</name>
> <value>true</value>
> </property>
>
> So I was hoping that the file hs_err_pid27109.log would exist but it
> doesn’t. I was sure to check the /tmp dir on both tasktrackers. In f 
> act
> there is no dir
>
>  jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0
>
> only
>
>
> jobcache/job_201011151350_0001/ 
> attempt_201011151350_0001_r_000063_0.cleanup
>
> I’d like to find the source of the segfault, can anyone point me in  
> the
> right direction?
>
> Of course let me know if you need more information!
>
> Greg Langmead | Senior Research Scientist | SDL Language Weaver |  
> (t) +1 310
> 437 7300
>
> SDL PLC confidential, all rights reserved.
> If you are not the intended recipient of this mail SDL requests and  
> requires that you delete it without acting upon or copying any of  
> its contents, and we further request that you advise us.
> SDL PLC is a public limited company registered in England and  
> Wales.  Registered number: 02675207.
> Registered address: Globe House, Clivemont Road, Maidenhead,  
> Berkshire SL6 7DY, UK.

iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.