You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Trevor <tr...@scurrilous.com> on 2012/07/17 23:24:22 UTC

MRv2 jobs fail when run with more than one slave

Hi all,

I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
more than one slave. For every slave except the one running the Application
Master, I get the following failed tasks and warnings repeatedly:

12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in
uber mode : false
12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
attempt_1342207265272_0001_m_000004_0, Status : FAILED
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
attempt_1342207265272_0001_m_000003_0, Status : FAILED
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
...
12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
with state FAILED due to:
...
                Failed map tasks=19
                Launched map tasks=31

The HTTP 400 error appears to be generated by the ShuffleHandler, which is
configured to run on port 8080 of the slaves, and doesn't understand that
URL. What I've been able to piece together so far is that /tasklog is
handled by the TaskLogServlet, which is part of the TaskTracker. However,
isn't this an MRv1 class that shouldn't even be running in my
configuration? Also, the TaskTracker appears to run on port 50060, so I
don't know where port 8080 is coming from.

Though it could be a red herring, this warning seems to be related to the
job failing, despite the fact that the job makes progress on the slave
running the AM. The Node Manager logs on both AM and non-AM slaves appear
fairly similar, and I don't see any errors in the non-AM logs.

Another strange data point: These failures occur running the slaves on ARM
systems. Running the slaves on x86 with the same configuration works. I'm
using the same tarball on both, which means that the native-hadoop library
isn't loaded on ARM. The master/client is the same x86 system in both
scenarios. All nodes are running Ubuntu 12.04.

Thanks for any guidance,
Trevor

Re: MRv2 jobs fail when run with more than one slave

Posted by Arun C Murthy <ac...@hortonworks.com>.

Look at the NodeManager logs on perfgb0n0 and look for logs of container_1342570404456_0001_* and check for errors.

Arun

On Jul 17, 2012, at 5:33 PM, Trevor wrote:

> Actually, the HTTP 400 is a red herring, and not the core issue. I added "-D mapreduce.client.output.filter=ALL" to the command line, and fetching the task output fails even for successful tasks:
> 
> 12/07/17 19:15:55 INFO mapreduce.Job: Task Id : attempt_1342570404456_0001_m_000006_1, Status : SUCCEEDED
> 12/07/17 19:15:55 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342570404456_0001_m_000006_1&filter=stdout
> 
> Having a better idea what to search for, I found that it's a recently fixed bug: https://issues.apache.org/jira/browse/MAPREDUCE-3889
> 
> So the real question is how can I debug the failing tasks on the non-AM slave(s)? Although I see failure on the client:
> 
> 12/07/17 19:14:35 INFO mapreduce.Job: Task Id : attempt_1342570404456_0001_m_000002_0, Status : FAILED
> 
> I see what appears to be success on the slave:
> 
> 2012-07-17 19:13:47,476 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1342570404456_0001_01_000002 succeeded
> 2012-07-17 19:13:47,477 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1342570404456_0001_01_000002 transitioned from RUNNING to EXITED_WITH_SUCCESS
> 
> Suggestions of where to look next?
> 
> Thanks,
> Trevor
> 
> On Tue, Jul 17, 2012 at 6:33 PM, Trevor <tr...@scurrilous.com> wrote:
> Arun, I just verified that I get the same error with 2.0.0-alpha (official tarball) and 2.0.1-alpha (built from svn).
> 
> Karthik, thanks for forwarding.
> 
> Thanks,
> Trevor
> 
> 
> On Tue, Jul 17, 2012 at 6:18 PM, Karthik Kambatla <ka...@cloudera.com> wrote:
> Forwarding your email to the cdh-user group.
> 
> Thanks
> Karthik
> 
> 
> On Tue, Jul 17, 2012 at 2:24 PM, Trevor <tr...@scurrilous.com> wrote:
> Hi all,
> 
> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with more than one slave. For every slave except the one running the Application Master, I get the following failed tasks and warnings repeatedly:
> 
> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in uber mode : false
> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000004_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000003_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
> ...
> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed with state FAILED due to:
> ...
>                 Failed map tasks=19
>                 Launched map tasks=31
> 
> The HTTP 400 error appears to be generated by the ShuffleHandler, which is configured to run on port 8080 of the slaves, and doesn't understand that URL. What I've been able to piece together so far is that /tasklog is handled by the TaskLogServlet, which is part of the TaskTracker. However, isn't this an MRv1 class that shouldn't even be running in my configuration? Also, the TaskTracker appears to run on port 50060, so I don't know where port 8080 is coming from.
> 
> Though it could be a red herring, this warning seems to be related to the job failing, despite the fact that the job makes progress on the slave running the AM. The Node Manager logs on both AM and non-AM slaves appear fairly similar, and I don't see any errors in the non-AM logs.
> 
> Another strange data point: These failures occur running the slaves on ARM systems. Running the slaves on x86 with the same configuration works. I'm using the same tarball on both, which means that the native-hadoop library isn't loaded on ARM. The master/client is the same x86 system in both scenarios. All nodes are running Ubuntu 12.04.
> 
> Thanks for any guidance,
> Trevor
> 
> 
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: MRv2 jobs fail when run with more than one slave

Posted by Trevor <tr...@scurrilous.com>.

Actually, the HTTP 400 is a red herring, and not the core issue. I added
"-D mapreduce.client.output.filter=ALL" to the command line, and fetching
the task output fails even for successful tasks:

12/07/17 19:15:55 INFO mapreduce.Job: Task Id :
attempt_1342570404456_0001_m_000006_1, Status : SUCCEEDED
12/07/17 19:15:55 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL:
http://perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342570404456_0001_m_000006_1&filter=stdout

Having a better idea what to search for, I found that it's a recently fixed
bug: https://issues.apache.org/jira/browse/MAPREDUCE-3889

So the real question is how can I debug the failing tasks on the non-AM
slave(s)? Although I see failure on the client:

12/07/17 19:14:35 INFO mapreduce.Job: Task Id :
attempt_1342570404456_0001_m_000002_0, Status : FAILED

I see what appears to be success on the slave:

2012-07-17 19:13:47,476 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_1342570404456_0001_01_000002 succeeded
2012-07-17 19:13:47,477 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1342570404456_0001_01_000002 transitioned from RUNNING
to EXITED_WITH_SUCCESS

Suggestions of where to look next?

Thanks,
Trevor

On Tue, Jul 17, 2012 at 6:33 PM, Trevor <tr...@scurrilous.com> wrote:

> Arun, I just verified that I get the same error with 2.0.0-alpha (official
> tarball) and 2.0.1-alpha (built from svn).
>
> Karthik, thanks for forwarding.
>
> Thanks,
> Trevor
>
>
> On Tue, Jul 17, 2012 at 6:18 PM, Karthik Kambatla <ka...@cloudera.com>wrote:
>
>> Forwarding your email to the cdh-user group.
>>
>> Thanks
>> Karthik
>>
>>
>> On Tue, Jul 17, 2012 at 2:24 PM, Trevor <tr...@scurrilous.com> wrote:
>>
>>> Hi all,
>>>
>>> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
>>> strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
>>> more than one slave. For every slave except the one running the Application
>>> Master, I get the following failed tasks and warnings repeatedly:
>>>
>>> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
>>> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running
>>> in uber mode : false
>>> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
>>> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
>>> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
>>> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
>>> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
>>> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
>>> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
>>> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
>>> attempt_1342207265272_0001_m_000004_0, Status : FAILED
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
>>> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
>>> attempt_1342207265272_0001_m_000003_0, Status : FAILED
>>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>>> returned HTTP response code: 400 for URL: http://
>>>
>>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
>>> ...
>>> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
>>> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
>>> with state FAILED due to:
>>> ...
>>>                 Failed map tasks=19
>>>                 Launched map tasks=31
>>>
>>> The HTTP 400 error appears to be generated by the ShuffleHandler, which
>>> is configured to run on port 8080 of the slaves, and doesn't understand
>>> that URL. What I've been able to piece together so far is that /tasklog is
>>> handled by the TaskLogServlet, which is part of the TaskTracker. However,
>>> isn't this an MRv1 class that shouldn't even be running in my
>>> configuration? Also, the TaskTracker appears to run on port 50060, so I
>>> don't know where port 8080 is coming from.
>>>
>>> Though it could be a red herring, this warning seems to be related to
>>> the job failing, despite the fact that the job makes progress on the slave
>>> running the AM. The Node Manager logs on both AM and non-AM slaves appear
>>> fairly similar, and I don't see any errors in the non-AM logs.
>>>
>>> Another strange data point: These failures occur running the slaves on
>>> ARM systems. Running the slaves on x86 with the same configuration works.
>>> I'm using the same tarball on both, which means that the native-hadoop
>>> library isn't loaded on ARM. The master/client is the same x86 system in
>>> both scenarios. All nodes are running Ubuntu 12.04.
>>>
>>> Thanks for any guidance,
>>> Trevor
>>>
>>>
>>
>

Re: MRv2 jobs fail when run with more than one slave

Posted by Trevor <tr...@scurrilous.com>.

Arun, I just verified that I get the same error with 2.0.0-alpha (official
tarball) and 2.0.1-alpha (built from svn).

Karthik, thanks for forwarding.

Thanks,
Trevor

On Tue, Jul 17, 2012 at 6:18 PM, Karthik Kambatla <ka...@cloudera.com>wrote:

> Forwarding your email to the cdh-user group.
>
> Thanks
> Karthik
>
>
> On Tue, Jul 17, 2012 at 2:24 PM, Trevor <tr...@scurrilous.com> wrote:
>
>> Hi all,
>>
>> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
>> strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
>> more than one slave. For every slave except the one running the Application
>> Master, I get the following failed tasks and warnings repeatedly:
>>
>> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
>> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running
>> in uber mode : false
>> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
>> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
>> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
>> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
>> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
>> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
>> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
>> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
>> attempt_1342207265272_0001_m_000004_0, Status : FAILED
>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>> returned HTTP response code: 400 for URL: http://
>>
>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>> returned HTTP response code: 400 for URL: http://
>>
>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
>> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
>> attempt_1342207265272_0001_m_000003_0, Status : FAILED
>> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
>> returned HTTP response code: 400 for URL: http://
>>
>> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
>> ...
>> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
>> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
>> with state FAILED due to:
>> ...
>>                 Failed map tasks=19
>>                 Launched map tasks=31
>>
>> The HTTP 400 error appears to be generated by the ShuffleHandler, which
>> is configured to run on port 8080 of the slaves, and doesn't understand
>> that URL. What I've been able to piece together so far is that /tasklog is
>> handled by the TaskLogServlet, which is part of the TaskTracker. However,
>> isn't this an MRv1 class that shouldn't even be running in my
>> configuration? Also, the TaskTracker appears to run on port 50060, so I
>> don't know where port 8080 is coming from.
>>
>> Though it could be a red herring, this warning seems to be related to the
>> job failing, despite the fact that the job makes progress on the slave
>> running the AM. The Node Manager logs on both AM and non-AM slaves appear
>> fairly similar, and I don't see any errors in the non-AM logs.
>>
>> Another strange data point: These failures occur running the slaves on
>> ARM systems. Running the slaves on x86 with the same configuration works.
>> I'm using the same tarball on both, which means that the native-hadoop
>> library isn't loaded on ARM. The master/client is the same x86 system in
>> both scenarios. All nodes are running Ubuntu 12.04.
>>
>> Thanks for any guidance,
>> Trevor
>>
>>
>

Re: MRv2 jobs fail when run with more than one slave

Posted by Karthik Kambatla <ka...@cloudera.com>.

Forwarding your email to the cdh-user group.

Thanks
Karthik

On Tue, Jul 17, 2012 at 2:24 PM, Trevor <tr...@scurrilous.com> wrote:

> Hi all,
>
> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
> strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
> more than one slave. For every slave except the one running the Application
> Master, I get the following failed tasks and warnings repeatedly:
>
> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running
> in uber mode : false
> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
> attempt_1342207265272_0001_m_000004_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
> returned HTTP response code: 400 for URL: http://
>
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
> returned HTTP response code: 400 for URL: http://
>
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
> attempt_1342207265272_0001_m_000003_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
> returned HTTP response code: 400 for URL: http://
>
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
> ...
> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
> with state FAILED due to:
> ...
>                 Failed map tasks=19
>                 Launched map tasks=31
>
> The HTTP 400 error appears to be generated by the ShuffleHandler, which is
> configured to run on port 8080 of the slaves, and doesn't understand that
> URL. What I've been able to piece together so far is that /tasklog is
> handled by the TaskLogServlet, which is part of the TaskTracker. However,
> isn't this an MRv1 class that shouldn't even be running in my
> configuration? Also, the TaskTracker appears to run on port 50060, so I
> don't know where port 8080 is coming from.
>
> Though it could be a red herring, this warning seems to be related to the
> job failing, despite the fact that the job makes progress on the slave
> running the AM. The Node Manager logs on both AM and non-AM slaves appear
> fairly similar, and I don't see any errors in the non-AM logs.
>
> Another strange data point: These failures occur running the slaves on ARM
> systems. Running the slaves on x86 with the same configuration works. I'm
> using the same tarball on both, which means that the native-hadoop library
> isn't loaded on ARM. The master/client is the same x86 system in both
> scenarios. All nodes are running Ubuntu 12.04.
>
> Thanks for any guidance,
> Trevor
>
>

Re: MRv2 jobs fail when run with more than one slave

Posted by Arun C Murthy <ac...@hortonworks.com>.

Trevor,

 It's hard for folks here to help you with CDH patchsets (it's their call on what they include), can you pls try with vanilla Apache hadoop-2.0.0-alpha and I'll try helping out? 

thanks,
Arun

On Jul 17, 2012, at 2:24 PM, Trevor wrote:

> Hi all,
> 
> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with more than one slave. For every slave except the one running the Application Master, I get the following failed tasks and warnings repeatedly:
> 
> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in uber mode : false
> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000004_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000003_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
> ...
> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed with state FAILED due to:
> ...
>                 Failed map tasks=19
>                 Launched map tasks=31
> 
> The HTTP 400 error appears to be generated by the ShuffleHandler, which is configured to run on port 8080 of the slaves, and doesn't understand that URL. What I've been able to piece together so far is that /tasklog is handled by the TaskLogServlet, which is part of the TaskTracker. However, isn't this an MRv1 class that shouldn't even be running in my configuration? Also, the TaskTracker appears to run on port 50060, so I don't know where port 8080 is coming from.
> 
> Though it could be a red herring, this warning seems to be related to the job failing, despite the fact that the job makes progress on the slave running the AM. The Node Manager logs on both AM and non-AM slaves appear fairly similar, and I don't see any errors in the non-AM logs.
> 
> Another strange data point: These failures occur running the slaves on ARM systems. Running the slaves on x86 with the same configuration works. I'm using the same tarball on both, which means that the native-hadoop library isn't loaded on ARM. The master/client is the same x86 system in both scenarios. All nodes are running Ubuntu 12.04.
> 
> Thanks for any guidance,
> Trevor
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/