You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Garth Goodson (JIRA)" <ji...@apache.org> on 2012/08/29 21:10:07 UTC

[jira] [Commented] (MAPREDUCE-4442) Accessing hadoop counters from a job is unreliable in yarn during AM process cleanup window

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444332#comment-13444332 ] 

Garth Goodson commented on MAPREDUCE-4442:
------------------------------------------

I think I'm also running into this on both 2.0.0-alpha and 0.23.3.  I am executing the job remotely and polling the job status.  As soon as the job completes (successfully), I get the IOException at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315).  After job submission (before completion) checking the job status works fine.

                
> Accessing hadoop counters from a job is unreliable in yarn during AM process cleanup  window
> --------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4442
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>         Attachments: am_logs_counter_failure.html, rsrc_mgr_logs_counter_failed.txt
>
>
> We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A few of our applications access job counters multiple times:
> a) After submission of job, while job is execution (works fine)
> b) Right after job complete notification is received (works fine)
> c) Few seconds after job complete notification (fails most of the time).
> The error snippet is as follows:
> {code}
> 2012-07-12 19:12:29,039 WARN  [Client] Unexpected error reading responses on connection Thread[IPC Client (1252749669) connection to sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main]
> java.lang.NullPointerException
> 	at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
> 2012-07-12 19:12:29,044 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,132 INFO  [ClientServiceDelegate] Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException
> 2012-07-12 19:12:29,216 WARN  [BaseOutputStageJob] getJobCounters: Unable to retrieve counters. null
> java.io.IOException
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
> 	at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335)
> 	at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470)
> 	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719)
> 	at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> 	at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716)
> 	at org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)
> {code}
> The connection to 10.202.50.187:47944 is actually the connection to AM; appears that we are connecting to AM to get the counters for the successful job and not yet to the history server.
>  
> I'll attach the logs for AM and resource mgr separately, however no unusual activity is seen in those.
> This makes me suspect that we have a race condition in the code trying to access job counters when AM is finishing up and the job hasn't moved to history server yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira