You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/09/27 19:13:46 UTC

[jira] Created: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Spasm of JobClient failures on successful jobs every once in a while
--------------------------------------------------------------------

                 Key: HADOOP-4296
                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.17.1
            Reporter: Joydeep Sen Sarma
            Priority: Critical


At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:

08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
java.lang.NullPointerException
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Priority: Blocker  (was: Critical)

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635904#action_12635904 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

Hi Devaraj, if every job status request scans the dfs file (if it cannot find the jobid in memory), it could result in bad behaviour, especially from the viewpoint of the JT thread that will do the scan. 

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Status: Patch Available  (was: Open)

Incorporated review comments from Vinod.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640631#action_12640631 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

quick comment: if the job is complete - but we don't know the status (ideally this should never happen, but if it does ..) - it's better to return a non-zero error status. Saying success when it ain't so is pretty scary (for workflow pipelines that depend on correct exit status)

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635169#action_12635169 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

it almost seems like there is some kind of race condition between the jobs being completed and disappearing from the tracker?


        try {
          if (running.isComplete()) {
            break;
          }
          running = jc.getJob(jobId);
993:          String report =
            (" map " + StringUtils.formatPercent(running.mapProgress(), 0)+
             " reduce " +
             StringUtils.formatPercent(running.reduceProgress(), 0));

is it possible that the running.isComplete() is stale? I looked at the isComplete call a little bit more:

  static long MAX_JOBPROFILE_AGE = 1000 * 2;

    synchronized void ensureFreshStatus() throws IOException {
      if (System.currentTimeMillis() - statustime > MAX_JOBPROFILE_AGE) {


ouch - so what might be happening is the isComplete() returns false based on stale status. But when we request the job from the jobtracker - the job is gone (very high volume of jobs from the same user)?

Is this a plausible theory? If so - the fix would be simply be to reduce the MAX_JOBPROFILE_AGE to match the polling loop sleep time in jobclient.



> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635651#action_12635651 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

If this use-case is verified, a simple solution could be to delay retiring successfully completed jobs by a fixed threshold (say 5 minutes). This will ensure that clients that were polling every 2 seconds would be able to fetch the completed status from the job tracker.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640632#action_12640632 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

Hi Vinod, thanks for the review.

> You should check for the condition before the job is even removed from JT.
I do not understand this. Can you pl explain?

>Further, are you sure we want to throw an exception in the job-client? 
No exceptions are being thrown in the JobClient. The pre-existing code catches those exceptions and retries (I have not changed any of these).

I am pretty sure that this is not a "feature". It fixes a bug.




> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635667#action_12635667 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

BTW, a slightly different approach would be to use the jobstatus store. Have a look at mapred.job.tracker.persist.jobstatus.* config items in hadoop-default.xml. The idea here is that job statuses are stored to the dfs and is available even after the jobs are purged.  

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640625#action_12640625 ] 

Vinod K V commented on HADOOP-4296:
-----------------------------------

Dhruba,

Sorry for not looking at the entire patch earlier. The place where you are using MIN_TIME_BEFORE_RETIRE in finalizeJob to prevent retiring very recently finished jobs seems wrong to me. You should check for the condition before the job is even removed from JT.

Further, are you sure we want to throw an exception in the job-client? An exception would throw a non-zero error code to the client; I'd rather log the message and just break out of the loop, thus returning a zero exit code, as we know for sure that the job is indeed complete.

Having said that, I think introducing MIN_TIME_BEFORE_RETIRE here, in this blocker, doesn't seem to be very right. It is a feature. The later fix, patching the client, is the correct and sufficient fix for the problem that Joydeep initially reported.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Status: Open  (was: Patch Available)

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635466#action_12635466 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

well - that's the only reasonable explanation i can think of. the volume of tasks is pretty incredible in the night when this happens (i was up last night watching it :-)).

the other very strange thing i saw last night was that i had a job running (and i had the job id and progress log) - but the job was not visible in the tracker for a long time. It eventually became visible - but that was pretty disconcerting (is the jsp using a different api?). 

Anyway - should we just close this loophole (or at least try it out internally) ? I don't see a downside and we can verify if this makes things better.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641259#action_12641259 ] 

Hadoop QA commented on HADOOP-4296:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12392454/4296_jt_delayretire4.patch
  against trunk revision 706463.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3502/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3502/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3502/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3502/console

This message is automatically generated.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636533#action_12636533 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

Hi Devaraj, From my understanding, "persisting job status" is optional. For those clusters on which job-status is not persisted the attached patch may be valuable in its own right. 




> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640098#action_12640098 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

+1

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640252#action_12640252 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

What Vinod means to say is that the JobClient could catch the exception if it happens. Instead of printing the raw exception, an error msg can be displayed like "job not found" or something.. This is what he meant I believe.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642306#action_12642306 ] 

Hudson commented on HADOOP-4296:
--------------------------------

Integrated in Hadoop-trunk #640 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/640/])
    

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635468#action_12635468 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

also - i don't know how the tracker purges jobs from memory. if that itself is batch oriented - then that may explain a large number of deletions at the same time.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636539#action_12636539 ] 

Vinod K V commented on HADOOP-4296:
-----------------------------------

JobClient breaks out of the polling loop when RunningJob.isComplete() returns true. During the polling of the job the client has started, as RunningJob.getJob(jobId) returning a null also essentially implies that the job is completed (and gone too), JobClient could just break out of polling in this case.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638240#action_12638240 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

here's a related/new observation (maybe this deserves a separate jira):

- job is long complete and gone from tracker (hours). i checked the history file to confirm.
- jobclient is hung polling for completion

we are probably running with the 5 minute wait-before-purge on jobclient patch for this. i did a jstack on the jobclient:

"main" prio=10 tid=0x08059c00 nid=0x7374 waiting on condition [0xf7fdb000..0xf7fdc298]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:992)
        at com.facebook.hive.common.columnSetLoader.run(columnSetLoader.java:545)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

i am quite confused how this can happen. anyway - i will leave it running in this state so u can take a look.


> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640135#action_12640135 ] 

Vinod K V commented on HADOOP-4296:
-----------------------------------

-1. I think that in this patch itself you should also make JobClient skip out of the loop when it detects the job as complete, because that is the correct fix irrespective of what the polling interval of JobClient is. Currently note that we are not sync'ing client's polling interval with MIN_TIME_BEFORE_RETIRE that you added, so if later, client's polling interval becomes configurable/increases beyond MIN_TIME_BEFORE_RETIRE, this problem surfaces again.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Attachment: 4296_jt_delayretire.patch

Introduced a new parameter mapred.jobtracker.mintime.before.retirejob that specifies the time (in milliseconds) before a completed job can be retired.


> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640242#action_12640242 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

>make JobClient skip out of the loop when it detects the job as complete

I do not understand Vinod's comment, can you pl explain in greater detail? The client is trying to determine if the job is complete. if the job is complete, of course the client exits. But the problem if that the JobClient was unable to fetch the status of the job from the JT

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Fix Version/s: 0.19.0
           Status: Patch Available  (was: Open)

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Status: Patch Available  (was: Open)

Incorporated review comments. Fixed findbugs warnings.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Attachment: 4296_jt_delayretire3.patch

The JobClient now elegantly handles the case when it fails to retrieve job status from the JobTracker. 
Fixed findbugs warnings.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637009#action_12637009 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

Dhruba, the dfs scan happens when the job is submitted (copying of jar files, etc.). Look at the public JobInProgress constructor and this gets called from the submitJob RPC. But I haven't thought about how this can be handled better...
Vinod's suggestion looks fine to me especially if you don't care about the final status of the completed job. In addition, if you really care about getting the status of completed jobs, you have the JobStatus store that I was talking about earlier, or, increase the number of jobs kept in memory and, maybe, tweak the retirejob interval. The former can be used to get the status of jobs that were completed a while back while the latter would be for clients that are polling very actively. Your approach would essentially be doing the latter, right?

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635942#action_12635942 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

Hi Dhruba, that's a fair point. The RPC handler thread would be blocked during the scan. I am worried about the submitJob API as well. That does a bunch of dfs operations too. We probably should do something about that as well (later on).
However, since we already do dfs scans per job inline in the RPC handler (and AFAIK there is no noticeable impact), I'd like to see the impact of these additional dfs lookups on your system... Would that be a lot of work at your end?

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637015#action_12637015 ] 

Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

we definitely care about the status of completed jobs (and i think most installations would - given that at least some of the uses are always programmatic invocations that check return status).

does the jobstatus store need to scan dfs even when the job status is available in memory? (falling back to persistent store only when the data is missing in memory would seem like a good strategy). another question is whether job counters are available from the persisted job status?

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Attachment: 4296_jt_delayretire4.patch

Incorporated review comments from Vinod.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned HADOOP-4296:
----------------------------------------

    Assignee: dhruba borthakur

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635457#action_12635457 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

So, your assumption is that the job finished successfully and got purged from the job queue (because of high volume of job completions from the same user) within 2 seconds? 

The current value of mapred.jobtracker.completeuserjobs.maximum is set to 20.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635647#action_12635647 ] 

Devaraj Das commented on HADOOP-4296:
-------------------------------------

One thing to verify here is whether you see this message in the JobTracker logs - "Retired job with id:" around the time you see the NPE in the client.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640462#action_12640462 ] 

Hadoop QA commented on HADOOP-4296:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12392272/4296_jt_delayretire3.patch
  against trunk revision 705430.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3481/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3481/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3481/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3481/console

This message is automatically generated.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Status: Open  (was: Patch Available)

Incorporated review comments from Vinod.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636413#action_12636413 ] 

dhruba borthakur commented on HADOOP-4296:
------------------------------------------

>However, since we already do dfs scans per job inline in the RPC handler (and AFAIK there is no noticeable impact)

can you pl explain about this already existing dfs scan that that JT does? When does it do it?

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4296:
-------------------------------------

    Attachment: 4296_jt_delayretire2.patch

Incorporated review comments from Devaraj. Merged patch with latest trunk.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640156#action_12640156 ] 

Hadoop QA commented on HADOOP-4296:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12392228/4296_jt_delayretire2.patch
  against trunk revision 705155.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3474/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3474/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3474/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3474/console

This message is automatically generated.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641279#action_12641279 ] 

Vinod K V commented on HADOOP-4296:
-----------------------------------

+1 for the patch.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 4296_jt_delayretire3.patch, 4296_jt_delayretire4.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637017#action_12637017 ] 

Vinod K V commented on HADOOP-4296:
-----------------------------------

bq. (falling back to persistent store only when the data is missing in memory would seem like a good strategy). 
That's what it does.

bq. another question is whether job counters are available from the persisted job status?
Yes.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Critical
>         Attachments: 4296_jt_delayretire.patch
>
>
> At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
> 	at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.