You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2009/03/04 14:35:56 UTC

[jira] Created: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
----------------------------------------------------------------------------------------------

                 Key: HADOOP-5394
                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
            Reporter: Amar Kamat
            Priority: Critical


This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.9.1.patch

Attaching a patch that fixes the testcase failures.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sharad Agarwal updated HADOOP-5394:
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.20.0
           Status: Resolved  (was: Patch Available)

I just committed this. Thanks Amar!

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-5394-v1.10.1.patch, HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682658#action_12682658 ] 

Devaraj Das commented on HADOOP-5394:
-------------------------------------

Some comments:
1) The file jobtracker.info (the restart count file) must always exist in the system directory, and the value in the file should be 0 (indicating the JT has started fresh). RecoveryManager.getRestartCount should be changed accordingly. The update to the info file should be like:
{code}
   if (infoFile.exists()) {
       delete (infoFile.recovery);
   } else {
       rename (infoFile.recovery, infoFile);
   }
   count = readInfoFile();
   write (count + 1) to infoFile.recover;
   delete infoFile;
   rename (infoFile.recover infoFile);
{code}

2) Add checks for info file in the testcase.
3) The restart count need not be logged in the JobHistory file.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689024#action_12689024 ] 

Devaraj Das commented on HADOOP-5394:
-------------------------------------

It would be good to always invoke getRestartCount() on the JobTracker startup. Also the code in init() that creates the restart count file can be moved there, and the creation can happen when the restart count file doesn't exist. JobTracker recovery should be disabled when the file doesn't exist for the current run (even if the configuration has set the recovery as true).

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678751#action_12678751 ] 

Amar Kamat commented on HADOOP-5394:
------------------------------------

bq. This can happen when the jobtracker gets restarted more than once
This can happen when the jobtracker is restarted frequently within a very small time window

bq. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state
Actually the job will be in the inconsistent state. The JobTracker should be fine.

Note that this can happen on small cluster where the rate of updates per job is less

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.9.1.patch

Attaching a patch incorporating Devaraj's offline comments. Result of test-patch
{code}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 18 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

Ant test passed on my box.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment:     (was: HADOOP-5394-v1.9.1.patch)

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694935#action_12694935 ] 

Amar Kamat commented on HADOOP-5394:
------------------------------------

{{TestSocketFactory}} and {{TestCapacityScheduler}} failed. {{TestCapacityScheduler}} fails on trunk while {{TestSocketFactory}}  fails because of the patch. Investigating. 

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694780#action_12694780 ] 

Hadoop QA commented on HADOOP-5394:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12404329/HADOOP-5394-v1.9.1.patch
  against trunk revision 760783.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 18 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/91/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/91/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/91/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/91/console

This message is automatically generated.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.10.patch

Attaching a patch the fixes the failure of TestSocketFactory. Result of test-patch
{code}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 18 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Status: Open  (was: Patch Available)

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.5.patch

Attaching a patch incorporating Devaraj's comments. Resulf ot test-patch
{code}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

Note that this patch requires HADOOP-5521.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.10.1.patch

Attaching a patch removing some unnecessary code/diffs and generating it for trunk.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.10.1.patch, HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679949#action_12679949 ] 

Devaraj Das commented on HADOOP-5394:
-------------------------------------

I suggest we move to the model of moving to the model where the restart count is based on the number of times the JobTracker got restarted rather than associating the count with a per job restart (as it is today). The restart-count read/update could be the first thing that the JT ever does as soon as it starts up.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Attachment: HADOOP-5394-v1.2.patch

Attaching a patch the logs the jobtracker restart count in a file named _jobtracker.info_ under system directory.  Result of test-patch :
{code}
 [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
{code}

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696642#action_12696642 ] 

Devaraj Das commented on HADOOP-5394:
-------------------------------------

Patch looks fine to me. Please check whether the patch applies to 0.20, and if not, submit one for 0.20.

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Status: Patch Available  (was: Open)

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.10.1.patch, HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-5394:
-------------------------------

    Status: Patch Available  (was: Open)

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695342#action_12695342 ] 

Amar Kamat commented on HADOOP-5394:
------------------------------------

TestSocketFactory tests if the clients can connect to the server using socket factory. It does it in the following fashion 
# Define a socket factory that uses (_port_ - 10) instead of _port_.
# Start the server
# Configure a client conf to use this socket factory implementation and server url as _hostname:port+10_
# At the client, the socket factory does a (-10) and thus is able to connect to the server.

This doesnt work with the current patch because the JobTracker tries to create a file on the DataNode using the socket factory but the DataNode info passed to the JobTracker is correct (i.e no +10 is done). And DataNode information cant be changed as it is obtained from the NameNode. Hence this patch starts the JobTracker with the correct conf and not the modified conf. JobTracker to NameNode connection need not be checked as DFSClient to NameNode connection is checked and for the NameNode, the JobTracker is a client. 

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>         Attachments: HADOOP-5394-v1.10.patch, HADOOP-5394-v1.2.patch, HADOOP-5394-v1.5.patch, HADOOP-5394-v1.9.1.patch
>
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-5394) JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat reassigned HADOOP-5394:
----------------------------------

    Assignee: Amar Kamat

> JobTracker might schedule 2 attempts of the same task with the same attempt id across restarts
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5394
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5394
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>
> This can happen when the jobtracker gets restarted more than once. In such cases, the jobtracker depends on the jobhistory file for the next restart count. If the new restart-count is not flushed to the file then there is a fair chance that upon next restart, the jobtracker might schedule a new attempt with an existing id. This can cause problems not only with the side-effect files but also can cause the jobtracker to be in an inconsistent state.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.