You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Francesco Salbaroli (JIRA)" <ji...@apache.org> on 2008/11/04 12:24:46 UTC

[jira] Created: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Fault tolerant Hadoop Job Tracker
---------------------------------

                 Key: HADOOP-4586
                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
    Affects Versions: 0.18.0
         Environment: High availability enterprise system
            Reporter: Francesco Salbaroli


The Hadoop framework has been designed, in an eort to enhance perfor-
mances, with a single JobTracker (master node). It's responsibilities varies
from managing job submission process, compute the input splits, schedule
the tasks to the slave nodes (TaskTrackers) and monitor their health.
In some environments, like the IBM and Google's Internet-scale com-
puting initiative, there is the need for high-availability, and performances
becomes a secondary issue. In this environments, having a system with
a Single Point of Failure (such as Hadoop's single JobTracker) is a major
concern.
My proposal is to provide a redundant version of Hadoop by adding
support for multiple replicated JobTrackers. This design can be approached
in many dierent ways. 

In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0

I wrote an overview of the problem and some approaches to solve it.

I post this to the community to gather feedback on the best way to proceed in my work.

Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645466#action_12645466 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

Thank you for the comment Steve.

1) You're right, but I don't have exact details about the infrastructure on which the Hadoop cluster will run (probably here at IBM will be possible to run Hadoop on top of IBM BlueCloud cloud computing infrastructure). So in an environment with fault-detection and fast, automatic VM provisioning, cold standby can be an option. This point need further investigation and I hope to receive feedback again on this.

2) In Hot-Standby I supposed a very small number of replicas of the JobTracker (between 2 and 4), so the high network traffic shouldn't be a major concern. But this can be verified only after extensive test sessions.

3) Yes, DNS update seems to be the best option.

4) I wasn't aware of the limitation on multicast of Amazon EC2. Two possible solutions: define statically the JobTracker nodes or maintain a list of nodes on DFS or shared cache

5) I thought about a heartbeat mechanism.

6) Network partitioning is an issue. Ignoring HDFS, using an election protocol will produce separate smaller fully functional cluster (that is an unwanted feature) but, it should be able to detect multiple running masters and re-run the election to reach another stable state.

7)This will be part of my M.Sc. but I produced this document only to propose my ideas to the community and gather feedback. And, yes, this is an early draft.

8) I will take a look at it

So, thank you again for the interest you show in my work and I hope to hear from you and other community members soon.

Francesco



> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Affects Version/s: 0.18.1
                       0.18.2
        Fix Version/s: 0.18.3

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659093#action_12659093 ] 

Steve Loughran commented on HADOOP-4586:
----------------------------------------

openoffice says the PPT file is password encrypted. Is that right? Could you upload a PDF?

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Attachment: Enhancing the Hadoop MapReduce framework by adding fault.ppt

This are the charts from a very introductory presentation I have held at the IBM Innovation Centre to give an update about progress of my work.




> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662412#action_12662412 ] 

dhruba borthakur commented on HADOOP-4586:
------------------------------------------

>the network partition problem should not be an issue (only one partition at a time can access the HDFS). 

I think the problem still exists. Suppose a network partition occurs between a master and the remainder of the nodes. A new master is elected allright, and the new master will now own the metadata state. The new master will start to update JobTracker metadata strored in HDFS beucae he thinks he is the sole owner of this metadata. At this time, how can you be guaranteed that the old master is not continuing to update the same metadata?


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Remaining Estimate: 960h  (was: 2016h)
     Original Estimate: 960h  (was: 2016h)

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>   Original Estimate: 960h
>  Remaining Estimate: 960h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HADOOP-4586 started by Francesco Salbaroli.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662458#action_12662458 ] 

Doug Cutting commented on HADOOP-4586:
--------------------------------------

> Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.

That's the key concern.  Zookeeper's in-memory datastructures would probably take much more space than those in the namenode and/or jobtracker do today.  Other than that, Zookeeper seems ideally suited to these tasks.  Perhaps if Zookeeper were to support namespace partitioning and rebalancing (hard problems) then it could be used to store such data.  It would certainly vastly simplify many things.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655608#action_12655608 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

I will release a preliminary test version of Fault tolerant Hadoop before 17th Dec.

Features will include:
-JGroups 2.6.7 toolkit for reliable multicast communication that is based on a highly configurable protocol stack to adapt to different environments (I will post documentation about it).
-Completely wraps around the Hadoop sourcecode to minimize modifications in the source tree.
-Dynamic JobTracker address resolution using HDFS as a support.

Enhancement in future versions:
-Higher level of abstraction
-Better exception handling

I'll post the sourcecode at the beginning of the next week (hopefully).

Can I be added to the developers?

Best regards,
Francesco


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649067#action_12649067 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

What the community think about using the JGroups reliable multicast system for communicating and status monitoring between master and slaves?

It has 2 major benefits:
1) Implements reliable multicast communications
2) Abstracts from the protocol used (can exploit benefits of Multicast UDP, where available, or using TCP where Multicast is forbidden i.e. Amazon EC2) 

Regards,
Francesco

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Bo Shi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719982#action_12719982 ] 

Bo Shi commented on HADOOP-4586:
--------------------------------

Hi,

I am in China until June 22nd and will only have intermittent access
to email until then.

Thanks,
Bo

-- 
Bo Shi
207-469-8264


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Bo Shi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662048#action_12662048 ] 

Bo Shi commented on HADOOP-4586:
--------------------------------

> In my opinion, it is better to avoid using active copies due to the high
> complexity of the coordination protocol and, instead, using a master-slave
> model with soft-state shared between copies through a distributed cache
> mechanism or saved on HDFS.

Please forgive me if I'm being naive here (I see that I'm a bit late to the show), but wouldn't using Zookeeper to persist jobtracker state effectively mask this complexity?

Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Suresh Srinivas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662054#action_12662054 ] 

Suresh Srinivas commented on HADOOP-4586:
-----------------------------------------

Sorry for the late comments:
For a master/slave HA solution, two main problems are:
1. Mechanism that determines a master in a cluster during startup and failover. Handling loss of quorum, split-brain and fencing in case of split-brain. It also requires comprehensive management tools for configuring, managing and monitoring cluster.
2. Sharing state information between master and slave, so that a slave node can take over as master.

Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

> Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?
Storing the jobtracker state on Zookeeper may not be a viable option, given that ZooKeeper is intended for storing small amount of data (KB) and JobTracker has lot more data than that to persist.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719980#action_12719980 ] 

Sharad Agarwal commented on HADOOP-4586:
----------------------------------------

If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right?
Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662058#action_12662058 ] 

dhruba borthakur commented on HADOOP-4586:
------------------------------------------

Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store. 

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662309#action_12662309 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

>Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.

The JGroups library (whose manual can be found here: [http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf] ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.

>Handling loss of quorum, 

The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.

>split-brain and fencing in case of split-brain.

The JGroups library tries to automatically handle network partitions and merging, but given that:

*There is no shared soft-state
*There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)

the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.

> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.

I am now adding JMX support. After the initial testing phase I will post result and an updated version.

>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

I hope I have given an answer to your question. If you need more, feel free to contact me.

>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

I am about to begin the testing phase, results will follow

Regards,
Francesco

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Affects Version/s:     (was: 0.18.2)
                           (was: 0.18.1)
                           (was: 0.18.0)

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645235#action_12645235 ] 

Steve Loughran commented on HADOOP-4586:
----------------------------------------

This is an interesting project which will provide much thesis work, especially from the testing and proof of correctness perspectives.

-There are some implicit assumptions about the ability of the infrastructure to provision hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go infrastructure, cost-effective.

-The statement that forwarding all state changes to all slaves -Hot Standby- is best needs to be qualified with estimated load values and the impact of the events on the network. Is there a cluster size or map/reduce job lifetime in which the state traffic will become an issue, or is it just load on the nodes.  

-How do you intend to implement failover without notifying the task trackers? DNS update?  

-I would like to see some coverage of the election protocol, in particular, how to coordinate such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2).

-Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions, it is only live if it is capable of performing work within bounded time, so the true way to determine health is to submit work to the system. Early failures: IPC deadlock, host outage etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA". 

-Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions have slaves and a set of task trackers? Who will be in charge?

-I would have expected some citations for an MSc project; presumably this is an early draft.

-Take a look at Anubis; this is how we implement partition awareness/HA, though it currently uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism (simpleDB?)
 http://wiki.smartfrog.org/wiki/display/sf/Anubis


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644949#action_12644949 ] 

Francesco Salbaroli commented on HADOOP-4586:
---------------------------------------------

Thanks for your quick response Amar.Just a quick question, is it possible now to have multiple instances of JobTracker running simultaneously on different machines with transparent failover to a redundant copy?

It is a problem reported by the IBM HiPODS team working on the academic initiative

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Remaining Estimate:     (was: 960h)
     Original Estimate:     (was: 960h)

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Bernd Fondermann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655674#action_12655674 ] 

Bernd Fondermann commented on HADOOP-4586:
------------------------------------------

Francesco,

about how to contribute, please see 
  
  http://wiki.apache.org/hadoop/HowToContribute

Have fun,

Bernd



> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Attachment: jgroups-all.jar
                HADOOP-4586-0.1.patch

This is a very preliminary (and tested only locally) release of Fault tolerant Hadoop.

The hadoop source tree is only slightly modified in the org.apache.hadoop.mapred.TaskTracker class.

The package containing the fault tolerance wrapper is org.apache.hadoop.mapred.faulttolerant.

The JGroups library (jgroups-all.jar) must be copied in the lib/ folder.

To run the FT version of Hadoop:
1) Configure hadoop-site.xml to match the environment
2) Format the HDFS filesystem ($HADOOP_HOME/bin/namenode -format)
3) Run the HDFS daemons (to run locally $HADOOP_HOME/bin/start-dfs.sh)
4) Run one or more instances of FTJobTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTJobTracker)
5) Run one or more instance of FTTaskTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTTaskTracker)

Regards,
	Francesco Salbaroli



> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644948#action_12644948 ] 

Amar Kamat commented on HADOOP-4586:
------------------------------------

Francesco, HADOOP-3245 allows jobtracker to (re)start. If the history files are maintained on dfs then the jobtracker can start on any other machine. The only issue that prevents us from achieving complete fault tolerance is HADOOP-4016. Once that gets fixed, we can _port_ the jobtracker to any machine and _continue_ from there. Hence the jobtracker is fault tolerant but the switch is not seamless and porting requires manual intervention.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659151#action_12659151 ] 

Otis Gospodnetic commented on HADOOP-4586:
------------------------------------------

Steve, just open it read-only, it works that way.


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.18.3
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Bo Shi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719981#action_12719981 ] 

Bo Shi commented on HADOOP-4586:
--------------------------------

Hi,

I am in China until June 22nd and will only have intermittent access
to email until then.

Thanks,
Bo

-- 
Bo Shi
207-469-8264


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Attachment: FaultTolerantHadoop.pdf

Copy of the document with proposals

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli updated HADOOP-4586:
----------------------------------------

    Attachment: HADOOP-4586v0.3.patch

Some bugfixes in release 0.3 to work in a distributed environment.

This version has been tested in a distributed VMware Infrastructure 3 environment (using RHEL 5.2 as a guest OS).


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Salbaroli reassigned HADOOP-4586:
-------------------------------------------

    Assignee: Francesco Salbaroli

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Francesco Salbaroli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662309#action_12662309 ] 

fsalbaroli edited comment on HADOOP-4586 at 1/9/09 2:31 AM:
---------------------------------------------------------------------

>Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.

The JGroups library (whose manual can be found here: [http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf] ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.

>Handling loss of quorum, 

The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.

>split-brain and fencing in case of split-brain.

The JGroups library tries to automatically handle network partitions and merging, but given that:

* There is no shared soft-state
* There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)

the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.

> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.

I am now adding JMX support. After the initial testing phase I will post result and an updated version.

>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

I hope I have given an answer to your question. If you need more, feel free to contact me.

>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

I am about to begin the testing phase, results will follow

Regards,
Francesco

      was (Author: fsalbaroli):
    >Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.

The JGroups library (whose manual can be found here: [http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf] ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.

>Handling loss of quorum, 

The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.

>split-brain and fencing in case of split-brain.

The JGroups library tries to automatically handle network partitions and merging, but given that:

*There is no shared soft-state
*There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)

the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.

> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.

I am now adding JMX support. After the initial testing phase I will post result and an updated version.

>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

I hope I have given an answer to your question. If you need more, feel free to contact me.

>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

I am about to begin the testing phase, results will follow

Regards,
Francesco
  
> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4586) Fault tolerant Hadoop Job Tracker

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-4586:
--------------------------------

    Fix Version/s:     (was: 0.18.3)
                   0.21.0

Francesco, given this is a new feature, it can be integrated into the next major release (which is 0.21).  It can't be integrated into a sustaining release.

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>             Fix For: 0.21.0
>
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.