You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Hari A V (JIRA)" <ji...@apache.org> on 2011/01/19 11:14:53 UTC

[jira] Commented: (MAPREDUCE-225) Fault tolerant Hadoop Job Tracker

    [ https://issues.apache.org/jira/browse/MAPREDUCE-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983639#action_12983639 ] 

Hari A V commented on MAPREDUCE-225:
------------------------------------

Hi,

In my team, we also have been analysing on how to provide HA for Job Tracker. Our approach is also quite similar to Francesco's approach. 

The complete HA solution can be divided to three aspects

1. Sharing of job related state between Master and Slave job trackers

	This can be achieved with issues HADOOP-1876 and HADOOP-3245. 

2. Failure Detection and Master Election
	
	We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects.

3. How to Notify JobClients and Task Trackers about the new Master, on failure. 
	One option would be DNS as mentioned. 
	Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election.

We have not yet started our work. Please provide your valuable opinions. 

thanks
Hari


> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>
>                 Key: MAPREDUCE-225
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-225
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>            Assignee: Francesco Salbaroli
>         Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar
>
>
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.