You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Sanjay Radia <sr...@yahoo-inc.com> on 2008/11/14 19:12:27 UTC

Hadoop 1.0 Tasklist/Prerequisites discussion

=========Hadoop 1.0 Tasks/Prerequisites (strawman) =============
================================================================


Release terminology used below:
Standard release numbering:
- Only bug fixes in dot releases: m.x.y
	- no changes to API, disk format, protocols or config etc.
- new features in major (m.0) and minor (m.x.0) releases


The task list below has been separated into the following 3 categories
1. Cleanup and Interface work
2. Mechanisms to support versioning and compatibility (Manual or  
automated)
3. 1.0 Features or Hooks for 1.x Features
    Features that we need before 2.0 and are likely to break  
compatibility.
    This needs to be a small list, otherwise we will have feature  
creep and 1.0 will take too long.



1. =========Cleanup and Interface Work ======================
1a. Split hdfs, mapRed, core projects

1.b Decide on the visibility and stability of interfaces.
	We need to decided which interfaces are external facing and which are
internal facing.

      *I will shortly start a separate email thread to discuss hadoop  
interface classification .*


1.c Interfaces that may deserve cleanup before 1.0
    * MapReduce  (new context objects API)  (targetted  for 0.20)
    * FileSystem
	 This is the most important API in the system, let us clean it up  
since we are
      committing to it for a long time.
        - declare the exceptions for each methods (even though  
subclass of IOException).
        - special characters in path
    * Config
        For example clients should need to specify only the NN address/ 
default file
        system. Many of the other parameters should be obtained from  
the NN.
    * Shell cli interface and shell cli output
    * Client protocols
         - Make data transfer protocol "concrete" to enable versioning
    * Mapred.lib
    * Job logs - if we have make them external stable or evolving.
    * Intra Hadoop protocols (to enable rolling upgrades)
         - HDFS, MapReduce

1.d Remove deprecated methods



2. =========Mechanisms to support versioning ======================

2a. Serialization and RPC - manual or automated versioning
    A mechanism for versioning (manual or automated) must be selected  
so that
    we can easily support compatibility will allowing methods to be  
added and
    fields to be added to rpc parameter data types.

2b. Dealing with old calling new
	- new hdfs clients calling old
     - new mapReduce framework calling client via old interface.
     Note we may not need a new mechanism but merely an awareness in  
the community to watchout
     for such issues.

2c. Support for Protocol Transition at major releases
     Note we may be able to delay this work till release 1.9.
     Since the protocol can break at major releases  and customers  
have multiple clusters that
     will not be upgraded simultaneously, we have to consider issues  
related to cross cluster access.

     - Need a mechanism for tranferring data out
         - Today http serves that purpose.  If that is all we need  
then we are done.

     - Today customer do not write apps that access data across  
clusters because wire protocol can
       break on any minor release. This will change in the 1.x series  
where Hadoop will provide
       wire protocol compatibility across minor releases. As a result  
customers are likely to
       write cross cluster apps (easy to do using URI file names).
       So we will need consider our client-side being able to talk  
multiple version of our protocols.
        Again the good news is that we can probably will till 1.9 to  
do this.



3. =========1.0 Features or Hooks for 1.x Features  
======================
Hadoop 1.0 has backward compatibility rules (API and wire protocol)  
that will
require that changes that break compatibility happen only at major  
release
boundaries (i.e 1.0, 2.0 3.0 etc and not 1.1, 1.2, etc.) Hence  
features that we
need before 2.0 that are likely to break compatibility need to be  
considered
now.  This needs to be a small list, otherwise we will have feature  
creep and
1.0 will take too long.

  3a.  Security - authentication is surely going to break the wire  
protocol.

  3b.Clients survive NN and JT restarts

  Others?
    - Hooks for rolling upgrades
    - Hooks for HA??