You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by ac...@apache.org on 2013/06/17 11:32:28 UTC

svn commit: r1493693 - in /hadoop/common/trunk/hadoop-common-project/hadoop-common: CHANGES.txt src/site/apt/Compatibility.apt.vm

Author: acmurthy
Date: Mon Jun 17 09:32:27 2013
New Revision: 1493693

URL: http://svn.apache.org/r1493693
Log:
HADOOP-9517. Documented various aspects of compatibility for Apache Hadoop. Contributed by Karthik Kambatla.

Added:
    hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm
Modified:
    hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt

Modified: hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt?rev=1493693&r1=1493692&r2=1493693&view=diff
==============================================================================
--- hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt (original)
+++ hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt Mon Jun 17 09:32:27 2013
@@ -457,6 +457,9 @@ Release 2.1.0-beta - UNRELEASED
     HADOOP-9649. Promoted YARN service life-cycle libraries into Hadoop Common
     for usage across all Hadoop projects. (Zhijie Shen via vinodkv)
 
+    HADOOP-9517. Documented various aspects of compatibility for Apache
+    Hadoop. (Karthik Kambatla via acmurthy)
+
   OPTIMIZATIONS
 
     HADOOP-9150. Avoid unnecessary DNS resolution attempts for logical URIs

Added: hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm?rev=1493693&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm (added)
+++ hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm Mon Jun 17 09:32:27 2013
@@ -0,0 +1,509 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+Apache Hadoop Compatibility
+  ---
+  ---
+  ${maven.build.timestamp}
+
+Apache Hadoop Compatibility
+
+%{toc|section=1|fromDepth=0}
+
+* Purpose
+
+  This document captures the compatibility goals of the Apache Hadoop
+  project. The different types of compatibility between Hadoop
+  releases that affects Hadoop developers, downstream projects, and
+  end-users are enumerated. For each type of compatibility we:
+  
+  * describe the impact on downstream projects or end-users
+ 
+  * where applicable, call out the policy adopted by the Hadoop
+   developers when incompatible changes are permitted.
+
+* Compatibility types
+
+** Java API
+
+   Hadoop interfaces and classes are annotated to describe the intended
+   audience and stability in order to maintain compatibility with previous
+   releases. See {{{./InterfaceClassification.html}Hadoop Interface
+   Classification}}
+   for details.
+
+   * InterfaceAudience: captures the intended audience, possible
+   values are Public (for end users and external projects),
+   LimitedPrivate (for other Hadoop components, and closely related
+   projects like YARN, MapReduce, HBase etc.), and Private (for intra component 
+   use).
+ 
+   * InterfaceStability: describes what types of interface changes are
+   permitted. Possible values are Stable, Evolving, Unstable, and Deprecated.
+
+*** Use Cases
+
+    * Public-Stable API compatibility is required to ensure end-user programs
+     and downstream projects continue to work without modification.
+
+    * LimitedPrivate-Stable API compatibility is required to allow upgrade of
+     individual components across minor releases.
+
+    * Private-Stable API compatibility is required for rolling upgrades.
+
+*** Policy
+
+    * Public-Stable APIs must be deprecated for at least one major release
+    prior to their removal in a major release.
+
+    * LimitedPrivate-Stable APIs can change across major releases,
+    but not within a major release.
+
+    * Private-Stable APIs can change across major releases,
+    but not within a major release.
+
+    * Note: APIs generated from the proto files need to be compatible for
+rolling-upgrades. See the section on wire-compatibility for more details. The
+compatibility policies for APIs and wire-communication need to go
+hand-in-hand to address this.
+
+** Semantic compatibility
+
+   Apache Hadoop strives to ensure that the behavior of APIs remains
+   consistent over versions, though changes for correctness may result in
+   changes in behavior. Tests and javadocs specify the API's behavior.
+   The community is in the process of specifying some APIs more rigorously,
+   and enhancing test suites to verify compliance with the specification,
+   effectively creating a formal specification for the subset of behaviors
+   that can be easily tested.
+
+*** Policy
+
+   The behavior of API may be changed to fix incorrect behavior,
+   such a change to be accompanied by updating existing buggy tests or adding
+   tests in cases there were none prior to the change.
+
+** Wire compatibility
+
+   Wire compatibility concerns data being transmitted over the wire
+   between Hadoop processes. Hadoop uses Protocol Buffers for most RPC
+   communication. Preserving compatibility requires prohibiting
+   modification to the required fields of the corresponding protocol
+   buffer. Optional fields may be added without breaking backwards
+   compatibility. Non-RPC communication should be considered as well,
+   for example using HTTP to transfer an HDFS image as part of
+   snapshotting or transferring MapTask output. The potential
+   communications can be categorized as follows:
+ 
+   * Client-Server: communication between Hadoop clients and servers (e.g.,
+   the HDFS client to NameNode protocol, or the YARN client to
+   ResourceManager protocol).
+
+   * Client-Server (Admin): It is worth distinguishing a subset of the
+   Client-Server protocols used solely by administrative commands (e.g.,
+   the HAAdmin protocol) as these protocols only impact administrators
+   who can tolerate changes that end users (which use general
+   Client-Server protocols) can not.
+
+   * Server-Server: communication between servers (e.g., the protocol between
+   the DataNode and NameNode, or NodeManager and ResourceManager)
+
+*** Use Cases
+    
+    * Client-Server compatibility is required to allow users to
+    continue using the old clients even after upgrading the server
+    (cluster) to a later version (or vice versa).  For example, a
+    Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
+
+    * Client-Server compatibility is also required to allow upgrading
+    individual components without upgrading others. For example,
+    upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
+
+    * Server-Server compatibility is required to allow mixed versions
+    within an active cluster so the cluster may be upgraded without
+    downtime.
+
+*** Policy
+
+    * Both Client-Server and Server-Server compatibility is preserved within a
+    major release. (Different policies for different categories are yet to be
+    considered.)
+
+    * The source files generated from the proto files need to be
+    compatible within a major release to facilitate rolling
+    upgrades. The proto files are governed by the following:
+
+      * The following changes are NEVER allowed:
+
+        * Change a field id.
+
+        * Reuse an old field that was previously deleted. Field numbers are 
+          cheap and changing and reusing is not a good idea.
+
+      * The following changes cannot be made to a stable .proto except at a 
+      major release:
+
+        * Modify a field type in an incompatible way (as defined recursively)
+
+	      * Add or delete a required field
+
+	      * Delete an optional field
+
+      * The following changes are allowed at any time:
+      
+        * Add an optional field, but ensure the code allows communication with prior
+	version of the client code which did not have that field.
+
+	  * Rename a field
+
+	  * Rename a .proto file
+
+	  * Change .proto annotations that effect code generation (e.g. name of
+	    java package)
+
+** Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI
+
+  As Apache Hadoop revisions are upgraded end-users reasonably expect that 
+  their applications should continue to work without any modifications. 
+  This is fulfilled as a result of support API compatibility, Semantic 
+  compatibility and Wire compatibility.
+  
+  However, Apache Hadoop is a very complex, distributed system and services a 
+  very wide variety of use-cases. In particular, Apache Hadoop MapReduce is a 
+  very, very wide API; in the sense that end-users may make wide-ranging 
+  assumptions such as layout of the local disk when their map/reduce tasks are 
+  executing, environment variables for their tasks etc. In such cases, it 
+  becomes very hard to fully specify, and support, absolute compatibility.
+ 
+*** Use cases
+
+    * Existing MapReduce applications, including jars of existing packaged 
+      end-user applications and projects such as Apache Pig, Apache Hive, 
+      Cascading etc. should work unmodified when pointed to an upgraded Apache 
+      Hadoop cluster within a major release. 
+
+    * Existing YARN applications, including jars of existing packaged 
+      end-user applications and projects such as Apache Tez etc. should work 
+      unmodified when pointed to an upgraded Apache Hadoop cluster within a 
+      major release. 
+
+    * Existing applications which transfer data in/out of HDFS, including jars 
+      of existing packaged end-user applications and frameworks such as Apache 
+      Flume, should work unmodified when pointed to an upgraded Apache Hadoop 
+      cluster within a major release. 
+
+*** Policy
+
+    * Existing MapReduce, YARN & HDFS applications and frameworks should work 
+      unmodified within a major release i.e. Apache Hadoop ABI is supported.
+
+    * A very minor fraction of applications maybe affected by changes to disk 
+      layouts etc., the developer community will strive to minimize these 
+      changes and will not make them within a minor version. In more egregious 
+      cases, we will consider strongly reverting these breaking changes and 
+      invalidating offending releases if necessary.
+
+    * In particular for MapReduce applications, the developer community will 
+      try our best to support provide binary compatibility across major 
+      releases e.g. applications using org.apache.hadop.mapred.* APIs are 
+      supported compatibly across hadoop-1.x and hadoop-2.x. See 
+      {{{../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html}
+      Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x}} 
+      for more details.
+
+** REST APIs
+
+  REST API compatibility corresponds to both the request (URLs) and responses
+   to each request (content, which may contain other URLs). Hadoop REST APIs
+   are specifically meant for stable use by clients across releases,
+   even major releases. The following are the exposed REST APIs:
+
+  * {{{../hadoop-hdfs/WebHDFS.html}WebHDFS}} - Stable
+
+  * {{{../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html}ResourceManager}}
+
+  * {{{../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html}NodeManager}}
+
+  * {{{../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html}MR Application Master}}
+
+  * {{{../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html}History Server}}
+  
+*** Policy
+    
+    The APIs annotated stable in the text above preserve compatibility
+    across at least one major release, and maybe deprecated by a newer 
+    version of the REST API in a major release.
+
+** Metrics/JMX
+
+   While the Metrics API compatibility is governed by Java API compatibility,
+   the actual metrics exposed by Hadoop need to be compatible for users to
+   be able to automate using them (scripts etc.). Adding additional metrics
+   is compatible. Modifying (eg changing the unit or measurement) or removing
+   existing metrics breaks compatibility. Similarly, changes to JMX MBean
+   object names also break compatibility.
+
+*** Policy 
+
+    Metrics should preserve compatibility within the major release.
+
+** File formats & Metadata
+
+   User and system level data (including metadata) is stored in files of
+   different formats. Changes to the metadata or the file formats used to
+   store data/metadata can lead to incompatibilities between versions.
+
+*** User-level file formats
+
+    Changes to formats that end-users use to store their data can prevent
+    them for accessing the data in later releases, and hence it is highly
+    important to keep those file-formats compatible. One can always add a
+    "new" format improving upon an existing format. Examples of these formats
+    include har, war, SequenceFileFormat etc.
+
+**** Policy
+
+     * Non-forward-compatible user-file format changes are
+     restricted to major releases. When user-file formats change, new
+     releases are expected to read existing formats, but may write data
+     in formats incompatible with prior releases. Also, the community
+     shall prefer to create a new format that programs must opt in to
+     instead of making incompatible changes to existing formats.
+
+*** System-internal file formats
+
+    Hadoop internal data is also stored in files and again changing these
+    formats can lead to incompatibilities. While such changes are not as
+    devastating as the user-level file formats, a policy on when the
+    compatibility can be broken is important.
+
+**** MapReduce
+
+     MapReduce uses formats like I-File to store MapReduce-specific data.
+     
+
+***** Policy
+
+     MapReduce-internal formats like IFile maintain compatibility within a
+     major release. Changes to these formats can cause in-flight jobs to fail 
+     and hence we should ensure newer clients can fetch shuffle-data from old 
+     servers in a compatible manner.
+
+**** HDFS Metadata
+
+    HDFS persists metadata (the image and edit logs) in a particular format.
+    Incompatible changes to either the format or the metadata prevent
+    subsequent releases from reading older metadata. Such incompatible
+    changes might require an HDFS "upgrade" to convert the metadata to make
+    it accessible. Some changes can require more than one such "upgrades".
+
+    Depending on the degree of incompatibility in the changes, the following
+    potential scenarios can arise:
+
+    * Automatic: The image upgrades automatically, no need for an explicit
+    "upgrade".
+
+    * Direct: The image is upgradable, but might require one explicit release
+    "upgrade".
+
+    * Indirect: The image is upgradable, but might require upgrading to
+    intermediate release(s) first.
+
+    * Not upgradeable: The image is not upgradeable.
+
+***** Policy
+
+    * A release upgrade must allow a cluster to roll-back to the older
+    version and its older disk format. The rollback needs to restore the
+    original data, but not required to restore the updated data.
+
+    * HDFS metadata changes must be upgradeable via any of the upgrade
+    paths - automatic, direct or indirect.
+
+    * More detailed policies based on the kind of upgrade are yet to be
+    considered.
+
+** Command Line Interface (CLI)
+
+   The Hadoop command line programs may be use either directly via the
+   system shell or via shell scripts. Changing the path of a command,
+   removing or renaming command line options, the order of arguments,
+   or the command return code and output break compatibility and
+   may adversely affect users.
+   
+*** Policy 
+
+    CLI commands are to be deprecated (warning when used) for one
+    major release before they are removed or incompatibly modified in
+    a subsequent major release.
+
+** Web UI
+
+   Web UI, particularly the content and layout of web pages, changes
+   could potentially interfere with attempts to screen scrape the web
+   pages for information.
+
+*** Policy
+
+    Web pages are not meant to be scraped and hence incompatible
+    changes to them are allowed at any time. Users are expected to use
+    REST APIs to get any information.
+
+** Hadoop Configuration Files
+
+   Users use (1) Hadoop-defined properties to configure and provide hints to
+   Hadoop and (2) custom properties to pass information to jobs. Hence,
+   compatibility of config properties is two-fold:
+
+   * Modifying key-names, units of values, and default values of Hadoop-defined
+     properties.
+
+   * Custom configuration property keys should not conflict with the
+     namespace of Hadoop-defined properties. Typically, users should
+     avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net,
+     file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
+
+*** Policy 
+
+    * Hadoop-defined properties are to be deprecated at least for one
+      major release before being removed. Modifying units for existing
+      properties is not allowed.
+
+    * The default values of Hadoop-defined properties can
+      be changed across minor/major releases, but will remain the same
+      across point releases within a minor release.
+
+    * Currently, there is NO explicit policy regarding when new
+      prefixes can be added/removed, and the list of prefixes to be
+      avoided for custom configuration properties. However, as noted above, 
+      users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, 
+      net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
+           
+** Directory Structure 
+
+   Source code, artifacts (source and tests), user logs, configuration files,
+   output and job history are all stored on disk either local file system or
+   HDFS. Changing the directory structure of these user-accessible
+   files break compatibility, even in cases where the original path is
+   preserved via symbolic links (if, for example, the path is accessed
+   by a servlet that is configured to not follow symbolic links).
+
+*** Policy
+
+    * The layout of source code and build artifacts can change
+      anytime, particularly so across major versions. Within a major
+      version, the developers will attempt (no guarantees) to preserve
+      the directory structure; however, individual files can be
+      added/moved/deleted. The best way to ensure patches stay in sync
+      with the code is to get them committed to the Apache source tree.
+
+    * The directory structure of configuration files, user logs, and
+      job history will be preserved across minor and point releases
+      within a major release.
+
+** Java Classpath
+
+   User applications built against Hadoop might add all Hadoop jars
+   (including Hadoop's library dependencies) to the application's
+   classpath. Adding new dependencies or updating the version of
+   existing dependencies may interfere with those in applications'
+   classpaths.
+
+*** Policy
+
+    Currently, there is NO policy on when Hadoop's dependencies can
+    change.
+
+** Environment variables
+
+   Users and related projects often utilize the exported environment
+   variables (eg HADOOP_CONF_DIR), therefore removing or renaming
+   environment variables is an incompatible change.
+
+*** Policy
+
+    Currently, there is NO policy on when the environment variables
+    can change. Developers try to limit changes to major releases.
+
+** Build artifacts
+
+   Hadoop uses maven for project management and changing the artifacts
+   can affect existing user workflows.
+
+*** Policy
+
+   * Test artifacts: The test jars generated are strictly for internal
+     use and are not expected to be used outside of Hadoop, similar to
+     APIs annotated @Private, @Unstable.
+
+   * Built artifacts: The hadoop-client artifact (maven
+     groupId:artifactId) stays compatible within a major release,
+     while the other artifacts can change in incompatible ways.
+
+** Hardware/Software Requirements
+
+   To keep up with the latest advances in hardware, operating systems,
+   JVMs, and other software, new Hadoop releases or some of their
+   features might require higher versions of the same. For a specific
+   environment, upgrading Hadoop might require upgrading other
+   dependent software components.
+
+*** Policies
+
+    * Hardware
+
+      * Architecture: The community has no plans to restrict Hadoop to
+        specific architectures, but can have family-specific
+        optimizations.
+
+      * Minimum resources: While there are no guarantees on the
+        minimum resources required by Hadoop daemons, the community
+        attempts to not increase requirements within a minor release.
+
+    * Operating Systems: The community will attempt to maintain the
+      same OS requirements (OS kernel versions) within a minor
+      release. Currently GNU/Linux and Microsoft Windows are the OSes officially 
+      supported by the community while Apache Hadoop is known to work reasonably 
+      well on other OSes such as Apple MacOSX, Solaris etc.
+
+    * The JVM requirements will not change across point releases
+      within the same minor release except if the JVM version under
+      question becomes unsupported. Minor/major releases might require
+      later versions of JVM for some/all of the supported operating
+      systems.
+
+    * Other software: The community tries to maintain the minimum
+      versions of additional software required by Hadoop. For example,
+      ssh, kerberos etc.
+  
+* References
+  
+  Here are some relevant JIRAs and pages related to the topic:
+
+  * The evolution of this document -
+    {{{https://issues.apache.org/jira/browse/HADOOP-9517}HADOOP-9517}}
+
+  * Binary compatibility for MapReduce end-user applications between hadoop-1.x and hadoop-2.x -
+    {{{../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html}MapReduce Compatibility between hadoop-1.x and hadoop-2.x}}
+
+  * Annotations for interfaces as per interface classification
+    schedule -
+    {{{https://issues.apache.org/jira/browse/HADOOP-7391}HADOOP-7391}}
+    {{{InterfaceClassification.html}Hadoop Interface Classification}}
+
+  * Compatibility for Hadoop 1.x releases -
+    {{{https://issues.apache.org/jira/browse/HADOOP-5071}HADOOP-5071}}
+
+  * The {{{http://wiki.apache.org/hadoop/Roadmap}Hadoop Roadmap}} page
+    that captures other release policies
+