You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/03/27 22:09:58 UTC
[Hadoop Wiki] Update of "HdfsFutures" by SanjayRadia

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SanjayRadia:
http://wiki.apache.org/hadoop/HdfsFutures

New page:
= HDFS Futures: Categorized List and Descriptions of HDFS Future Features =


'''The following page is under development - it is being converted from TWiki to Wiki'''

== Goal: Make HDFS Real ==

   1. Reliable and Secure: The file system is solid enough for user to feel comfortable to use in "production"
         * Availability and integrity of HDFS is good enough
            * Availability of NN and integrity of NN Data
            * Availability if the file data and its integrity
         * Secure
            * Access control - in 0.16
            * Secure authentication  0.18?)
   1. Good Enough Performance: HDFS should not limit the scaling of the Grid and the utilization of the nodes in the Grid
      * Handle large number of files
      * Handle large number of clients
      * Low latency of HDFS operation - this will affect the utilization of the client nodes<br />
      * High throughput of HDFS operations
   1. Rich Enough FS Features for applications<br />
      * e.g. append
      * e.g. good performance for random IO
   1. Sufficient Operations and Management features to manage large 4K Cluster
      * Easy to configure, upgrade etc
      * BCP, snapshots, backups
      *

== Service Scaling ==


There are two main issues here
   * scaling the name space (i.e. the number of files and directories we can handle
   * scale the performance of the name service - i.e. its throughput and latency and in particular the number of concurrent clients
Improving one  may improve the other.
   * E.g. moving Block map functionality to slave NNs will free up storage for Name entries
   * E.g. Paritioning name space can also improve performance of each NN slave<br />

=== Summary of various options that scale name space and performance (details below) ===
 (Also see [[http://twiki.corp.yahoo.com/pub/Grid/HdfsFeaturePlanning/ScaleNN_Sea_of_Options.pdf][Scaling NN: Sea of Options]])
   * Grow memory
      * Scales name space but not performance
      * Issue: GC and Java scaling for large memories
   * Read-only Replicas of NN
      * Scales performance but not namespace
      * Adds reliability and one of the steps towards HA
   * Partition Namespace: Multiple namespace volumes
      * Scales both
      * Retains HDFS’s design philosophy but need a simplified automounter and management of horizontal scaling of NN servers
   * Split function on NN (Namespace and Block maps)
      * scales name space x3, a little performance scaling
   * Page-in partial name space from disk (as in traditional FSs)
      * Scales namespace but not performance, unless multiple volumes

=== Scaling Name Service Throughput and Response Time ===

   * Distribute/Partition/Replicate the NN functionality across multiple computers
      * Read-only replicas of the name node
         * What is the ratio of Rs to Ws - get data from Simon<br />
         * Note: RO replicas can be useful for the HA solution and for checkpoint rolling
         * Add pointer to the RO replicas design we did on the white board. TBD<br />

      * Partition by function (also scales namespace and addressible storage space)
         * E.g. move block management and processing to slave NN.
         * E.g. move Replica management to slave NN

      * Partition by name space - ie different parts of the name space are handled by different NN (see below)
         * this helps in scaling the performance of NN and also the Name space scaling

   * RPC and Timeout issues
      * When load spikes occur, the clients timeout and the spiral of death occurs<br />
      * See [[#Hadoop_Protocol_RPC][ Hadoop Protocol RPC]]

   * Higher concurrency in Namespace access (more sophisticated Namespace locking)
      * This is probably an issue only on NN restart, not during normal operation
      * Improving concurrency is hard since it will require redesign and testing
         * Better to do this when NN is being redesigned for other reasons.<br />

   * Journaling and  Sync
      * *Benefits*: improves latency, client utilization, less timeouts, greater throughput<br />
      * Improve Remote syncs
         * Approach 1 - NVRM NFS file system - investigate this
         * Approach 2 - If flush on NFS pushes the data to the NFS server, this may be good eough if there is a local sync - investigate<br />
      * Lazy syncs - need to investigate the benefit and cost (latency)<br />
         * Delay the reply by a few milliseconds to take allow for more bunching of syncs
         * This increases the latency<br />
      *  NVRAM for journal
      * Async sysncs [No!!!]<br />
         * reply as soon as memory is updated
         * This changes semantics
            * - note even though UNIX like this, the failure of machine implies failure of client and fs *together*
            * In a distributed file system, there is partial failure; further more one expects HA'ed NN to not loose data.<b><br /></b>
      * Issue: if async syncs then change in semantics - this an issue What about HA?

   * Move more functionality to data node
         * Distributed replica creation  - not simple

   *  Improve Block report processing [[https://issues.apache.org/jira/browse/HADOOP-2448][HADOOP-2448]]
           2K nodes mean a block report every 3 sec.<br />
      * Currently: Each DN sends Full BR are sent as array of longs every hour. Initial BR has random backoff (configurable)
      * Incremental and Event based B-reports - [[https://issues.apache.org/jira/browse/HADOOP-1079][HADOOP-1079]]
         * E.g when disk is lost. or blocks are deleted, etc
         * DN can determine what if anything has changed and send only of there are changes
      * Send only checksums
         * NN recalculates the checksum, OR has rolling checksum<br />
      * Make intial block report's random backoff to be dynamicaly set via NN when DNs register.  -  [[https://issues.apache.org/jira/browse/HADOOP-2444][HADOOP-2444]]


   * <br />

=== Scaling Namespace (i.e. number of files/dirs) ===

   *    Partition/distribute Name node (will also help performance)
   * Partition the namespace hierarchically and mount the volumes
      * In this scheme, there are multiple namespace volumes in a cluster.
      * All the name space volumes share the physical block storage (i.e. One storage pool)
      * Optionally All namesspaces (ie volumes) are mounted at top level using an automounter like approach
      * A namepace can be explicitly mounted on to a node in another namename (a la mount in Posix)
         * Note the Cepf file system [ref] partitions automatically and mounts the partition

   * Partition by a hash function - clients go to one NN by hashing the name<br />

   *    Only keep part of the namespace in memory.
      * This like a tradional file system where the entire namepsace is stored in secondary and page-in as needed.

   *    Reduce accidental space growth - name space quotas

== Name Service Availability (includes integrityof NN data, HA, etc) ==

=== Integrity of NN Image and Journal ===


   * Handling of incomplete transactions in journal on startup
   * Keep 2 generations of fsimage - checkpoint deamon is verifying the fsimage each time it creates the new one.
   *     CRC- for fsimage and journal
   * Make the NN persistent data solid
      * add internal consistency counters - to detect bugs in our code<br />
         * Num files, Num dirs, num blocks, sentenials between fields, strong lengths<br />
   * Recycling of block-Ids - problems if old data nodes come back - fix has been deisgned
   * If failure in FSImage, recover from alternate fsimages
   *    Versioning of NN persistent data (use jute)
   * Smart fsck
      * Bad entry in journal - ignore rest
      * Bad entry in journal - ignore only those remaining entry not effected (Hard)
      * If multiple journals, recover from the best one or merge the entries<br />
      * NN has flag about whether to continue on such an error<br />
   * Recreating NN data from DN will require fundamental changes in design

=== Faster Startup ===


   * Faster Block report processing    ( See above)
   * Reload FS Image faster

=== Restart and Failover ===
   *    Automatic NN restart on NN failure (We will lets the operations folks handle this)<br />
   *    Hot standby &amp; failover

== Security: Authorization and ACLs ==


   *    0.16 has access control without authorization (i.e. trust client)
   *    Security Use cases and requirements (2008 - Q1)<br />
   *    Secure authorization (2008 - Q2, Q3)
   *    Service-level access control - ie which user can access the HDFS service (as opposed ACLs for specific files)

== File Features ==
   * File data visible as flushed
      *   Motivation: logging and tail -f
   * Open files are NOT accessible by readers in the event of deletion or renaming
   * Growable Files
      *    via atomic append with multiple writers
      * Via append with 1 writer [[http://issues.apache.org/jira/secure/QuickSearch.jspa][Hadoop-1700]]
   * Truncate files
      * Use case for this?
      * note truncate and append needs to be designed together
   * Concatenate files
      * May reduce name space growth if apps merge small files into a larger one
      * To make this work well, we will need to support variable length blocks
      * Is there a way for the map-reduce framework to take advantage of this feature automatically?
   * Support multiple writers for log file
      * Alternatives
         1 logging toolkit that adapts to Hadood
            * Logging is from within a single application
            * No changes needed to Hadoop
         1 atomic appends takes care of that - overkill for logging??

   *    Block view of files - a file is a list of block, can move block from one to another, have holes etc

== File IO Performance ==
   * In memory checksum caching (full or partial) on Datanodes (What is this Sameer?)
   * Reduce CPU utilization on IO
      * Remove double buffering [[http://issues.apache.org/jira/browse/HADOOP-1702][Hadoop-1702]]
      * Take advantage of send file
   * Random access performance [[http://issues.apache.org/jira/browse/HADOOP-2736][Hadoop-2736]]


== Namespace Features ==
   * Hardlinks
      * Will need add file-ids to make this work
   *  Symbolic links
   *    Native mounts - mount Hadoop on Linux
   *    Mount Hadoop as NFS

   * "Flat name" to file mapping
      * Idea is remove the notion of directories. This has the potential of helping us scale the NN
      * Main issue with this approach is that many of our applciations and users have a notion of a name space that they own. For example many mapReduce jobs process all files in a directory; another user creating files in that directory can mess up the application.
   * Notion of a fileset - kind of like a directory, except that it cannot contain directories - under discussion - has potential to scale name node.


== File Data Integrity (For NN see NN Availability) ==


   * Periodic data verification

== Operations and Management Features ==

   * Improved Grid Configuration management
      * Put config in Zookeeper (would require that NN gets started with at least one ZooKeeper instance)
      * DNs can get most of their config from NN. The only DN specific config is the directories or the DN data blocks

   * Software version upgrades and version rollback P???
      * See [[#Versioning][Rpc Protocol Versioning]]
   * Rolling upgrade when we have HA


   * NN data rollback
      * this depends on keeping old fsImage/journals around
      * A startup parameter for this?
      * Need an advisory on how much data will be discarded so that operator can make an intelligent decision<br />

   * Snapshots
      * We allow snapshots only when system is offline
      * Need live Snapshots
      * Subtree snapshots (rather then whole system)

   * Replica Management
      * Ensure that (R-1) racks  for R replicas
      * FSCK shold warn that there aren't R-1 racks



== Hadoop Protocol RPC ==

=== RPC Timeouts, Connection handling, Q handling, threading ===
   * When load spikes occur, the clients timeout and the spiral of death occurs<br />
   * Remove Timeout, Instead Ping to detect server failures [[http://issues.apache.org/jira/browse/HADOOP-2188][HADOOP-2188]]
   * Improve Connection handling, idle connections etc

=== Client-side recovery from NN restarts and faIlovers ===
   * HDFS client side (or just MapRed?) should be able to recover from NN restarts and failovers

=== Versioning ===
   * Across clusters
      * Versioning of the Hadoop client protocol, server can deal with clients of different versions
         *    The data types change frequently, fields added, deleted
   *    Within cluster - between NN and DN

=== Multiple Language Support ===
   *    Are all interfaces well defined/cleanup
   *    Generate stubs automatically  for Java, C, Python
   *    Service IDL


== Benchmarks and Performance Measurements  ==

   *     Where are the cycles going for data and name nodes?
   *    For HDFS and Map-Reduce

== Diagnosability ==
   *     NN - what do we need here - log analysis?<br />
   *    DN - what do we need here?

== Development Support ==
   *     What do we need here?

== Intercluster Features ==
   * HDFS supports access to remote grids/clusters through URIs
   * Federation features - investigate
   * What else?

== BCP support ==

   * Support for keeping data in sync across data-centers