You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/03/27 22:09:58 UTC
[Hadoop Wiki] Update of "HdfsFutures" by SanjayRadia
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by SanjayRadia:
http://wiki.apache.org/hadoop/HdfsFutures
New page:
= HDFS Futures: Categorized List and Descriptions of HDFS Future Features =
'''The following page is under development - it is being converted from TWiki to Wiki'''
== Goal: Make HDFS Real ==
1. Reliable and Secure: The file system is solid enough for user to feel comfortable to use in "production"
* Availability and integrity of HDFS is good enough
* Availability of NN and integrity of NN Data
* Availability if the file data and its integrity
* Secure
* Access control - in 0.16
* Secure authentication 0.18?)
1. Good Enough Performance: HDFS should not limit the scaling of the Grid and the utilization of the nodes in the Grid
* Handle large number of files
* Handle large number of clients
* Low latency of HDFS operation - this will affect the utilization of the client nodes<br />
* High throughput of HDFS operations
1. Rich Enough FS Features for applications<br />
* e.g. append
* e.g. good performance for random IO
1. Sufficient Operations and Management features to manage large 4K Cluster
* Easy to configure, upgrade etc
* BCP, snapshots, backups
*
== Service Scaling ==
There are two main issues here
* scaling the name space (i.e. the number of files and directories we can handle
* scale the performance of the name service - i.e. its throughput and latency and in particular the number of concurrent clients
Improving one may improve the other.
* E.g. moving Block map functionality to slave NNs will free up storage for Name entries
* E.g. Paritioning name space can also improve performance of each NN slave<br />
=== Summary of various options that scale name space and performance (details below) ===
(Also see [[http://twiki.corp.yahoo.com/pub/Grid/HdfsFeaturePlanning/ScaleNN_Sea_of_Options.pdf][Scaling NN: Sea of Options]])
* Grow memory
* Scales name space but not performance
* Issue: GC and Java scaling for large memories
* Read-only Replicas of NN
* Scales performance but not namespace
* Adds reliability and one of the steps towards HA
* Partition Namespace: Multiple namespace volumes
* Scales both
* Retains HDFS’s design philosophy but need a simplified automounter and management of horizontal scaling of NN servers
* Split function on NN (Namespace and Block maps)
* scales name space x3, a little performance scaling
* Page-in partial name space from disk (as in traditional FSs)
* Scales namespace but not performance, unless multiple volumes
=== Scaling Name Service Throughput and Response Time ===
* Distribute/Partition/Replicate the NN functionality across multiple computers
* Read-only replicas of the name node
* What is the ratio of Rs to Ws - get data from Simon<br />
* Note: RO replicas can be useful for the HA solution and for checkpoint rolling
* Add pointer to the RO replicas design we did on the white board. TBD<br />
* Partition by function (also scales namespace and addressible storage space)
* E.g. move block management and processing to slave NN.
* E.g. move Replica management to slave NN
* Partition by name space - ie different parts of the name space are handled by different NN (see below)
* this helps in scaling the performance of NN and also the Name space scaling
* RPC and Timeout issues
* When load spikes occur, the clients timeout and the spiral of death occurs<br />
* See [[#Hadoop_Protocol_RPC][ Hadoop Protocol RPC]]
* Higher concurrency in Namespace access (more sophisticated Namespace locking)
* This is probably an issue only on NN restart, not during normal operation
* Improving concurrency is hard since it will require redesign and testing
* Better to do this when NN is being redesigned for other reasons.<br />
* Journaling and Sync
* *Benefits*: improves latency, client utilization, less timeouts, greater throughput<br />
* Improve Remote syncs
* Approach 1 - NVRM NFS file system - investigate this
* Approach 2 - If flush on NFS pushes the data to the NFS server, this may be good eough if there is a local sync - investigate<br />
* Lazy syncs - need to investigate the benefit and cost (latency)<br />
* Delay the reply by a few milliseconds to take allow for more bunching of syncs
* This increases the latency<br />
* NVRAM for journal
* Async sysncs [No!!!]<br />
* reply as soon as memory is updated
* This changes semantics
* - note even though UNIX like this, the failure of machine implies failure of client and fs *together*
* In a distributed file system, there is partial failure; further more one expects HA'ed NN to not loose data.<b><br /></b>
* Issue: if async syncs then change in semantics - this an issue What about HA?
* Move more functionality to data node
* Distributed replica creation - not simple
* Improve Block report processing [[https://issues.apache.org/jira/browse/HADOOP-2448][HADOOP-2448]]
2K nodes mean a block report every 3 sec.<br />
* Currently: Each DN sends Full BR are sent as array of longs every hour. Initial BR has random backoff (configurable)
* Incremental and Event based B-reports - [[https://issues.apache.org/jira/browse/HADOOP-1079][HADOOP-1079]]
* E.g when disk is lost. or blocks are deleted, etc
* DN can determine what if anything has changed and send only of there are changes
* Send only checksums
* NN recalculates the checksum, OR has rolling checksum<br />
* Make intial block report's random backoff to be dynamicaly set via NN when DNs register. - [[https://issues.apache.org/jira/browse/HADOOP-2444][HADOOP-2444]]
* <br />
=== Scaling Namespace (i.e. number of files/dirs) ===
* Partition/distribute Name node (will also help performance)
* Partition the namespace hierarchically and mount the volumes
* In this scheme, there are multiple namespace volumes in a cluster.
* All the name space volumes share the physical block storage (i.e. One storage pool)
* Optionally All namesspaces (ie volumes) are mounted at top level using an automounter like approach
* A namepace can be explicitly mounted on to a node in another namename (a la mount in Posix)
* Note the Cepf file system [ref] partitions automatically and mounts the partition
* Partition by a hash function - clients go to one NN by hashing the name<br />
* Only keep part of the namespace in memory.
* This like a tradional file system where the entire namepsace is stored in secondary and page-in as needed.
* Reduce accidental space growth - name space quotas
== Name Service Availability (includes integrityof NN data, HA, etc) ==
=== Integrity of NN Image and Journal ===
* Handling of incomplete transactions in journal on startup
* Keep 2 generations of fsimage - checkpoint deamon is verifying the fsimage each time it creates the new one.
* CRC- for fsimage and journal
* Make the NN persistent data solid
* add internal consistency counters - to detect bugs in our code<br />
* Num files, Num dirs, num blocks, sentenials between fields, strong lengths<br />
* Recycling of block-Ids - problems if old data nodes come back - fix has been deisgned
* If failure in FSImage, recover from alternate fsimages
* Versioning of NN persistent data (use jute)
* Smart fsck
* Bad entry in journal - ignore rest
* Bad entry in journal - ignore only those remaining entry not effected (Hard)
* If multiple journals, recover from the best one or merge the entries<br />
* NN has flag about whether to continue on such an error<br />
* Recreating NN data from DN will require fundamental changes in design
=== Faster Startup ===
* Faster Block report processing ( See above)
* Reload FS Image faster
=== Restart and Failover ===
* Automatic NN restart on NN failure (We will lets the operations folks handle this)<br />
* Hot standby & failover
== Security: Authorization and ACLs ==
* 0.16 has access control without authorization (i.e. trust client)
* Security Use cases and requirements (2008 - Q1)<br />
* Secure authorization (2008 - Q2, Q3)
* Service-level access control - ie which user can access the HDFS service (as opposed ACLs for specific files)
== File Features ==
* File data visible as flushed
* Motivation: logging and tail -f
* Open files are NOT accessible by readers in the event of deletion or renaming
* Growable Files
* via atomic append with multiple writers
* Via append with 1 writer [[http://issues.apache.org/jira/secure/QuickSearch.jspa][Hadoop-1700]]
* Truncate files
* Use case for this?
* note truncate and append needs to be designed together
* Concatenate files
* May reduce name space growth if apps merge small files into a larger one
* To make this work well, we will need to support variable length blocks
* Is there a way for the map-reduce framework to take advantage of this feature automatically?
* Support multiple writers for log file
* Alternatives
1 logging toolkit that adapts to Hadood
* Logging is from within a single application
* No changes needed to Hadoop
1 atomic appends takes care of that - overkill for logging??
* Block view of files - a file is a list of block, can move block from one to another, have holes etc
== File IO Performance ==
* In memory checksum caching (full or partial) on Datanodes (What is this Sameer?)
* Reduce CPU utilization on IO
* Remove double buffering [[http://issues.apache.org/jira/browse/HADOOP-1702][Hadoop-1702]]
* Take advantage of send file
* Random access performance [[http://issues.apache.org/jira/browse/HADOOP-2736][Hadoop-2736]]
== Namespace Features ==
* Hardlinks
* Will need add file-ids to make this work
* Symbolic links
* Native mounts - mount Hadoop on Linux
* Mount Hadoop as NFS
* "Flat name" to file mapping
* Idea is remove the notion of directories. This has the potential of helping us scale the NN
* Main issue with this approach is that many of our applciations and users have a notion of a name space that they own. For example many mapReduce jobs process all files in a directory; another user creating files in that directory can mess up the application.
* Notion of a fileset - kind of like a directory, except that it cannot contain directories - under discussion - has potential to scale name node.
== File Data Integrity (For NN see NN Availability) ==
* Periodic data verification
== Operations and Management Features ==
* Improved Grid Configuration management
* Put config in Zookeeper (would require that NN gets started with at least one ZooKeeper instance)
* DNs can get most of their config from NN. The only DN specific config is the directories or the DN data blocks
* Software version upgrades and version rollback P???
* See [[#Versioning][Rpc Protocol Versioning]]
* Rolling upgrade when we have HA
* NN data rollback
* this depends on keeping old fsImage/journals around
* A startup parameter for this?
* Need an advisory on how much data will be discarded so that operator can make an intelligent decision<br />
* Snapshots
* We allow snapshots only when system is offline
* Need live Snapshots
* Subtree snapshots (rather then whole system)
* Replica Management
* Ensure that (R-1) racks for R replicas
* FSCK shold warn that there aren't R-1 racks
== Hadoop Protocol RPC ==
=== RPC Timeouts, Connection handling, Q handling, threading ===
* When load spikes occur, the clients timeout and the spiral of death occurs<br />
* Remove Timeout, Instead Ping to detect server failures [[http://issues.apache.org/jira/browse/HADOOP-2188][HADOOP-2188]]
* Improve Connection handling, idle connections etc
=== Client-side recovery from NN restarts and faIlovers ===
* HDFS client side (or just MapRed?) should be able to recover from NN restarts and failovers
=== Versioning ===
* Across clusters
* Versioning of the Hadoop client protocol, server can deal with clients of different versions
* The data types change frequently, fields added, deleted
* Within cluster - between NN and DN
=== Multiple Language Support ===
* Are all interfaces well defined/cleanup
* Generate stubs automatically for Java, C, Python
* Service IDL
== Benchmarks and Performance Measurements ==
* Where are the cycles going for data and name nodes?
* For HDFS and Map-Reduce
== Diagnosability ==
* NN - what do we need here - log analysis?<br />
* DN - what do we need here?
== Development Support ==
* What do we need here?
== Intercluster Features ==
* HDFS supports access to remote grids/clusters through URIs
* Federation features - investigate
* What else?
== BCP support ==
* Support for keeping data in sync across data-centers