You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Yan (JIRA)" <ji...@apache.org> on 2014/03/19 02:30:44 UTC

[jira] [Created] (HDFS-6121) Support of "mount" onto HDFS directories

Yan created HDFS-6121:
-------------------------

             Summary: Support of "mount" onto HDFS directories
                 Key: HDFS-6121
                 URL: https://issues.apache.org/jira/browse/HDFS-6121
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode
            Reporter: Yan


Currently, HDFS configuration can only create HDFS on one or several existing local file system directories. This pretty much abstracts physical disk drives away from HDFS users.

While it may provide conveniences in data movement/manipulation/management/formatting, it could deprive users a way to access physical disks in a more directly controlled manner.

For instance, a multi-threaded server may wish to access its disk blocks sequentially per thread for fear of random I/O otherwise. If the cluster boxes have multiple physical disk drives, and the server load is pretty much I/O-bound, then it will be quite reasonable to hope for disk performance typical of sequential I/O. Disk read-ahead and/or buffering at various layers may alleviate the problem to some degree, but it couldn't totally eliminate it. This could hurt hard performance of workloads than need to scan data.

Map/Reduce may experience the same problem as well.

For instance, HBase region servers may wish to scan disk data for each region in a sequential way, again, to avoid random I/O. HBase incapability in this regard aside, one major obstacle is with HDFS's incapability to specify mappings of local directories to HDFS directories. Specifically, the "dfs.data.dir" configuration setting only allows for the mapping from one or multiple local directories to the HDFS root directory. In the case of data nodes of multiple disk drives mounted as multiple local file system directories per node, the HDFS data will be spread on all disk drives in a pretty random manner, potentially resulting random I/O from a multi-threaded server reading multiple data blocks from each thread.

A seemingly simple enhancement is an introduction of mappings from one or multiple local FS directories to a single HDFS directory, plus necessary sanity checks, replication policies, advices of best practices, ..., etc, of course. Note that this should be an one-to-one or many-to-one mapping from local to HDFS directories. The other way around, though probably feasible, won't serve our purpose at all. This is similar to the mounting of different disks onto different local FS directories, and will give the users an option to place and access their data in a more controlled and efficient way. 

Conceptually this option will allow for local physical data partition in a distributed environment for application data on HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)