You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/11/20 22:43:07 UTC

[jira] Updated: (HADOOP-571) Path should use URI syntax

     [ http://issues.apache.org/jira/browse/HADOOP-571?page=all ]

Doug Cutting updated HADOOP-571:
--------------------------------

    Attachment: uri.patch

Here's a first version of this.  It passes all unit tests except TestCopyFiles.  I haven't yet figured out why that one fails.

A few notes:

Path is now a wrapper on a URI.  For back-compatibility, Path differs from URI in a number of ways.  Path.toString() returns an unescaped string, and 'new Path(String)' does not expect an escaped string, unlike the corresponding URI methods.  Thus, one can easily construct Paths containing characters like question marks (used in globbing).  A Path's URI never has a query or fragment part.  Path directories are always normalized as follows: // and \\ are replaced with /, and terminal slashes are removed.

A FileSystem is now named by a URI containing only a scheme and authority.  The local filesystem is thus now named "file:///", and an HDFS filesystem is named something like "hdfs://namenode:50002".

Configuration properties are used to map from URI scheme to FileSystem implementation.  A FileSystem is named by fs.<scheme>.impl.  A FileSystem instance is cached for each unique scheme and authority string.  Instances are constructed using a no-arg constructor, then the initialize(URI,Configuration) method is called.  FileSystem implementations typically check that a Path provided to them indeed belongs to them, and thereafter typically ignore the scheme and authority of the Path's URI, using the URI's path to determine the file.

In general, a file is now identified by a Path and a Configuration (to determine the default FileSystem).  Thus, to operate on a Path, one typically does something like 'path.getFileSystem(conf).open(path)'.  We could add convenience methods for things like this to Path, but I have not yet done that.

In MapReduce, input formats are modified to generate and consume fully-qualified paths.  This makes the FileSystem parameter in most InputFormat and OutputFormat methods redundant.

> Path should use URI syntax
> --------------------------
>
>                 Key: HADOOP-571
>                 URL: http://issues.apache.org/jira/browse/HADOOP-571
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: uri.patch
>
>
> The following changes are proposed:
> 1. Add a factory/registry of FileSystem implementations.  Given a protocol, hostname and port, it should be possible to get a FileSystem implementation.
> 2. Path's constructor should accept URI-formatted strings & a configuration.
> 3. A new Path method should be added: FileSystem.getFileSystem().  This returns the filesystem named in the path or the default configured filesystem.
> 4. Most methods which currently take FileSystem and Path parameters can be changed to take only Path.
> 5. Many FileSystem operations (create, open, delete, list, etc.) can become convenience methods on Path.
> 6. A URLStreamHandler can be defined in terms of the FileSystem API, so that URLs for any protocol with a registered FileSystem implementation can be accessed with a java.net.URL, permitting FileSystem implementations to be used on the classpath, etc.
> It is tempting to try to replace Path with java.net.URL, but URL's methods are insufficient for mapreduce.  We require directory listings, random access, location hints, etc., which are not supported by existing URLStreamHandler implementations.  But we can expose all FileSystem implementations for access with java.net.URL.
> (From a brainstorm with Owen.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira