You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2007/07/04 01:01:06 UTC

[jira] Created: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Create FileSystem implementation to read HDFS data via http
-----------------------------------------------------------

                 Key: HADOOP-1563
                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
             Project: Hadoop
          Issue Type: New Feature
            Reporter: Owen O'Malley
            Assignee: Chris Douglas


There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
  1. Copy using distcp between different versions of HDFS.
  2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510998 ] 

Tom White commented on HADOOP-1563:
-----------------------------------

bq. Then we should try to use this as a source for MapReduce and distcp and see how it fares. The HTTP client may need to be replaced, file status may need to be cached, etc. But this simple approach will get us up and going, and avoid investing too much time designing a schema, parsing XML, etc. when that may not be required.

+1

A couple of points regarding the patch:

In HttpFileSystem#initialize the name variable is set to itself, so it's always null. 

By removing getDefaultBlockSize() in S3FileSystem the property "fs.s3.block.size" is removed (but it's still in hadoop-default.xml). This looks like a change that was made earlier in the checksumming work, so is probably fine in the context of this patch.

Finally, some unit tests would be good. Otherwise, it looks good.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510818 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

I think we should implement a servlet that:
1. Considers everything after the HttpServletRequest#getContextPath() as a path.
2. If it names an HDFS file, set attributes as HTTP headers and, if the request is HEAD return an empty page, if GET, return the content, otherwise return an error.
3. If it's a HEAD or GET of a non-slash-terminated directory, redirect to the slash-terminated directory.
4. If it's a HEAD or GET of a slash-terminated directory name, set attributes and, if GET, return HTML containing links to that directory's files;
5. Otherwise return an error.

Then we should try to use this as a source for MapReduce and distcp and see how it fares.  The HTTP client may need to be replaced, file status may need to be cached, etc.  But this simple approach will get us up and going, and avoid investing too much time designing a schema, parsing XML, etc. when that may not be required.

Thoughts?

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510788 ] 

Owen O'Malley commented on HADOOP-1563:
---------------------------------------

Please factor out the directory content parsing into a separate method so that it can be replaced in subclasses.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510796 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

> factor out the directory content parsing into a separate method so that it can be replaced in subclasses.

Sure, that'd be easy, if we need to subclass.  Do we really?  Are there required features that cannot be supported by delivering HTML?  Will those features be guaranteed not to change from version-to-version, potentially compromising bi-directional compatibility?

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510929 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

A couple of thoughts:

1. If, for performance, we find we must cache FileStatus in most FileSystem#listPaths implementations, then that means the FileSystem API is inappropriate.  In this case, we should replace FileSystem#listPaths() and #getFileStatus() with a single new method:

public abstract Map<Path,FileStatus> listStatus(Path path) throws IOException;

2. If, in HttpFileSystem, we find that (e.g., in order to efficiently support #listStatus) an HTML-based implementation is insufficient for HDFS, then we should not implement other directory formats by subclassing.  Rather HttpFileSystem should use plugins for various formats.  That fits the existing FileSystem extension mechanism better, which dispatches on protocol only.

The plugin interface might look like:

public interface HttpFileServer {
  /** Set connection properties prior to connect, typically authentication headers. */
  void prepareConnection(HttpURLConnection connection);
  /** Parse directory content. */
  Map<Path,FileStatus> parseDirectoryContent(byte[] content);
}

HttpFileSystem would pick an HttpFileServer implementation based hostname, content type or something.  Content-type would be elegant, but probably insufficient, since, e.g., S3 returns a content-type of application/xml.  Hostname would require reconfiguration for each site.  Perhaps we can use the "Server" header.  That would work for S3, and we could set it for HDFS.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511353 ] 

Tom White commented on HADOOP-1563:
-----------------------------------

bq. I don't see an easy way to handle S3 with this, exposing it as a hierarchical space of slash-delimited directories, except perhaps to write a servlet that proxies directory listings and redirects for file content.

The proxy idea sounds good - the servlet pseudo code would be something like:
{noformat} 
if path is not slash-terminated
  if HEAD S3 path is successful
    redirect to S3 resource at path
  else
    redirect to path/
else
  GET S3 bucket with prefix = path, delimiter = /
  if bucket is empty
    return 404
  else
    return bucket contents as XHTML
{noformat} 

(Of course, the work to do this would go in a new Jira issue.)

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch, httpfs2.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513077 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

I moved the FileSystem API cleanups from the patch here to HADOOP-1620, updating them to the current trunk.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch, httpfs2.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1563:
---------------------------------

    Attachment: httpfs.patch

Attaching a FileSystem for HTTP-browsable directories.

This assumes that:
 * the URL space is hierarchical and slash-delimited;
 * URLs of directories always end in a slash;
 * GET of non-slash-terminated directory URL redirects to the slash-terminated URL;
 * the content of a directory URL contains the URLs of its children;
 * child URLs can be extracted from parent content with a regular expression.

This seems to work for directory listings produced by Apache, Tomcat, Jetty, and Subversion.  If we make HDFS browsable over HTTP in the above manner, then this will work for HDFS too.

I've also added default definitions for a bunch of abstract FileSystem methods, and removed definitions in implementations that matched these default definitions, simplifying most FileSystem implementations.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-1563:
----------------------------------

          Component/s: fs
    Affects Version/s: 0.14.0

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511237 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

Should HTML scraping prove inadequate, WebDav might be useful for this.  Its PROPFIND method permits directory enumeration.

http://www.webdav.org/specs/rfc2518.html#METHOD_PROPFIND


> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1563:
---------------------------------

    Attachment: httpfs2.patch

This fixes the 'name = name' issue Tom pointed out, and permits file lengths longer than 2^31.  I agree that this needs unit tests before it can be committed.  I'd also like to first implement a servlet for HDFS to test that performance is acceptable.

I don't see an easy way to handle S3 with this, exposing it as a hierarchical space of slash-delimited directories, except perhaps to write a servlet that proxies directory listings and redirects for file content.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch, httpfs2.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas resolved HADOOP-1563.
-----------------------------------

    Resolution: Duplicate

See HADOOP-1568

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch, httpfs2.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1563) Create FileSystem implementation to read HDFS data via http

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511823 ] 

Doug Cutting commented on HADOOP-1563:
--------------------------------------

I talked with Owen about this, and what he wants is more like a 'tar' format for the FileSystem API, something that preserves standard properties, without being specific to the FileSystem implementation.  The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.  Finally, we agreed that the FileSystem API should be altered, so that listStatus() is the primary method, replacing both listPaths() and getStatus().  Whether or not my HttpFileSystem (included above) is in fact ever used, that patch also has some cleanups to the FileSystem API that should be committed.

> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
>                 Key: HADOOP-1563
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1563
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: fs
>    Affects Versions: 0.14.0
>            Reporter: Owen O'Malley
>            Assignee: Chris Douglas
>         Attachments: httpfs.patch, httpfs2.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's http interface. This would have a couple of useful abilities:
>   1. Copy using distcp between different versions of HDFS.
>   2. Use map/reduce inputs from a different version of HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.