You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Marco Nicosia (JIRA)" <ji...@apache.org> on 2009/01/10 06:19:59 UTC

[jira] Created: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Replace HFTP/HSFTP with plain HTTP/HTTPS
----------------------------------------

                 Key: HADOOP-5010
                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/hdfsproxy
    Affects Versions: 0.18.0
            Reporter: Marco Nicosia


In HADOOP-1563, [~cutting] wrote:
bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.

Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).

In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.

NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Marco Nicosia (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663388#action_12663388 ] 

Marco Nicosia commented on HADOOP-5010:
---------------------------------------

bq. So maybe all we need is better documentation of what's passed over HTTP?

If there's a guarantee that subsequent connections need not always connect to the same server (ie, any session over the protocol is managed either via a single continuous HTTP/1.1 connection, cookies, or some other session management), then yes, more documentation on how the HTTP protocol is used will allow "creative" admins to use existing HTTP infrastructure in their Hadoop deployments.

bq. I think we cannot simply use standard HTTP because it does not support file system access.

If the limitation is that HTTP doesn't specify how to get/put structured data (such as a directory listing), why not use some well accepted standard, such as REST?

The reason I'm pushing for this is that the closer Hadoop comes to presenting some standards-compliant interface, the easier it becomes for users to integrate Hadoop into existing infrastructure(s). Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS.


> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5010) Document HTTP/HTTPS methods to read directory and file data

Posted by "Marco Nicosia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Nicosia updated HADOOP-5010:
----------------------------------

    Summary: Document HTTP/HTTPS methods to read directory and file data  (was: Replace HFTP/HSFTP with plain HTTP/HTTPS)

> Document HTTP/HTTPS methods to read directory and file data
> -----------------------------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>             Fix For: 0.20.0
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Document HTTP/HTTPS methods to read directory and file data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663887#action_12663887 ] 

Doug Cutting commented on HADOOP-5010:
--------------------------------------

So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol.  If we do, that will encourage folks to build other tools that use it, and we will need to support it longer-term than we might otherwise.  If only GET of a file's content is required, then perhaps that's all we should document?

> Document HTTP/HTTPS methods to read directory and file data
> -----------------------------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>             Fix For: 0.20.0
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663410#action_12663410 ] 

Doug Cutting commented on HADOOP-5010:
--------------------------------------

> use some well accepted standard

That's possible.  An appropriate HTTP-based standard for filesystem access might be WebDav.

HFTP was designed to meet a particular goal: cross-version filesystem access.  Implementing an accepted standard is a more ambitious project.

> Currently, one of the least obvious points of integration is how to get data both onto, and back off of, an HDFS.

Distcp is the best tool today for this.  How is it insufficient?  We have an FTP FileSystem implementation, so we can import data from external systems that way.  One can also use file: uri's to import data from NFS.  We could implement a WebDav filesystem, so that folks could dav: URI's to import and export datasets from web servers that have mod_dav installed.  Would that help?



> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Kan Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663594#action_12663594 ] 

kzhang edited comment on HADOOP-5010 at 1/13/09 5:51 PM:
------------------------------------------------------------

> Can you say more about the requirement?

We have an internal interface by which clients can query a metadata server to get file locations and other metadata, and then use HTTPS clients like curl to retrieve the actual files from datastore servers. We want hdfsproxy to act as datastore servers. Directory listings are not required. Neither is support for browsers. Just using HTTP GET request to fetch a file. Could be a thin wrapper around existing streamFile servlet. Some convention is needed, like how to specify the cluster that the file is stored on since we want one hdfsproxy to proxy for multiple HDFS clusters.

      was (Author: kzhang):
    > Can you say more about the requirement?

We have an internal interface by which clients can query a metadata server to get file locations and then use HTTPS clients like curl to retrieve the actual files from datastore servers. We want hdfsproxy to act as datastore servers. Directory listings are not required. Neither is support for browsers. Just using HTTP GET request to fetch a file. Could be a thin wrapper around existing streamFile servlet. Some convention is needed, like how to specify the cluster that the file is stored on since we want one hdfsproxy to proxy for multiple HDFS clusters.
  
> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663523#action_12663523 ] 

Chris Douglas commented on HADOOP-5010:
---------------------------------------

bq. HFTP is only documented on the distcp page, and HSFTP is not documented at all?

HSFTP is the same protocol- the same server- over an SSL connector; we can speak of them interchangeably. The HFTP protocol is not documented outside of its FileSystem implementation, which should be remedied, but the premise for this issue seems ill defined.

I don't know to what "plain", "pure", and "standard" HTTP refers in a filesystem context, if not adherence to an RFC for which there are already tools. If not WebDAV, then either some other standard must be chosen, or we define our own conventions for listing directories, writing/appending to files, deleting resources, managing permissions, etc. Unless we also want to write a client- which returns us to where we started- are there better options than picking a standard and (partially?) implementing it?

bq. Here the focus seems to be on a servlet that implements the server-side of this for HDFS. That seems reasonable. It would also be browsable, which is nice.

Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools, but they need to be documented.

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663047#action_12663047 ] 

Doug Cutting commented on HADOOP-5010:
--------------------------------------

> simply offer standard HTTP and HTTPS

HFTP and HSFTP are just internal naming schemes, a way to encode HDFS file names but indicate that a different mechanism should be used to access them.

> That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.

We already use HTTP and HTTPS as the transport for HFTP and HSFTP.  So maybe all we need is better documentation of what's passed over HTTP?



> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Kan Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663433#action_12663433 ] 

Kan Zhang commented on HADOOP-5010:
-----------------------------------

I suggest
1. We keep the HSFTP interface on hdfsproxy as is, so that existing filesystem clients like distcp can continue work.
2. In the short term, add a *pure* HTTP support for retrieving files using standard HTTP clients like curl. This may fall short of a full-fledged system like WebDAV. But it's very useful by itself (we actually have a requirement for it in Yahoo) and a good starting point.

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5010) Document HTTP/HTTPS methods to read directory and file data

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5010:
----------------------------------

    Attachment: 5010-0.patch

bq. So, if this is a documentation request, then we need to decide whether we indeed want to document this HTTP-based protocol.

IMHO, hftp should live while it is necessary for cross-version compatibility. That alone will keep it around longer than we might otherwise. This adds a line to the "web interface" section of the HDFS user guide, noting the syntax for the FileDataServlet but ignoring any permissions issues.

> Document HTTP/HTTPS methods to read directory and file data
> -----------------------------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>             Fix For: 0.21.0
>
>         Attachments: 5010-0.patch
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5010) Document HTTP/HTTPS methods to read directory and file data

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5010:
----------------------------------

         Priority: Trivial  (was: Major)
    Fix Version/s:     (was: 0.20.0)
                   0.21.0

> Document HTTP/HTTPS methods to read directory and file data
> -----------------------------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>            Priority: Trivial
>             Fix For: 0.21.0
>
>         Attachments: 5010-0.patch
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663475#action_12663475 ] 

Doug Cutting commented on HADOOP-5010:
--------------------------------------

> There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software.

Or simply run something like 'ssh foo distcp ...', where foo is a host in the cluster.  It would be better to know more about the requirement.

> add a pure HTTP support for retrieving files using standard HTTP clients like curl

https://issues.apache.org/jira/browse/HADOOP-1563?focusedCommentId=12510760#action_12510760

In that comment I suggest a convention for encoding directory listings as links in the HTML of slash-ending URLs.  I also provided a patch there that implements a client for this.  Here the focus seems to be on a servlet that implements the server-side of this for HDFS.  That seems reasonable.  It would also be browsable, which is nice.

> we actually have a requirement for it in Yahoo

Can you say more about the requirement?  Are directory listings required?  Is other file status information required?  Some file status can be done in HTTP (e.g., the last-modified header), but some does not have a natural place (e.g., owner, group & permissions).

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Kan Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663594#action_12663594 ] 

Kan Zhang commented on HADOOP-5010:
-----------------------------------

> Can you say more about the requirement?

We have an internal interface by which clients can query a metadata server to get file locations and then use HTTPS clients like curl to retrieve the actual files from datastore servers. We want hdfsproxy to act as datastore servers. Directory listings are not required. Neither is support for browsers. Just using HTTP GET request to fetch a file. Could be a thin wrapper around existing streamFile servlet. Some convention is needed, like how to specify the cluster that the file is stored on since we want one hdfsproxy to proxy for multiple HDFS clusters.

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Marco Nicosia (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663431#action_12663431 ] 

Marco Nicosia commented on HADOOP-5010:
---------------------------------------

bq. Distcp is the best tool today for this. How is it insufficient?

Distcp works for pulling data from a source, or pushing data to a source. In both cases, distcp implies running a Hadoop job. There is currently no external way to push data to an HDFS nor pull from an HDFS using an existing standard; instead anyone wishing to do so must install HDFS clients on computers that do not otherwise run Hadoop software.

bq. That's possible. An appropriate HTTP-based standard for filesystem access might be WebDav.
bq. Implementing an accepted standard is a more ambitious project.

I remember previous attempts to make WebDav available, and recognize that as an ambitious goal.

My naive thought is that HFTP is very close to a much simpler feature. The main purpose of the HDFS proxy _could be_ to make HDFS files available to a standard web client (curl, Net::HTTP, etc) to retrieve file listings and file contents from the HDFS proxy without installing an HDFS client, which is required to speak H{S}FTP.

The only difference is that HDFS proxy/H{S}FTP have invented an internal way of exposing this data where existing standards could have been used?


> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663148#action_12663148 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-5010:
------------------------------------------------

>.. does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? 

HFTP is a file system interface, which is currently implemented with HTTP.  I agree if you say that the name HFTP is bad or miss-leading.

The "artificial" part of HFTP defines the way of accessing the file system.  For example, 
{code}
Path p = new Path("hftp://namenode:port/foo/bar");
FileStatus status = p.getFileSystem(conf).listStatus(p);
...
{code}
the code above is indeed accessing "http://namenode:port/listPaths/foo/bar?ugi=user,groups", where listPaths is a servlet running on the NameNode and ugi is a parameter.  Then the NameNode will reply the output in xml format back to the HFTP client.  The HFTP client constructs a FileStatus object and returns it.  Without the HFTP interface, clients have to know all the details including the servlet name, url parameters, xml format, etc. in order to access the file system.

I think we cannot simply use standard HTTP because it does not support file system access.

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hdfsproxy
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5010) Document HTTP/HTTPS methods to read directory and file data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666665#action_12666665 ] 

Doug Cutting commented on HADOOP-5010:
--------------------------------------

So is the expectation is that this will work to for world-readable files?  Should we document that?  Marco, is this sufficient?

> Document HTTP/HTTPS methods to read directory and file data
> -----------------------------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>            Priority: Trivial
>             Fix For: 0.21.0
>
>         Attachments: 5010-0.patch
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5010) Replace HFTP/HSFTP with plain HTTP/HTTPS

Posted by "Marco Nicosia (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marco Nicosia updated HADOOP-5010:
----------------------------------

      Component/s:     (was: contrib/hdfsproxy)
                   documentation
    Fix Version/s: 0.20.0

bq. Counting the listPaths servlet, there are already two interfaces for browsing HDFS over HTTP, aren't there? This seems to be asking for a way to manipulate HDFS without the Hadoop jar. If reading is sufficient, then the HFTP servlets should suffice for hand-rolled tools

Reading is sufficient (from my original request). I didn't know that there's a combination of HTTP requests which will allow an http client to get directory listings and file data.

Does listPaths and the .../data/... component respect the dfs.web.ugi directive? (But then, this is what HDFS proxy was invented for, so permissions should be a non-issue.) When Hadoop becomes kerberized, these servlets need to require credentials over HTTP.

bq. but they need to be documented.

Yes. Switching this bug to a documentation task.

> Replace HFTP/HSFTP with plain HTTP/HTTPS
> ----------------------------------------
>
>                 Key: HADOOP-5010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.18.0
>            Reporter: Marco Nicosia
>             Fix For: 0.20.0
>
>
> In HADOOP-1563, [~cutting] wrote:
> bq. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls.
> Recently, we've been talking about implementing an HDFS proxy (HADOOP-4575) which would be a secure way to make HFTP/HSFTP available. In so doing, we may even remove HFTP/HSFTP from being offered on the HDFS itself (that's another discussion).
> In the case of the HDFS proxy, does it make sense to do away with the artificial HFTP/HSFTP protocols, and instead simply offer standard HTTP and HTTPS? That would allow non-HDFS-specific clients, as well as using various standard HTTP infrastructure, such as load balancers, etc.
> NB, to the best of my knowledge, HFTP is only documented on the [distcp|http://hadoop.apache.org/core/docs/current/distcp.html] page, and HSFTP is not documented at all?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.