You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Colin McCabe <cm...@apache.org> on 2016/05/02 19:32:12 UTC

Re: Another thought on client-side support of HDFS federation

Hi Tianyi HE,

Thanks for sharing this!  This reminds me of the httpfs daemon.  This
daemon basically sits in front of an HDFS cluster and accepts requests,
which it serves by forwarding them to the underlying HDFS instance. 
There is some documentation about it here:
https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/index.html

Since httpfs uses an org.apache.hadoop.fs.FileSystem instance, it seems
like you could plug in the apache.hadoop.fs.viewfs.ViewFileSystem class
and be up and running with federation.  I haven't tried this, but I
would expect that it would work, unless there are bugs in ViewFS itself.

The big advantage of httpfs is that it provides a webhdfs-style REST
interface.  As you said, this kind of interface makes it simple to use
any language with REST bindings, without worrying about using a thick
client.

The big disadvantage of httpfs is that you must move both metadata and
data operations through the httpfs daemon.  This could become a
performance bottleneck.  It seems like you are concerned about this
bottleneck.

We also have webhdfs.  Unlike httpfs, webhdfs doesn't require all the
data to move through its daemon.  With webhdfs, the client talks to
DataNodes directly.

I wonder if extending httpfs or webhdfs would be a better approach than
starting from scratch.  There is a maintenance burden for adding new
services and daemons.  This was our motivation for removing hftp, for
example.  It's certainly something to think about.

best,
Colin

On Thu, Apr 28, 2016, at 17:55, 何天一 wrote:
> Hey guys,
> 
> My associates have investigated HDFS federation recently, which, turns
> out
> to be a quite good solution for improving scalability on
> NameNode/DataNode
> side.
> 
> However, we encountered some problem on client-side. Since:
> A) For historical reason, we use clients in multiple languages to access
> HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either
> implement
> multiple versions of ViewFS or we give up the consistency view (which can
> be confusing to user).
> B) We have hadoop client configuration deployed on client nodes, which we
> do not have control over . Also, releasing new configuration could be a
> real heavy operation because it needs to be pushed to several thousand of
> nodes, as well as maintaining consistency (say a node is down throughout
> the operation, then come back online. it could still possess a stale
> version of configuration).
> 
> So we intended to explore another solution to these problems, and came up
> with a proxy model.
> That is, build a RPC proxy in front of NameNodes.
> All clients talk to proxy when they need to consult NameNode, then proxy
> decide which NameNode should the request go to according to mount table.
> This solved our problem. All clients are seamlessly upgraded with
> federation support.
> We open sourced the proxy recently: https://github.com/bytedance/nnproxy
> (BTW, all kinds of feedbacks are welcomed)
> 
> But there are still a few issues. For example, several modifications
> needs
> to be done inside hadoop ipc to support rpc forwarding. We released patch
> according to which with nnproxy project (
> https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it
> could be better to have these merged to apache trunk. Does someone think
> it's worth?
> 
> 
> -- 
> Cheers,
> Tianyi HE
> (+86) 185 0042 4096

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

Re: Another thought on client-side support of HDFS federation

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Tianyi HE,

I noticed that a similar design for a federation proxying model has just
been proposed on Apache JIRA HDFS-10467.  You might want to join the
conversation there.

https://issues.apache.org/jira/browse/HDFS-10467


--Chris Nauroth




On 5/2/16, 10:32 AM, "Colin McCabe" <cm...@apache.org> wrote:

>Hi Tianyi HE,
>
>Thanks for sharing this!  This reminds me of the httpfs daemon.  This
>daemon basically sits in front of an HDFS cluster and accepts requests,
>which it serves by forwarding them to the underlying HDFS instance.
>There is some documentation about it here:
>https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/index.html
>
>Since httpfs uses an org.apache.hadoop.fs.FileSystem instance, it seems
>like you could plug in the apache.hadoop.fs.viewfs.ViewFileSystem class
>and be up and running with federation.  I haven't tried this, but I
>would expect that it would work, unless there are bugs in ViewFS itself.
>
>The big advantage of httpfs is that it provides a webhdfs-style REST
>interface.  As you said, this kind of interface makes it simple to use
>any language with REST bindings, without worrying about using a thick
>client.
>
>The big disadvantage of httpfs is that you must move both metadata and
>data operations through the httpfs daemon.  This could become a
>performance bottleneck.  It seems like you are concerned about this
>bottleneck.
>
>We also have webhdfs.  Unlike httpfs, webhdfs doesn't require all the
>data to move through its daemon.  With webhdfs, the client talks to
>DataNodes directly.
>
>I wonder if extending httpfs or webhdfs would be a better approach than
>starting from scratch.  There is a maintenance burden for adding new
>services and daemons.  This was our motivation for removing hftp, for
>example.  It's certainly something to think about.
>
>best,
>Colin
>
>
>On Thu, Apr 28, 2016, at 17:55, 何天一 wrote:
>> Hey guys,
>> 
>> My associates have investigated HDFS federation recently, which, turns
>> out
>> to be a quite good solution for improving scalability on
>> NameNode/DataNode
>> side.
>> 
>> However, we encountered some problem on client-side. Since:
>> A) For historical reason, we use clients in multiple languages to access
>> HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either
>> implement
>> multiple versions of ViewFS or we give up the consistency view (which
>>can
>> be confusing to user).
>> B) We have hadoop client configuration deployed on client nodes, which
>>we
>> do not have control over . Also, releasing new configuration could be a
>> real heavy operation because it needs to be pushed to several thousand
>>of
>> nodes, as well as maintaining consistency (say a node is down throughout
>> the operation, then come back online. it could still possess a stale
>> version of configuration).
>> 
>> So we intended to explore another solution to these problems, and came
>>up
>> with a proxy model.
>> That is, build a RPC proxy in front of NameNodes.
>> All clients talk to proxy when they need to consult NameNode, then proxy
>> decide which NameNode should the request go to according to mount table.
>> This solved our problem. All clients are seamlessly upgraded with
>> federation support.
>> We open sourced the proxy recently: https://github.com/bytedance/nnproxy
>> (BTW, all kinds of feedbacks are welcomed)
>> 
>> But there are still a few issues. For example, several modifications
>> needs
>> to be done inside hadoop ipc to support rpc forwarding. We released
>>patch
>> according to which with nnproxy project (
>> https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it
>> could be better to have these merged to apache trunk. Does someone think
>> it's worth?
>> 
>> 
>> -- 
>> Cheers,
>> Tianyi HE
>> (+86) 185 0042 4096
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org