You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Jun Rao <ju...@almaden.ibm.com> on 2009/01/07 18:19:58 UTC

short-circuiting HDFS reads


Hi,

Today, HDFS always reads through a socket even when the data is local to
the client. This adds a lot of overhead, especially for warm reads. It
should be possible for a dfs client to test if a block to be read is local
and if so, bypass socket and read through local FS api directly. This
should improve random access performance significantly (e.g., for HBase).
Has this been considered in HDFS? Thanks,

Jun

Re: short-circuiting HDFS reads

Posted by stack <st...@duboce.net>.
On Thu, Jan 8, 2009 at 10:39 AM, Doug Cutting <cu...@apache.org> wrote:

> So this sort of optimization may only have a significant impact for jobs
> whose maps do not produce much output.


Don't forget the lowly non-MR users of HDFS.  The short-circuit looks
promising improving HDFS random access times.
St.Ack

Re: short-circuiting HDFS reads

Posted by Doug Cutting <cu...@apache.org>.
Chris K Wensel wrote:
> Any comments on the probability (currently) that reads by a Task are 
> over the network vs. being "local", as seen in your tests? That is, are 
> 10% of block reads over the network, or 90% of reads?

Greater than 90% of map reads are typically local in a sort job, like 
98-99%.  But map input is not the bottleneck in sort.  Shuffle and 
reduce output are both considerably slower.  So this sort of 
optimization may only have a significant impact for jobs whose maps do 
not produce much output.

Doug

Re: short-circuiting HDFS reads

Posted by Chris K Wensel <ch...@wensel.net>.
Hey George

Any comments on the probability (currently) that reads by a Task are  
over the network vs. being "local", as seen in your tests? That is,  
are 10% of block reads over the network, or 90% of reads?

I haven't looked, but am wondering if this metrics is stuffed  
somewhere by Hadoop...

ckw


On Jan 8, 2009, at 10:13 AM, George Porter wrote:

> Hi Jun,
>
> The earlier responses to your email reference the JIRA that I opened  
> about this issue.  Short-circuiting the primary HDFS datapath does  
> improve throughput, and the amount depends on your workload (random  
> reads especially).  Some initial experimental results are posted to  
> that JIRA.  A second advantage is that since the JVM hosting the  
> HDFS client is doing the reading, the O/S will satisfy future disk  
> requests from the cache, which isn't really possible when you read  
> over the network (even to another JVM on the same host).
>
> There are several real disadvantages, the largest of which include  
> 1) it adds a new datapath, and 2) bypasses various security and  
> auditing features of HDFS.  I would certainly like to think through  
> a more clean interface for achieving this goal, especially since  
> reading local data should be the common case.  Any thoughts you  
> might have would be appreciated.
>
> Thanks,
> George
>
> Jun Rao wrote:
>> Hi,
>>
>> Today, HDFS always reads through a socket even when the data is  
>> local to
>> the client. This adds a lot of overhead, especially for warm reads.  
>> It
>> should be possible for a dfs client to test if a block to be read  
>> is local
>> and if so, bypass socket and read through local FS api directly. This
>> should improve random access performance significantly (e.g., for  
>> HBase).
>> Has this been considered in HDFS? Thanks,
>>
>> Jun
>>
>>

--
Chris K Wensel
chris@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/


Re: short-circuiting HDFS reads

Posted by Dhruba Borthakur <dh...@gmail.com>.
Hi folks,

This is a very interesting discussion. We have been considering divvy-ing up
a set of "old" netapp storage among a few number of nodes that run HDFS.
This HDFS could be used to archive rarely used data from our production
cluster. These netapp-storage could actually be mounted on all HDFS client
machines, and it would be nice if there was a short-circuit protocol to make
HDFS clients write directory to the block file.

This is helful in the scenario when there is shared storage that can be
accessed directly by datanodes as well as hdfs client. I understand this
exposes problems with security and such.

-dhruba

On Fri, Feb 13, 2009 at 2:39 PM, Sanjay Radia <sr...@yahoo-inc.com> wrote:

>
> On Jan 8, 2009, at 10:13 AM, George Porter wrote:
>
>  Hi Jun,
>>
>> The earlier responses to your email reference the JIRA that I opened
>> about this issue.  Short-circuiting the primary HDFS datapath does
>> improve throughput, and the amount depends on your workload (random
>> reads especially).  Some initial experimental results are posted to that
>> JIRA.  A second advantage is that since the JVM hosting the HDFS client
>> is doing the reading, the O/S will satisfy future disk requests from the
>> cache, which isn't really possible when you read over the network (even
>> to another JVM on the same host).
>>
>> There are several real disadvantages, the largest of which include 1) it
>> adds a new datapath, and 2) bypasses various security and auditing
>> features of HDFS.
>>
>>  We are in middle of adding security to HDFS.
> Having the client read the blocks directly would violate security. Security
> is a specially thorny problem to solve in this case.
> Further the internal structure and hence the path name of the file are not
> visible outside.
> One could consider hacking this (ignoring security) but even this gets
> tricky as the directory in which the block is saved may change if
> some one starts to write to the file (which  can happen with the recent
>  append work),
>
> Interesting optimization but tricky to do in a clean way (at least not
> obvious to me).
>
>
> sanjay
>
>
>
>  I would certainly like to think through a more clean
>> interface for achieving this goal, especially since reading local data
>> should be the common case.  Any thoughts you might have would be
>> appreciated.
>>
>> Thanks,
>> George
>>
>> Jun Rao wrote:
>> > Hi,
>> >
>> > Today, HDFS always reads through a socket even when the data is local to
>> > the client. This adds a lot of overhead, especially for warm reads. It
>> > should be possible for a dfs client to test if a block to be read is
>> local
>> > and if so, bypass socket and read through local FS api directly. This
>> > should improve random access performance significantly (e.g., for
>> HBase).
>> > Has this been considered in HDFS? Thanks,
>> >
>> > Jun
>> >
>> >
>>
>>
>

Re: short-circuiting HDFS reads

Posted by Sanjay Radia <sr...@yahoo-inc.com>.
On Jan 8, 2009, at 10:13 AM, George Porter wrote:

> Hi Jun,
>
> The earlier responses to your email reference the JIRA that I opened
> about this issue.  Short-circuiting the primary HDFS datapath does
> improve throughput, and the amount depends on your workload (random
> reads especially).  Some initial experimental results are posted to  
> that
> JIRA.  A second advantage is that since the JVM hosting the HDFS  
> client
> is doing the reading, the O/S will satisfy future disk requests from  
> the
> cache, which isn't really possible when you read over the network  
> (even
> to another JVM on the same host).
>
> There are several real disadvantages, the largest of which include  
> 1) it
> adds a new datapath, and 2) bypasses various security and auditing
> features of HDFS.
>
We are in middle of adding security to HDFS.
Having the client read the blocks directly would violate security.  
Security is a specially thorny problem to solve in this case.
Further the internal structure and hence the path name of the file are  
not visible outside.
One could consider hacking this (ignoring security) but even this gets  
tricky as the directory in which the block is saved may change if
some one starts to write to the file (which  can happen with the  
recent  append work),

Interesting optimization but tricky to do in a clean way (at least not  
obvious to me).


sanjay


> I would certainly like to think through a more clean
> interface for achieving this goal, especially since reading local data
> should be the common case.  Any thoughts you might have would be
> appreciated.
>
> Thanks,
> George
>
> Jun Rao wrote:
> > Hi,
> >
> > Today, HDFS always reads through a socket even when the data is  
> local to
> > the client. This adds a lot of overhead, especially for warm  
> reads. It
> > should be possible for a dfs client to test if a block to be read  
> is local
> > and if so, bypass socket and read through local FS api directly.  
> This
> > should improve random access performance significantly (e.g., for  
> HBase).
> > Has this been considered in HDFS? Thanks,
> >
> > Jun
> >
> >
>


Re: short-circuiting HDFS reads

Posted by Jun Rao <ju...@almaden.ibm.com>.
George,

Thanks for the comment.

Note that HDFS files are never updated in-place. This should make it easy
to deal with cache consistency.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com
(408)927-1886 (phone)
(408)927-3215 (fax)


George.Porter@Sun.COM wrote on 01/12/2009 10:11:08 AM:

> Hi Jun,
>
> Thanks for the pointer to clear the disk cache, as well as the
> suggestion for creating a DFS Client cache layer.  As for the double
> buffering overhead, I think that there is not going to be a large
> benefit to buffering in the DataNode, since the DataNode itself does not
> use the data in the buffer (it just forwards that data to HDFS
> clients).  With the ability to perform zero-copy I/O operations, it
> probably shouldn't buffer any data at all, since it could just
> sendfile() the data directly from the disk to the network client via
> DMA, rather than copying it from disk into its memory space, then from
> memory to the socket.  The downside of a DFS client cache is that it
> would need to be kept consistent, which would probably add a lot of
> complexity to the client.  It is an interesting idea, though, and I
> think we should keep thinking about it.
>
> Thanks,
> George
>
> Jun Rao wrote:
> >
> > Hi, George,
> >
> > I read the results in your JIRA. Very encouraging. It would be useful
> > to test the improvement on both cold and warm data (warm data likely
> > has larger improvment). There is a simple way to clear the file cache
> > on Linux (http://www.linuxinsight.com/proc_sys_vm_drop_caches.html).
> >
> > An alternative approach is to build an in-memory caching layer on top
> > of a DFS Client. The advantages are (1) less security issues; (2)
> > probably even better performance since checksum can be avoided once
> > the data is cached in memory; (3) the caching layer can be used
> > anywhere, no just nodes owning a block locally. The disadvantage is
> > that data is buffered twice in memory: once in the caching layer and
> > once in the OS file cache. One can probably limit the OS file cache
> > size (not sure if there is an easy way in Linux). What's your thought
> > on this?
> >
> > Jun
> >
> > George.Porter@Sun.COM wrote on 01/08/2009 10:13:25 AM:
> >
> > > Hi Jun,
> > >
> > > The earlier responses to your email reference the JIRA that I opened
> > > about this issue.  Short-circuiting the primary HDFS datapath does
> > > improve throughput, and the amount depends on your workload (random
> > > reads especially).  Some initial experimental results are posted to
> > that
> > > JIRA.  A second advantage is that since the JVM hosting the HDFS
client
> > > is doing the reading, the O/S will satisfy future disk requests from
> > the
> > > cache, which isn't really possible when you read over the network
(even
> > > to another JVM on the same host).
> > >
> > > There are several real disadvantages, the largest of which include
> > 1) it
> > > adds a new datapath, and 2) bypasses various security and auditing
> > > features of HDFS.  I would certainly like to think through a more
clean
> > > interface for achieving this goal, especially since reading local
data
> > > should be the common case.  Any thoughts you might have would be
> > > appreciated.
> > >
> > > Thanks,
> > > George
> > >
> > > Jun Rao wrote:
> > > > Hi,
> > > >
> > > > Today, HDFS always reads through a socket even when the data is
> > local to
> > > > the client. This adds a lot of overhead, especially for warm reads.
It
> > > > should be possible for a dfs client to test if a block to be read
> > is local
> > > > and if so, bypass socket and read through local FS api directly.
This
> > > > should improve random access performance significantly (e.g., for
> > HBase).
> > > > Has this been considered in HDFS? Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> >

Re: short-circuiting HDFS reads

Posted by George Porter <Ge...@Sun.COM>.
Hi Jun,

Thanks for the pointer to clear the disk cache, as well as the 
suggestion for creating a DFS Client cache layer.  As for the double 
buffering overhead, I think that there is not going to be a large 
benefit to buffering in the DataNode, since the DataNode itself does not 
use the data in the buffer (it just forwards that data to HDFS 
clients).  With the ability to perform zero-copy I/O operations, it 
probably shouldn't buffer any data at all, since it could just 
sendfile() the data directly from the disk to the network client via 
DMA, rather than copying it from disk into its memory space, then from 
memory to the socket.  The downside of a DFS client cache is that it 
would need to be kept consistent, which would probably add a lot of 
complexity to the client.  It is an interesting idea, though, and I 
think we should keep thinking about it.

Thanks,
George

Jun Rao wrote:
>
> Hi, George,
>
> I read the results in your JIRA. Very encouraging. It would be useful 
> to test the improvement on both cold and warm data (warm data likely 
> has larger improvment). There is a simple way to clear the file cache 
> on Linux (http://www.linuxinsight.com/proc_sys_vm_drop_caches.html).
>
> An alternative approach is to build an in-memory caching layer on top 
> of a DFS Client. The advantages are (1) less security issues; (2) 
> probably even better performance since checksum can be avoided once 
> the data is cached in memory; (3) the caching layer can be used 
> anywhere, no just nodes owning a block locally. The disadvantage is 
> that data is buffered twice in memory: once in the caching layer and 
> once in the OS file cache. One can probably limit the OS file cache 
> size (not sure if there is an easy way in Linux). What's your thought 
> on this?
>
> Jun
>
> George.Porter@Sun.COM wrote on 01/08/2009 10:13:25 AM:
>
> > Hi Jun,
> >
> > The earlier responses to your email reference the JIRA that I opened
> > about this issue.  Short-circuiting the primary HDFS datapath does
> > improve throughput, and the amount depends on your workload (random
> > reads especially).  Some initial experimental results are posted to 
> that
> > JIRA.  A second advantage is that since the JVM hosting the HDFS client
> > is doing the reading, the O/S will satisfy future disk requests from 
> the
> > cache, which isn't really possible when you read over the network (even
> > to another JVM on the same host).
> >
> > There are several real disadvantages, the largest of which include 
> 1) it
> > adds a new datapath, and 2) bypasses various security and auditing
> > features of HDFS.  I would certainly like to think through a more clean
> > interface for achieving this goal, especially since reading local data
> > should be the common case.  Any thoughts you might have would be
> > appreciated.
> >
> > Thanks,
> > George
> >
> > Jun Rao wrote:
> > > Hi,
> > >
> > > Today, HDFS always reads through a socket even when the data is 
> local to
> > > the client. This adds a lot of overhead, especially for warm reads. It
> > > should be possible for a dfs client to test if a block to be read 
> is local
> > > and if so, bypass socket and read through local FS api directly. This
> > > should improve random access performance significantly (e.g., for 
> HBase).
> > > Has this been considered in HDFS? Thanks,
> > >
> > > Jun
> > >
> > >  
>

Re: short-circuiting HDFS reads

Posted by Jun Rao <ju...@almaden.ibm.com>.
Hi, George,

I read the results in your JIRA. Very encouraging. It would be useful to
test the improvement on both cold and warm data (warm data likely has
larger improvment). There is a simple way to clear the file cache on Linux
(http://www.linuxinsight.com/proc_sys_vm_drop_caches.html).

An alternative approach is to build an in-memory caching layer on top of a
DFS Client. The advantages are (1) less security issues; (2) probably even
better performance since checksum can be avoided once the data is cached in
memory; (3) the caching layer can be used anywhere, no just nodes owning a
block locally. The disadvantage is that data is buffered twice in memory:
once in the caching layer and once in the OS file cache. One can probably
limit the OS file cache size (not sure if there is an easy way in Linux).
What's your thought on this?

Jun

George.Porter@Sun.COM wrote on 01/08/2009 10:13:25 AM:

> Hi Jun,
>
> The earlier responses to your email reference the JIRA that I opened
> about this issue.  Short-circuiting the primary HDFS datapath does
> improve throughput, and the amount depends on your workload (random
> reads especially).  Some initial experimental results are posted to that
> JIRA.  A second advantage is that since the JVM hosting the HDFS client
> is doing the reading, the O/S will satisfy future disk requests from the
> cache, which isn't really possible when you read over the network (even
> to another JVM on the same host).
>
> There are several real disadvantages, the largest of which include 1) it
> adds a new datapath, and 2) bypasses various security and auditing
> features of HDFS.  I would certainly like to think through a more clean
> interface for achieving this goal, especially since reading local data
> should be the common case.  Any thoughts you might have would be
> appreciated.
>
> Thanks,
> George
>
> Jun Rao wrote:
> > Hi,
> >
> > Today, HDFS always reads through a socket even when the data is local
to
> > the client. This adds a lot of overhead, especially for warm reads. It
> > should be possible for a dfs client to test if a block to be read is
local
> > and if so, bypass socket and read through local FS api directly. This
> > should improve random access performance significantly (e.g., for
HBase).
> > Has this been considered in HDFS? Thanks,
> >
> > Jun
> >
> >

Re: short-circuiting HDFS reads

Posted by George Porter <Ge...@Sun.COM>.
Hi Jun,

The earlier responses to your email reference the JIRA that I opened 
about this issue.  Short-circuiting the primary HDFS datapath does 
improve throughput, and the amount depends on your workload (random 
reads especially).  Some initial experimental results are posted to that 
JIRA.  A second advantage is that since the JVM hosting the HDFS client 
is doing the reading, the O/S will satisfy future disk requests from the 
cache, which isn't really possible when you read over the network (even 
to another JVM on the same host).

There are several real disadvantages, the largest of which include 1) it 
adds a new datapath, and 2) bypasses various security and auditing 
features of HDFS.  I would certainly like to think through a more clean 
interface for achieving this goal, especially since reading local data 
should be the common case.  Any thoughts you might have would be 
appreciated.

Thanks,
George

Jun Rao wrote:
> Hi,
>
> Today, HDFS always reads through a socket even when the data is local to
> the client. This adds a lot of overhead, especially for warm reads. It
> should be possible for a dfs client to test if a block to be read is local
> and if so, bypass socket and read through local FS api directly. This
> should improve random access performance significantly (e.g., for HBase).
> Has this been considered in HDFS? Thanks,
>
> Jun
>
>   

Re: short-circuiting HDFS reads

Posted by Doug Cutting <cu...@apache.org>.
Please see https://issues.apache.org/jira/browse/HADOOP-4801.

Doug

Jun Rao wrote:
> 
> Hi,
> 
> Today, HDFS always reads through a socket even when the data is local to
> the client. This adds a lot of overhead, especially for warm reads. It
> should be possible for a dfs client to test if a block to be read is local
> and if so, bypass socket and read through local FS api directly. This
> should improve random access performance significantly (e.g., for HBase).
> Has this been considered in HDFS? Thanks,
> 
> Jun

Re: short-circuiting HDFS reads

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
https://issues.apache.org/jira/browse/HADOOP-4801

Jun Rao wrote:
> 
> Hi,
> 
> Today, HDFS always reads through a socket even when the data is local to
> the client. This adds a lot of overhead, especially for warm reads. It
> should be possible for a dfs client to test if a block to be read is local
> and if so, bypass socket and read through local FS api directly. This
> should improve random access performance significantly (e.g., for HBase).
> Has this been considered in HDFS? Thanks,
> 
> Jun