You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rita <rm...@gmail.com> on 2012/01/15 02:30:10 UTC

hadoop filesystem cache

After reading this article,
http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
wondering if there was a filesystem cache for hdfs. For example, if a large
file (10gigabytes) was keep getting accessed on the cluster instead of keep
getting it from the network why not storage the content of the file locally
on the client itself.  A use case on the client would be like this:



<property>
  <name>dfs.client.cachedirectory</name>
  <value>/var/cache/hdfs</value>
</property>


<property>
<name>dfs.client.cachesize</name>
<description>in megabytes</description>
<value>100000</value>
</property>


Any thoughts of a feature like this?


-- 
--- Get your facts first, then you can distort them as you please.--

Re: hadoop filesystem cache

Posted by Rita <rm...@gmail.com>.
My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better
than getting it from the network and the cost of fs buffer cache is much
cheaper than a RPC call.

On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo <ed...@gmail.com>wrote:

> The challenges of this design is people accessing the same data over and
> over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
> all about streaming through large datasets that do not fit in memory. Also
> your shuffle-sort-spill is going to play havoc on and file system based
> cache. The distributed cache roughly fits this role except that it does not
> persist after a job.
>
> Replicating content to N nodes also is not a hard problem to tackle (you
> can hack up a content delivery system with ssh+rsync) and get similar
> results.The approach often taken has been to keep data that is accessed
> repeatedly and fits in memory in some other system
> (hbase/cassandra/mysql/whatever).
>
> Edward
>
>
> On Mon, Jan 16, 2012 at 11:33 AM, Rita <rm...@gmail.com> wrote:
>
> > Thanks. I believe this is a good feature to have for clients especially
> if
> > you are reading the same large file over and over.
> >
> >
> > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > There is some work being done in this area by some folks over at UC
> > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > > has been published quite yet, but the title of the project is "PACMan"
> > > -- I expect it will be published soon.
> > >
> > > -Todd
> > >
> > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <rm...@gmail.com> wrote:
> > > > After reading this article,
> > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> > was
> > > > wondering if there was a filesystem cache for hdfs. For example, if a
> > > large
> > > > file (10gigabytes) was keep getting accessed on the cluster instead
> of
> > > keep
> > > > getting it from the network why not storage the content of the file
> > > locally
> > > > on the client itself.  A use case on the client would be like this:
> > > >
> > > >
> > > >
> > > > <property>
> > > >  <name>dfs.client.cachedirectory</name>
> > > >  <value>/var/cache/hdfs</value>
> > > > </property>
> > > >
> > > >
> > > > <property>
> > > > <name>dfs.client.cachesize</name>
> > > > <description>in megabytes</description>
> > > > <value>100000</value>
> > > > </property>
> > > >
> > > >
> > > > Any thoughts of a feature like this?
> > > >
> > > >
> > > > --
> > > > --- Get your facts first, then you can distort them as you please.--
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
>



-- 
--- Get your facts first, then you can distort them as you please.--

Re: hadoop filesystem cache

Posted by Edward Capriolo <ed...@gmail.com>.
The challenges of this design is people accessing the same data over and
over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
all about streaming through large datasets that do not fit in memory. Also
your shuffle-sort-spill is going to play havoc on and file system based
cache. The distributed cache roughly fits this role except that it does not
persist after a job.

Replicating content to N nodes also is not a hard problem to tackle (you
can hack up a content delivery system with ssh+rsync) and get similar
results.The approach often taken has been to keep data that is accessed
repeatedly and fits in memory in some other system
(hbase/cassandra/mysql/whatever).

Edward


On Mon, Jan 16, 2012 at 11:33 AM, Rita <rm...@gmail.com> wrote:

> Thanks. I believe this is a good feature to have for clients especially if
> you are reading the same large file over and over.
>
>
> On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > There is some work being done in this area by some folks over at UC
> > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > has been published quite yet, but the title of the project is "PACMan"
> > -- I expect it will be published soon.
> >
> > -Todd
> >
> > On Sat, Jan 14, 2012 at 5:30 PM, Rita <rm...@gmail.com> wrote:
> > > After reading this article,
> > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> was
> > > wondering if there was a filesystem cache for hdfs. For example, if a
> > large
> > > file (10gigabytes) was keep getting accessed on the cluster instead of
> > keep
> > > getting it from the network why not storage the content of the file
> > locally
> > > on the client itself.  A use case on the client would be like this:
> > >
> > >
> > >
> > > <property>
> > >  <name>dfs.client.cachedirectory</name>
> > >  <value>/var/cache/hdfs</value>
> > > </property>
> > >
> > >
> > > <property>
> > > <name>dfs.client.cachesize</name>
> > > <description>in megabytes</description>
> > > <value>100000</value>
> > > </property>
> > >
> > >
> > > Any thoughts of a feature like this?
> > >
> > >
> > > --
> > > --- Get your facts first, then you can distort them as you please.--
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>

Re: hadoop filesystem cache

Posted by Rita <rm...@gmail.com>.
Thanks. I believe this is a good feature to have for clients especially if
you are reading the same large file over and over.


On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <to...@cloudera.com> wrote:

> There is some work being done in this area by some folks over at UC
> Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> has been published quite yet, but the title of the project is "PACMan"
> -- I expect it will be published soon.
>
> -Todd
>
> On Sat, Jan 14, 2012 at 5:30 PM, Rita <rm...@gmail.com> wrote:
> > After reading this article,
> > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
> > wondering if there was a filesystem cache for hdfs. For example, if a
> large
> > file (10gigabytes) was keep getting accessed on the cluster instead of
> keep
> > getting it from the network why not storage the content of the file
> locally
> > on the client itself.  A use case on the client would be like this:
> >
> >
> >
> > <property>
> >  <name>dfs.client.cachedirectory</name>
> >  <value>/var/cache/hdfs</value>
> > </property>
> >
> >
> > <property>
> > <name>dfs.client.cachesize</name>
> > <description>in megabytes</description>
> > <value>100000</value>
> > </property>
> >
> >
> > Any thoughts of a feature like this?
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
--- Get your facts first, then you can distort them as you please.--

Re: hadoop filesystem cache

Posted by Todd Lipcon <to...@cloudera.com>.
There is some work being done in this area by some folks over at UC
Berkeley's AMP Lab in coordination with Facebook. I don't believe it
has been published quite yet, but the title of the project is "PACMan"
-- I expect it will be published soon.

-Todd

On Sat, Jan 14, 2012 at 5:30 PM, Rita <rm...@gmail.com> wrote:
> After reading this article,
> http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
> wondering if there was a filesystem cache for hdfs. For example, if a large
> file (10gigabytes) was keep getting accessed on the cluster instead of keep
> getting it from the network why not storage the content of the file locally
> on the client itself.  A use case on the client would be like this:
>
>
>
> <property>
>  <name>dfs.client.cachedirectory</name>
>  <value>/var/cache/hdfs</value>
> </property>
>
>
> <property>
> <name>dfs.client.cachesize</name>
> <description>in megabytes</description>
> <value>100000</value>
> </property>
>
>
> Any thoughts of a feature like this?
>
>
> --
> --- Get your facts first, then you can distort them as you please.--



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: hadoop filesystem cache

Posted by Rita <rm...@gmail.com>.
yes, something different from that. To my knowledge, DistributedCache is
only for Mapreduce.

On Sat, Jan 14, 2012 at 8:33 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> You mean something different from the DistributedCache?
>
> Sent from my iPhone
>
> On Jan 14, 2012, at 5:30 PM, Rita <rm...@gmail.com> wrote:
>
> > After reading this article,
> > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
> > wondering if there was a filesystem cache for hdfs. For example, if a
> large
> > file (10gigabytes) was keep getting accessed on the cluster instead of
> keep
> > getting it from the network why not storage the content of the file
> locally
> > on the client itself.  A use case on the client would be like this:
> >
> >
> >
> > <property>
> >  <name>dfs.client.cachedirectory</name>
> >  <value>/var/cache/hdfs</value>
> > </property>
> >
> >
> > <property>
> > <name>dfs.client.cachesize</name>
> > <description>in megabytes</description>
> > <value>100000</value>
> > </property>
> >
> >
> > Any thoughts of a feature like this?
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
>



-- 
--- Get your facts first, then you can distort them as you please.--

Re: hadoop filesystem cache

Posted by Prashant Kommireddi <pr...@gmail.com>.
You mean something different from the DistributedCache?

Sent from my iPhone

On Jan 14, 2012, at 5:30 PM, Rita <rm...@gmail.com> wrote:

> After reading this article,
> http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
> wondering if there was a filesystem cache for hdfs. For example, if a large
> file (10gigabytes) was keep getting accessed on the cluster instead of keep
> getting it from the network why not storage the content of the file locally
> on the client itself.  A use case on the client would be like this:
>
>
>
> <property>
>  <name>dfs.client.cachedirectory</name>
>  <value>/var/cache/hdfs</value>
> </property>
>
>
> <property>
> <name>dfs.client.cachesize</name>
> <description>in megabytes</description>
> <value>100000</value>
> </property>
>
>
> Any thoughts of a feature like this?
>
>
> --
> --- Get your facts first, then you can distort them as you please.--