You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Demai Ni <ni...@gmail.com> on 2015/08/11 19:53:32 UTC

hadoop/hdfs cache question, do client processes share cache?

hi, folks,

I have a quick question about how hdfs handle cache? In this lab
experiment, I have a 4 node hadoop cluster (2.x) and each node has a fair
large memory (96GB).  And have a single hdfs file with 256MB, and also fit
in one HDFS block. The local filesystem is linux.

Now from one of the DataNode, I started 10 hadoop client processes to
repeatedly read the above file. With the assumption that HDFS will cache
the 256MB in memory, so (after the 1st read) READs will have no disk I/O
involved anymore.

My question is : *how many COPY of the 256MB will be in memory of this
DataNode? 10 or 1?*

How about the 10 client processes are located at the 5th linux box
 independent of the cluster? Will we have 10 copies of the 256 MB or just
1?

Many thanks. Appreciate your help on this.

Demai

RE: hadoop/hdfs cache question, do client processes share cache?

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data stored in HDFS will be too high and if multiple clients are accesing the Datanode then cache hit ratio will be very low, hence there is no point in having implicit HDFS-managed cache.

Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & dfs.datanode.drop.cache.behind.writes in http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  between processes?
Yes, READ will have to move the whole file through network and no sharing between multiple clients/processes within a given node.

+ Naga
________________________________
From: Demai Ni [nidmgg@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache? Will the cache occur at local filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the client is a stand-alone Linux(not part of the cluster), which connects to the HDFS cluster with centralized cache configured. So on HDFS cluster, the file is cached. In the same scenario, the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side? or every READ will have to move the whole file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <ri...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of these 3 datanodes will have only one copy of the block in their respective mem cache and thus avoiding the repeated i/o reads. This goes with the centralized cache management policy of hdfs that also gives you an option to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache space. Here's a link<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on the same.

Hope that helps.

Ritesh

RE: hadoop/hdfs cache question, do client processes share cache?

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data stored in HDFS will be too high and if multiple clients are accesing the Datanode then cache hit ratio will be very low, hence there is no point in having implicit HDFS-managed cache.

Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & dfs.datanode.drop.cache.behind.writes in http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  between processes?
Yes, READ will have to move the whole file through network and no sharing between multiple clients/processes within a given node.

+ Naga
________________________________
From: Demai Ni [nidmgg@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache? Will the cache occur at local filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the client is a stand-alone Linux(not part of the cluster), which connects to the HDFS cluster with centralized cache configured. So on HDFS cluster, the file is cached. In the same scenario, the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side? or every READ will have to move the whole file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <ri...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of these 3 datanodes will have only one copy of the block in their respective mem cache and thus avoiding the repeated i/o reads. This goes with the centralized cache management policy of hdfs that also gives you an option to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache space. Here's a link<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on the same.

Hope that helps.

Ritesh

RE: hadoop/hdfs cache question, do client processes share cache?

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data stored in HDFS will be too high and if multiple clients are accesing the Datanode then cache hit ratio will be very low, hence there is no point in having implicit HDFS-managed cache.

Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & dfs.datanode.drop.cache.behind.writes in http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  between processes?
Yes, READ will have to move the whole file through network and no sharing between multiple clients/processes within a given node.

+ Naga
________________________________
From: Demai Ni [nidmgg@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache? Will the cache occur at local filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the client is a stand-alone Linux(not part of the cluster), which connects to the HDFS cluster with centralized cache configured. So on HDFS cluster, the file is cached. In the same scenario, the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side? or every READ will have to move the whole file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <ri...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of these 3 datanodes will have only one copy of the block in their respective mem cache and thus avoiding the repeated i/o reads. This goes with the centralized cache management policy of hdfs that also gives you an option to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache space. Here's a link<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on the same.

Hope that helps.

Ritesh

RE: hadoop/hdfs cache question, do client processes share cache?

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data stored in HDFS will be too high and if multiple clients are accesing the Datanode then cache hit ratio will be very low, hence there is no point in having implicit HDFS-managed cache.

Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & dfs.datanode.drop.cache.behind.writes in http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  between processes?
Yes, READ will have to move the whole file through network and no sharing between multiple clients/processes within a given node.

+ Naga
________________________________
From: Demai Ni [nidmgg@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, there is no HDFS-managed cache? Will the cache occur at local filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the client is a stand-alone Linux(not part of the cluster), which connects to the HDFS cluster with centralized cache configured. So on HDFS cluster, the file is cached. In the same scenario, the client has 10 processes repeatedly read the same HDFS file. will HDFS client API be able to cache the file content at Client side? or every READ will have to move the whole file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <ri...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of these 3 datanodes will have only one copy of the block in their respective mem cache and thus avoiding the repeated i/o reads. This goes with the centralized cache management policy of hdfs that also gives you an option to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache space. Here's a link<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on the same.

Hope that helps.

Ritesh

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Demai Ni <ni...@gmail.com>.

Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai

On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> <https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on
> the same.
>
> Hope that helps.
>
> Ritesh
>

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Demai Ni <ni...@gmail.com>.

Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai

On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> <https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on
> the same.
>
> Hope that helps.
>
> Ritesh
>

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Demai Ni <ni...@gmail.com>.

Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai

On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> <https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on
> the same.
>
> Hope that helps.
>
> Ritesh
>

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Demai Ni <ni...@gmail.com>.

Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai

On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamillion@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> <https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html> on
> the same.
>
> Hope that helps.
>
> Ritesh
>

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Ritesh Kumar Singh <ri...@gmail.com>.

Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of
these 3 datanodes will have only one copy of the block in their respective
mem cache and thus avoiding the repeated i/o reads. This goes with the
centralized cache management policy of hdfs that also gives you an option
to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache
space. Here's a link
<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html>
on
the same.

Hope that helps.

Ritesh

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Ritesh Kumar Singh <ri...@gmail.com>.

Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of
these 3 datanodes will have only one copy of the block in their respective
mem cache and thus avoiding the repeated i/o reads. This goes with the
centralized cache management policy of hdfs that also gives you an option
to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache
space. Here's a link
<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html>
on
the same.

Hope that helps.

Ritesh

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Ritesh Kumar Singh <ri...@gmail.com>.

Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of
these 3 datanodes will have only one copy of the block in their respective
mem cache and thus avoiding the repeated i/o reads. This goes with the
centralized cache management policy of hdfs that also gives you an option
to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache
space. Here's a link
<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html>
on
the same.

Hope that helps.

Ritesh

Re: hadoop/hdfs cache question, do client processes share cache?

Posted by Ritesh Kumar Singh <ri...@gmail.com>.

Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of
these 3 datanodes will have only one copy of the block in their respective
mem cache and thus avoiding the repeated i/o reads. This goes with the
centralized cache management policy of hdfs that also gives you an option
to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache
space. Here's a link
<https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html>
on
the same.

Hope that helps.

Ritesh