You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Patrick Donnelly <ba...@batbytes.com> on 2010/06/11 19:05:42 UTC

Caching in HDFS C API Client

Hi List,

I need to explain an higher than expected throughput (bandwidth) for a
HDFS C API Client. Specifically, the client is getting bandwidth
higher than its link rate :). The client is first writing a 512 MB
file followed by reading the entire file back. The file read is what's
getting the higher than link rate bandwidth. I assume this is a
consequence of caching? Is this done by HDFS or by Linux?

Thanks for any help,

-- 
- Patrick Donnelly

Re: Caching in HDFS C API Client

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Nice, thanks Brian!

On Jun 14, 2010, at 7:39 AM, Brian Bockelman wrote:

> Hey Owen, all,
>
> I find this one handy if you have root access:
>
> http://linux-mm.org/Drop_Caches
>
> echo 3 > /proc/sys/vm/drop_caches
>
> Drops the pagecache, dentries, and inodes.  Without this, you can  
> still get caching effects doing the normal "read and write large  
> files" if the linux pagecache outsmarts you (and I don't know about  
> you, but it often outsmarts me...).
>
> Brian
>
> On Jun 14, 2010, at 9:35 AM, Owen O'Malley wrote:
>
>> Indeed. On the terasort benchmark, I had to run intermediate jobs  
>> that
>> were larger than ram on the cluster to ensure that the data was not
>> coming from the file cache.
>>
>> -- Owen
>


Re: Caching in HDFS C API Client

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Owen, all,

I find this one handy if you have root access:

http://linux-mm.org/Drop_Caches

echo 3 > /proc/sys/vm/drop_caches

Drops the pagecache, dentries, and inodes.  Without this, you can still get caching effects doing the normal "read and write large files" if the linux pagecache outsmarts you (and I don't know about you, but it often outsmarts me...).

Brian

On Jun 14, 2010, at 9:35 AM, Owen O'Malley wrote:

> Indeed. On the terasort benchmark, I had to run intermediate jobs that
> were larger than ram on the cluster to ensure that the data was not
> coming from the file cache.
> 
> -- Owen


Re: Caching in HDFS C API Client

Posted by Owen O'Malley <om...@apache.org>.
Indeed. On the terasort benchmark, I had to run intermediate jobs that
were larger than ram on the cluster to ensure that the data was not
coming from the file cache.

-- Owen

Re: Caching in HDFS C API Client

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
I'd bet on the Linux file-cache. Assuming you wrote the file with the  
default replication factor of 3, there is one replica of the local- 
filesystem which you are reading...

Try writing multiple GBs of data and randomly reading large files to  
blow your file-cache?

Arun

On Jun 11, 2010, at 10:05 AM, Patrick Donnelly wrote:

> Hi List,
>
> I need to explain an higher than expected throughput (bandwidth) for a
> HDFS C API Client. Specifically, the client is getting bandwidth
> higher than its link rate :). The client is first writing a 512 MB
> file followed by reading the entire file back. The file read is what's
> getting the higher than link rate bandwidth. I assume this is a
> consequence of caching? Is this done by HDFS or by Linux?
>
> Thanks for any help,
>
> -- 
> - Patrick Donnelly