You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Brian Bockelman <bb...@math.unl.edu> on 2009/04/13 03:53:30 UTC

Interesting Hadoop/FUSE-DFS access patterns

Hey all,

I was doing some research on I/O patterns of our applications, and I  
noticed the attached pattern.  In case if the mail server strips out  
attachments, I also uploaded it:

http://t2.unl.edu/store/Hadoop_64KB_ra.png
http://t2.unl.edu/store/Hadoop_1024KB_ra.png

This was taken using the FUSE mounts of Hadoop; the first one was with  
a 64KB read-ahead and the second with a 1MB read-ahead.  This was  
taken from a 2GB file and randomly 'seek'ed in the file.  This was  
performed 20 times for each read size, advancing in 4KB increments.   
Each blue dot is the read time of one experiment; the red dot is the  
median read time for the read size.  The graphs show the absolute read  
time.

There's very interesting behavior - it seems that there is a change in  
behavior around reads of size of 800KB.  The time for the reads go  
down significantly when you read *larger* files.  I thought this was  
just an artifact of the 64KB read-ahead I set in FUSE, so I upped the  
read-ahead significantly, to 1MB.  In this case, the difference  
between the the small read sizes and large read sizes are *very*  
pronounced.  If it was an artifact from FUSE, I'd expect the place  
where the change occurred would be a function of the readahead-size.

Anyone out there who knows the code have any ideas?  What could I be  
doing wrong?

Brian

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by jason hadoop <ja...@gmail.com>.

Oh I agree caching, is wonderful when you plan to re-use the data in the
near term.

Solaris has an interesting feature, if the application writes enough
contiguous data, in a short time window, (tunable in later nevada builds),
solaris bypasses the buffer cache for the writes.

For reasons I have never had time to look into, there is a significant
impact on overall system responsiveness when there is significant cache
store activity going on, and there are patterns that work in the general
case but fail in others, the tar example from earlier, it is my theory that
the blocks written to the tar file, take priority over the read ahead, and
so the next file to be read for the tar archive are not pre-cached. Using
the cache flush on the tar file, allows the read aheads to go ahead.
The other nice thing that happens is that the size of the dirty pool tends
not to grow to to the point that the periodic sync operations pause the
system.

We had an interesting problem with solaris under vmware some years back,
where we were running IMAP servers as part of JES for testing a middleware
mail application, The IMAP writes would accumulate in the buffer cache, and
performace would be wonderful, and the middle ware performace was great,
then the must flush now threshold would be crossed and it would take 2
minutes to flush all of the accumulated writes out, and the middle ware app
would block waiting on that to finish. In the end as a quick hack, we did
the following *while true; do sync; sleep 30; done*, which prevented the
stalls as it kept the flush time down. The flushes totally fill the disk
queues and will cause starvation for other apps.

I believe this is part of the block report stall problem in 4584.

On Tue, Apr 14, 2009 at 4:52 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Jason,
>
> Thanks, I'll keep this on hand as I do more tests.  I now have a C, Java,
> and python version of my testing program ;)
>
> However, I particularly *like* the fact that there's caching going on -
> it'll help out our application immensely, I think.  I'll be looking at the
> performance both with and without the cache.
>
> Brian
>
>
> On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:
>
>  The following very simple program will tell the VM to drop the pages being
>> cached for a file. I tend to spin this in a for loop when making large tar
>> files, or otherwise working with large files, and the system performance
>> really smooths out.
>> Since it use open(path) it will churn through the inode cache and
>> directories.
>> Something like this might actually significantly speed up HDFS by running
>> over the blocks on the datanodes, for busy clusters.
>>
>>
>> #define _XOPEN_SOURCE 600
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>>
>> /** Simple program to dump buffered data for specific files from the
>> buffer
>> cache. Copyright Jason Venner 2009, License GPL*/
>>
>> int main( int argc, char** argv )
>> {
>>  int failCount = 0;
>>  int i;
>>  for( i = 1; i < argc; i++ ) {
>>   char* file = argv[i];
>>   int fd = open( file, O_RDONLY|O_LARGEFILE );
>>   if (fd == -1) {
>>     perror( file );
>>     failCount++;
>>     continue;
>>   }
>>   if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
>>     fprintf( stderr, "Failed to flush cache for %s %s\n", argv[optind],
>> strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
>>     failCount++;
>>   }
>>   close(fd);
>>  }
>>  exit( failCount );
>> }
>>
>>
>> On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey <scott@richrelevance.com
>> >wrote:
>>
>>
>>> On 4/12/09 9:41 PM, "Brian Bockelman" <bb...@cse.unl.edu> wrote:
>>>
>>>  Ok, here's something perhaps even more strange.  I removed the "seek"
>>>> part out of my timings, so I was only timing the "read" instead of the
>>>> "seek + read" as in the first case.  I also turned the read-ahead down
>>>> to 1-byte (aka, off).
>>>>
>>>> The jump *always* occurs at 128KB, exactly.
>>>>
>>>
>>> Some random ideas:
>>>
>>> I have no idea how FUSE interops with the Linux block layer, but 128K
>>> happens to be the default 'readahead' value for block devices, which may
>>> just be a coincidence.
>>>
>>> For a disk 'sda', you check and set the value (in 512 byte blocks) with:
>>>
>>> /sbin/blockdev --getra /dev/sda
>>> /sbin/blockdev --setra [num blocks] /dev/sda
>>>
>>>
>>> I know on my file system tests, the OS readahead is not activated until a
>>> series of sequential reads go through the block device, so truly random
>>> access is not affected by this.  I've set it to 128MB and random iops
>>> does
>>> not change on a ext3 or xfs file system.  If this applies to FUSE too,
>>> there
>>> may be reasons that this behavior differs.
>>> Furthermore, one would not expect it to be slower to randomly read 4k
>>> than
>>> randomly read up to the readahead size itself even if it did.
>>>
>>> I also have no idea how much of the OS device queue and block device
>>> scheduler is involved with FUSE.  If those are involved, then there's a
>>> bunch of stuff to tinker with there as well.
>>>
>>> Lastly, an FYI if you don't already know the following.  If the OS is
>>> caching pages, there is a way to flush these in Linux to evict the cache.
>>> See /proc/sys/vm/drop_caches .
>>>
>>>
>>>
>>>
>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>>>> sequential reads, not random reads - but it seems that it's one bug-
>>>> fix away from being a good general-purpose system.  Heck if I can find
>>>> what's causing the issues though...
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>
>

-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Jason,

Thanks, I'll keep this on hand as I do more tests.  I now have a C,  
Java, and python version of my testing program ;)

However, I particularly *like* the fact that there's caching going on  
- it'll help out our application immensely, I think.  I'll be looking  
at the performance both with and without the cache.

Brian

On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:

> The following very simple program will tell the VM to drop the pages  
> being
> cached for a file. I tend to spin this in a for loop when making  
> large tar
> files, or otherwise working with large files, and the system  
> performance
> really smooths out.
> Since it use open(path) it will churn through the inode cache and
> directories.
> Something like this might actually significantly speed up HDFS by  
> running
> over the blocks on the datanodes, for busy clusters.
>
>
> #define _XOPEN_SOURCE 600
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> /** Simple program to dump buffered data for specific files from the  
> buffer
> cache. Copyright Jason Venner 2009, License GPL*/
>
> int main( int argc, char** argv )
> {
>  int failCount = 0;
>  int i;
>  for( i = 1; i < argc; i++ ) {
>    char* file = argv[i];
>    int fd = open( file, O_RDONLY|O_LARGEFILE );
>    if (fd == -1) {
>      perror( file );
>      failCount++;
>      continue;
>    }
>    if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
>      fprintf( stderr, "Failed to flush cache for %s %s\n",  
> argv[optind],
> strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
>      failCount++;
>    }
>    close(fd);
>  }
>  exit( failCount );
> }
>
>
> On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey  
> <sc...@richrelevance.com>wrote:
>
>>
>> On 4/12/09 9:41 PM, "Brian Bockelman" <bb...@cse.unl.edu> wrote:
>>
>>> Ok, here's something perhaps even more strange.  I removed the  
>>> "seek"
>>> part out of my timings, so I was only timing the "read" instead of  
>>> the
>>> "seek + read" as in the first case.  I also turned the read-ahead  
>>> down
>>> to 1-byte (aka, off).
>>>
>>> The jump *always* occurs at 128KB, exactly.
>>
>> Some random ideas:
>>
>> I have no idea how FUSE interops with the Linux block layer, but 128K
>> happens to be the default 'readahead' value for block devices,  
>> which may
>> just be a coincidence.
>>
>> For a disk 'sda', you check and set the value (in 512 byte blocks)  
>> with:
>>
>> /sbin/blockdev --getra /dev/sda
>> /sbin/blockdev --setra [num blocks] /dev/sda
>>
>>
>> I know on my file system tests, the OS readahead is not activated  
>> until a
>> series of sequential reads go through the block device, so truly  
>> random
>> access is not affected by this.  I've set it to 128MB and random  
>> iops does
>> not change on a ext3 or xfs file system.  If this applies to FUSE  
>> too,
>> there
>> may be reasons that this behavior differs.
>> Furthermore, one would not expect it to be slower to randomly read  
>> 4k than
>> randomly read up to the readahead size itself even if it did.
>>
>> I also have no idea how much of the OS device queue and block device
>> scheduler is involved with FUSE.  If those are involved, then  
>> there's a
>> bunch of stuff to tinker with there as well.
>>
>> Lastly, an FYI if you don't already know the following.  If the OS is
>> caching pages, there is a way to flush these in Linux to evict the  
>> cache.
>> See /proc/sys/vm/drop_caches .
>>
>>
>>
>>>
>>> I'm a bit befuddled.  I know we say that HDFS is optimized for  
>>> large,
>>> sequential reads, not random reads - but it seems that it's one bug-
>>> fix away from being a good general-purpose system.  Heck if I can  
>>> find
>>> what's causing the issues though...
>>>
>>> Brian
>>>
>>>
>>
>>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by jason hadoop <ja...@gmail.com>.

The following very simple program will tell the VM to drop the pages being
cached for a file. I tend to spin this in a for loop when making large tar
files, or otherwise working with large files, and the system performance
really smooths out.
Since it use open(path) it will churn through the inode cache and
directories.
Something like this might actually significantly speed up HDFS by running
over the blocks on the datanodes, for busy clusters.


#define _XOPEN_SOURCE 600
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/** Simple program to dump buffered data for specific files from the buffer
cache. Copyright Jason Venner 2009, License GPL*/

int main( int argc, char** argv )
{
  int failCount = 0;
  int i;
  for( i = 1; i < argc; i++ ) {
    char* file = argv[i];
    int fd = open( file, O_RDONLY|O_LARGEFILE );
    if (fd == -1) {
      perror( file );
      failCount++;
      continue;
    }
    if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
      fprintf( stderr, "Failed to flush cache for %s %s\n", argv[optind],
strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
      failCount++;
    }
    close(fd);
  }
  exit( failCount );
}


On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey <sc...@richrelevance.com>wrote:

>
> On 4/12/09 9:41 PM, "Brian Bockelman" <bb...@cse.unl.edu> wrote:
>
> > Ok, here's something perhaps even more strange.  I removed the "seek"
> > part out of my timings, so I was only timing the "read" instead of the
> > "seek + read" as in the first case.  I also turned the read-ahead down
> > to 1-byte (aka, off).
> >
> > The jump *always* occurs at 128KB, exactly.
>
> Some random ideas:
>
> I have no idea how FUSE interops with the Linux block layer, but 128K
> happens to be the default 'readahead' value for block devices, which may
> just be a coincidence.
>
> For a disk 'sda', you check and set the value (in 512 byte blocks) with:
>
> /sbin/blockdev --getra /dev/sda
> /sbin/blockdev --setra [num blocks] /dev/sda
>
>
> I know on my file system tests, the OS readahead is not activated until a
> series of sequential reads go through the block device, so truly random
> access is not affected by this.  I've set it to 128MB and random iops does
> not change on a ext3 or xfs file system.  If this applies to FUSE too,
> there
> may be reasons that this behavior differs.
> Furthermore, one would not expect it to be slower to randomly read 4k than
> randomly read up to the readahead size itself even if it did.
>
> I also have no idea how much of the OS device queue and block device
> scheduler is involved with FUSE.  If those are involved, then there's a
> bunch of stuff to tinker with there as well.
>
> Lastly, an FYI if you don't already know the following.  If the OS is
> caching pages, there is a way to flush these in Linux to evict the cache.
> See /proc/sys/vm/drop_caches .
>
>
>
> >
> > I'm a bit befuddled.  I know we say that HDFS is optimized for large,
> > sequential reads, not random reads - but it seems that it's one bug-
> > fix away from being a good general-purpose system.  Heck if I can find
> > what's causing the issues though...
> >
> > Brian
> >
> >
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Brian,
>
> This is really interesting stuff. I'm curious - have you tried these same
> experiments using the Java API? I'm wondering whether this is FUSE-specific
> or inherent to all HDFS reads. I'll try to reproduce this over here as well.
>

I just tried this on a localhost single-node cluster with the following test
program:

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.conf.Configuration;
import java.io.IOException;
import java.net.URI;

public class Test {
  public static void main(String[] args) throws Exception {
    URI uri = new URI("hdfs://localhost:8020/");
    FileSystem fs = FileSystem.get(uri, new Configuration());
    Path path = new Path("/testfile");
    FSDataInputStream dis = fs.open(path);

    for (int size=0; size < 1024*1024; size += 4096) {
      for (int i = 0; i < 100; i++) {
        long st = System.currentTimeMillis();
        byte buf[] = new byte[size];
        dis.read(0, buf, 0, size);
        long et = System.currentTimeMillis();

        System.out.println(String.valueOf(size) + "\t" + String.valueOf(et -
st));
      }
    }
    fs.close();
  }
}


I didn't see the same behavior as you're reporting. Can you give this a try
on your cluster and see if it shows the 128K jump?

-Todd

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Tom,

Yup, that's one of the things I've been looking at - however, it  
doesn't appear to be the likely culprit as to why data access is  
fairly random.  The time the operation took does not seem to be a  
factor of the number of bytes read, at least in the smaller range.

Brian

On Apr 16, 2009, at 5:17 AM, Tom White wrote:

> Not sure if will affect your findings, but when you read from a
> FSDataInputStream you should see how many bytes were actually read by
> inspecting the return value and re-read if it was fewer than you want.
> See Hadoop's IOUtils readFully() method.
>
> Tom
>
> On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman  
> <bb...@cse.unl.edu> wrote:
>>
>> Hey Todd,
>>
>> Been playing more this morning after thinking about it for the  
>> night -- I
>> think the culprit is not the network, but actually the cache.   
>> Here's the
>> output of your script adjusted to do the same calls as I was doing  
>> (you had
>> left out the random I/O part).
>>
>> [brian@red tmp]$ java hdfs_tester
>> Mean value for reads of size 0: 0.0447
>> Mean value for reads of size 16384: 10.4872
>> Mean value for reads of size 32768: 10.82925
>> Mean value for reads of size 49152: 6.2417
>> Mean value for reads of size 65536: 7.0511003
>> Mean value for reads of size 81920: 9.411599
>> Mean value for reads of size 98304: 9.378799
>> Mean value for reads of size 114688: 8.99065
>> Mean value for reads of size 131072: 5.1378503
>> Mean value for reads of size 147456: 6.1324
>> Mean value for reads of size 163840: 17.1187
>> Mean value for reads of size 180224: 6.5492
>> Mean value for reads of size 196608: 8.45695
>> Mean value for reads of size 212992: 7.4292
>> Mean value for reads of size 229376: 10.7843
>> Mean value for reads of size 245760: 9.29095
>> Mean value for reads of size 262144: 6.57865
>>
>> Copy of the script below.
>>
>> So, without the FUSE layer, we don't see much (if any) patterns  
>> here.  The
>> overhead of randomly skipping through the file is higher than the  
>> overhead
>> of reading out the data.
>>
>> Upon further inspection, the biggest factor affecting the FUSE  
>> layer is
>> actually the Linux VFS caching -- if you notice, the bandwidth in  
>> the given
>> graph for larger read sizes is *higher* than 1Gbps, which is the  
>> limit of
>> the network on that particular node.  If I go in the opposite  
>> direction -
>> starting with the largest reads first, then going down to the  
>> smallest
>> reads, the graph entirely smooths out for the small values -  
>> everything is
>> read from the filesystem cache in the client RAM.  Graph attached.
>>
>> So, on the upside, mounting through FUSE gives us the opportunity  
>> to speed
>> up reads for very complex, non-sequential patterns - for free,  
>> thanks to the
>> hardworking Linux kernel.  On the downside, it's incredibly  
>> difficult to
>> come up with simple cases to demonstrate performance for an  
>> application --
>> the cache performance and size depends on how much activity there's  
>> on the
>> client, the previous file system activity that the application did,  
>> and the
>> amount of concurrent activity on the server.  I can give you  
>> results for
>> performance, but it's not going to be the performance you see in  
>> real life.
>>  (Gee, if only file systems were easy...)
>>
>> Ok, sorry for the list noise -- it seems I'm going to have to think  
>> more
>> about this problem before I can come up with something coherent.
>>
>> Brian
>>
>>
>>
>>
>>
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.FileStatus;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.fs.FSDataInputStream;
>> import org.apache.hadoop.conf.Configuration;
>> import java.io.IOException;
>> import java.net.URI;
>> import java.util.Random;
>>
>> public class hdfs_tester {
>>  public static void main(String[] args) throws Exception {
>>   URI uri = new URI("hdfs://hadoop-name:9000/");
>>   FileSystem fs = FileSystem.get(uri, new Configuration());
>>   Path path = new
>> Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/ 
>> Nebraska/LoadTest07_Nebraska_33");
>>   FSDataInputStream dis = fs.open(path);
>>   Random rand = new Random();
>>   FileStatus status = fs.getFileStatus(path);
>>   long file_len = status.getLen();
>>   int iters = 20;
>>   for (int size=0; size < 1024*1024; size += 4*4096) {
>>     long csum = 0;
>>     for (int i = 0; i < iters; i++) {
>>       int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
>>       byte buf[] = new byte[size];
>>       if (pos < 0)
>>         pos = 0;
>>       long st = System.nanoTime();
>>       dis.read(pos, buf, 0, size);
>>       long et = System.nanoTime();
>>       csum += et-st;
>>       //System.out.println(String.valueOf(size) + "\t" +  
>> String.valueOf(pos)
>> + "\t" + String.valueOf(et - st));
>>     }
>>     float csum2 = csum; csum2 /= iters;
>>     System.out.println("Mean value for reads of size " + size + ":  
>> " +
>> (csum2/1000/1000));
>>   }
>>   fs.close();
>>  }
>> }
>>
>>
>> On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:
>>
>>> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <to...@cloudera.com>  
>>> wrote:
>>>
>>>> Hey Brian,
>>>>
>>>> This is really interesting stuff. I'm curious - have you tried  
>>>> these same
>>>> experiments using the Java API? I'm wondering whether this is
>>>> FUSE-specific
>>>> or inherent to all HDFS reads. I'll try to reproduce this over  
>>>> here as
>>>> well.
>>>>
>>>> This smells sort of nagle-related to me... if you get a chance,  
>>>> you may
>>>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 *  
>>>> 1024, and
>>>> see if the magic number jumps up to 256KB. If so, I think it  
>>>> should be a
>>>> pretty easy bugfix.
>>>>
>>>
>>> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't  
>>> actually
>>> used for any socket configuration, so I don't think that will make a
>>> difference... still think networking might be the culprit, though.
>>>
>>> -Todd
>>>
>>>
>>>>
>>>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman
>>>> <bb...@cse.unl.edu>wrote:
>>>>
>>>>> Ok, here's something perhaps even more strange.  I removed the  
>>>>> "seek"
>>>>> part
>>>>> out of my timings, so I was only timing the "read" instead of  
>>>>> the "seek
>>>>> +
>>>>> read" as in the first case.  I also turned the read-ahead down  
>>>>> to 1-byte
>>>>> (aka, off).
>>>>>
>>>>> The jump *always* occurs at 128KB, exactly.
>>>>>
>>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for  
>>>>> large,
>>>>> sequential reads, not random reads - but it seems that it's one  
>>>>> bug-fix
>>>>> away
>>>>> from being a good general-purpose system.  Heck if I can find  
>>>>> what's
>>>>> causing
>>>>> the issues though...
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>>>>
>>>>> Hey all,
>>>>>>
>>>>>> I was doing some research on I/O patterns of our applications,  
>>>>>> and I
>>>>>> noticed the attached pattern.  In case if the mail server  
>>>>>> strips out
>>>>>> attachments, I also uploaded it:
>>>>>>
>>>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>>>>
>>>>>> This was taken using the FUSE mounts of Hadoop; the first one  
>>>>>> was with
>>>>>> a
>>>>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was  
>>>>>> taken
>>>>>> from a
>>>>>> 2GB file and randomly 'seek'ed in the file.  This was performed  
>>>>>> 20
>>>>>> times for
>>>>>> each read size, advancing in 4KB increments.  Each blue dot is  
>>>>>> the read
>>>>>> time
>>>>>> of one experiment; the red dot is the median read time for the  
>>>>>> read
>>>>>> size.
>>>>>> The graphs show the absolute read time.
>>>>>>
>>>>>> There's very interesting behavior - it seems that there is a  
>>>>>> change in
>>>>>> behavior around reads of size of 800KB.  The time for the reads  
>>>>>> go down
>>>>>> significantly when you read *larger* files.  I thought this was  
>>>>>> just an
>>>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the
>>>>>> read-ahead
>>>>>> significantly, to 1MB.  In this case, the difference between  
>>>>>> the the
>>>>>> small
>>>>>> read sizes and large read sizes are *very* pronounced.  If it  
>>>>>> was an
>>>>>> artifact from FUSE, I'd expect the place where the change  
>>>>>> occurred
>>>>>> would be
>>>>>> a function of the readahead-size.
>>>>>>
>>>>>> Anyone out there who knows the code have any ideas?  What could  
>>>>>> I be
>>>>>> doing wrong?
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> <Hadoop_64KB_ra.png>
>>>>>>
>>>>>> <Hadoop_1024KB_ra.png>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Tom White <to...@cloudera.com>.

Not sure if will affect your findings, but when you read from a
FSDataInputStream you should see how many bytes were actually read by
inspecting the return value and re-read if it was fewer than you want.
See Hadoop's IOUtils readFully() method.

Tom

On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman <bb...@cse.unl.edu> wrote:
>
> Hey Todd,
>
> Been playing more this morning after thinking about it for the night -- I
> think the culprit is not the network, but actually the cache.  Here's the
> output of your script adjusted to do the same calls as I was doing (you had
> left out the random I/O part).
>
> [brian@red tmp]$ java hdfs_tester
> Mean value for reads of size 0: 0.0447
> Mean value for reads of size 16384: 10.4872
> Mean value for reads of size 32768: 10.82925
> Mean value for reads of size 49152: 6.2417
> Mean value for reads of size 65536: 7.0511003
> Mean value for reads of size 81920: 9.411599
> Mean value for reads of size 98304: 9.378799
> Mean value for reads of size 114688: 8.99065
> Mean value for reads of size 131072: 5.1378503
> Mean value for reads of size 147456: 6.1324
> Mean value for reads of size 163840: 17.1187
> Mean value for reads of size 180224: 6.5492
> Mean value for reads of size 196608: 8.45695
> Mean value for reads of size 212992: 7.4292
> Mean value for reads of size 229376: 10.7843
> Mean value for reads of size 245760: 9.29095
> Mean value for reads of size 262144: 6.57865
>
> Copy of the script below.
>
> So, without the FUSE layer, we don't see much (if any) patterns here.  The
> overhead of randomly skipping through the file is higher than the overhead
> of reading out the data.
>
> Upon further inspection, the biggest factor affecting the FUSE layer is
> actually the Linux VFS caching -- if you notice, the bandwidth in the given
> graph for larger read sizes is *higher* than 1Gbps, which is the limit of
> the network on that particular node.  If I go in the opposite direction -
> starting with the largest reads first, then going down to the smallest
> reads, the graph entirely smooths out for the small values - everything is
> read from the filesystem cache in the client RAM.  Graph attached.
>
> So, on the upside, mounting through FUSE gives us the opportunity to speed
> up reads for very complex, non-sequential patterns - for free, thanks to the
> hardworking Linux kernel.  On the downside, it's incredibly difficult to
> come up with simple cases to demonstrate performance for an application --
> the cache performance and size depends on how much activity there's on the
> client, the previous file system activity that the application did, and the
> amount of concurrent activity on the server.  I can give you results for
> performance, but it's not going to be the performance you see in real life.
>  (Gee, if only file systems were easy...)
>
> Ok, sorry for the list noise -- it seems I'm going to have to think more
> about this problem before I can come up with something coherent.
>
> Brian
>
>
>
>
>
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.conf.Configuration;
> import java.io.IOException;
> import java.net.URI;
> import java.util.Random;
>
> public class hdfs_tester {
>  public static void main(String[] args) throws Exception {
>   URI uri = new URI("hdfs://hadoop-name:9000/");
>   FileSystem fs = FileSystem.get(uri, new Configuration());
>   Path path = new
> Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33");
>   FSDataInputStream dis = fs.open(path);
>   Random rand = new Random();
>   FileStatus status = fs.getFileStatus(path);
>   long file_len = status.getLen();
>   int iters = 20;
>   for (int size=0; size < 1024*1024; size += 4*4096) {
>     long csum = 0;
>     for (int i = 0; i < iters; i++) {
>       int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
>       byte buf[] = new byte[size];
>       if (pos < 0)
>         pos = 0;
>       long st = System.nanoTime();
>       dis.read(pos, buf, 0, size);
>       long et = System.nanoTime();
>       csum += et-st;
>       //System.out.println(String.valueOf(size) + "\t" + String.valueOf(pos)
> + "\t" + String.valueOf(et - st));
>     }
>     float csum2 = csum; csum2 /= iters;
>     System.out.println("Mean value for reads of size " + size + ": " +
> (csum2/1000/1000));
>   }
>   fs.close();
>  }
> }
>
>
> On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:
>
>> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hey Brian,
>>>
>>> This is really interesting stuff. I'm curious - have you tried these same
>>> experiments using the Java API? I'm wondering whether this is
>>> FUSE-specific
>>> or inherent to all HDFS reads. I'll try to reproduce this over here as
>>> well.
>>>
>>> This smells sort of nagle-related to me... if you get a chance, you may
>>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and
>>> see if the magic number jumps up to 256KB. If so, I think it should be a
>>> pretty easy bugfix.
>>>
>>
>> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't actually
>> used for any socket configuration, so I don't think that will make a
>> difference... still think networking might be the culprit, though.
>>
>> -Todd
>>
>>
>>>
>>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman
>>> <bb...@cse.unl.edu>wrote:
>>>
>>>> Ok, here's something perhaps even more strange.  I removed the "seek"
>>>> part
>>>> out of my timings, so I was only timing the "read" instead of the "seek
>>>> +
>>>> read" as in the first case.  I also turned the read-ahead down to 1-byte
>>>> (aka, off).
>>>>
>>>> The jump *always* occurs at 128KB, exactly.
>>>>
>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>>>> sequential reads, not random reads - but it seems that it's one bug-fix
>>>> away
>>>> from being a good general-purpose system.  Heck if I can find what's
>>>> causing
>>>> the issues though...
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>>>
>>>> Hey all,
>>>>>
>>>>> I was doing some research on I/O patterns of our applications, and I
>>>>> noticed the attached pattern.  In case if the mail server strips out
>>>>> attachments, I also uploaded it:
>>>>>
>>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>>>
>>>>> This was taken using the FUSE mounts of Hadoop; the first one was with
>>>>> a
>>>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was taken
>>>>> from a
>>>>> 2GB file and randomly 'seek'ed in the file.  This was performed 20
>>>>> times for
>>>>> each read size, advancing in 4KB increments.  Each blue dot is the read
>>>>> time
>>>>> of one experiment; the red dot is the median read time for the read
>>>>> size.
>>>>> The graphs show the absolute read time.
>>>>>
>>>>> There's very interesting behavior - it seems that there is a change in
>>>>> behavior around reads of size of 800KB.  The time for the reads go down
>>>>> significantly when you read *larger* files.  I thought this was just an
>>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the
>>>>> read-ahead
>>>>> significantly, to 1MB.  In this case, the difference between the the
>>>>> small
>>>>> read sizes and large read sizes are *very* pronounced.  If it was an
>>>>> artifact from FUSE, I'd expect the place where the change occurred
>>>>> would be
>>>>> a function of the readahead-size.
>>>>>
>>>>> Anyone out there who knows the code have any ideas?  What could I be
>>>>> doing wrong?
>>>>>
>>>>> Brian
>>>>>
>>>>> <Hadoop_64KB_ra.png>
>>>>>
>>>>> <Hadoop_1024KB_ra.png>
>>>>>
>>>>
>>>>
>>>>
>>>
>
>
>

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Todd,

Been playing more this morning after thinking about it for the night  
-- I think the culprit is not the network, but actually the cache.   
Here's the output of your script adjusted to do the same calls as I  
was doing (you had left out the random I/O part).

[brian@red tmp]$ java hdfs_tester
Mean value for reads of size 0: 0.0447
Mean value for reads of size 16384: 10.4872
Mean value for reads of size 32768: 10.82925
Mean value for reads of size 49152: 6.2417
Mean value for reads of size 65536: 7.0511003
Mean value for reads of size 81920: 9.411599
Mean value for reads of size 98304: 9.378799
Mean value for reads of size 114688: 8.99065
Mean value for reads of size 131072: 5.1378503
Mean value for reads of size 147456: 6.1324
Mean value for reads of size 163840: 17.1187
Mean value for reads of size 180224: 6.5492
Mean value for reads of size 196608: 8.45695
Mean value for reads of size 212992: 7.4292
Mean value for reads of size 229376: 10.7843
Mean value for reads of size 245760: 9.29095
Mean value for reads of size 262144: 6.57865

Copy of the script below.

So, without the FUSE layer, we don't see much (if any) patterns here.   
The overhead of randomly skipping through the file is higher than the  
overhead of reading out the data.

Upon further inspection, the biggest factor affecting the FUSE layer  
is actually the Linux VFS caching -- if you notice, the bandwidth in  
the given graph for larger read sizes is *higher* than 1Gbps, which is  
the limit of the network on that particular node.  If I go in the  
opposite direction - starting with the largest reads first, then going  
down to the smallest reads, the graph entirely smooths out for the  
small values - everything is read from the filesystem cache in the  
client RAM.  Graph attached.

So, on the upside, mounting through FUSE gives us the opportunity to  
speed up reads for very complex, non-sequential patterns - for free,  
thanks to the hardworking Linux kernel.  On the downside, it's  
incredibly difficult to come up with simple cases to demonstrate  
performance for an application -- the cache performance and size  
depends on how much activity there's on the client, the previous file  
system activity that the application did, and the amount of concurrent  
activity on the server.  I can give you results for performance, but  
it's not going to be the performance you see in real life.  (Gee, if  
only file systems were easy...)

Ok, sorry for the list noise -- it seems I'm going to have to think  
more about this problem before I can come up with something coherent.

Brian

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Brian,
>
> This is really interesting stuff. I'm curious - have you tried these same
> experiments using the Java API? I'm wondering whether this is FUSE-specific
> or inherent to all HDFS reads. I'll try to reproduce this over here as well.
>
> This smells sort of nagle-related to me... if you get a chance, you may
> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and
> see if the magic number jumps up to 256KB. If so, I think it should be a
> pretty easy bugfix.
>

Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't actually
used for any socket configuration, so I don't think that will make a
difference... still think networking might be the culprit, though.

-Todd


>
> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:
>
>> Ok, here's something perhaps even more strange.  I removed the "seek" part
>> out of my timings, so I was only timing the "read" instead of the "seek +
>> read" as in the first case.  I also turned the read-ahead down to 1-byte
>> (aka, off).
>>
>> The jump *always* occurs at 128KB, exactly.
>>
>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>> sequential reads, not random reads - but it seems that it's one bug-fix away
>> from being a good general-purpose system.  Heck if I can find what's causing
>> the issues though...
>>
>> Brian
>>
>>
>>
>>
>>
>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>
>>  Hey all,
>>>
>>> I was doing some research on I/O patterns of our applications, and I
>>> noticed the attached pattern.  In case if the mail server strips out
>>> attachments, I also uploaded it:
>>>
>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>
>>> This was taken using the FUSE mounts of Hadoop; the first one was with a
>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was taken from a
>>> 2GB file and randomly 'seek'ed in the file.  This was performed 20 times for
>>> each read size, advancing in 4KB increments.  Each blue dot is the read time
>>> of one experiment; the red dot is the median read time for the read size.
>>>  The graphs show the absolute read time.
>>>
>>> There's very interesting behavior - it seems that there is a change in
>>> behavior around reads of size of 800KB.  The time for the reads go down
>>> significantly when you read *larger* files.  I thought this was just an
>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the read-ahead
>>> significantly, to 1MB.  In this case, the difference between the the small
>>> read sizes and large read sizes are *very* pronounced.  If it was an
>>> artifact from FUSE, I'd expect the place where the change occurred would be
>>> a function of the readahead-size.
>>>
>>> Anyone out there who knows the code have any ideas?  What could I be
>>> doing wrong?
>>>
>>> Brian
>>>
>>> <Hadoop_64KB_ra.png>
>>>
>>> <Hadoop_1024KB_ra.png>
>>>
>>
>>
>>
>

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Brian,

This is really interesting stuff. I'm curious - have you tried these same
experiments using the Java API? I'm wondering whether this is FUSE-specific
or inherent to all HDFS reads. I'll try to reproduce this over here as well.

This smells sort of nagle-related to me... if you get a chance, you may want
to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and see if
the magic number jumps up to 256KB. If so, I think it should be a pretty
easy bugfix.

-Todd


On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Ok, here's something perhaps even more strange.  I removed the "seek" part
> out of my timings, so I was only timing the "read" instead of the "seek +
> read" as in the first case.  I also turned the read-ahead down to 1-byte
> (aka, off).
>
> The jump *always* occurs at 128KB, exactly.
>
> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
> sequential reads, not random reads - but it seems that it's one bug-fix away
> from being a good general-purpose system.  Heck if I can find what's causing
> the issues though...
>
> Brian
>
>
>
>
>
> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>
>  Hey all,
>>
>> I was doing some research on I/O patterns of our applications, and I
>> noticed the attached pattern.  In case if the mail server strips out
>> attachments, I also uploaded it:
>>
>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>
>> This was taken using the FUSE mounts of Hadoop; the first one was with a
>> 64KB read-ahead and the second with a 1MB read-ahead.  This was taken from a
>> 2GB file and randomly 'seek'ed in the file.  This was performed 20 times for
>> each read size, advancing in 4KB increments.  Each blue dot is the read time
>> of one experiment; the red dot is the median read time for the read size.
>>  The graphs show the absolute read time.
>>
>> There's very interesting behavior - it seems that there is a change in
>> behavior around reads of size of 800KB.  The time for the reads go down
>> significantly when you read *larger* files.  I thought this was just an
>> artifact of the 64KB read-ahead I set in FUSE, so I upped the read-ahead
>> significantly, to 1MB.  In this case, the difference between the the small
>> read sizes and large read sizes are *very* pronounced.  If it was an
>> artifact from FUSE, I'd expect the place where the change occurred would be
>> a function of the readahead-size.
>>
>> Anyone out there who knows the code have any ideas?  What could I be doing
>> wrong?
>>
>> Brian
>>
>> <Hadoop_64KB_ra.png>
>>
>> <Hadoop_1024KB_ra.png>
>>
>
>
>

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Scott Carey <sc...@richrelevance.com>.

On 4/12/09 9:41 PM, "Brian Bockelman" <bb...@cse.unl.edu> wrote:

> Ok, here's something perhaps even more strange.  I removed the "seek"
> part out of my timings, so I was only timing the "read" instead of the
> "seek + read" as in the first case.  I also turned the read-ahead down
> to 1-byte (aka, off).
> 
> The jump *always* occurs at 128KB, exactly.

Some random ideas:

I have no idea how FUSE interops with the Linux block layer, but 128K
happens to be the default 'readahead' value for block devices, which may
just be a coincidence.

For a disk 'sda', you check and set the value (in 512 byte blocks) with:

/sbin/blockdev --getra /dev/sda
/sbin/blockdev --setra [num blocks] /dev/sda

I know on my file system tests, the OS readahead is not activated until a
series of sequential reads go through the block device, so truly random
access is not affected by this.  I've set it to 128MB and random iops does
not change on a ext3 or xfs file system.  If this applies to FUSE too, there
may be reasons that this behavior differs.
Furthermore, one would not expect it to be slower to randomly read 4k than
randomly read up to the readahead size itself even if it did.

I also have no idea how much of the OS device queue and block device
scheduler is involved with FUSE.  If those are involved, then there's a
bunch of stuff to tinker with there as well.

Lastly, an FYI if you don't already know the following.  If the OS is
caching pages, there is a way to flush these in Linux to evict the cache.
See /proc/sys/vm/drop_caches .

> 
> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
> sequential reads, not random reads - but it seems that it's one bug-
> fix away from being a good general-purpose system.  Heck if I can find
> what's causing the issues though...
> 
> Brian
> 
>

Re: Interesting Hadoop/FUSE-DFS access patterns

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Ok, here's something perhaps even more strange.  I removed the "seek"  
part out of my timings, so I was only timing the "read" instead of the  
"seek + read" as in the first case.  I also turned the read-ahead down  
to 1-byte (aka, off).

The jump *always* occurs at 128KB, exactly.

I'm a bit befuddled.  I know we say that HDFS is optimized for large,  
sequential reads, not random reads - but it seems that it's one bug- 
fix away from being a good general-purpose system.  Heck if I can find  
what's causing the issues though...

Brian