You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Dennis Kubes <ku...@apache.org> on 2007/07/31 16:07:01 UTC

Hard Disk Failures

Can anyone who is running large clusters (50+) tell me what you are 
seeing with hard disk failure rates.  Something that we are seeing is 
that certain machines will consistently have double or triple the load 
of other machines with the same tasks.  I believe that it is due to some 
hard disks beginning to fail, just wanted to know if anyone else is 
seeing similar behavior?

Dennis Kubes

Re: Hard Disk Failures

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Also I should add, disks degrade and continue to work with bad 
performance for a long time, not just before beginning to fail. I was 
looking at some benchmarks and could isolate some of those machines. See 
HADOOP-1649.

If DFS schedules less writes on such nodes, then it helps the whole 
cluster and would make speculative execution of tasks in map/reduce more 
effective.

Raghu.

Raghu Angadi wrote:
> 
> I did not watch large hadoop clusters closely but from my experience of 
>  other large clusters that have heavy disk loads (seek dominated), the 
> behavior you see seems consistent. Some disks do become very slow and if 
> they are on some raid, whole raid runs at the speed of the slowest disk. 
> iostat -x also helps confirm this.
> 
> Also comparing ext2 and ext3, ext3 did not have noticeable slow down. 
> Many times application access patterns tend dictate most of the disk 
> performance than the native filesystem implementation itself. Filesystem 
> would probably matter more when we are dealing with lot of small files.
> 
> Dennis Kubes wrote:
>> Can anyone who is running large clusters (50+) tell me what you are 
>> seeing with hard disk failure rates.  Something that we are seeing is 
>> that certain machines will consistently have double or triple the load 
>> of other machines with the same tasks.  I believe that it is due to 
>> some hard disks beginning to fail, just wanted to know if anyone else 
>> is seeing similar behavior?
>>
>> Dennis Kubes
> 
>

Re: Hard Disk Failures

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

I did not watch large hadoop clusters closely but from my experience of 
  other large clusters that have heavy disk loads (seek dominated), the 
behavior you see seems consistent. Some disks do become very slow and if 
they are on some raid, whole raid runs at the speed of the slowest disk. 
iostat -x also helps confirm this.

Also comparing ext2 and ext3, ext3 did not have noticeable slow down. 
Many times application access patterns tend dictate most of the disk 
performance than the native filesystem implementation itself. Filesystem 
would probably matter more when we are dealing with lot of small files.

Dennis Kubes wrote:
> Can anyone who is running large clusters (50+) tell me what you are 
> seeing with hard disk failure rates.  Something that we are seeing is 
> that certain machines will consistently have double or triple the load 
> of other machines with the same tasks.  I believe that it is due to some 
> hard disks beginning to fail, just wanted to know if anyone else is 
> seeing similar behavior?
> 
> Dennis Kubes