You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Igor Bolotin <ig...@collarity.com> on 2009/03/05 19:10:28 UTC

DataNode stops cleaning disk?

Normally I dislike writing about problems without being able to provide
some more information, but unfortunately in this case I just can't find
anything.

 

Here is the situation - DFS cluster running Hadoop version 0.19.0. The
cluster is running on multiple servers with practically identical
hardware. Everything works perfectly well, except for one thing - from
time to time one of the data nodes (every time it's a different node)
starts to consume more and more disk space. The node keeps going and if
we don't do anything - it runs out of space completely (ignoring 20GB
reserved space settings). Once restarted - it cleans disk rapidly and
goes back to approximately the same utilization as the rest of data
nodes in the cluster.

 

Scanning datanodes and namenode logs and comparing thread dumps (stacks)
from nodes experiencing problem and those that run normally didn't
produce any clues. Running balancer tool didn't help at all. FSCK shows
that everything is healthy and number of over-replicated blocks is not
significant.

 

To me - it just looks like at some point the data node stops cleaning
invalidated/deleted blocks, but keeps reporting space consumed by these
blocks as "not used", but I'm not familiar enough with the internals and
just plain don't have enough free time to start digging deeper.

 

Anyone has an idea what is wrong or what else we can do to find out
what's wrong or maybe where to start looking in the code?

 

Thanks,

Igor

RE: DataNode stops cleaning disk?

Posted by Igor Bolotin <ig...@collarity.com>.

My mistake about 'current' directory - that's the one that consumes all
the disk space and 'du' on that directory matches exactly with namenode
web ui reported size.
I'm waiting for the next time this happens to collect more details, but
ever since I wrote the first email - everything works perfectly well
(another application of Murphy law). 

Thanks,
Igor

-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 12:06 PM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?

Igor Bolotin wrote:
> That's what I saw just yesterday on one of the data nodes with this
> situation (will confirm also next time it happens):
> - Tmp and current were either empty or almost empty last time I
checked.
> - du on the entire data directory matched exactly with reported used
> space in NameNode web UI and it did report that it uses some most of
the
> available disk space. 
> - nothing else was using disk space (actually - it's dedicated DFS
> cluster).

If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.

If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.

hope this helps.
Raghu.

> Thank you for help!
> Igor
> 
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 11:05 AM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> 
> This is unexpected unless some other process is eating up space.
> 
> Couple of things to collect next time (along with log):
> 
>   - All the contents under datanode-directory/ (especially including 
> 'tmp' and 'current')
>   - Does 'du' of this directory match with what is reported to
NameNode 
> (shown on webui) by this DataNode.
>   - Is there anything else taking disk space on the machine?
> 
> Raghu.
> 
> Igor Bolotin wrote:
>> Normally I dislike writing about problems without being able to
> provide
>> some more information, but unfortunately in this case I just can't
> find
>> anything.
>>
>>  
>>
>> Here is the situation - DFS cluster running Hadoop version 0.19.0.
The
>> cluster is running on multiple servers with practically identical
>> hardware. Everything works perfectly well, except for one thing -
from
>> time to time one of the data nodes (every time it's a different node)
>> starts to consume more and more disk space. The node keeps going and
> if
>> we don't do anything - it runs out of space completely (ignoring 20GB
>> reserved space settings). Once restarted - it cleans disk rapidly and
>> goes back to approximately the same utilization as the rest of data
>> nodes in the cluster.
>>
>>  
>>
>> Scanning datanodes and namenode logs and comparing thread dumps
> (stacks)
>> from nodes experiencing problem and those that run normally didn't
>> produce any clues. Running balancer tool didn't help at all. FSCK
> shows
>> that everything is healthy and number of over-replicated blocks is
not
>> significant.
>>
>>  
>>
>> To me - it just looks like at some point the data node stops cleaning
>> invalidated/deleted blocks, but keeps reporting space consumed by
> these
>> blocks as "not used", but I'm not familiar enough with the internals
> and
>> just plain don't have enough free time to start digging deeper.
>>
>>  
>>
>> Anyone has an idea what is wrong or what else we can do to find out
>> what's wrong or maybe where to start looking in the code?
>>
>>  
>>
>> Thanks,
>>
>> Igor
>>
>>  
>>
>>
>

RE: DataNode stops cleaning disk?

Posted by Igor Bolotin <ig...@collarity.com>.

Filed HADOOP-5523 on this and will continue investigation.
Thanks for the help!
Igor


-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Tuesday, March 17, 2009 11:53 AM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?

Igor,

A few things you could do (may be better to file a jira, with a short 
description and more details in follow up comments) :

1) Pick one of block ids from the open files and grep for it in DataNode

and NameNode logs (there is one log file for each day)
2) Pick one of the over-replicated blocks (if above block is not one of 
them) and trace it in NameNode log
3) take jstack of the datanode in this state.

Since you still have over-replicated blocks, you probably has more 
datanodes in early stages of this problem.

Igor Bolotin wrote:
> Caught this issue again on one of the clusters. DF and DU sizes match
> very closely with information reported by dfsadmin command. 

If DN reports the space properly, then the original problem you reported

that DataNode runs out of disk should not happen.

You can follow up with more investigation if you are interested. Other 
alternative is to upgrade to latest 0.19.x to see if the problem 
persists. There have been many fixes since 0.19.0.

Raghu.

> Lsof reports some 1000 open files in DFS data directories on the
> problematic datanode, but total size for open files is only about
10GB.
> I can't really track the space usage to individual files - there are
way
> too many files/blocks for detailed analysis.
> 
> Here is something interesting - fsck before datanode restart reports
> very significant number of over-replicated blocks (~10% of blocks are
> over-replicated):
> 
> Status: HEALTHY
>  Total size:    1472758591906 B (Total open files size: 29050588133 B)
> 
>  Total dirs:    58431
> 
>  Total files:   375703 (Files currently being written: 418)
> 
>  Total blocks (validated):      387205 (avg. block size 3803562 B)
> (Total open file blocks (not validated): 595)            
>  Minimally replicated blocks:   387205 (100.0 %)
> 
>  Over-replicated blocks:        38782 (10.015883 %)
> 
>  Under-replicated blocks:       0 (0.0 %)
> 
>  Mis-replicated blocks:         0 (0.0 %)
> 
>  Default replication factor:    3
> 
>  Average block replication:     3.1003888
> 
>  Corrupt blocks:                0
> 
>  Missing replicas:              0 (0.0 %)
> 
>  Number of data-nodes:          7
> 
>  Number of racks:               1
> 
> 
> After datanode restart - over-replicated nodes are practically gone:
> 
> Status: HEALTHY
>  Total size:    1310669475298 B (Total open files size: 29535016933 B)
>  Total dirs:    59431
>  Total files:   377177 (Files currently being written: 387)
>  Total blocks (validated):      386661 (avg. block size 3389712 B)
> (Total open file blocks (not validated): 607)
>  Minimally replicated blocks:   386661 (100.0 %)
>  Over-replicated blocks:        272 (0.070345856 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0007036
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          7
>  Number of racks:               1
> 
> What might be the cause for over-replication?
> 
> Best regards,
> Igor
> 
> 
> -----Original Message-----
> From: Igor Bolotin 
> Sent: Monday, March 09, 2009 2:50 PM
> To: core-user@hadoop.apache.org
> Subject: RE: DataNode stops cleaning disk?
> 
> My mistake about 'current' directory - that's the one that consumes
all
> the disk space and 'du' on that directory matches exactly with
namenode
> web ui reported size.
> I'm waiting for the next time this happens to collect more details,
but
> ever since I wrote the first email - everything works perfectly well
> (another application of Murphy law). 
> 
> Thanks,
> Igor
> 
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 12:06 PM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> Igor Bolotin wrote:
>> That's what I saw just yesterday on one of the data nodes with this
>> situation (will confirm also next time it happens):
>> - Tmp and current were either empty or almost empty last time I
> checked.
>> - du on the entire data directory matched exactly with reported used
>> space in NameNode web UI and it did report that it uses some most of
> the
>> available disk space. 
>> - nothing else was using disk space (actually - it's dedicated DFS
>> cluster).
> 
> If 'du' command (you can run in the shell) counts properly then you 
> should be able to see which files are taking space.
> 
> If 'du' can't but 'df' reports very less space available, then it is 
> possible (though never seen) that datanode is keeping a a lot these 
> files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
> not datanode, then check lsof to find who is holding these files.
> 
> hope this helps.
> Raghu.
> 
>> Thank you for help!
>> Igor
>>
>> -----Original Message-----
>> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
>> Sent: Thursday, March 05, 2009 11:05 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: DataNode stops cleaning disk?
>>
>>
>> This is unexpected unless some other process is eating up space.
>>
>> Couple of things to collect next time (along with log):
>>
>>   - All the contents under datanode-directory/ (especially including 
>> 'tmp' and 'current')
>>   - Does 'du' of this directory match with what is reported to
> NameNode 
>> (shown on webui) by this DataNode.
>>   - Is there anything else taking disk space on the machine?
>>
>> Raghu.
>>
>> Igor Bolotin wrote:
>>> Normally I dislike writing about problems without being able to
>> provide
>>> some more information, but unfortunately in this case I just can't
>> find
>>> anything.
>>>
>>>  
>>>
>>> Here is the situation - DFS cluster running Hadoop version 0.19.0.
> The
>>> cluster is running on multiple servers with practically identical
>>> hardware. Everything works perfectly well, except for one thing -
> from
>>> time to time one of the data nodes (every time it's a different
node)
>>> starts to consume more and more disk space. The node keeps going and
>> if
>>> we don't do anything - it runs out of space completely (ignoring
20GB
>>> reserved space settings). Once restarted - it cleans disk rapidly
and
>>> goes back to approximately the same utilization as the rest of data
>>> nodes in the cluster.
>>>
>>>  
>>>
>>> Scanning datanodes and namenode logs and comparing thread dumps
>> (stacks)
>>> from nodes experiencing problem and those that run normally didn't
>>> produce any clues. Running balancer tool didn't help at all. FSCK
>> shows
>>> that everything is healthy and number of over-replicated blocks is
> not
>>> significant.
>>>
>>>  
>>>
>>> To me - it just looks like at some point the data node stops
cleaning
>>> invalidated/deleted blocks, but keeps reporting space consumed by
>> these
>>> blocks as "not used", but I'm not familiar enough with the internals
>> and
>>> just plain don't have enough free time to start digging deeper.
>>>
>>>  
>>>
>>> Anyone has an idea what is wrong or what else we can do to find out
>>> what's wrong or maybe where to start looking in the code?
>>>
>>>  
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>  
>>>
>>>
>

Re: DataNode stops cleaning disk?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Igor,

A few things you could do (may be better to file a jira, with a short 
description and more details in follow up comments) :

1) Pick one of block ids from the open files and grep for it in DataNode 
and NameNode logs (there is one log file for each day)
2) Pick one of the over-replicated blocks (if above block is not one of 
them) and trace it in NameNode log
3) take jstack of the datanode in this state.

Since you still have over-replicated blocks, you probably has more 
datanodes in early stages of this problem.

Igor Bolotin wrote:
> Caught this issue again on one of the clusters. DF and DU sizes match
> very closely with information reported by dfsadmin command. 

If DN reports the space properly, then the original problem you reported 
that DataNode runs out of disk should not happen.

You can follow up with more investigation if you are interested. Other 
alternative is to upgrade to latest 0.19.x to see if the problem 
persists. There have been many fixes since 0.19.0.

Raghu.

> Lsof reports some 1000 open files in DFS data directories on the
> problematic datanode, but total size for open files is only about 10GB.
> I can't really track the space usage to individual files - there are way
> too many files/blocks for detailed analysis.
> 
> Here is something interesting - fsck before datanode restart reports
> very significant number of over-replicated blocks (~10% of blocks are
> over-replicated):
> 
> Status: HEALTHY
>  Total size:    1472758591906 B (Total open files size: 29050588133 B)
> 
>  Total dirs:    58431
> 
>  Total files:   375703 (Files currently being written: 418)
> 
>  Total blocks (validated):      387205 (avg. block size 3803562 B)
> (Total open file blocks (not validated): 595)            
>  Minimally replicated blocks:   387205 (100.0 %)
> 
>  Over-replicated blocks:        38782 (10.015883 %)
> 
>  Under-replicated blocks:       0 (0.0 %)
> 
>  Mis-replicated blocks:         0 (0.0 %)
> 
>  Default replication factor:    3
> 
>  Average block replication:     3.1003888
> 
>  Corrupt blocks:                0
> 
>  Missing replicas:              0 (0.0 %)
> 
>  Number of data-nodes:          7
> 
>  Number of racks:               1
> 
> 
> After datanode restart - over-replicated nodes are practically gone:
> 
> Status: HEALTHY
>  Total size:    1310669475298 B (Total open files size: 29535016933 B)
>  Total dirs:    59431
>  Total files:   377177 (Files currently being written: 387)
>  Total blocks (validated):      386661 (avg. block size 3389712 B)
> (Total open file blocks (not validated): 607)
>  Minimally replicated blocks:   386661 (100.0 %)
>  Over-replicated blocks:        272 (0.070345856 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0007036
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          7
>  Number of racks:               1
> 
> What might be the cause for over-replication?
> 
> Best regards,
> Igor
> 
> 
> -----Original Message-----
> From: Igor Bolotin 
> Sent: Monday, March 09, 2009 2:50 PM
> To: core-user@hadoop.apache.org
> Subject: RE: DataNode stops cleaning disk?
> 
> My mistake about 'current' directory - that's the one that consumes all
> the disk space and 'du' on that directory matches exactly with namenode
> web ui reported size.
> I'm waiting for the next time this happens to collect more details, but
> ever since I wrote the first email - everything works perfectly well
> (another application of Murphy law). 
> 
> Thanks,
> Igor
> 
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 12:06 PM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> Igor Bolotin wrote:
>> That's what I saw just yesterday on one of the data nodes with this
>> situation (will confirm also next time it happens):
>> - Tmp and current were either empty or almost empty last time I
> checked.
>> - du on the entire data directory matched exactly with reported used
>> space in NameNode web UI and it did report that it uses some most of
> the
>> available disk space. 
>> - nothing else was using disk space (actually - it's dedicated DFS
>> cluster).
> 
> If 'du' command (you can run in the shell) counts properly then you 
> should be able to see which files are taking space.
> 
> If 'du' can't but 'df' reports very less space available, then it is 
> possible (though never seen) that datanode is keeping a a lot these 
> files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
> not datanode, then check lsof to find who is holding these files.
> 
> hope this helps.
> Raghu.
> 
>> Thank you for help!
>> Igor
>>
>> -----Original Message-----
>> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
>> Sent: Thursday, March 05, 2009 11:05 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: DataNode stops cleaning disk?
>>
>>
>> This is unexpected unless some other process is eating up space.
>>
>> Couple of things to collect next time (along with log):
>>
>>   - All the contents under datanode-directory/ (especially including 
>> 'tmp' and 'current')
>>   - Does 'du' of this directory match with what is reported to
> NameNode 
>> (shown on webui) by this DataNode.
>>   - Is there anything else taking disk space on the machine?
>>
>> Raghu.
>>
>> Igor Bolotin wrote:
>>> Normally I dislike writing about problems without being able to
>> provide
>>> some more information, but unfortunately in this case I just can't
>> find
>>> anything.
>>>
>>>  
>>>
>>> Here is the situation - DFS cluster running Hadoop version 0.19.0.
> The
>>> cluster is running on multiple servers with practically identical
>>> hardware. Everything works perfectly well, except for one thing -
> from
>>> time to time one of the data nodes (every time it's a different node)
>>> starts to consume more and more disk space. The node keeps going and
>> if
>>> we don't do anything - it runs out of space completely (ignoring 20GB
>>> reserved space settings). Once restarted - it cleans disk rapidly and
>>> goes back to approximately the same utilization as the rest of data
>>> nodes in the cluster.
>>>
>>>  
>>>
>>> Scanning datanodes and namenode logs and comparing thread dumps
>> (stacks)
>>> from nodes experiencing problem and those that run normally didn't
>>> produce any clues. Running balancer tool didn't help at all. FSCK
>> shows
>>> that everything is healthy and number of over-replicated blocks is
> not
>>> significant.
>>>
>>>  
>>>
>>> To me - it just looks like at some point the data node stops cleaning
>>> invalidated/deleted blocks, but keeps reporting space consumed by
>> these
>>> blocks as "not used", but I'm not familiar enough with the internals
>> and
>>> just plain don't have enough free time to start digging deeper.
>>>
>>>  
>>>
>>> Anyone has an idea what is wrong or what else we can do to find out
>>> what's wrong or maybe where to start looking in the code?
>>>
>>>  
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>  
>>>
>>>
>

RE: DataNode stops cleaning disk?

Posted by Igor Bolotin <ig...@collarity.com>.

Caught this issue again on one of the clusters. DF and DU sizes match
very closely with information reported by dfsadmin command. 
Lsof reports some 1000 open files in DFS data directories on the
problematic datanode, but total size for open files is only about 10GB.
I can't really track the space usage to individual files - there are way
too many files/blocks for detailed analysis.

Here is something interesting - fsck before datanode restart reports
very significant number of over-replicated blocks (~10% of blocks are
over-replicated):

Status: HEALTHY
 Total size:    1472758591906 B (Total open files size: 29050588133 B)

 Total dirs:    58431

 Total files:   375703 (Files currently being written: 418)

 Total blocks (validated):      387205 (avg. block size 3803562 B)
(Total open file blocks (not validated): 595)            
 Minimally replicated blocks:   387205 (100.0 %)

 Over-replicated blocks:        38782 (10.015883 %)

 Under-replicated blocks:       0 (0.0 %)

 Mis-replicated blocks:         0 (0.0 %)

 Default replication factor:    3

 Average block replication:     3.1003888

 Corrupt blocks:                0

 Missing replicas:              0 (0.0 %)

 Number of data-nodes:          7

 Number of racks:               1


After datanode restart - over-replicated nodes are practically gone:

Status: HEALTHY
 Total size:    1310669475298 B (Total open files size: 29535016933 B)
 Total dirs:    59431
 Total files:   377177 (Files currently being written: 387)
 Total blocks (validated):      386661 (avg. block size 3389712 B)
(Total open file blocks (not validated): 607)
 Minimally replicated blocks:   386661 (100.0 %)
 Over-replicated blocks:        272 (0.070345856 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0007036
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          7
 Number of racks:               1

What might be the cause for over-replication?

Best regards,
Igor


-----Original Message-----
From: Igor Bolotin 
Sent: Monday, March 09, 2009 2:50 PM
To: core-user@hadoop.apache.org
Subject: RE: DataNode stops cleaning disk?

My mistake about 'current' directory - that's the one that consumes all
the disk space and 'du' on that directory matches exactly with namenode
web ui reported size.
I'm waiting for the next time this happens to collect more details, but
ever since I wrote the first email - everything works perfectly well
(another application of Murphy law). 

Thanks,
Igor

-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 12:06 PM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?

Igor Bolotin wrote:
> That's what I saw just yesterday on one of the data nodes with this
> situation (will confirm also next time it happens):
> - Tmp and current were either empty or almost empty last time I
checked.
> - du on the entire data directory matched exactly with reported used
> space in NameNode web UI and it did report that it uses some most of
the
> available disk space. 
> - nothing else was using disk space (actually - it's dedicated DFS
> cluster).

If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.

If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.

hope this helps.
Raghu.

> Thank you for help!
> Igor
> 
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 11:05 AM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> 
> This is unexpected unless some other process is eating up space.
> 
> Couple of things to collect next time (along with log):
> 
>   - All the contents under datanode-directory/ (especially including 
> 'tmp' and 'current')
>   - Does 'du' of this directory match with what is reported to
NameNode 
> (shown on webui) by this DataNode.
>   - Is there anything else taking disk space on the machine?
> 
> Raghu.
> 
> Igor Bolotin wrote:
>> Normally I dislike writing about problems without being able to
> provide
>> some more information, but unfortunately in this case I just can't
> find
>> anything.
>>
>>  
>>
>> Here is the situation - DFS cluster running Hadoop version 0.19.0.
The
>> cluster is running on multiple servers with practically identical
>> hardware. Everything works perfectly well, except for one thing -
from
>> time to time one of the data nodes (every time it's a different node)
>> starts to consume more and more disk space. The node keeps going and
> if
>> we don't do anything - it runs out of space completely (ignoring 20GB
>> reserved space settings). Once restarted - it cleans disk rapidly and
>> goes back to approximately the same utilization as the rest of data
>> nodes in the cluster.
>>
>>  
>>
>> Scanning datanodes and namenode logs and comparing thread dumps
> (stacks)
>> from nodes experiencing problem and those that run normally didn't
>> produce any clues. Running balancer tool didn't help at all. FSCK
> shows
>> that everything is healthy and number of over-replicated blocks is
not
>> significant.
>>
>>  
>>
>> To me - it just looks like at some point the data node stops cleaning
>> invalidated/deleted blocks, but keeps reporting space consumed by
> these
>> blocks as "not used", but I'm not familiar enough with the internals
> and
>> just plain don't have enough free time to start digging deeper.
>>
>>  
>>
>> Anyone has an idea what is wrong or what else we can do to find out
>> what's wrong or maybe where to start looking in the code?
>>
>>  
>>
>> Thanks,
>>
>> Igor
>>
>>  
>>
>>
>

Re: DataNode stops cleaning disk?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Igor Bolotin wrote:
> That's what I saw just yesterday on one of the data nodes with this
> situation (will confirm also next time it happens):
> - Tmp and current were either empty or almost empty last time I checked.
> - du on the entire data directory matched exactly with reported used
> space in NameNode web UI and it did report that it uses some most of the
> available disk space. 
> - nothing else was using disk space (actually - it's dedicated DFS
> cluster).

If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.

If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.

hope this helps.
Raghu.

> Thank you for help!
> Igor
> 
> -----Original Message-----
> From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
> Sent: Thursday, March 05, 2009 11:05 AM
> To: core-user@hadoop.apache.org
> Subject: Re: DataNode stops cleaning disk?
> 
> 
> This is unexpected unless some other process is eating up space.
> 
> Couple of things to collect next time (along with log):
> 
>   - All the contents under datanode-directory/ (especially including 
> 'tmp' and 'current')
>   - Does 'du' of this directory match with what is reported to NameNode 
> (shown on webui) by this DataNode.
>   - Is there anything else taking disk space on the machine?
> 
> Raghu.
> 
> Igor Bolotin wrote:
>> Normally I dislike writing about problems without being able to
> provide
>> some more information, but unfortunately in this case I just can't
> find
>> anything.
>>
>>  
>>
>> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
>> cluster is running on multiple servers with practically identical
>> hardware. Everything works perfectly well, except for one thing - from
>> time to time one of the data nodes (every time it's a different node)
>> starts to consume more and more disk space. The node keeps going and
> if
>> we don't do anything - it runs out of space completely (ignoring 20GB
>> reserved space settings). Once restarted - it cleans disk rapidly and
>> goes back to approximately the same utilization as the rest of data
>> nodes in the cluster.
>>
>>  
>>
>> Scanning datanodes and namenode logs and comparing thread dumps
> (stacks)
>> from nodes experiencing problem and those that run normally didn't
>> produce any clues. Running balancer tool didn't help at all. FSCK
> shows
>> that everything is healthy and number of over-replicated blocks is not
>> significant.
>>
>>  
>>
>> To me - it just looks like at some point the data node stops cleaning
>> invalidated/deleted blocks, but keeps reporting space consumed by
> these
>> blocks as "not used", but I'm not familiar enough with the internals
> and
>> just plain don't have enough free time to start digging deeper.
>>
>>  
>>
>> Anyone has an idea what is wrong or what else we can do to find out
>> what's wrong or maybe where to start looking in the code?
>>
>>  
>>
>> Thanks,
>>
>> Igor
>>
>>  
>>
>>
>

RE: DataNode stops cleaning disk?

Posted by Igor Bolotin <ig...@collarity.com>.

That's what I saw just yesterday on one of the data nodes with this
situation (will confirm also next time it happens):
- Tmp and current were either empty or almost empty last time I checked.
- du on the entire data directory matched exactly with reported used
space in NameNode web UI and it did report that it uses some most of the
available disk space. 
- nothing else was using disk space (actually - it's dedicated DFS
cluster).

Thank you for help!
Igor

-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 11:05 AM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?


This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.
  - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:
> Normally I dislike writing about problems without being able to
provide
> some more information, but unfortunately in this case I just can't
find
> anything.
> 
>  
> 
> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
> cluster is running on multiple servers with practically identical
> hardware. Everything works perfectly well, except for one thing - from
> time to time one of the data nodes (every time it's a different node)
> starts to consume more and more disk space. The node keeps going and
if
> we don't do anything - it runs out of space completely (ignoring 20GB
> reserved space settings). Once restarted - it cleans disk rapidly and
> goes back to approximately the same utilization as the rest of data
> nodes in the cluster.
> 
>  
> 
> Scanning datanodes and namenode logs and comparing thread dumps
(stacks)
> from nodes experiencing problem and those that run normally didn't
> produce any clues. Running balancer tool didn't help at all. FSCK
shows
> that everything is healthy and number of over-replicated blocks is not
> significant.
> 
>  
> 
> To me - it just looks like at some point the data node stops cleaning
> invalidated/deleted blocks, but keeps reporting space consumed by
these
> blocks as "not used", but I'm not familiar enough with the internals
and
> just plain don't have enough free time to start digging deeper.
> 
>  
> 
> Anyone has an idea what is wrong or what else we can do to find out
> what's wrong or maybe where to start looking in the code?
> 
>  
> 
> Thanks,
> 
> Igor
> 
>  
> 
>

Re: DataNode stops cleaning disk?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.
  - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:
> Normally I dislike writing about problems without being able to provide
> some more information, but unfortunately in this case I just can't find
> anything.
> 
>  
> 
> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
> cluster is running on multiple servers with practically identical
> hardware. Everything works perfectly well, except for one thing - from
> time to time one of the data nodes (every time it's a different node)
> starts to consume more and more disk space. The node keeps going and if
> we don't do anything - it runs out of space completely (ignoring 20GB
> reserved space settings). Once restarted - it cleans disk rapidly and
> goes back to approximately the same utilization as the rest of data
> nodes in the cluster.
> 
>  
> 
> Scanning datanodes and namenode logs and comparing thread dumps (stacks)
> from nodes experiencing problem and those that run normally didn't
> produce any clues. Running balancer tool didn't help at all. FSCK shows
> that everything is healthy and number of over-replicated blocks is not
> significant.
> 
>  
> 
> To me - it just looks like at some point the data node stops cleaning
> invalidated/deleted blocks, but keeps reporting space consumed by these
> blocks as "not used", but I'm not familiar enough with the internals and
> just plain don't have enough free time to start digging deeper.
> 
>  
> 
> Anyone has an idea what is wrong or what else we can do to find out
> what's wrong or maybe where to start looking in the code?
> 
>  
> 
> Thanks,
> 
> Igor
> 
>  
> 
>