You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Venner <ja...@attributor.com> on 2009/01/09 01:03:50 UTC

0.18.1 datanode psuedo deadlock problem

The problem we are having is that datanodes periodically stall for 10-15 
minutes and drop off the active list and then come back.

What is going on is that a long operation set is holding the lock on on 
FSDataset.volumes, and all of the other block service requests stall 
behind this lock.

"DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 tid=0x4d7ad400 
nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
   java.lang.Thread.State: RUNNABLE
    at java.lang.String.lastIndexOf(String.java:1628)
    at java.io.File.getName(File.java:399)
    at 
org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148)
    at 
org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511)
    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
    at java.lang.Thread.run(Thread.java:619)

This is basically taking a stat on every hdfs block on the datanode, 
which in our case is ~ 2million, and can take 10+ minutes (we may be 
experiencing problems with our raid controller but have no visibility 
into it) at the OS level the file system seems fine and operations 
eventually finish.

It appears that a couple of different data structures are being locked 
with the single object FSDataset$Volume.

Then this happens:
"org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473)
    - waiting to lock <0x551e8d48> (a 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
    at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
    at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
    at java.lang.Thread.run(Thread.java:619)

which locks the FSDataset while waiting on the volume object

and now all of the Datanode operations stall waiting on the FSDataset 
object.
----------

Our particular installation doesn't use multiple directories for hdfs, 
so a first simple hack for a local fix would be to modify getNextVolume 
to just return the single volume and not be synchronized

A richer alternative would be to make the locking more fine grained on 
FSDataset$FSVolumeSet.

Of course we are also trying to fix the file system performance and dfs 
block loading that results in the block report taking a long time.

Any suggestions or warnings?

Thanks.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Jason Venner <ja...@attributor.com>.

The thought is that the notifier would stat each file as it was notified 
about it, and thus would have the real time dusage information also.
There would be no need for the current du task or the block task after 
startup (ie: do it one time to compute the current blocks and space).

After startup the recipient of the notifications would keep up the block 
information and the du information.

Raghu Angadi wrote:
> Jason Venner wrote:
>> There is no reason to do the block scans. All of the modern kernels 
>> will provide you notification when an file or directory is altered.
>>
>> This could be readily handled with a native application that writes 
>> structured data to a receiver in the Datanode, or via JNA/JNI for 
>> pure java or mixed solutions.
>
> Agreed. pure java would be best. I didn't mean to suggest scan as the 
> main solution. I only stressed that the problem needs to be fixed.
>
> Unlike DU, since scan would be in our control and it could be 
> throttled at our own pace so that it does not thrash the disks. It 
> will have minimal effect on DN.
>
> Please open a jira for this issue, so the these solutions could be 
> proposed.
>
> would such a notifier help with thrashing from DU? May be we should 
> have alternatives to calling du shell command on the entire directory.
>
> Raghu.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Jason Venner <ja...@attributor.com>.

Here is some simple code I wrote using JNA to handline linux INOTIFY. 
This code was my first and only attempt to use JNA.

The JNA jars are available from  https://jna.dev.java.net/


Raghu Angadi wrote:
> Jason Venner wrote:
>> There is no reason to do the block scans. All of the modern kernels 
>> will provide you notification when an file or directory is altered.
>>
>> This could be readily handled with a native application that writes 
>> structured data to a receiver in the Datanode, or via JNA/JNI for 
>> pure java or mixed solutions.
>
> Agreed. pure java would be best. I didn't mean to suggest scan as the 
> main solution. I only stressed that the problem needs to be fixed.
>
> Unlike DU, since scan would be in our control and it could be 
> throttled at our own pace so that it does not thrash the disks. It 
> will have minimal effect on DN.
>
> Please open a jira for this issue, so the these solutions could be 
> proposed.
>
> would such a notifier help with thrashing from DU? May be we should 
> have alternatives to calling du shell command on the entire directory.
>
> Raghu.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Jason Venner wrote:
> There is no reason to do the block scans. All of the modern kernels will 
> provide you notification when an file or directory is altered.
> 
> This could be readily handled with a native application that writes 
> structured data to a receiver in the Datanode, or via JNA/JNI for pure 
> java or mixed solutions.

Agreed. pure java would be best. I didn't mean to suggest scan as the 
main solution. I only stressed that the problem needs to be fixed.

Unlike DU, since scan would be in our control and it could be throttled 
at our own pace so that it does not thrash the disks. It will have 
minimal effect on DN.

Please open a jira for this issue, so the these solutions could be proposed.

would such a notifier help with thrashing from DU? May be we should have 
alternatives to calling du shell command on the entire directory.

Raghu.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Jason Venner <ja...@attributor.com>.

There is no reason to do the block scans. All of the modern kernels will 
provide you notification when an file or directory is altered.

This could be readily handled with a native application that writes 
structured data to a receiver in the Datanode, or via JNA/JNI for pure 
java or mixed solutions.



Raghu Angadi wrote:
> Sagar Naik wrote:
>> Hi Raghu,
>>
>>
>> The periodic "du" and block reports thread thrash the disk. (Block 
>> Reports takes abt on an avg 21 mins )
>>
>> and I think all the datanode threads are not able to do much and freeze
>
> yes, that is the known problem we talked about in the earlier mails in 
> this thread.
>
> When you have millions of blocks, one hour for du and block report 
> intervals is too often for you. May be you could increase it to 
> something like 6 or 12 hours.
>
> It still does not fix the block report problem since DataNode does the 
> scan in-line.
>
> As I mentioned in earlier mails, we should really fix the block report 
> problem. As simple fix would scan (very slowly, unlike DU) the 
> directories in the background.
>
> Even after fixing block reports, you should be aware that excessive 
> number of block does impact the performance. No system can guarantee 
> performance when overloaded. What we want to do is to make Hadoop 
> degrade gracefully.. rather than DNs getting killed.
>
> Raghu.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Sagar Naik wrote:
> Hi Raghu,
> 
> 
> The periodic "du" and block reports thread thrash the disk. (Block 
> Reports takes abt on an avg 21 mins )
> 
> and I think all the datanode threads are not able to do much and freeze

yes, that is the known problem we talked about in the earlier mails in 
this thread.

When you have millions of blocks, one hour for du and block report 
intervals is too often for you. May be you could increase it to 
something like 6 or 12 hours.

It still does not fix the block report problem since DataNode does the 
scan in-line.

As I mentioned in earlier mails, we should really fix the block report 
problem. As simple fix would scan (very slowly, unlike DU) the 
directories in the background.

Even after fixing block reports, you should be aware that excessive 
number of block does impact the performance. No system can guarantee 
performance when overloaded. What we want to do is to make Hadoop 
degrade gracefully.. rather than DNs getting killed.

Raghu.

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Sagar Naik <sn...@attributor.com>.

Hi Raghu,


The periodic "du" and block reports thread thrash the disk. (Block 
Reports takes abt on an avg 21 mins )

and I think all the datanode threads are not able to do much and freeze

"org.apache.hadoop.dfs.DataNode$DataXceiver@f2127a" daemon prio=10 
tid=0x41f06000 nid=0x7c7c waiting for monitor entry [0x43918000..0x43918f50]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.hadoop.dfs.FSDataset.getFile(FSDataset.java:1158)
    - waiting to lock <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
    at 
org.apache.hadoop.dfs.FSDataset.validateBlockFile(FSDataset.java:1074)
    at org.apache.hadoop.dfs.FSDataset.isValidBlock(FSDataset.java:1066)
    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:894)
    at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
    at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
    at java.lang.Thread.run(Thread.java:619)

   Locked ownable synchronizers:
    - None

"org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473)
    - waiting to lock <0x551e8d48> (a 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
    at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
    at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
    at java.lang.Thread.run(Thread.java:619)

   Locked ownable synchronizers:
    - None

"DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 tid=0x4d7ad400 
nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
   java.lang.Thread.State: RUNNABLE
    at java.lang.String.lastIndexOf(String.java:1628)
    at java.io.File.getName(File.java:399)
    at 
org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148)
    at 
org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
    at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511)
    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
    at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
    at java.lang.Thread.run(Thread.java:619)


and the lock is *<0x54e550e0> *is held by another similar thread and tht 
thread is waiting on FSVolume blocked by getBlockReport()

Infact, during this time, the datanode appears as dead node and clients 
keep on getting createBlockException with timeout

We dont see this problem on other DNs with less number of blocks. So I 
think, 2Million files is an issue here

Pl correct me if I missed on something

-Sagar

Raghu Angadi wrote:
>
> The scan required for each block report is well known issue and it can 
> be fixed. It was discussed multiple times (e.g. 
> https://issues.apache.org/jira/browse/HADOOP-3232?focusedCommentId=12587795#action_12587795 
> ).
>
> Earlier, inline 'du' on datanodes used to cause the same problem and 
> they they were moved to a separate thread (HADOOP-3232). block reports 
> can do the same...
>
> Though 2M blocks on DN is very large, there is no reason block reports 
> should break things. Once we fix block reports, something else might 
> break.. but that is different issue.
>
> Raghu.
>
> Jason Venner wrote:
>> The problem we are having is that datanodes periodically stall for 
>> 10-15 minutes and drop off the active list and then come back.
>>
>> What is going on is that a long operation set is holding the lock on 
>> on FSDataset.volumes, and all of the other block service requests 
>> stall behind this lock.
>>
>> "DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 
>> tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
>>   java.lang.Thread.State: RUNNABLE
>>    at java.lang.String.lastIndexOf(String.java:1628)
>>    at java.io.File.getName(File.java:399)
>>    at 
>> org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) 
>>
>>    at 
>> org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
>>    at 
>> org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) 
>>
>>    at 
>> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) 
>>
>>    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>>    at 
>> org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
>>    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
>>    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
>>    at java.lang.Thread.run(Thread.java:619)
>>
>> This is basically taking a stat on every hdfs block on the datanode, 
>> which in our case is ~ 2million, and can take 10+ minutes (we may be 
>> experiencing problems with our raid controller but have no visibility 
>> into it) at the OS level the file system seems fine and operations 
>> eventually finish.
>>
>> It appears that a couple of different data structures are being 
>> locked with the single object FSDataset$Volume.
>>
>> Then this happens:
>> "org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
>> tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry 
>> [0x459fe000..0x459ff0d0]
>>   java.lang.Thread.State: BLOCKED (on object monitor)
>>    at 
>> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) 
>>
>>    - waiting to lock <0x551e8d48> (a 
>> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>>    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
>>    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
>>    at 
>> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
>>    at 
>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187) 
>>
>>    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
>>    at java.lang.Thread.run(Thread.java:619)
>>
>> which locks the FSDataset while waiting on the volume object
>>
>> and now all of the Datanode operations stall waiting on the FSDataset 
>> object.
>> ----------
>>
>> Our particular installation doesn't use multiple directories for 
>> hdfs, so a first simple hack for a local fix would be to modify 
>> getNextVolume to just return the single volume and not be synchronized
>>
>> A richer alternative would be to make the locking more fine grained 
>> on FSDataset$FSVolumeSet.
>>
>> Of course we are also trying to fix the file system performance and 
>> dfs block loading that results in the block report taking a long time.
>>
>> Any suggestions or warnings?
>>
>> Thanks.
>>
>>
>>
>

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

The scan required for each block report is well known issue and it can 
be fixed. It was discussed multiple times (e.g. 
https://issues.apache.org/jira/browse/HADOOP-3232?focusedCommentId=12587795#action_12587795 
).

Earlier, inline 'du' on datanodes used to cause the same problem and 
they they were moved to a separate thread (HADOOP-3232). block reports 
can do the same...

Though 2M blocks on DN is very large, there is no reason block reports 
should break things. Once we fix block reports, something else might 
break.. but that is different issue.

Raghu.

Jason Venner wrote:
> The problem we are having is that datanodes periodically stall for 10-15 
> minutes and drop off the active list and then come back.
> 
> What is going on is that a long operation set is holding the lock on on 
> FSDataset.volumes, and all of the other block service requests stall 
> behind this lock.
> 
> "DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 tid=0x4d7ad400 
> nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
>   java.lang.Thread.State: RUNNABLE
>    at java.lang.String.lastIndexOf(String.java:1628)
>    at java.io.File.getName(File.java:399)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) 
> 
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) 
> 
>    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
>    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
>    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
>    at java.lang.Thread.run(Thread.java:619)
> 
> This is basically taking a stat on every hdfs block on the datanode, 
> which in our case is ~ 2million, and can take 10+ minutes (we may be 
> experiencing problems with our raid controller but have no visibility 
> into it) at the OS level the file system seems fine and operations 
> eventually finish.
> 
> It appears that a couple of different data structures are being locked 
> with the single object FSDataset$Volume.
> 
> Then this happens:
> "org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
> tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry 
> [0x459fe000..0x459ff0d0]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) 
> 
>    - waiting to lock <0x551e8d48> (a 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
>    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
>    at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
>    at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
>    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
>    at java.lang.Thread.run(Thread.java:619)
> 
> which locks the FSDataset while waiting on the volume object
> 
> and now all of the Datanode operations stall waiting on the FSDataset 
> object.
> ----------
> 
> Our particular installation doesn't use multiple directories for hdfs, 
> so a first simple hack for a local fix would be to modify getNextVolume 
> to just return the single volume and not be synchronized
> 
> A richer alternative would be to make the locking more fine grained on 
> FSDataset$FSVolumeSet.
> 
> Of course we are also trying to fix the file system performance and dfs 
> block loading that results in the block report taking a long time.
> 
> Any suggestions or warnings?
> 
> Thanks.
> 
> 
>

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

2M files is excessive. But there is no reason block reports should 
break. My preference is to make block reports handle this better. DNs 
dropping in and out of the cluster causes too many other problems.

Raghu.

Konstantin Shvachko wrote:
> Hi Jason,
> 
> 2 million blocks per data-node is not going to work.
> There were discussions about it previously, please
> check the mail archives.
> 
> This means you have a lot of very small files, which
> HDFS is not designed to support. A general recommendation
> is to group small files into large ones, introducing
> some kind of record structure delimiting those small files,
> and control it in on the application level.
> 
> Thanks,
> --Konstantin
>

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Hi Jason,

2 million blocks per data-node is not going to work.
There were discussions about it previously, please
check the mail archives.

This means you have a lot of very small files, which
HDFS is not designed to support. A general recommendation
is to group small files into large ones, introducing
some kind of record structure delimiting those small files,
and control it in on the application level.

Thanks,
--Konstantin


Jason Venner wrote:
> The problem we are having is that datanodes periodically stall for 10-15 
> minutes and drop off the active list and then come back.
> 
> What is going on is that a long operation set is holding the lock on on 
> FSDataset.volumes, and all of the other block service requests stall 
> behind this lock.
> 
> "DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 tid=0x4d7ad400 
> nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
>   java.lang.Thread.State: RUNNABLE
>    at java.lang.String.lastIndexOf(String.java:1628)
>    at java.io.File.getName(File.java:399)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) 
> 
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) 
> 
>    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
>    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
>    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
>    at java.lang.Thread.run(Thread.java:619)
> 
> This is basically taking a stat on every hdfs block on the datanode, 
> which in our case is ~ 2million, and can take 10+ minutes (we may be 
> experiencing problems with our raid controller but have no visibility 
> into it) at the OS level the file system seems fine and operations 
> eventually finish.
> 
> It appears that a couple of different data structures are being locked 
> with the single object FSDataset$Volume.
> 
> Then this happens:
> "org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
> tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry 
> [0x459fe000..0x459ff0d0]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) 
> 
>    - waiting to lock <0x551e8d48> (a 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
>    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
>    at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
>    at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
>    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
>    at java.lang.Thread.run(Thread.java:619)
> 
> which locks the FSDataset while waiting on the volume object
> 
> and now all of the Datanode operations stall waiting on the FSDataset 
> object.
> ----------
> 
> Our particular installation doesn't use multiple directories for hdfs, 
> so a first simple hack for a local fix would be to modify getNextVolume 
> to just return the single volume and not be synchronized
> 
> A richer alternative would be to make the locking more fine grained on 
> FSDataset$FSVolumeSet.
> 
> Of course we are also trying to fix the file system performance and dfs 
> block loading that results in the block report taking a long time.
> 
> Any suggestions or warnings?
> 
> Thanks.
> 
> 
>

Re: 0.18.1 datanode psuedo deadlock problem

Posted by Jason Venner <ja...@attributor.com>.

I propose an alternate solution for this.
If the block information was managed by having a inotify task (in 
linux/solaris), and the windows equivalent which I forget, the datanode 
could be informed each time a file in the dfs tree is created, updated, 
or deleted.

With this information being delivered, it can maintain an accurate block 
map with only 1 full scan of the datanode blocks, at start time.

With this algorithm the data nodes will be able to scale to a much 
larger number of blocks.

The other thing is the way the sync blocks on the FSDataset.FSVolumeSet 
are held totally aggravates this bug in 0.18.1.

I have implemented a pure java version of inotify, using JNA 
(https://jna.dev.java.net/) and there is a windows version also 
available, or some simple jni could be written.

The jason@attributor.com address will be going away shortly, I will be 
switching to jason.hadoop@gmail.com in the next little bit.



Jason Venner wrote:
> The problem we are having is that datanodes periodically stall for 
> 10-15 minutes and drop off the active list and then come back.
>
> What is going on is that a long operation set is holding the lock on 
> on FSDataset.volumes, and all of the other block service requests 
> stall behind this lock.
>
> "DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10 
> tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
>   java.lang.Thread.State: RUNNABLE
>    at java.lang.String.lastIndexOf(String.java:1628)
>    at java.io.File.getName(File.java:399)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) 
>
>    at 
> org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) 
>
>    - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
>    at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
>    at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
>    at java.lang.Thread.run(Thread.java:619)
>
> This is basically taking a stat on every hdfs block on the datanode, 
> which in our case is ~ 2million, and can take 10+ minutes (we may be 
> experiencing problems with our raid controller but have no visibility 
> into it) at the OS level the file system seems fine and operations 
> eventually finish.
>
> It appears that a couple of different data structures are being locked 
> with the single object FSDataset$Volume.
>
> Then this happens:
> "org.apache.hadoop.dfs.DataNode$DataXceiver@1bcee17" daemon prio=10 
> tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry 
> [0x459fe000..0x459ff0d0]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>    at 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473) 
>
>    - waiting to lock <0x551e8d48> (a 
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
>    at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
>    - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
>    at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
>    at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
>    at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
>    at java.lang.Thread.run(Thread.java:619)
>
> which locks the FSDataset while waiting on the volume object
>
> and now all of the Datanode operations stall waiting on the FSDataset 
> object.
> ----------
>
> Our particular installation doesn't use multiple directories for hdfs, 
> so a first simple hack for a local fix would be to modify 
> getNextVolume to just return the single volume and not be synchronized
>
> A richer alternative would be to make the locking more fine grained on 
> FSDataset$FSVolumeSet.
>
> Of course we are also trying to fix the file system performance and 
> dfs block loading that results in the block report taking a long time.
>
> Any suggestions or warnings?
>
> Thanks.
>
>
>