You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2009/01/10 01:21:59 UTC

[jira] Commented: (HBASE-24) Scaling: Too many open file handles to datanodes

    [ https://issues.apache.org/jira/browse/HBASE-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662570#action_12662570 ] 

stack commented on HBASE-24:
----------------------------

Here are related remarks made by Jean-Adrien over in hadoop.  Baseline is that we've been bandaging over this uglyness for a while now.  Time to address the puss.

{code}
xceiverCount limit reason
Click to flag this post

by Jean-Adrien Jan 08, 2009; 03:03am :: Rate this Message: - Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
Hello all,

I'm running HBase on top of hadoop and I have some difficulties to tune hadoop conf in order to work fine with HBase.
My configuration is 4 desktop class machines, 2 are running a datanode/region server, 1 only a region server and 1 a namenode/hbase master, 1Gb RAM each

When I start HBase, about 300 regions must be load on 3 region servers; a lot of accesses are made concurrently on Hadoop. My first problem, using the default configuration, was to see too many of:
DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write.

I was wondering what the reason of such a time out is. Where is the bottleneck ? First I believed that was a network problem (I have  100Mbits/s interfaces). But after monitoring the network, it seems the load is low when it happens.
Anyway, I found the parameter
dfs.datanode.socket.write.timeout and I set it 0 to disable the timeout.

Then I saw in datanodes
xceiverCount 256 exceeds the limit of concurrent xcievers 255
What is exactly the role of the receivers ? to receive the replicated blocks and/or to receive the file from clients ?
When their threads end ? When their threads are created ?

Anyway, I found the parameter
dfs.datanode.max.xcievers
I upped it to 511, then to 1023 and today to 2047; but by cluster is not so big (300 HBase regions, 200Gb including replication factor of 2); I'm not sure I will be able to up this limit for a long time. Moreover, it considerably increases the amount of virtual memory needed for the datanode jvm (about 2Gb now, only 500Mb for heap). That yields to excessive swap, and a new problem arises; some leases expired, and my entire cluster eventually fails.

Can I tune other parameter to avoid these concurrent receivers to be created ?
Upping the dfs.replication.interval for example could help ?

Could the fact the I run the regionserver on the same machine that the datanode up the amount of xciever ? in which case I'll try a different layout, and use the network bottleneck to avoid stress datanodes.

Any clue on the inside-hadoop-xciever would be appreciated.
Thanks.

-- Jean-Adrien
{code}


... and


{code}
Some more information about the case.

I read the HADOOP-3633 / 3859 / 3831 in jira.
I run the version 18.1 of hadoop therefore I have no fix for 3831.
Nevertheless my problem seems different.
The threads are created as soon the client (HBase) requests data. the data
arrives to HBase without problem but the thread never ends. Looking at the #
of threads graphs:

http://www.nabble.com/file/p21352818/launch_tests.png
(you might need to go to nabble to see the image:
http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21349807.html

In the graph one runs hadoop / HBase 3 times (A/B/C) :
A:
I configure hadoop with dfs.datanode.max.xcievers=2023 and
dfs.datanode.socket.write.timeout=0
as soon I start hbase, the region load their data from dfs and the number of
threads climbs up to 1100 in about 2-3 min. Then it stays in this scope.
All DataXceiver threads are in one of these two states:

"org.apache.hadoop.dfs.DataNode$DataXceiver@6a2f81" daemon prio=10
tid=0x08289c00 nid=0x6bb6 runnable [0x8f980000..0x8f981140]
  java.lang.Thread.State: RUNNABLE
       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
       - locked <0x95838858> (a sun.nio.ch.Util$1)
       - locked <0x95838868> (a java.util.Collections$UnmodifiableSet)
       - locked <0x95838818> (a sun.nio.ch.EPollSelectorImpl)
       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
       at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:260)
       at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
       at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
       at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
       - locked <0x95838b90> (a java.io.BufferedInputStream)
       at java.io.DataInputStream.readShort(DataInputStream.java:295)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
       at java.lang.Thread.run(Thread.java:619)

"org.apache.hadoop.dfs.DataNode$DataXceiver@1abf87e" daemon prio=10
tid=0x90bbd400 nid=0x61ae runnable [0x7b68a000..0x7b68afc0]
  java.lang.Thread.State: RUNNABLE
       at java.net.SocketInputStream.socketRead0(Native Method)
       at java.net.SocketInputStream.read(SocketInputStream.java:129)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
       - locked <0x9671a8e0> (a java.io.BufferedInputStream)
       at java.io.DataInputStream.readShort(DataInputStream.java:295)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
       at java.lang.Thread.run(Thread.java:619)


B:
I changed hadoop configuration, introducing the default 8min timeout.
Once again, as soon HBase gets data from dfs, the number of thread grows to
1100. After 8 minutes the timeout fires, and they fail one after each other
with the exception:

2009-01-08 14:21:09,305 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.13:50010,
storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
ipcPor
t=50020):Got exception while serving blk_-1718199459793984230_722338 to
/192.168.1.13:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.13:50010 re
mote=/192.168.1.13:37462]
       at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
       at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
       at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
       at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
       at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
       at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
       at java.lang.Thread.run(Thread.java:619)

C:
During this third session, I made the same run, but before the timeout
fires, I stop HBase. In this case, the thread ends correctly.

Is it the responsibility of hadoop client too manage its connection pool
with the server ? In which case the problem would be an HBase problem?
Anyway I found my problem, it is not a matter of performances.

Thanks for your answers
Have a nice day.

-- Jean-Adrien
--
View this message in context: http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21352818.html
- Show quoted text -
{code}

And Raghu:

{code}
Jean-Adrien wrote:

    Is it the responsibility of hadoop client too manage its connection pool
    with the server ? In which case the problem would be an HBase problem?
    Anyway I found my problem, it is not a matter of performances.


Essentially, yes. Client has to close the file to relinquish connections, if clients are using the common read/write interface.

Currently if a client keeps many hdfs files open, it results in many threads held at the DataNodes. As you noticed, timeout at DNs helps.

Various solutions are possible at different levels: application(hbase), Client API, HDFS, etc. https://issues.apache.org/jira/browse/HADOOP-3856 is proposal at HDFS level.
{code}

> Scaling: Too many open file handles to datanodes
> ------------------------------------------------
>
>                 Key: HBASE-24
>                 URL: https://issues.apache.org/jira/browse/HBASE-24
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> We've been here before (HADOOP-2341).
> Today the rapleaf gave me an lsof listing from a regionserver.  Had thousands of open sockets to datanodes all in ESTABLISHED and CLOSE_WAIT state.  On average they seem to have about ten file descriptors/sockets open per region (They have 3 column families IIRC.  Per family, can have between 1-5 or so mapfiles open per family -- 3 is max... but compacting we open a new one, etc.).
> They have thousands of regions.   400 regions -- ~100G, which is not that much -- takes about 4k open file handles.
> If they want a regionserver to server a decent disk worths -- 300-400G -- then thats maybe 1600 regions... 16k file handles.  If more than just 3 column families..... then we are in danger of blowing out limits if they are 32k.
> We've been here before with HADOOP-2341.
> A dfsclient that used non-blocking i/o would help applications like hbase (The datanode doesn't have this problem as bad -- CLOSE_WAIT on regionserver side, the bulk of the open fds in the rapleaf log, don't have a corresponding open resource on datanode end).
> Could also just open mapfiles as needed, but that'd kill our random read performance and its bad enough already.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.