You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Xu-Feng Mao <m9...@gmail.com> on 2011/08/22 12:33:19 UTC

The number of fd and CLOSE_WAIT keep increasing.

Hi,

We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
got three regionservers have
opened fd and CLOSE_WAIT kept increasing.

It looks like if the lines like

====
2011-08-22 18:19:01,815 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
has too many store files; delaying flush up to 90000ms
2011-08-22 18:19:01,815 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
has too many store files; delaying flush up to 90000ms
====

increase, then the the number of opened fds and CLOSE_WAIT increase
accordingly.

We're not sure if it's kind of fd leak under some unexpected circumstance or
exceptional path.

By netstat -lntp, we found that there're lots of connection like

====
Proto Recv-Q Send-Q Local Address               Foreign Address
State       PID/Program name
tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
CLOSE_WAIT  27748/java
====

The connections are keeping in these situation. It seems like some
connections to hdfs is in the situation
that the hdfs datanode has sent FIN, but regionservers are blocking on the
recv queue, so the fd and CLOSE_WAIT sockets
are probably leaked.

We also see some logs like
====
2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.150.161.73:50010, add to deadNodes and continue
java.io.IOException: Got error in response to OP_READ_BLOCK self=/
10.150.161.64:55229, remote=/10.150.161.73:50010 for file
/hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 for
block 2791681537571770744_132142063
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at
org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
        at
org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
        at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
        at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
        at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
        at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
        at
org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
        at
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
====

The number is much less than the number of " too many store files" WARNs, so
this might not the cause of too many
fds, but is this dangerous to the whole cluster?

Thanks and regards,

Mao Xu-Feng

Re: The number of fd and CLOSE_WAIT keep increasing.

Posted by Xu-Feng Mao <m9...@gmail.com>.

Thanks Harsh.

Mao Xu-Feng

On Tue, Aug 23, 2011 at 1:01 PM, Harsh J <ha...@cloudera.com> wrote:

> Yes, since it is a minor version update, you should be all set with
> replacement of packages and restart of nodes with the same configuration.
>
> No additional procedure should generally be required in updating between
> dot versions cause of compatibility being maintained :)
>
> On 23-Aug-2011, at 10:26 AM, Xu-Feng Mao wrote:
>
> > Thanks Andy!
> >
> > cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like
> > graceful_stop.sh.
> > Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply
> > replace the package
> > with our own configuration, right?
> >
> > Thanks and regards,
> >
> > Mao Xu-Feng
> >
> > On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> >>> We are running cdh3u0 hbase/hadoop suites on 28 nodes.
> >>
> >>
> >> For your information, CDHU1 does contain this:
> >>
> >>  Author: Eli Collins <el...@cloudera.com>
> >>  Date:   Tue Jul 5 16:02:22 2011 -0700
> >>
> >>      HDFS-1836. Thousand of CLOSE_WAIT socket.
> >>
> >>      Reason: Bug
> >>      Author: Bharath Mundlapudi
> >>      Ref: CDH-3200
> >>
> >> Best regards,
> >>
> >>
> >>   - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >>
> >> ----- Original Message -----
> >>> From: Xu-Feng Mao <m9...@gmail.com>
> >>> To: hbase-user@hadoop.apache.org; user@hbase.apache.org
> >>> Cc:
> >>> Sent: Monday, August 22, 2011 4:58 AM
> >>> Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
> >>>
> >>> On average, we have about 3000 CLOSE_WAIT, while on the three
> problematic
> >>> regionservers, we have about 30k CLOSE_WAIT.
> >>> We set open files limit to 130k, so it work OK now, but it seems not
> that
> >>> well.
> >>>
> >>> On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9...@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last
> >> Friday, we
> >>>> got three regionservers have
> >>>> opened fd and CLOSE_WAIT kept increasing.
> >>>>
> >>>> It looks like if the lines like
> >>>>
> >>>> ====
> >>>> 2011-08-22 18:19:01,815 WARN
> >>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>>>
> >>>
> >>
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
> >>>> has too many store files; delaying flush up to 90000ms
> >>>> 2011-08-22 18:19:01,815 WARN
> >>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>>>
> >>>
> >>
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
> >>>> has too many store files; delaying flush up to 90000ms
> >>>> ====
> >>>>
> >>>> increase, then the the number of opened fds and CLOSE_WAIT increase
> >>>> accordingly.
> >>>>
> >>>> We're not sure if it's kind of fd leak under some unexpected
> >>> circumstance
> >>>> or exceptional path.
> >>>>
> >>>> By netstat -lntp, we found that there're lots of connection like
> >>>>
> >>>> ====
> >>>> Proto Recv-Q Send-Q Local Address               Foreign Address
> >>>> State       PID/Program name
> >>>> tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
> >>>>   CLOSE_WAIT  27748/java
> >>>> ====
> >>>>
> >>>> The connections are keeping in these situation. It seems like some
> >>>> connections to hdfs is in the situation
> >>>> that the hdfs datanode has sent FIN, but regionservers are blocking on
> >> the
> >>>> recv queue, so the fd and CLOSE_WAIT sockets
> >>>> are probably leaked.
> >>>>
> >>>> We also see some logs like
> >>>> ====
> >>>> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed
> >> to
> >>>> connect to /10.150.161.73:50010, add to deadNodes and continue
> >>>> java.io.IOException: Got error in response to OP_READ_BLOCK self=/
> >>>> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file
> >>>>
> >> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241
> >>> for
> >>>> block 2791681537571770744_132142063
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
> >>>>         at
> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
> >>>>         at java.io.DataInputStream.read(DataInputStream.java:132)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
> >>>>         at
> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>>>         at
> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>>>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> >>>>         at
> >>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
> >>>>         at
> >>>>
> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
> >>>>         at
> >>>>
> >>
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
> >>>>         at
> >>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
> >>>>         at
> >>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
> >>>>         at
> >>>>
> >>>
> >>
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> >>>> ====
> >>>>
> >>>> The number is much less than the number of " too many store
> >>> files" WARNs,
> >>>> so this might not the cause of too many
> >>>> fds, but is this dangerous to the whole cluster?
> >>>>
> >>>> Thanks and regards,
> >>>>
> >>>> Mao Xu-Feng
> >>>>
> >>>>
> >>>
> >>
>
>

Re: The number of fd and CLOSE_WAIT keep increasing.

Posted by Harsh J <ha...@cloudera.com>.

Yes, since it is a minor version update, you should be all set with replacement of packages and restart of nodes with the same configuration.

No additional procedure should generally be required in updating between dot versions cause of compatibility being maintained :)

On 23-Aug-2011, at 10:26 AM, Xu-Feng Mao wrote:

> Thanks Andy!
> 
> cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like
> graceful_stop.sh.
> Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply
> replace the package
> with our own configuration, right?
> 
> Thanks and regards,
> 
> Mao Xu-Feng
> 
> On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <ap...@apache.org> wrote:
> 
>>> We are running cdh3u0 hbase/hadoop suites on 28 nodes.
>> 
>> 
>> For your information, CDHU1 does contain this:
>> 
>>  Author: Eli Collins <el...@cloudera.com>
>>  Date:   Tue Jul 5 16:02:22 2011 -0700
>> 
>>      HDFS-1836. Thousand of CLOSE_WAIT socket.
>> 
>>      Reason: Bug
>>      Author: Bharath Mundlapudi
>>      Ref: CDH-3200
>> 
>> Best regards,
>> 
>> 
>>   - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>> 
>> 
>> ----- Original Message -----
>>> From: Xu-Feng Mao <m9...@gmail.com>
>>> To: hbase-user@hadoop.apache.org; user@hbase.apache.org
>>> Cc:
>>> Sent: Monday, August 22, 2011 4:58 AM
>>> Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
>>> 
>>> On average, we have about 3000 CLOSE_WAIT, while on the three problematic
>>> regionservers, we have about 30k CLOSE_WAIT.
>>> We set open files limit to 130k, so it work OK now, but it seems not that
>>> well.
>>> 
>>> On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last
>> Friday, we
>>>> got three regionservers have
>>>> opened fd and CLOSE_WAIT kept increasing.
>>>> 
>>>> It looks like if the lines like
>>>> 
>>>> ====
>>>> 2011-08-22 18:19:01,815 WARN
>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>>>> 
>>> 
>> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
>>>> has too many store files; delaying flush up to 90000ms
>>>> 2011-08-22 18:19:01,815 WARN
>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>>>> 
>>> 
>> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
>>>> has too many store files; delaying flush up to 90000ms
>>>> ====
>>>> 
>>>> increase, then the the number of opened fds and CLOSE_WAIT increase
>>>> accordingly.
>>>> 
>>>> We're not sure if it's kind of fd leak under some unexpected
>>> circumstance
>>>> or exceptional path.
>>>> 
>>>> By netstat -lntp, we found that there're lots of connection like
>>>> 
>>>> ====
>>>> Proto Recv-Q Send-Q Local Address               Foreign Address
>>>> State       PID/Program name
>>>> tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>>>>   CLOSE_WAIT  27748/java
>>>> ====
>>>> 
>>>> The connections are keeping in these situation. It seems like some
>>>> connections to hdfs is in the situation
>>>> that the hdfs datanode has sent FIN, but regionservers are blocking on
>> the
>>>> recv queue, so the fd and CLOSE_WAIT sockets
>>>> are probably leaked.
>>>> 
>>>> We also see some logs like
>>>> ====
>>>> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed
>> to
>>>> connect to /10.150.161.73:50010, add to deadNodes and continue
>>>> java.io.IOException: Got error in response to OP_READ_BLOCK self=/
>>>> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file
>>>> 
>> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241
>>> for
>>>> block 2791681537571770744_132142063
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>>>>         at
>>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>>>>         at java.io.DataInputStream.read(DataInputStream.java:132)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>>>>         at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>>         at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>>>>         at
>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>>>>         at
>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>>> ====
>>>> 
>>>> The number is much less than the number of " too many store
>>> files" WARNs,
>>>> so this might not the cause of too many
>>>> fds, but is this dangerous to the whole cluster?
>>>> 
>>>> Thanks and regards,
>>>> 
>>>> Mao Xu-Feng
>>>> 
>>>> 
>>> 
>>

Re: The number of fd and CLOSE_WAIT keep increasing.

Posted by Xu-Feng Mao <m9...@gmail.com>.

Thanks Andy!

cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like
graceful_stop.sh.
Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply
replace the package
with our own configuration, right?

Thanks and regards,

Mao Xu-Feng

On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <ap...@apache.org> wrote:

> > We are running cdh3u0 hbase/hadoop suites on 28 nodes.
>
>
> For your information, CDHU1 does contain this:
>
>   Author: Eli Collins <el...@cloudera.com>
>   Date:   Tue Jul 5 16:02:22 2011 -0700
>
>       HDFS-1836. Thousand of CLOSE_WAIT socket.
>
>       Reason: Bug
>       Author: Bharath Mundlapudi
>       Ref: CDH-3200
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> ----- Original Message -----
> > From: Xu-Feng Mao <m9...@gmail.com>
> > To: hbase-user@hadoop.apache.org; user@hbase.apache.org
> > Cc:
> > Sent: Monday, August 22, 2011 4:58 AM
> > Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
> >
> > On average, we have about 3000 CLOSE_WAIT, while on the three problematic
> > regionservers, we have about 30k CLOSE_WAIT.
> > We set open files limit to 130k, so it work OK now, but it seems not that
> > well.
> >
> > On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9...@gmail.com> wrote:
> >
> >>  Hi,
> >>
> >>  We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last
> Friday, we
> >>  got three regionservers have
> >>  opened fd and CLOSE_WAIT kept increasing.
> >>
> >>  It looks like if the lines like
> >>
> >>  ====
> >>  2011-08-22 18:19:01,815 WARN
> >>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>
> >
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
> >>  has too many store files; delaying flush up to 90000ms
> >>  2011-08-22 18:19:01,815 WARN
> >>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> >>
> >
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
> >>  has too many store files; delaying flush up to 90000ms
> >>  ====
> >>
> >>  increase, then the the number of opened fds and CLOSE_WAIT increase
> >>  accordingly.
> >>
> >>  We're not sure if it's kind of fd leak under some unexpected
> > circumstance
> >>  or exceptional path.
> >>
> >>  By netstat -lntp, we found that there're lots of connection like
> >>
> >>  ====
> >>  Proto Recv-Q Send-Q Local Address               Foreign Address
> >>  State       PID/Program name
> >>  tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
> >>    CLOSE_WAIT  27748/java
> >>  ====
> >>
> >>  The connections are keeping in these situation. It seems like some
> >>  connections to hdfs is in the situation
> >>  that the hdfs datanode has sent FIN, but regionservers are blocking on
> the
> >>  recv queue, so the fd and CLOSE_WAIT sockets
> >>  are probably leaked.
> >>
> >>  We also see some logs like
> >>  ====
> >>  2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>  connect to /10.150.161.73:50010, add to deadNodes and continue
> >>  java.io.IOException: Got error in response to OP_READ_BLOCK self=/
> >>  10.150.161.64:55229, remote=/10.150.161.73:50010 for file
> >>
>  /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241
> > for
> >>  block 2791681537571770744_132142063
> >>          at
> >>
> >
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
> >>          at
> >>
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
> >>          at
> >>
>  org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
> >>          at java.io.DataInputStream.read(DataInputStream.java:132)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
> >>          at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>          at
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>          at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
> >>          at
> >>
>  org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
> >>          at
> >>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
> >>          at
> >>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
> >>          at
> >>
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> >>  ====
> >>
> >>  The number is much less than the number of " too many store
> > files" WARNs,
> >>  so this might not the cause of too many
> >>  fds, but is this dangerous to the whole cluster?
> >>
> >>  Thanks and regards,
> >>
> >>  Mao Xu-Feng
> >>
> >>
> >
>

Re: The number of fd and CLOSE_WAIT keep increasing.

Posted by Andrew Purtell <ap...@apache.org>.

> We are running cdh3u0 hbase/hadoop suites on 28 nodes. 


For your information, CDHU1 does contain this:

  Author: Eli Collins <el...@cloudera.com>
  Date:   Tue Jul 5 16:02:22 2011 -0700

      HDFS-1836. Thousand of CLOSE_WAIT socket.

      Reason: Bug
      Author: Bharath Mundlapudi
      Ref: CDH-3200

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


----- Original Message -----
> From: Xu-Feng Mao <m9...@gmail.com>
> To: hbase-user@hadoop.apache.org; user@hbase.apache.org
> Cc: 
> Sent: Monday, August 22, 2011 4:58 AM
> Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
> 
> On average, we have about 3000 CLOSE_WAIT, while on the three problematic
> regionservers, we have about 30k CLOSE_WAIT.
> We set open files limit to 130k, so it work OK now, but it seems not that
> well.
> 
> On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9...@gmail.com> wrote:
> 
>>  Hi,
>> 
>>  We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
>>  got three regionservers have
>>  opened fd and CLOSE_WAIT kept increasing.
>> 
>>  It looks like if the lines like
>> 
>>  ====
>>  2011-08-22 18:19:01,815 WARN
>>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>> 
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
>>  has too many store files; delaying flush up to 90000ms
>>  2011-08-22 18:19:01,815 WARN
>>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>> 
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
>>  has too many store files; delaying flush up to 90000ms
>>  ====
>> 
>>  increase, then the the number of opened fds and CLOSE_WAIT increase
>>  accordingly.
>> 
>>  We're not sure if it's kind of fd leak under some unexpected 
> circumstance
>>  or exceptional path.
>> 
>>  By netstat -lntp, we found that there're lots of connection like
>> 
>>  ====
>>  Proto Recv-Q Send-Q Local Address               Foreign Address
>>  State       PID/Program name
>>  tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>>    CLOSE_WAIT  27748/java
>>  ====
>> 
>>  The connections are keeping in these situation. It seems like some
>>  connections to hdfs is in the situation
>>  that the hdfs datanode has sent FIN, but regionservers are blocking on the
>>  recv queue, so the fd and CLOSE_WAIT sockets
>>  are probably leaked.
>> 
>>  We also see some logs like
>>  ====
>>  2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
>>  connect to /10.150.161.73:50010, add to deadNodes and continue
>>  java.io.IOException: Got error in response to OP_READ_BLOCK self=/
>>  10.150.161.64:55229, remote=/10.150.161.73:50010 for file
>>  /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 
> for
>>  block 2791681537571770744_132142063
>>          at
>> 
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>>          at
>> 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>>          at
>>  org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>>          at java.io.DataInputStream.read(DataInputStream.java:132)
>>          at
>> 
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>>          at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>          at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>          at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>>          at
>>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>>          at
>>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>  ====
>> 
>>  The number is much less than the number of " too many store 
> files" WARNs,
>>  so this might not the cause of too many
>>  fds, but is this dangerous to the whole cluster?
>> 
>>  Thanks and regards,
>> 
>>  Mao Xu-Feng
>> 
>> 
>

Re: The number of fd and CLOSE_WAIT keep increasing.

Posted by Xu-Feng Mao <m9...@gmail.com>.

On average, we have about 3000 CLOSE_WAIT, while on the three problematic
regionservers, we have about 30k CLOSE_WAIT.
We set open files limit to 130k, so it work OK now, but it seems not that
well.

On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <m9...@gmail.com> wrote:

> Hi,
>
> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
> got three regionservers have
> opened fd and CLOSE_WAIT kept increasing.
>
> It looks like if the lines like
>
> ====
> 2011-08-22 18:19:01,815 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
> has too many store files; delaying flush up to 90000ms
> 2011-08-22 18:19:01,815 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
> has too many store files; delaying flush up to 90000ms
> ====
>
> increase, then the the number of opened fds and CLOSE_WAIT increase
> accordingly.
>
> We're not sure if it's kind of fd leak under some unexpected circumstance
> or exceptional path.
>
> By netstat -lntp, we found that there're lots of connection like
>
> ====
> Proto Recv-Q Send-Q Local Address               Foreign Address
> State       PID/Program name
> tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>   CLOSE_WAIT  27748/java
> ====
>
> The connections are keeping in these situation. It seems like some
> connections to hdfs is in the situation
> that the hdfs datanode has sent FIN, but regionservers are blocking on the
> recv queue, so the fd and CLOSE_WAIT sockets
> are probably leaked.
>
> We also see some logs like
> ====
> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.150.161.73:50010, add to deadNodes and continue
> java.io.IOException: Got error in response to OP_READ_BLOCK self=/
> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file
> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 for
> block 2791681537571770744_132142063
>         at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>         at java.io.DataInputStream.read(DataInputStream.java:132)
>         at
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>         at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>         at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> ====
>
> The number is much less than the number of " too many store files" WARNs,
> so this might not the cause of too many
> fds, but is this dangerous to the whole cluster?
>
> Thanks and regards,
>
> Mao Xu-Feng
>
>