You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Goel, Ankur" <an...@corp.aol.com> on 2008/09/26 13:29:31 UTC

IPC Client error | Too many files open

Hi Folks,

    We have developed a simple log writer in Java that is plugged into
Apache custom log and writes log entries directly to our hadoop cluster
(50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
machine as dedicated Namenode another machine as JobTracker &
TaskTracker + DataNode).

There are around 8 Apache servers dumping logs into HDFS via our writer.
Everything was working fine and we were getting around 15 - 20 MB log
data per hour from each server.

 

Recently we have been experiencing problems with 2-3 of our Apache
servers where a file is opened by log-writer in HDFS for writing but it
never receives any data.

Looking at apache error logs shows the following errors

08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
files
        at sun.nio.ch.IOUtil.initPipe(Native Method)
        at
sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
        at
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
:18)
        at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
imeout.java:312)
        at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
thTimeout.java:227)
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
155)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)

        ...

        ...

 Followed by connection errors saying 

"Retrying to connect to server: hadoop-server.com:9000. Already tried
'n' times".

(same as above) ...

....

and is retrying constantly (log-writer set up so that it waits and
retries).

 

Doing an lsof on the log writer java process shows that it got stuck in
a lot of pipe/event poll and eventually ran out of file handles. 

Below is the part of the lsof output

 

lsof -p 2171
COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE NAME
....

....
java    2171 root   20r  FIFO                0,7          24090207 pipe
java    2171 root   21w  FIFO                0,7          24090207 pipe
java    2171 root   22r  0000                0,8        0 24090208
eventpoll
java    2171 root   23r  FIFO                0,7          23323281 pipe
java    2171 root   24r  FIFO                0,7          23331536 pipe
java    2171 root   25w  FIFO                0,7          23306764 pipe
java    2171 root   26r  0000                0,8        0 23306765
eventpoll
java    2171 root   27r  FIFO                0,7          23262160 pipe
java    2171 root   28w  FIFO                0,7          23262160 pipe
java    2171 root   29r  0000                0,8        0 23262161
eventpoll
java    2171 root   30w  FIFO                0,7          23299329 pipe
java    2171 root   31r  0000                0,8        0 23299330
eventpoll
java    2171 root   32w  FIFO                0,7          23331536 pipe
java    2171 root   33r  FIFO                0,7          23268961 pipe
java    2171 root   34w  FIFO                0,7          23268961 pipe
java    2171 root   35r  0000                0,8        0 23268962
eventpoll
java    2171 root   36w  FIFO                0,7          23314889 pipe

...

...

...

What in DFS client (if any) could have caused this? Could it be
something else?

Is it not ideal to use an HDFS writer to directly write logs from Apache
into HDFS?

Is 'Chukwa" (hadoop log collection and analysis framework contributed by
Yahoo) a better fit for our case?

 

I would highly appreciate help on any or all of the above questions.

 

Thanks and Regards

-Ankur

Re: IPC Client error | Too many files open

Posted by 何永强 <he...@software.ict.ac.cn>.

try update jdk to 1.6, there is a bug for jdk 1.5 about nio.
在 2008-9-26，下午7:29，Goel, Ankur 写道：

> Hi Folks,
>
>     We have developed a simple log writer in Java that is plugged into
> Apache custom log and writes log entries directly to our hadoop  
> cluster
> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
> machine as dedicated Namenode another machine as JobTracker &
> TaskTracker + DataNode).
>
> There are around 8 Apache servers dumping logs into HDFS via our  
> writer.
> Everything was working fine and we were getting around 15 - 20 MB log
> data per hour from each server.
>
>
>
> Recently we have been experiencing problems with 2-3 of our Apache
> servers where a file is opened by log-writer in HDFS for writing  
> but it
> never receives any data.
>
> Looking at apache error logs shows the following errors
>
> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
> files
>         at sun.nio.ch.IOUtil.initPipe(Native Method)
>         at
> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>         at
> sun.nio.ch.EPollSelectorProvider.openSelector 
> (EPollSelectorProvider.java
> :18)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get 
> (SocketIOWithT
> imeout.java:312)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select 
> (SocketIOWi
> thTimeout.java:227)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO 
> (SocketIOWithTimeout.java:
> 155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
> 149)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java: 
> 122)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at
> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java: 
> 289)
>
>         ...
>
>         ...
>
>  Followed by connection errors saying
>
> "Retrying to connect to server: hadoop-server.com:9000. Already tried
> 'n' times".
>
> (same as above) ...
>
> ....
>
> and is retrying constantly (log-writer set up so that it waits and
> retries).
>
>
>
> Doing an lsof on the log writer java process shows that it got  
> stuck in
> a lot of pipe/event poll and eventually ran out of file handles.
>
> Below is the part of the lsof output
>
>
>
> lsof -p 2171
> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE  
> NAME
> ....
>
> ....
> java    2171 root   20r  FIFO                0,7          24090207  
> pipe
> java    2171 root   21w  FIFO                0,7          24090207  
> pipe
> java    2171 root   22r  0000                0,8        0 24090208
> eventpoll
> java    2171 root   23r  FIFO                0,7          23323281  
> pipe
> java    2171 root   24r  FIFO                0,7          23331536  
> pipe
> java    2171 root   25w  FIFO                0,7          23306764  
> pipe
> java    2171 root   26r  0000                0,8        0 23306765
> eventpoll
> java    2171 root   27r  FIFO                0,7          23262160  
> pipe
> java    2171 root   28w  FIFO                0,7          23262160  
> pipe
> java    2171 root   29r  0000                0,8        0 23262161
> eventpoll
> java    2171 root   30w  FIFO                0,7          23299329  
> pipe
> java    2171 root   31r  0000                0,8        0 23299330
> eventpoll
> java    2171 root   32w  FIFO                0,7          23331536  
> pipe
> java    2171 root   33r  FIFO                0,7          23268961  
> pipe
> java    2171 root   34w  FIFO                0,7          23268961  
> pipe
> java    2171 root   35r  0000                0,8        0 23268962
> eventpoll
> java    2171 root   36w  FIFO                0,7          23314889  
> pipe
>
> ...
>
> ...
>
> ...
>
> What in DFS client (if any) could have caused this? Could it be
> something else?
>
> Is it not ideal to use an HDFS writer to directly write logs from  
> Apache
> into HDFS?
>
> Is 'Chukwa" (hadoop log collection and analysis framework  
> contributed by
> Yahoo) a better fit for our case?
>
>
>
> I would highly appreciate help on any or all of the above questions.
>
>
>
> Thanks and Regards
>
> -Ankur
>

Re: IPC Client error | Too many files open

Posted by Ankur Goel <an...@corp.aol.com>.

We have only one thread per process doing HDFS I/O. And there are around 
30 such processes each running on different machine. So there's a 1:1:1 
mapping for  machine:Apache:logwriter.
I think HADOOP-4346 explains a lot.

Thanks Raghu :-)

Raghu Angadi wrote:
>
> I filed https://issues.apache.org/jira/browse/HADOOP-4346 and it might 
> explain whats happening here.
>
> Raghu.
>
> Raghu Angadi wrote:
>>
>> We are seeing similar issue at Yahoo! as well. 'jmap -histo' and 
>> 'jmap -histo:live' are turning out to be pretty helpful. stay tuned.
>>
>> How many threads do you expect to be doing HDFS i/o in your case. 
>> both the max and normal cases are helpful.
>>
>> Thanks,
>> Raghu.
>>
>> Goel, Ankur wrote:
>>> Hi Dhruba,
>>>          Thanks for the reply. 1. We are using 0.17.2 version of 
>>> Hadoop.
>>> 2. Max file descriptor settings per process at the time error occurred
>>> was 1024. lsof -p <java-proc-id> confirms this as the process ran 
>>> out of
>>> file handles after reaching the limit. Here is the snippet...
>>> java    2171 root    0r  FIFO                0,7          23261756 pipe
>>> java    2171 root    1w   CHR                1,3              2067
>>> /dev/null
>>> java    2171 root    2w  FIFO                0,7          23261747 pipe
>>> ..
>>> ..
>>> java    2171 root 1006w  FIFO                0,7          26486656 pipe
>>> java    2171 root 1007r  0000                0,8        0 26486657
>>> eventpoll
>>> java    2171 root 1008r  FIFO                0,7          26492141 pipe
>>> java    2171 root 1009w  FIFO                0,7          26492141 pipe
>>> java    2171 root 1010r  0000                0,8        0 26492142
>>> eventpoll
>>> java    2171 root 1011r  FIFO                0,7          26497184 pipe
>>> java    2171 root 1012w  FIFO                0,7          26497184 pipe
>>> java    2171 root 1013r  0000                0,8        0 26497185
>>> eventpoll
>>> java    2171 root 1014r  FIFO                0,7          26514795 pipe
>>> java    2171 root 1015w  FIFO                0,7          26514795 pipe
>>> java    2171 root 1016r  0000                0,8        0 26514796
>>> eventpoll
>>> java    2171 root 1017r  FIFO                0,7          26510109 pipe
>>> java    2171 root 1018w  FIFO                0,7          26510109 pipe
>>> java    2171 root 1019r  0000                0,8        0 26510110
>>> eventpoll
>>> java    2171 root 1020u  IPv6           27549169               TCP
>>> server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED)
>>> java    2171 root 1021r  FIFO                0,7          26527653 pipe
>>> java    2171 root 1022w  FIFO                0,7          26527653 pipe
>>> java    2171 root 1023u  IPv6           26527645               TCP
>>> server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)
>>>
>>> We tried upping the limit and restarting the servers but the problem
>>> recurred after 1-2 days.
>>>
>>> 3. Yes, there are multiple threads in the apache server which are
>>> created dynamically.
>>> 4. The java log writer plugged into Apache custom log closes and 
>>> reopens
>>> a new log file periodically. The logwriter has 2 threads, one that
>>> writes data to FSDataOutputStream and another that wakes up 
>>> periodically
>>> to close the old stream and open a new one.I am trying to see if 
>>> this is
>>> the place where file handles could be leaking.
>>> Another thing to note is that we have a signal handler implementation
>>> that uses sun.misc package. The signal handler is installed for the 
>>> java
>>> processes to ensure that when Apache gives the java process SIGTERM or
>>> SIGINT, we close the file handles properly.
>>>
>>> I will do some more analysis of our code to find out if it's our code
>>> issue or HDFS client issue. In case I find it's a HDFS client issue 
>>> I'll
>>> move this discussion on a Hadoop JIRA.
>>>
>>> Thanks and Regards
>>> -Ankur
>

Re: IPC Client error | Too many files open

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

I filed https://issues.apache.org/jira/browse/HADOOP-4346 and it might 
explain whats happening here.

Raghu.

Raghu Angadi wrote:
> 
> We are seeing similar issue at Yahoo! as well. 'jmap -histo' and 'jmap 
> -histo:live' are turning out to be pretty helpful. stay tuned.
> 
> How many threads do you expect to be doing HDFS i/o in your case. both 
> the max and normal cases are helpful.
> 
> Thanks,
> Raghu.
> 
> Goel, Ankur wrote:
>> Hi Dhruba,
>>          Thanks for the reply. 1. We are using 0.17.2 version of Hadoop.
>> 2. Max file descriptor settings per process at the time error occurred
>> was 1024. lsof -p <java-proc-id> confirms this as the process ran out of
>> file handles after reaching the limit. Here is the snippet...
>> java    2171 root    0r  FIFO                0,7          23261756 pipe
>> java    2171 root    1w   CHR                1,3              2067
>> /dev/null
>> java    2171 root    2w  FIFO                0,7          23261747 pipe
>> ..
>> ..
>> java    2171 root 1006w  FIFO                0,7          26486656 pipe
>> java    2171 root 1007r  0000                0,8        0 26486657
>> eventpoll
>> java    2171 root 1008r  FIFO                0,7          26492141 pipe
>> java    2171 root 1009w  FIFO                0,7          26492141 pipe
>> java    2171 root 1010r  0000                0,8        0 26492142
>> eventpoll
>> java    2171 root 1011r  FIFO                0,7          26497184 pipe
>> java    2171 root 1012w  FIFO                0,7          26497184 pipe
>> java    2171 root 1013r  0000                0,8        0 26497185
>> eventpoll
>> java    2171 root 1014r  FIFO                0,7          26514795 pipe
>> java    2171 root 1015w  FIFO                0,7          26514795 pipe
>> java    2171 root 1016r  0000                0,8        0 26514796
>> eventpoll
>> java    2171 root 1017r  FIFO                0,7          26510109 pipe
>> java    2171 root 1018w  FIFO                0,7          26510109 pipe
>> java    2171 root 1019r  0000                0,8        0 26510110
>> eventpoll
>> java    2171 root 1020u  IPv6           27549169               TCP
>> server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED)
>> java    2171 root 1021r  FIFO                0,7          26527653 pipe
>> java    2171 root 1022w  FIFO                0,7          26527653 pipe
>> java    2171 root 1023u  IPv6           26527645               TCP
>> server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)
>>
>> We tried upping the limit and restarting the servers but the problem
>> recurred after 1-2 days.
>>
>> 3. Yes, there are multiple threads in the apache server which are
>> created dynamically.
>> 4. The java log writer plugged into Apache custom log closes and reopens
>> a new log file periodically. The logwriter has 2 threads, one that
>> writes data to FSDataOutputStream and another that wakes up periodically
>> to close the old stream and open a new one.I am trying to see if this is
>> the place where file handles could be leaking.
>> Another thing to note is that we have a signal handler implementation
>> that uses sun.misc package. The signal handler is installed for the java
>> processes to ensure that when Apache gives the java process SIGTERM or
>> SIGINT, we close the file handles properly.
>>
>> I will do some more analysis of our code to find out if it's our code
>> issue or HDFS client issue. In case I find it's a HDFS client issue I'll
>> move this discussion on a Hadoop JIRA.
>>
>> Thanks and Regards
>> -Ankur

Re: IPC Client error | Too many files open

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

We are seeing similar issue at Yahoo! as well. 'jmap -histo' and 'jmap 
-histo:live' are turning out to be pretty helpful. stay tuned.

How many threads do you expect to be doing HDFS i/o in your case. both 
the max and normal cases are helpful.

Thanks,
Raghu.

Goel, Ankur wrote:
> Hi Dhruba,
>          Thanks for the reply. 
> 1. We are using 0.17.2 version of Hadoop.
> 2. Max file descriptor settings per process at the time error occurred
> was 1024. lsof -p <java-proc-id> confirms this as the process ran out of
> file handles after reaching the limit. Here is the snippet...
> java    2171 root    0r  FIFO                0,7          23261756 pipe
> java    2171 root    1w   CHR                1,3              2067
> /dev/null
> java    2171 root    2w  FIFO                0,7          23261747 pipe
> ..
> ..
> java    2171 root 1006w  FIFO                0,7          26486656 pipe
> java    2171 root 1007r  0000                0,8        0 26486657
> eventpoll
> java    2171 root 1008r  FIFO                0,7          26492141 pipe
> java    2171 root 1009w  FIFO                0,7          26492141 pipe
> java    2171 root 1010r  0000                0,8        0 26492142
> eventpoll
> java    2171 root 1011r  FIFO                0,7          26497184 pipe
> java    2171 root 1012w  FIFO                0,7          26497184 pipe
> java    2171 root 1013r  0000                0,8        0 26497185
> eventpoll
> java    2171 root 1014r  FIFO                0,7          26514795 pipe
> java    2171 root 1015w  FIFO                0,7          26514795 pipe
> java    2171 root 1016r  0000                0,8        0 26514796
> eventpoll
> java    2171 root 1017r  FIFO                0,7          26510109 pipe
> java    2171 root 1018w  FIFO                0,7          26510109 pipe
> java    2171 root 1019r  0000                0,8        0 26510110
> eventpoll
> java    2171 root 1020u  IPv6           27549169               TCP
> server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED)
> java    2171 root 1021r  FIFO                0,7          26527653 pipe
> java    2171 root 1022w  FIFO                0,7          26527653 pipe
> java    2171 root 1023u  IPv6           26527645               TCP
> server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)
> 
> We tried upping the limit and restarting the servers but the problem
> recurred after 1-2 days.
> 
> 3. Yes, there are multiple threads in the apache server which are
> created dynamically.
> 4. The java log writer plugged into Apache custom log closes and reopens
> a new log file periodically. The logwriter has 2 threads, one that
> writes data to FSDataOutputStream and another that wakes up periodically
> to close the old stream and open a new one.I am trying to see if this is
> the place where file handles could be leaking. 
> 
> Another thing to note is that we have a signal handler implementation
> that uses sun.misc package. The signal handler is installed for the java
> processes to ensure that when Apache gives the java process SIGTERM or
> SIGINT, we close the file handles properly.
> 
> I will do some more analysis of our code to find out if it's our code
> issue or HDFS client issue. In case I find it's a HDFS client issue I'll
> move this discussion on a Hadoop JIRA.
> 
> Thanks and Regards
> -Ankur

RE: IPC Client error | Too many files open

Posted by "Goel, Ankur" <an...@corp.aol.com>.

Hi Dhruba,
         Thanks for the reply. 
1. We are using 0.17.2 version of Hadoop.
2. Max file descriptor settings per process at the time error occurred
was 1024. lsof -p <java-proc-id> confirms this as the process ran out of
file handles after reaching the limit. Here is the snippet...
java    2171 root    0r  FIFO                0,7          23261756 pipe
java    2171 root    1w   CHR                1,3              2067
/dev/null
java    2171 root    2w  FIFO                0,7          23261747 pipe
..
..
java    2171 root 1006w  FIFO                0,7          26486656 pipe
java    2171 root 1007r  0000                0,8        0 26486657
eventpoll
java    2171 root 1008r  FIFO                0,7          26492141 pipe
java    2171 root 1009w  FIFO                0,7          26492141 pipe
java    2171 root 1010r  0000                0,8        0 26492142
eventpoll
java    2171 root 1011r  FIFO                0,7          26497184 pipe
java    2171 root 1012w  FIFO                0,7          26497184 pipe
java    2171 root 1013r  0000                0,8        0 26497185
eventpoll
java    2171 root 1014r  FIFO                0,7          26514795 pipe
java    2171 root 1015w  FIFO                0,7          26514795 pipe
java    2171 root 1016r  0000                0,8        0 26514796
eventpoll
java    2171 root 1017r  FIFO                0,7          26510109 pipe
java    2171 root 1018w  FIFO                0,7          26510109 pipe
java    2171 root 1019r  0000                0,8        0 26510110
eventpoll
java    2171 root 1020u  IPv6           27549169               TCP
server.domain.com:46551->hadoop.aol.com:9000 (ESTABLISHED)
java    2171 root 1021r  FIFO                0,7          26527653 pipe
java    2171 root 1022w  FIFO                0,7          26527653 pipe
java    2171 root 1023u  IPv6           26527645               TCP
server.domain.com:15245->hadoop.aol.com:9000 (CLOSE_WAIT)

We tried upping the limit and restarting the servers but the problem
recurred after 1-2 days.

3. Yes, there are multiple threads in the apache server which are
created dynamically.
4. The java log writer plugged into Apache custom log closes and reopens
a new log file periodically. The logwriter has 2 threads, one that
writes data to FSDataOutputStream and another that wakes up periodically
to close the old stream and open a new one.I am trying to see if this is
the place where file handles could be leaking. 

Another thing to note is that we have a signal handler implementation
that uses sun.misc package. The signal handler is installed for the java
processes to ensure that when Apache gives the java process SIGTERM or
SIGINT, we close the file handles properly.

I will do some more analysis of our code to find out if it's our code
issue or HDFS client issue. In case I find it's a HDFS client issue I'll
move this discussion on a Hadoop JIRA.

Thanks and Regards
-Ankur


-----Original Message-----
From: Dhruba Borthakur [mailto:dhruba@gmail.com] 
Sent: Friday, September 26, 2008 10:30 PM
To: core-dev@hadoop.apache.org
Subject: Re: IPC Client error | Too many files open

Hi Ankur,

1. which version of Hadoop are you using?
2. What is the max file-descriptor setting on the linux boxes on which
the apache server is running?
3. When you do a lsof, how many descriptors are listed as "being open"?
4. Are there multiple threads in the apache server that write logs?
are these threads created dynamically?
5. Does one Apache server open a HDFS and keep writing to it for its
entire lifetime? Or does it close and reopen a new log file
periodically?

thanks,
dhruba
On Fri, Sep 26, 2008 at 8:40 AM, Raghu Angadi <ra...@yahoo-inc.com>
wrote:
>
> What does jstack show for this?
>
> Probably better suited for jira discussion.
> Raghu.
> Goel, Ankur wrote:
>>
>> Hi Folks,
>>
>>    We have developed a simple log writer in Java that is plugged into
>> Apache custom log and writes log entries directly to our hadoop
cluster
>> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
>> machine as dedicated Namenode another machine as JobTracker &
>> TaskTracker + DataNode).
>>
>> There are around 8 Apache servers dumping logs into HDFS via our
writer.
>> Everything was working fine and we were getting around 15 - 20 MB log
>> data per hour from each server.
>>
>>
>> Recently we have been experiencing problems with 2-3 of our Apache
>> servers where a file is opened by log-writer in HDFS for writing but
it
>> never receives any data.
>>
>> Looking at apache error logs shows the following errors
>>
>> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
>> files
>>        at sun.nio.ch.IOUtil.initPipe(Native Method)
>>        at
>> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>        at
>>
sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
>> :18)
>>        at
>>
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
>> imeout.java:312)
>>        at
>>
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
>> thTimeout.java:227)
>>        at
>>
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
>> 155)
>>        at
>>
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>>        at
>>
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>        at
org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
>>
>>        ...
>>
>>        ...
>>
>>  Followed by connection errors saying
>> "Retrying to connect to server: hadoop-server.com:9000. Already tried
>> 'n' times".
>>
>> (same as above) ...
>>
>> ....
>>
>> and is retrying constantly (log-writer set up so that it waits and
>> retries).
>>
>>
>> Doing an lsof on the log writer java process shows that it got stuck
in
>> a lot of pipe/event poll and eventually ran out of file handles.
>> Below is the part of the lsof output
>>
>>
>> lsof -p 2171
>> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE
NAME
>> ....
>>
>> ....
>> java    2171 root   20r  FIFO                0,7          24090207
pipe
>> java    2171 root   21w  FIFO                0,7          24090207
pipe
>> java    2171 root   22r  0000                0,8        0 24090208
>> eventpoll
>> java    2171 root   23r  FIFO                0,7          23323281
pipe
>> java    2171 root   24r  FIFO                0,7          23331536
pipe
>> java    2171 root   25w  FIFO                0,7          23306764
pipe
>> java    2171 root   26r  0000                0,8        0 23306765
>> eventpoll
>> java    2171 root   27r  FIFO                0,7          23262160
pipe
>> java    2171 root   28w  FIFO                0,7          23262160
pipe
>> java    2171 root   29r  0000                0,8        0 23262161
>> eventpoll
>> java    2171 root   30w  FIFO                0,7          23299329
pipe
>> java    2171 root   31r  0000                0,8        0 23299330
>> eventpoll
>> java    2171 root   32w  FIFO                0,7          23331536
pipe
>> java    2171 root   33r  FIFO                0,7          23268961
pipe
>> java    2171 root   34w  FIFO                0,7          23268961
pipe
>> java    2171 root   35r  0000                0,8        0 23268962
>> eventpoll
>> java    2171 root   36w  FIFO                0,7          23314889
pipe
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> What in DFS client (if any) could have caused this? Could it be
>> something else?
>>
>> Is it not ideal to use an HDFS writer to directly write logs from
Apache
>> into HDFS?
>>
>> Is 'Chukwa" (hadoop log collection and analysis framework contributed
by
>> Yahoo) a better fit for our case?
>>
>>
>> I would highly appreciate help on any or all of the above questions.
>>
>>
>> Thanks and Regards
>>
>> -Ankur
>>
>>
>
>

Re: IPC Client error | Too many files open

Posted by Dhruba Borthakur <dh...@gmail.com>.

Hi Ankur,

1. which version of Hadoop are you using?
2. What is the max file-descriptor setting on the linux boxes on which
the apache server is running?
3. When you do a lsof, how many descriptors are listed as "being open"?
4. Are there multiple threads in the apache server that write logs?
are these threads created dynamically?
5. Does one Apache server open a HDFS and keep writing to it for its
entire lifetime? Or does it close and reopen a new log file
periodically?

thanks,
dhruba
On Fri, Sep 26, 2008 at 8:40 AM, Raghu Angadi <ra...@yahoo-inc.com> wrote:
>
> What does jstack show for this?
>
> Probably better suited for jira discussion.
> Raghu.
> Goel, Ankur wrote:
>>
>> Hi Folks,
>>
>>    We have developed a simple log writer in Java that is plugged into
>> Apache custom log and writes log entries directly to our hadoop cluster
>> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
>> machine as dedicated Namenode another machine as JobTracker &
>> TaskTracker + DataNode).
>>
>> There are around 8 Apache servers dumping logs into HDFS via our writer.
>> Everything was working fine and we were getting around 15 - 20 MB log
>> data per hour from each server.
>>
>>
>> Recently we have been experiencing problems with 2-3 of our Apache
>> servers where a file is opened by log-writer in HDFS for writing but it
>> never receives any data.
>>
>> Looking at apache error logs shows the following errors
>>
>> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
>> files
>>        at sun.nio.ch.IOUtil.initPipe(Native Method)
>>        at
>> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>        at
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
>> :18)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
>> imeout.java:312)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
>> thTimeout.java:227)
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
>> 155)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
>>
>>        ...
>>
>>        ...
>>
>>  Followed by connection errors saying
>> "Retrying to connect to server: hadoop-server.com:9000. Already tried
>> 'n' times".
>>
>> (same as above) ...
>>
>> ....
>>
>> and is retrying constantly (log-writer set up so that it waits and
>> retries).
>>
>>
>> Doing an lsof on the log writer java process shows that it got stuck in
>> a lot of pipe/event poll and eventually ran out of file handles.
>> Below is the part of the lsof output
>>
>>
>> lsof -p 2171
>> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE NAME
>> ....
>>
>> ....
>> java    2171 root   20r  FIFO                0,7          24090207 pipe
>> java    2171 root   21w  FIFO                0,7          24090207 pipe
>> java    2171 root   22r  0000                0,8        0 24090208
>> eventpoll
>> java    2171 root   23r  FIFO                0,7          23323281 pipe
>> java    2171 root   24r  FIFO                0,7          23331536 pipe
>> java    2171 root   25w  FIFO                0,7          23306764 pipe
>> java    2171 root   26r  0000                0,8        0 23306765
>> eventpoll
>> java    2171 root   27r  FIFO                0,7          23262160 pipe
>> java    2171 root   28w  FIFO                0,7          23262160 pipe
>> java    2171 root   29r  0000                0,8        0 23262161
>> eventpoll
>> java    2171 root   30w  FIFO                0,7          23299329 pipe
>> java    2171 root   31r  0000                0,8        0 23299330
>> eventpoll
>> java    2171 root   32w  FIFO                0,7          23331536 pipe
>> java    2171 root   33r  FIFO                0,7          23268961 pipe
>> java    2171 root   34w  FIFO                0,7          23268961 pipe
>> java    2171 root   35r  0000                0,8        0 23268962
>> eventpoll
>> java    2171 root   36w  FIFO                0,7          23314889 pipe
>>
>> ...
>>
>> ...
>>
>> ...
>>
>> What in DFS client (if any) could have caused this? Could it be
>> something else?
>>
>> Is it not ideal to use an HDFS writer to directly write logs from Apache
>> into HDFS?
>>
>> Is 'Chukwa" (hadoop log collection and analysis framework contributed by
>> Yahoo) a better fit for our case?
>>
>>
>> I would highly appreciate help on any or all of the above questions.
>>
>>
>> Thanks and Regards
>>
>> -Ankur
>>
>>
>
>

Re: IPC Client error | Too many files open

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

What does jstack show for this?

Probably better suited for jira discussion.
Raghu.
Goel, Ankur wrote:
> Hi Folks,
> 
>     We have developed a simple log writer in Java that is plugged into
> Apache custom log and writes log entries directly to our hadoop cluster
> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
> machine as dedicated Namenode another machine as JobTracker &
> TaskTracker + DataNode).
> 
> There are around 8 Apache servers dumping logs into HDFS via our writer.
> Everything was working fine and we were getting around 15 - 20 MB log
> data per hour from each server.
> 
>  
> 
> Recently we have been experiencing problems with 2-3 of our Apache
> servers where a file is opened by log-writer in HDFS for writing but it
> never receives any data.
> 
> Looking at apache error logs shows the following errors
> 
> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
> files
>         at sun.nio.ch.IOUtil.initPipe(Native Method)
>         at
> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>         at
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
> :18)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
> imeout.java:312)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
> thTimeout.java:227)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
> 155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at
> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
> 
>         ...
> 
>         ...
> 
>  Followed by connection errors saying 
> 
> "Retrying to connect to server: hadoop-server.com:9000. Already tried
> 'n' times".
> 
> (same as above) ...
> 
> ....
> 
> and is retrying constantly (log-writer set up so that it waits and
> retries).
> 
>  
> 
> Doing an lsof on the log writer java process shows that it got stuck in
> a lot of pipe/event poll and eventually ran out of file handles. 
> 
> Below is the part of the lsof output
> 
>  
> 
> lsof -p 2171
> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE NAME
> ....
> 
> ....
> java    2171 root   20r  FIFO                0,7          24090207 pipe
> java    2171 root   21w  FIFO                0,7          24090207 pipe
> java    2171 root   22r  0000                0,8        0 24090208
> eventpoll
> java    2171 root   23r  FIFO                0,7          23323281 pipe
> java    2171 root   24r  FIFO                0,7          23331536 pipe
> java    2171 root   25w  FIFO                0,7          23306764 pipe
> java    2171 root   26r  0000                0,8        0 23306765
> eventpoll
> java    2171 root   27r  FIFO                0,7          23262160 pipe
> java    2171 root   28w  FIFO                0,7          23262160 pipe
> java    2171 root   29r  0000                0,8        0 23262161
> eventpoll
> java    2171 root   30w  FIFO                0,7          23299329 pipe
> java    2171 root   31r  0000                0,8        0 23299330
> eventpoll
> java    2171 root   32w  FIFO                0,7          23331536 pipe
> java    2171 root   33r  FIFO                0,7          23268961 pipe
> java    2171 root   34w  FIFO                0,7          23268961 pipe
> java    2171 root   35r  0000                0,8        0 23268962
> eventpoll
> java    2171 root   36w  FIFO                0,7          23314889 pipe
> 
> ...
> 
> ...
> 
> ...
> 
> What in DFS client (if any) could have caused this? Could it be
> something else?
> 
> Is it not ideal to use an HDFS writer to directly write logs from Apache
> into HDFS?
> 
> Is 'Chukwa" (hadoop log collection and analysis framework contributed by
> Yahoo) a better fit for our case?
> 
>  
> 
> I would highly appreciate help on any or all of the above questions.
> 
>  
> 
> Thanks and Regards
> 
> -Ankur
> 
>

Re: IPC Client error | Too many files open

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

What does jstack show for this?

Probably better suited for jira discussion.
Raghu.
Goel, Ankur wrote:
> Hi Folks,
> 
>     We have developed a simple log writer in Java that is plugged into
> Apache custom log and writes log entries directly to our hadoop cluster
> (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1
> machine as dedicated Namenode another machine as JobTracker &
> TaskTracker + DataNode).
> 
> There are around 8 Apache servers dumping logs into HDFS via our writer.
> Everything was working fine and we were getting around 15 - 20 MB log
> data per hour from each server.
> 
>  
> 
> Recently we have been experiencing problems with 2-3 of our Apache
> servers where a file is opened by log-writer in HDFS for writing but it
> never receives any data.
> 
> Looking at apache error logs shows the following errors
> 
> 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open
> files
>         at sun.nio.ch.IOUtil.initPipe(Native Method)
>         at
> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>         at
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java
> :18)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT
> imeout.java:312)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi
> thTimeout.java:227)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:
> 155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at
> org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203)
>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289)
> 
>         ...
> 
>         ...
> 
>  Followed by connection errors saying 
> 
> "Retrying to connect to server: hadoop-server.com:9000. Already tried
> 'n' times".
> 
> (same as above) ...
> 
> ....
> 
> and is retrying constantly (log-writer set up so that it waits and
> retries).
> 
>  
> 
> Doing an lsof on the log writer java process shows that it got stuck in
> a lot of pipe/event poll and eventually ran out of file handles. 
> 
> Below is the part of the lsof output
> 
>  
> 
> lsof -p 2171
> COMMAND  PID USER   FD   TYPE             DEVICE     SIZE     NODE NAME
> ....
> 
> ....
> java    2171 root   20r  FIFO                0,7          24090207 pipe
> java    2171 root   21w  FIFO                0,7          24090207 pipe
> java    2171 root   22r  0000                0,8        0 24090208
> eventpoll
> java    2171 root   23r  FIFO                0,7          23323281 pipe
> java    2171 root   24r  FIFO                0,7          23331536 pipe
> java    2171 root   25w  FIFO                0,7          23306764 pipe
> java    2171 root   26r  0000                0,8        0 23306765
> eventpoll
> java    2171 root   27r  FIFO                0,7          23262160 pipe
> java    2171 root   28w  FIFO                0,7          23262160 pipe
> java    2171 root   29r  0000                0,8        0 23262161
> eventpoll
> java    2171 root   30w  FIFO                0,7          23299329 pipe
> java    2171 root   31r  0000                0,8        0 23299330
> eventpoll
> java    2171 root   32w  FIFO                0,7          23331536 pipe
> java    2171 root   33r  FIFO                0,7          23268961 pipe
> java    2171 root   34w  FIFO                0,7          23268961 pipe
> java    2171 root   35r  0000                0,8        0 23268962
> eventpoll
> java    2171 root   36w  FIFO                0,7          23314889 pipe
> 
> ...
> 
> ...
> 
> ...
> 
> What in DFS client (if any) could have caused this? Could it be
> something else?
> 
> Is it not ideal to use an HDFS writer to directly write logs from Apache
> into HDFS?
> 
> Is 'Chukwa" (hadoop log collection and analysis framework contributed by
> Yahoo) a better fit for our case?
> 
>  
> 
> I would highly appreciate help on any or all of the above questions.
> 
>  
> 
> Thanks and Regards
> 
> -Ankur
> 
>