You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Andrew Ryan (JIRA)" <ji...@apache.org> on 2009/10/10 01:24:31 UTC

[jira] Created: (HADOOP-6308) make number of IPC accepts configurable

make number of IPC accepts configurable
---------------------------------------

Key: HADOOP-6308
URL: https://issues.apache.org/jira/browse/HADOOP-6308
Project: Hadoop Common
Issue Type: Improvement
Components: ipc
Affects Versions: 0.20.0
Environment: Linux, running Yahoo-based 0.20
Reporter: Andrew Ryan

We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).

When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.

In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.

So we've identified (at least) 3 improvements that can be made here:
1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793292#action_12793292 ] 

dhruba borthakur commented on HADOOP-6308:
------------------------------------------

Can we please get a patch for this one posted to this JIRA?

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796466#action_12796466 ] 

Hadoop QA commented on HADOOP-6308:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429404/HADOOP-6308.patch
  against trunk revision 895831.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/35/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/35/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/35/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/35/console

This message is automatically generated.

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>         Attachments: HADOOP-6308.patch
>
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Andrew Ryan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769349#action_12769349 ] 

Andrew Ryan commented on HADOOP-6308:
-------------------------------------

To follow up on Dhruba's comment, we've deployed the patch which changes the hardcoded "10" to "ipc.server.listen.queue.size/10" on our JobTracker. This has eliminated the frequent TCP connection resets we were seeing before. We haven't deployed to our Namenode yet since we haven't done a DFS restart recently.

It's a one line patch, but the area of code it's in can't readily be unit tested other than generically by existing tests which instantiate a cluster and hit it with RPC's.

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764253#action_12764253 ] 

dhruba borthakur commented on HADOOP-6308:
------------------------------------------

I like option 1, src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept(), we can accept more than 10 connections (configurable) before doing a doRead().

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Sam Walker (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874023#action_12874023 ] 

Sam Walker commented on HADOOP-6308:
------------------------------------

Alpha Bravo Charlie

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>         Attachments: HADOOP-6308.patch
>
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928649#action_12928649 ] 

Hadoop QA commented on HADOOP-6308:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429404/HADOOP-6308.patch
  against trunk revision 1031422.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/68//console

This message is automatically generated.

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>         Attachments: HADOOP-6308.patch
>
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Andrew Ryan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Ryan updated HADOOP-6308:
--------------------------------

    Release Note: Increases the accept() rate on the IPC socket, to avoid TCP connection refused errors on busy namenodes and jobtrackers that get a lot of connections.
          Status: Patch Available  (was: Open)

This is a hard area of code to test, there aren't any unit tests right now and the code is not easily tested except by real load.

For the default out-of-the-box config (ipc.server.listen.queue.size == 128), this patch will produce a value of "12" which is about the same as now.  But for sites which bump this up to 10000 or higher, using 10 accept()'s per iteration is not fast enough to drain the socket backlog.

We're still experimenting with optimal values. With this patch and increasing ipc.server.listen.queue.size to 240, our connection reset issues went away on our Jobtracker. But our namenode is still getting connection resets, we suspect we will have to bump up ipc.server.listen.queue.size (currently 10240) and net.core.somaxconn.



> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>         Attachments: HADOOP-6308.patch
>
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6308) make number of IPC accepts configurable

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769345#action_12769345 ] 

dhruba borthakur commented on HADOOP-6308:
------------------------------------------

Looks like it works better if we odopt Option 1, but instead of adding yet another configuration variable we can divide ipc.server.listen.queue.size by 10 and then use that value to replace the hardcoded value of 10 in the loop in Listener.doAccept().

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-6308) make number of IPC accepts configurable

Posted by "Andrew Ryan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Ryan updated HADOOP-6308:
--------------------------------

    Attachment: HADOOP-6308.patch

Patch we are using in production

> make number of IPC accepts configurable
> ---------------------------------------
>
>                 Key: HADOOP-6308
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.0
>         Environment: Linux, running Yahoo-based 0.20
>            Reporter: Andrew Ryan
>         Attachments: HADOOP-6308.patch
>
>
> We were recently seeing issues in our environments where HDFS clients would experience RST's from the NN when trying to do RPC to get file info, which would cause the task to fatal out. After some debugging we identified this to be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far too low, we had been using the default value of 128 and found we needed to bump it up to 10240 before resets went away (although this value is a bit suspect, as I will explain later in the issue).
> When a large map job starts, lots of clients very quickly start to issue RPC requests to the namenode, which creates this listen queue filling up problem, because clients are opening connections faster than Hadoop's RPC server can process them. We went back to our 0.17 cluster and instrumented that with tcpdump and found that we had been sending RST's for a long time there, but the retry handling was implemented differently back in 0.17 so a single TCP failure wasn't task-fatal.
> In our environment we have our TCP stack set to explicitly send resets when the listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), default linux behavior is to start dropping SYN packets and let the client retransmit. Other people may be experiencing this issue and not noticing it because they are using the default behavior, which is to let the NN drop packets on the floor and let clients retransmit.
> So we've identified (at least) 3 improvements that can be made here:
> 1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is currently hardcoded to do 10 accept()'s at a time, then it will start to read. We feel that it would be better to allow the server to be configured to support more than 10 accept's at one time using a configurable parameter. We can still leave 10 as the default.
> 2) Increase the default value of ipc.server.listen.queue.size from 128, or at least document that people with larger clusters starting thousands of mappers at once should increase this value. I wonder if a lot of people running larger clusters are dropping packets and don't realize it because TCP is covering them up. One one hand, yay TCP, on the other hand, those are needless delays and retries because the server can handle more connections.
> 3) Document that ipc.server.listen.queue.size may be limited to the value of SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The Java docs are not completely clear about this, and it's difficult to test because you can't query the backlog of a listening socket. We were under some time pressure in our case and tried 1024 which was not enough, and 10240 which worked, so we stuck with that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.