You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Raghu Angadi (JIRA)" <ji...@apache.org> on 2007/09/06 21:46:31 UTC

[jira] Created: (HADOOP-1849) IPC server max queue size should be configurable

IPC server max queue size should be configurable
------------------------------------------------

                 Key: HADOOP-1849
                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Raghu Angadi
             Fix For: 0.15.0



Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).

Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.

Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.

For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525519 ] 

Raghu Angadi commented on HADOOP-1849:
--------------------------------------

Server log for HADOOP-1763 would have been very useful for this. As far as I remember Dhruba looked for "dropping because max q reached" messages for scalability improvements on Namenode. When these messages went away that was a good indicator of improvement. With a large cluster this is pretty easy to test.

Yes, memory should also be a concern, though increasing handler also has the same memory increase plus memory for for each of the threads (may be 512k virtual memory for each thread). I datanode blockReports is one example where each RPC take a lot of memory.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Raghu Angadi
>             Fix For: 0.15.0
>
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525516 ] 

Owen O'Malley commented on HADOOP-1849:
---------------------------------------

The 100*handlers cap is just there as an upper bound on memory. Have you observed it actually triggering? (It is a different message.) I have not. Timeouts, yes, but not the queue length capping. I deliberately choose a very high upper bound to make sure it didn't happen randomly. I think introducing a config variable is a bad idea. Having more handlers does make a server more responsive under load, but I doubt it has anything to do with the queue length.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Raghu Angadi
>             Fix For: 0.15.0
>
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-1849:
---------------------------------

    Fix Version/s:     (was: 0.15.0)
      Description: 
Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).

Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.

Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.

For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).


  was:

Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).

Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.

Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.

For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).



> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525528 ] 

Doug Cutting commented on HADOOP-1849:
--------------------------------------

If 500 proves a better value then, again, I would prefer we just change that constant for now rather than introduce a new parameter.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-1849:
---------------------------------

    Component/s: ipc

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526517 ] 

Raghu Angadi commented on HADOOP-1849:
--------------------------------------


At least while testing, if this is configurable, it would be easy to asks users to experiment with different values.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525518 ] 

Doug Cutting commented on HADOOP-1849:
--------------------------------------

Let's experiment before we commit to a new parameter.  If increasing the per-handler queue constant to, e.g., 200 fixes things for now, that would be preferable.  HADOOP-1841 will alter the meaning of the handler count, and HADOOP-1850 will change the way it is set.  So it would be a mistake to base new parameters on it at this point.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Raghu Angadi
>             Fix For: 0.15.0
>
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: Use HDFS as a long term storage solution?

Posted by Lance Boomerang <lr...@boomerang.com>.
Are there any recent updates on dealing with the single point of failure 
in the single namenode?

Has anyone considered using DRBD 
<http://www.linux-ha.org/DataRedundancyByDrbd#head-820e05581351be70bb34b9ff88b6fbce6a83fba9> 
to replicate the FsImage and EditLog(s) and then use Linux HA heartbeat 
to bring up the redundant node in the event of failure?
Also, has anyone experimented with putting the alternate copies of the 
FsImage and EditLog on a shared disk (NFS / NAS) ?

Another big question:  Has anybody tried using HADOOP / HDFS across 
multiple geographic sites?

Thanks in advance,

Lance


-- 
CSI Cardiff, I'd like to see that. They'd be measuring the velocity of a kebab!"



[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525508 ] 

Raghu Angadi commented on HADOOP-1849:
--------------------------------------

Also this is quite easy to test this, with a large enough cluster, of course.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Raghu Angadi
>             Fix For: 0.15.0
>
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1849) IPC server max queue size should be configurable

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526905 ] 

Christian Kunz commented on HADOOP-1849:
----------------------------------------

As part of HADOOP-1874 (job running on 1400 node cluster with 60 rpc handlers for both namenode and jobtracker) we see many call queue overflows on both namenode and jobtracker, resulting in escalation of lost tasktrackers.

> IPC server max queue size should be configurable
> ------------------------------------------------
>
>                 Key: HADOOP-1849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1849
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Raghu Angadi
>
> Currently max queue size for IPC server is set to (100 * handlers). Usually when RPC failures are observed (e.g. HADOOP-1763), we increase number of handlers and the problem goes away. I think a big part of such a fix is increase in max queue size. I think we should make maxQsize per handler configurable (with a bigger default than 100). There are other improvements also (HADOOP-1841).
> Server keeps reading RPC requests from clients. When the number in-flight RPCs is larger than maxQsize, the earliest RPCs are deleted. This is the main feedback Server has for the client. I have often heard from users that Hadoop doesn't handle bursty traffic.
> Say handler count is 10 (default) and Server can handle 1000 RPCs a sec (quite conservative/low for a typical server), it implies that an RPC can wait for only for 1 sec before it is dropped. If there 3000 clients and all of them send RPCs around the same time (not very rare, with heartbeats etc), 2000 will be dropped. In stead of dropping the earliest RPCs, if the server delays reading new RPCs, the feedback to clients would be much smoother, I will file another jira regd queue management.
> For this jira I propose to make queue size per handler configurable, with a larger default (may be 500).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.