You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Erik Krogen (JIRA)" <ji...@apache.org> on 2019/02/08 17:31:00 UTC

[jira] [Commented] (HADOOP-9640) RPC Congestion Control with FairCallQueue

    [ https://issues.apache.org/jira/browse/HADOOP-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763772#comment-16763772 ] 

Erik Krogen commented on HADOOP-9640:
-------------------------------------

Hi [~linyiqun], I just uploaded some documentation at HADOOP-16097. Please take a look!

[~jojochuang], regarding your comment about cleaning up this JIRA, I am thinking to move all unresolved subtasks out (I would consider all of the as-yet unresolved tasks as follow-ons) and close this umbrella. Let me know your thoughts. Also if you have some time to help review HADOOP-10286 it would be appreciated :)

> RPC Congestion Control with FairCallQueue
> -----------------------------------------
>
>                 Key: HADOOP-9640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9640
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.2.0, 3.0.0-alpha1
>            Reporter: Xiaobo Peng
>            Assignee: Chris Li
>            Priority: Major
>              Labels: hdfs, qos, rpc
>         Attachments: FairCallQueue-PerformanceOnCluster.pdf, MinorityMajorityPerformance.pdf, NN-denial-of-service-updated-plan.pdf, faircallqueue.patch, faircallqueue2.patch, faircallqueue3.patch, faircallqueue4.patch, faircallqueue5.patch, faircallqueue6.patch, faircallqueue7_with_runtime_swapping.patch, rpc-congestion-control-draft-plan.pdf
>
>
> For an easy-to-read summary see: http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/
> Several production Hadoop cluster incidents occurred where the Namenode was overloaded and failed to respond. 
> We can improve quality of service for users during namenode peak loads by replacing the FIFO call queue with a [Fair Call Queue|https://issues.apache.org/jira/secure/attachment/12616864/NN-denial-of-service-updated-plan.pdf]. (this plan supersedes rpc-congestion-control-draft-plan).
> Excerpted from the communication of one incident, “The map task of a user was creating huge number of small files in the user directory. Due to the heavy load on NN, the JT also was unable to communicate with NN...The cluster became responsive only once the job was killed.”
> Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo for a nonexistent file in an endless loop). All other requests to namenode were also affected by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually killed jobtracker to kill all jobs that are running.”
> Excerpted from HDFS-945, “We've seen defective applications cause havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org