You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Chris Li (JIRA)" <ji...@apache.org> on 2014/10/27 22:37:35 UTC

[jira] [Updated] (HADOOP-11238) Group cache expiry causes namenode slowdown

     [ https://issues.apache.org/jira/browse/HADOOP-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Li updated HADOOP-11238:
------------------------------
    Description: 
Our namenode pauses for 12-60 seconds several times every hour or so. During these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential performance problem: getGroups(user=xxxxx) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. 
2. When the cache is cleared, we have a thundering herd effect which overwhelms our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell `time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause could lead to a 60 second pause as all the threads compete for the resource. The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries



  was:
Our namenode pauses for 12-60 seconds every hour or so. During these pauses, no new requests can come in.

Around the time of pauses, we have log messages such as:
2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential performance problem: getGroups(user=xxxxx) took 34507 milliseconds.

The current theory is:
1. Groups has a cache that is refreshed periodically. 
2. When the cache is cleared, we have a thundering herd effect which overwhelms our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this happens has yet to be established)
3. group resolution queries begin to take longer, I've observed it taking 1.2 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell `time groups myself`
4. If there is mutual exclusion somewhere along this path, a 1 second pause could lead to a 60 second pause as all the threads compete for the resource. The exact cause hasn't been established

Potential solutions include:
1. Increasing group cache time, which will make the issue less frequent
2. Rolling evictions of the cache so we prevent the large spike in LDAP queries




> Group cache expiry causes namenode slowdown
> -------------------------------------------
>
>                 Key: HADOOP-11238
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11238
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.5.1
>            Reporter: Chris Li
>            Priority: Minor
>
> Our namenode pauses for 12-60 seconds several times every hour or so. During these pauses, no new requests can come in.
> Around the time of pauses, we have log messages such as:
> 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential performance problem: getGroups(user=xxxxx) took 34507 milliseconds.
> The current theory is:
> 1. Groups has a cache that is refreshed periodically. 
> 2. When the cache is cleared, we have a thundering herd effect which overwhelms our LDAP servers (we are using ShellBasedUnixGroupsMapping with sssd, how this happens has yet to be established)
> 3. group resolution queries begin to take longer, I've observed it taking 1.2 seconds instead of the usual 0.01-0.03 seconds when measuring in the shell `time groups myself`
> 4. If there is mutual exclusion somewhere along this path, a 1 second pause could lead to a 60 second pause as all the threads compete for the resource. The exact cause hasn't been established
> Potential solutions include:
> 1. Increasing group cache time, which will make the issue less frequent
> 2. Rolling evictions of the cache so we prevent the large spike in LDAP queries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)