You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ambari.apache.org by Schätzle Alexander <Al...@badenit.de> on 2018/03/29 09:16:29 UTC

Ambari Metrics not responding periodically

Hi,

we have a Kerberos secured cluster and currently facing issues with Ambari Metrics.
After starting Ambari Metrics everythin is fine but after a couple of days we get alerts from Ambari like this:

NameNode Service RPC Processing Latency (Hourly)
Unable to retrieve metrics from the Ambari Metrics service.

When I check the logs oft he Metrics Collector I can find entries like:

2018-03-28 11:19:47,013 WARN org.apache.hadoop.security.UserGroupInformation: Exception encountered while running the renewal command for amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>. (TGT end time:1522228847000, renewalFailures: org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal<mailto:org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal>: org.apache.hadoop.metrics2.lib.MutableGaugeLong@7d8dc9b8<ma...@7d8dc9b8>)
ExitCodeException exitCode=1: kinit: KDC can't fulfill requested option while renewing credentials

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:954)
        at org.apache.hadoop.util.Shell.run(Shell.java:855)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1163)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1257)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1239)
        at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:987)
        at java.lang.Thread.run(Thread.java:745)
2018-03-28 11:19:47,014 ERROR org.apache.hadoop.security.UserGroupInformation: TGT is expired. Aborting renew thread for amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>.

In the following I then see aggregation errors:

2018-03-28 11:27:08,188 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Wed Mar 28 11:27:08 CEST 2018
2018-03-28 11:27:08,189 INFO TimelineClusterAggregatorMinute: Skipping aggregation function not owned by this instance.
2018-03-28 11:27:08,205 ERROR TimelineMetricHostAggregatorHourly: Exception during aggregating metrics.
java.sql.SQLTimeoutException: Operation timed out.
        at org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:364)
        at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
        at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:831)

So this seems to be related to Kerberos. When I check the log oft he KDC there is not much info:

Mar 28 11:19:47 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.11.1.21: TICKET NOT RENEWABLE: authtime 0,  amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for krbtgt/PSIORI.COM@PSIORI.COM<ma...@PSIORI.COM>, KDC can't fulfill requested option
...
Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): AS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for krbtgt/PSIORI.COM@PSIORI.COM<ma...@PSIORI.COM>
Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for nn/m0201.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>

When I check the principal amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> in the KDC I get the following:

Principal: amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>
Expiration date: [never]
Last password change: Mo Mär 19 11:24:05 CET 2018
Password expiration date: [never]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Mo Mär 19 11:24:05 CET 2018 (admin/admin@PSIORI.COM<ma...@PSIORI.COM>)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 2
Key: vno 1, aes256-cts-hmac-sha1-96
Key: vno 1, aes128-cts-hmac-sha1-96
MKey: vno 1
Attributes:
Policy: [none]

Ist hat normal? Maximum renewable life is set to 0 so ticket renewal is not possible. But that is also true for all other principals in the KDC and all other services work normally.
This is the content of krb5.conf:

[libdefaults]
  renew_lifetime = 7d
  forwardable = true
  default_realm = PSIORI.COM
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[domain_realm]
  .cl.psiori.com = PSIORI.COM
  cl.psiori.com = PSIORI.COM

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  PSIORI.COM = {
    admin_server = sql.cl.psiori.com
    kdc = sql.cl.psiori.com
  }

I have not applied any changes to the kdc.conf so it has the default content:

[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88

[realms]
EXAMPLE.COM = {
  #master_key_type = aes256-cts
  acl_file = /var/kerberos/krb5kdc/kadm5.acl
  dict_file = /usr/share/dict/words
  admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
  supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}

Is there any misconfiguration?
When I restart the service then everything is fine again (for some time).

Any suggestions or help is very welcome.

Best regards,
Alex