You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ambari.apache.org by Schätzle Alexander <Al...@badenit.de> on 2018/03/29 09:16:29 UTC
Ambari Metrics not responding periodically
Hi,
we have a Kerberos secured cluster and currently facing issues with Ambari Metrics.
After starting Ambari Metrics everythin is fine but after a couple of days we get alerts from Ambari like this:
NameNode Service RPC Processing Latency (Hourly)
Unable to retrieve metrics from the Ambari Metrics service.
When I check the logs oft he Metrics Collector I can find entries like:
2018-03-28 11:19:47,013 WARN org.apache.hadoop.security.UserGroupInformation: Exception encountered while running the renewal command for amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>. (TGT end time:1522228847000, renewalFailures: org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal<mailto:org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal>: org.apache.hadoop.metrics2.lib.MutableGaugeLong@7d8dc9b8<ma...@7d8dc9b8>)
ExitCodeException exitCode=1: kinit: KDC can't fulfill requested option while renewing credentials
at org.apache.hadoop.util.Shell.runCommand(Shell.java:954)
at org.apache.hadoop.util.Shell.run(Shell.java:855)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1163)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1257)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1239)
at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:987)
at java.lang.Thread.run(Thread.java:745)
2018-03-28 11:19:47,014 ERROR org.apache.hadoop.security.UserGroupInformation: TGT is expired. Aborting renew thread for amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>.
In the following I then see aggregation errors:
2018-03-28 11:27:08,188 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Wed Mar 28 11:27:08 CEST 2018
2018-03-28 11:27:08,189 INFO TimelineClusterAggregatorMinute: Skipping aggregation function not owned by this instance.
2018-03-28 11:27:08,205 ERROR TimelineMetricHostAggregatorHourly: Exception during aggregating metrics.
java.sql.SQLTimeoutException: Operation timed out.
at org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:364)
at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150)
at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:831)
So this seems to be related to Kerberos. When I check the log oft he KDC there is not much info:
Mar 28 11:19:47 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.11.1.21: TICKET NOT RENEWABLE: authtime 0, amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for krbtgt/PSIORI.COM@PSIORI.COM<ma...@PSIORI.COM>, KDC can't fulfill requested option
...
Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): AS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for krbtgt/PSIORI.COM@PSIORI.COM<ma...@PSIORI.COM>
Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> for nn/m0201.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>
When I check the principal amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM> in the KDC I get the following:
Principal: amshbase/s0202.cl.psiori.com@PSIORI.COM<ma...@PSIORI.COM>
Expiration date: [never]
Last password change: Mo Mär 19 11:24:05 CET 2018
Password expiration date: [never]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Mo Mär 19 11:24:05 CET 2018 (admin/admin@PSIORI.COM<ma...@PSIORI.COM>)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 2
Key: vno 1, aes256-cts-hmac-sha1-96
Key: vno 1, aes128-cts-hmac-sha1-96
MKey: vno 1
Attributes:
Policy: [none]
Ist hat normal? Maximum renewable life is set to 0 so ticket renewal is not possible. But that is also true for all other principals in the KDC and all other services work normally.
This is the content of krb5.conf:
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = PSIORI.COM
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[domain_realm]
.cl.psiori.com = PSIORI.COM
cl.psiori.com = PSIORI.COM
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
PSIORI.COM = {
admin_server = sql.cl.psiori.com
kdc = sql.cl.psiori.com
}
I have not applied any changes to the kdc.conf so it has the default content:
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
EXAMPLE.COM = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}
Is there any misconfiguration?
When I restart the service then everything is fine again (for some time).
Any suggestions or help is very welcome.
Best regards,
Alex