You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wei Tan <wt...@us.ibm.com> on 2014/03/10 22:26:24 UTC
Occasional GSSException that brings down region server
Hi,
We are running a HBase cluster in these settings and with kerberos
enabled.
HBase: 0.96.1.1
Zookeeper: 3.4.5
Hadoop: 1.1.1
We constantly put data into HBase and every several hours we get the error
below on a random region server; this error arises and the region server
kills itself.
ERROR:
2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9]
security.UserGroupInformation: PriviledgedActionException
as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided (Mechanism
level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
We also tried with multiple version of kdc - all the way up to latest
1.12.1 - still see this error. What is weird is that most put gets
processed successfully until this error occurs and kills the RS.
Thanks,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
Re: Occasional GSSException that brings down region server
Posted by Bharath Vissapragada <bh...@cloudera.com>.
Hey Wei,
Can you try adding "-Dsun.security.krb5.debug=true" to regionserver jvm
opts and see if it prints something before crash?
- Bharath
On Tue, Mar 11, 2014 at 6:35 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Thanks Ted. Yes our team looked at the doc you pointed out and:
>
> The key here is "every several hours" - so we can rule out 1) valid
> kerberos ticket ~ klist shows a valid ticket
> , 2) [0] does not have our error message ~ link password / keytab / clocks
> / realm is not incorrect ~ all these errors on this page seem to be for
> "does not work at all" conditions... not a "fails every randomly long
> amount of time"
> 3) we don't have this "problematic combination of components" listed...
> but again - this is a work / no work dichotomy...
>
>
> Thanks,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
>
>
>
> From: Ted Yu <yu...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>,
> Date: 03/10/2014 05:31 PM
> Subject: Re: Occasional GSSException that brings down region server
>
>
>
> Have you looked at
> http://hbase.apache.org/book.html#trouble.client.security.rpc ?
>
>
> On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
>
> > Hi,
> >
> > We are running a HBase cluster in these settings and with kerberos
> > enabled.
> > HBase: 0.96.1.1
> > Zookeeper: 3.4.5
> > Hadoop: 1.1.1
> >
> >
> > We constantly put data into HBase and every several hours we get the
> error
> > below on a random region server; this error arises and the region server
> > kills itself.
> >
> > ERROR:
> > 2014-02-28 09:32:39,755 ERROR
> [hconnection-0x116987ad-shared--pool1378-t9]
> > security.UserGroupInformation: PriviledgedActionException
> > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> > failed [Caused by GSSException: No valid credentials provided (Mechanism
> > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> >
> >
> >
> > We also tried with multiple version of kdc - all the way up to latest
> > 1.12.1 - still see this error. What is weird is that most put gets
> > processed successfully until this error occurs and kills the RS.
> >
> > Thanks,
> > Wei
> > ---------------------------------
> > Wei Tan, PhD
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > http://researcher.ibm.com/person/us-wtan
>
>
--
Bharath Vissapragada
<http://www.cloudera.com>
Re: Occasional GSSException that brings down region server
Posted by Wei Tan <wt...@us.ibm.com>.
Thanks Ted. Yes our team looked at the doc you pointed out and:
The key here is "every several hours" - so we can rule out 1) valid
kerberos ticket ~ klist shows a valid ticket
, 2) [0] does not have our error message ~ link password / keytab / clocks
/ realm is not incorrect ~ all these errors on this page seem to be for
"does not work at all" conditions... not a "fails every randomly long
amount of time"
3) we don't have this "problematic combination of components" listed...
but again - this is a work / no work dichotomy...
Thanks,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
From: Ted Yu <yu...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>,
Date: 03/10/2014 05:31 PM
Subject: Re: Occasional GSSException that brings down region server
Have you looked at
http://hbase.apache.org/book.html#trouble.client.security.rpc ?
On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi,
>
> We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the
error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR
[hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
Re: Occasional GSSException that brings down region server
Posted by Ted Yu <yu...@gmail.com>.
Have you looked at
http://hbase.apache.org/book.html#trouble.client.security.rpc ?
On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi,
>
> We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
Re: Occasional GSSException that brings down region server
Posted by wuzesheng <wu...@vip.qq.com>.
Hi Andrew,
We use keytab file for hbase, there's no ticket cache file under /tmp, but
we still encounter the same error as above, how about this?
--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Occasional-GSSException-that-brings-down-region-server-tp4056857p4056968.html
Sent from the HBase User mailing list archive at Nabble.com.
Re: Occasional GSSException that brings down region server
Posted by Stack <st...@duboce.net>.
On Fri, Mar 14, 2014 at 2:55 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi All, we seem to have overcome this occasional exception by changing
> HTable from being called through ConnectionManager vs. directly
> instantiating HTable.
>
> OLD BAD:
> connection = HConnectionManager.createConnection(config);
>
> protected HTableInterface getHTable(String tableName) throws IOException
> {
> //Connection might have been closed since we acquired
> it...create a
> //new one if this is the case
> if (connection == null || connection.isClosed())
> {
> log.info("HConnection null or
> closed...reopening");
> connection =
> HConnectionManager.createConnection(config);
> }
>
> return connection.getTable(tableName);
> }
>
>
> NEW GOOD:
> public HTableInterface getHTable(byte[] tableName) throws IOException
> {
> return new HTable(conf, tableName);
> }
>
> Comment on this link [0] is why they opted down the initial - BAD path:
> ". Instances of this class can be constructed directly but it is
> encouraged that users get instances via HConnection and HConnectionManager
> . See HConnectionManager class comment for an example. "
>
>
I made it a blocker to change the wording (HBASE-10757). We want folks to
go via HCM ALWAYs from here on out.
Thanks for reporting back Wei Tan,
St.Ack
Re: Occasional GSSException that brings down region server
Posted by Wei Tan <wt...@us.ibm.com>.
Hi All, we seem to have overcome this occasional exception by changing
HTable from being called through ConnectionManager vs. directly
instantiating HTable.
OLD BAD:
connection = HConnectionManager.createConnection(config);
protected HTableInterface getHTable(String tableName) throws IOException
{
//Connection might have been closed since we acquired
it...create a
//new one if this is the case
if (connection == null || connection.isClosed())
{
log.info("HConnection null or
closed...reopening");
connection =
HConnectionManager.createConnection(config);
}
return connection.getTable(tableName);
}
NEW GOOD:
public HTableInterface getHTable(byte[] tableName) throws IOException
{
return new HTable(conf, tableName);
}
Comment on this link [0] is why they opted down the initial - BAD path:
". Instances of this class can be constructed directly but it is
encouraged that users get instances via HConnection and HConnectionManager
. See HConnectionManager class comment for an example. "
[0]
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
We run the workload for a few days and it seems fine. We are not sure why
this occurs but somebody listens this thread may have a better idea.
Thanks for all your help!
Best regards,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan
From: Zesheng Wu <wu...@gmail.com>
To: user@hbase.apache.org,
Date: 03/13/2014 09:33 PM
Subject: Re: Occasional GSSException that brings down region server
Thank you all the same :)
2014-03-13 22:50 GMT+08:00 Andrew Purtell <ap...@apache.org>:
> Then nothing immediately comes to mind, sorry.
>
> On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > We use keytab file for hbase, there's no ticket cache file under /tmp,
> but
> > we still encounter the same error as above, how about this?
> >
> >
> > 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org
> <javascript:;>
> > >:
> >
> > > If you might have more than one security enabled Java process
running
> > under
> > > the same UNIX user on the servers, then this and other weirdness can
> > happen
> > > because they share the same ticket cache file in /tmp. Someone does
a
> > > (re)login and another process concurrently accessing the cache gets
> > garbage
> > > or unexpected state.
> > >
> > >
> > > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com<
javascript:;>>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > We are running a HBase cluster in these settings and with
kerberos
> > > > enabled.
> > > > HBase: 0.96.1.1
> > > > Zookeeper: 3.4.5
> > > > Hadoop: 1.1.1
> > > >
> > > >
> > > > We constantly put data into HBase and every several hours we get
the
> > > error
> > > > below on a random region server; this error arises and the region
> > server
> > > > kills itself.
> > > >
> > > > ERROR:
> > > > 2014-02-28 09:32:39,755 ERROR
> > > [hconnection-0x116987ad-shared--pool1378-t9]
> > > > security.UserGroupInformation: PriviledgedActionException
> > > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> > initiate
> > > > failed [Caused by GSSException: No valid credentials provided
> > (Mechanism
> > > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > > >
> > > >
> > > >
> > > > We also tried with multiple version of kdc - all the way up to
latest
> > > > 1.12.1 - still see this error. What is weird is that most put gets
> > > > processed successfully until this error occurs and kills the RS.
> > > >
> > > > Thanks,
> > > > Wei
> > > > ---------------------------------
> > > > Wei Tan, PhD
> > > > Research Staff Member
> > > > IBM T. J. Watson Research Center
> > > > http://researcher.ibm.com/person/us-wtan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Yours, Zesheng
> >
>
>
> --
> Best regards,
>
> - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
--
Best Wishes!
Yours, Zesheng
Re: Occasional GSSException that brings down region server
Posted by Zesheng Wu <wu...@gmail.com>.
Thank you all the same :)
2014-03-13 22:50 GMT+08:00 Andrew Purtell <ap...@apache.org>:
> Then nothing immediately comes to mind, sorry.
>
> On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > We use keytab file for hbase, there's no ticket cache file under /tmp,
> but
> > we still encounter the same error as above, how about this?
> >
> >
> > 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org
> <javascript:;>
> > >:
> >
> > > If you might have more than one security enabled Java process running
> > under
> > > the same UNIX user on the servers, then this and other weirdness can
> > happen
> > > because they share the same ticket cache file in /tmp. Someone does a
> > > (re)login and another process concurrently accessing the cache gets
> > garbage
> > > or unexpected state.
> > >
> > >
> > > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com<javascript:;>>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > We are running a HBase cluster in these settings and with kerberos
> > > > enabled.
> > > > HBase: 0.96.1.1
> > > > Zookeeper: 3.4.5
> > > > Hadoop: 1.1.1
> > > >
> > > >
> > > > We constantly put data into HBase and every several hours we get the
> > > error
> > > > below on a random region server; this error arises and the region
> > server
> > > > kills itself.
> > > >
> > > > ERROR:
> > > > 2014-02-28 09:32:39,755 ERROR
> > > [hconnection-0x116987ad-shared--pool1378-t9]
> > > > security.UserGroupInformation: PriviledgedActionException
> > > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> > initiate
> > > > failed [Caused by GSSException: No valid credentials provided
> > (Mechanism
> > > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > > >
> > > >
> > > >
> > > > We also tried with multiple version of kdc - all the way up to latest
> > > > 1.12.1 - still see this error. What is weird is that most put gets
> > > > processed successfully until this error occurs and kills the RS.
> > > >
> > > > Thanks,
> > > > Wei
> > > > ---------------------------------
> > > > Wei Tan, PhD
> > > > Research Staff Member
> > > > IBM T. J. Watson Research Center
> > > > http://researcher.ibm.com/person/us-wtan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Yours, Zesheng
> >
>
>
> --
> Best regards,
>
> - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
--
Best Wishes!
Yours, Zesheng
Re: Occasional GSSException that brings down region server
Posted by Andrew Purtell <ap...@apache.org>.
Then nothing immediately comes to mind, sorry.
On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:
> Hi Andrew,
>
> We use keytab file for hbase, there's no ticket cache file under /tmp, but
> we still encounter the same error as above, how about this?
>
>
> 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org<javascript:;>
> >:
>
> > If you might have more than one security enabled Java process running
> under
> > the same UNIX user on the servers, then this and other weirdness can
> happen
> > because they share the same ticket cache file in /tmp. Someone does a
> > (re)login and another process concurrently accessing the cache gets
> garbage
> > or unexpected state.
> >
> >
> > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com <javascript:;>>
> wrote:
> >
> > > Hi,
> > >
> > > We are running a HBase cluster in these settings and with kerberos
> > > enabled.
> > > HBase: 0.96.1.1
> > > Zookeeper: 3.4.5
> > > Hadoop: 1.1.1
> > >
> > >
> > > We constantly put data into HBase and every several hours we get the
> > error
> > > below on a random region server; this error arises and the region
> server
> > > kills itself.
> > >
> > > ERROR:
> > > 2014-02-28 09:32:39,755 ERROR
> > [hconnection-0x116987ad-shared--pool1378-t9]
> > > security.UserGroupInformation: PriviledgedActionException
> > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> initiate
> > > failed [Caused by GSSException: No valid credentials provided
> (Mechanism
> > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > >
> > >
> > >
> > > We also tried with multiple version of kdc - all the way up to latest
> > > 1.12.1 - still see this error. What is weird is that most put gets
> > > processed successfully until this error occurs and kills the RS.
> > >
> > > Thanks,
> > > Wei
> > > ---------------------------------
> > > Wei Tan, PhD
> > > Research Staff Member
> > > IBM T. J. Watson Research Center
> > > http://researcher.ibm.com/person/us-wtan
> >
> >
> >
> >
> > --
> > Best regards,
> >
> > - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
--
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
Re: Occasional GSSException that brings down region server
Posted by Zesheng Wu <wu...@gmail.com>.
Hi Andrew,
We use keytab file for hbase, there's no ticket cache file under /tmp, but
we still encounter the same error as above, how about this?
2014-03-12 7:37 GMT+08:00 Andrew Purtell <ap...@apache.org>:
> If you might have more than one security enabled Java process running under
> the same UNIX user on the servers, then this and other weirdness can happen
> because they share the same ticket cache file in /tmp. Someone does a
> (re)login and another process concurrently accessing the cache gets garbage
> or unexpected state.
>
>
> On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
>
> > Hi,
> >
> > We are running a HBase cluster in these settings and with kerberos
> > enabled.
> > HBase: 0.96.1.1
> > Zookeeper: 3.4.5
> > Hadoop: 1.1.1
> >
> >
> > We constantly put data into HBase and every several hours we get the
> error
> > below on a random region server; this error arises and the region server
> > kills itself.
> >
> > ERROR:
> > 2014-02-28 09:32:39,755 ERROR
> [hconnection-0x116987ad-shared--pool1378-t9]
> > security.UserGroupInformation: PriviledgedActionException
> > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> > failed [Caused by GSSException: No valid credentials provided (Mechanism
> > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> >
> >
> >
> > We also tried with multiple version of kdc - all the way up to latest
> > 1.12.1 - still see this error. What is weird is that most put gets
> > processed successfully until this error occurs and kills the RS.
> >
> > Thanks,
> > Wei
> > ---------------------------------
> > Wei Tan, PhD
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > http://researcher.ibm.com/person/us-wtan
>
>
>
>
> --
> Best regards,
>
> - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
--
Best Wishes!
Yours, Zesheng
Re: Occasional GSSException that brings down region server
Posted by Andrew Purtell <ap...@apache.org>.
If you might have more than one security enabled Java process running under
the same UNIX user on the servers, then this and other weirdness can happen
because they share the same ticket cache file in /tmp. Someone does a
(re)login and another process concurrently accessing the cache gets garbage
or unexpected state.
On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
> Hi,
>
> We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
--
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)