You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Wei Tan <wt...@us.ibm.com> on 2014/03/10 22:26:24 UTC

Occasional GSSException that brings down region server

Hi,

  We are running a HBase cluster in these settings and with kerberos 
enabled. 
HBase: 0.96.1.1
Zookeeper: 3.4.5
Hadoop: 1.1.1


We constantly put data into HBase and every several hours we get the error 
below on a random region server; this error arises and the region server 
kills itself.

ERROR:
2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9] 
security.UserGroupInformation: PriviledgedActionException 
as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism 
level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]



We also tried with multiple version of kdc - all the way up to latest 
1.12.1 - still see this error. What is weird is that most put gets 
processed successfully until this error occurs and kills the RS.

Thanks,
Wei
---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan

Re: Occasional GSSException that brings down region server

Posted by Bharath Vissapragada <bh...@cloudera.com>.

Hey Wei,

Can you try adding "-Dsun.security.krb5.debug=true" to regionserver jvm
opts and see if it prints something before crash?

- Bharath


On Tue, Mar 11, 2014 at 6:35 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Thanks Ted. Yes our team looked at the doc you pointed out and:
>
> The key here is "every several hours" - so we can rule out 1) valid
> kerberos ticket ~ klist shows a valid ticket
> , 2) [0] does not have our error message ~ link password / keytab / clocks
> / realm is not incorrect ~ all these errors on this page seem to be for
> "does not work at all" conditions... not a "fails every randomly long
> amount of time"
> 3) we don't have this "problematic combination of components" listed...
> but again - this is a work / no work dichotomy...
>
>
> Thanks,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan
>
>
>
> From:   Ted Yu <yu...@gmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
> Date:   03/10/2014 05:31 PM
> Subject:        Re: Occasional GSSException that brings down region server
>
>
>
> Have you looked at
> http://hbase.apache.org/book.html#trouble.client.security.rpc ?
>
>
> On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
>
> > Hi,
> >
> >   We are running a HBase cluster in these settings and with kerberos
> > enabled.
> > HBase: 0.96.1.1
> > Zookeeper: 3.4.5
> > Hadoop: 1.1.1
> >
> >
> > We constantly put data into HBase and every several hours we get the
> error
> > below on a random region server; this error arises and the region server
> > kills itself.
> >
> > ERROR:
> > 2014-02-28 09:32:39,755 ERROR
> [hconnection-0x116987ad-shared--pool1378-t9]
> > security.UserGroupInformation: PriviledgedActionException
> > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> > failed [Caused by GSSException: No valid credentials provided (Mechanism
> > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> >
> >
> >
> > We also tried with multiple version of kdc - all the way up to latest
> > 1.12.1 - still see this error. What is weird is that most put gets
> > processed successfully until this error occurs and kills the RS.
> >
> > Thanks,
> > Wei
> > ---------------------------------
> > Wei Tan, PhD
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > http://researcher.ibm.com/person/us-wtan
>
>


-- 
Bharath Vissapragada
<http://www.cloudera.com>

Re: Occasional GSSException that brings down region server

Posted by Wei Tan <wt...@us.ibm.com>.

Thanks Ted. Yes our team looked at the doc you pointed out and:

The key here is "every several hours" - so we can rule out 1) valid 
kerberos ticket ~ klist shows a valid ticket
, 2) [0] does not have our error message ~ link password / keytab / clocks 
/ realm is not incorrect ~ all these errors on this page seem to be for 
"does not work at all" conditions... not a "fails every randomly long 
amount of time"
3) we don't have this "problematic combination of components" listed... 
but again - this is a work / no work dichotomy...

Thanks,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan

From:   Ted Yu <yu...@gmail.com>
To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
Date:   03/10/2014 05:31 PM
Subject:        Re: Occasional GSSException that brings down region server

Have you looked at
http://hbase.apache.org/book.html#trouble.client.security.rpc ?

On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Hi,
>
>   We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the 
error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR 
[hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan

Re: Occasional GSSException that brings down region server

Posted by Ted Yu <yu...@gmail.com>.

Have you looked at
http://hbase.apache.org/book.html#trouble.client.security.rpc ?


On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Hi,
>
>   We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan

Re: Occasional GSSException that brings down region server

Posted by wuzesheng <wu...@vip.qq.com>.

Hi Andrew,

We use keytab file for hbase, there's no ticket cache file under /tmp, but
we still encounter the same error as above,  how about this?



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Occasional-GSSException-that-brings-down-region-server-tp4056857p4056968.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Occasional GSSException that brings down region server

Posted by Stack <st...@duboce.net>.

On Fri, Mar 14, 2014 at 2:55 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Hi All, we seem to have overcome this occasional exception by changing
> HTable from being called through ConnectionManager vs. directly
> instantiating  HTable.
>
> OLD BAD:
> connection = HConnectionManager.createConnection(config);
>
> protected HTableInterface getHTable(String tableName) throws IOException
> {
>                  //Connection might have been closed since we acquired
> it...create a
>                  //new one if this is the case
>                  if (connection == null || connection.isClosed())
>                  {
>                                  log.info("HConnection null or
> closed...reopening");
>                                  connection =
> HConnectionManager.createConnection(config);
>                  }
>
>                  return connection.getTable(tableName);
> }
>
>
> NEW GOOD:
> public HTableInterface getHTable(byte[] tableName) throws IOException
> {
>                  return new HTable(conf, tableName);
> }
>
> Comment on this link [0] is why they opted down the initial - BAD path:
> ". Instances of this class can be constructed directly but it is
> encouraged that users get instances via HConnection and HConnectionManager
> . See HConnectionManager class comment for an example. "
>
>
I made it a blocker to change the wording (HBASE-10757).  We want folks to
go via HCM ALWAYs from here on out.

Thanks for reporting back Wei Tan,
St.Ack

Re: Occasional GSSException that brings down region server

Posted by Wei Tan <wt...@us.ibm.com>.

Hi All, we seem to have overcome this occasional exception by changing 
HTable from being called through ConnectionManager vs. directly 
instantiating  HTable.

OLD BAD:
connection = HConnectionManager.createConnection(config);

protected HTableInterface getHTable(String tableName) throws IOException
{
                 //Connection might have been closed since we acquired 
it...create a
                 //new one if this is the case
                 if (connection == null || connection.isClosed())
                 {
                                 log.info("HConnection null or 
closed...reopening");
                                 connection = 
HConnectionManager.createConnection(config);
                 }

                 return connection.getTable(tableName);
}


NEW GOOD:
public HTableInterface getHTable(byte[] tableName) throws IOException
{
                 return new HTable(conf, tableName);
}

Comment on this link [0] is why they opted down the initial - BAD path:
". Instances of this class can be constructed directly but it is 
encouraged that users get instances via HConnection and HConnectionManager
. See HConnectionManager class comment for an example. "


[0] 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html

We run the workload for a few days and it seems fine. We are not sure why 
this occurs but somebody listens this thread may have a better idea. 
Thanks for all your help!

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Zesheng Wu <wu...@gmail.com>
To:     user@hbase.apache.org, 
Date:   03/13/2014 09:33 PM
Subject:        Re: Occasional GSSException that brings down region server



Thank you all the same :)


2014-03-13 22:50 GMT+08:00 Andrew Purtell <ap...@apache.org>:

> Then nothing immediately comes to mind, sorry.
>
> On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > We use keytab file for hbase, there's no ticket cache file under /tmp,
> but
> > we still encounter the same error as above,  how about this?
> >
> >
> > 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org
> <javascript:;>
> > >:
> >
> > > If you might have more than one security enabled Java process 
running
> > under
> > > the same UNIX user on the servers, then this and other weirdness can
> > happen
> > > because they share the same ticket cache file in /tmp. Someone does 
a
> > > (re)login and another process concurrently accessing the cache gets
> > garbage
> > > or unexpected state.
> > >
> > >
> > > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com<
javascript:;>>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > >   We are running a HBase cluster in these settings and with 
kerberos
> > > > enabled.
> > > > HBase: 0.96.1.1
> > > > Zookeeper: 3.4.5
> > > > Hadoop: 1.1.1
> > > >
> > > >
> > > > We constantly put data into HBase and every several hours we get 
the
> > > error
> > > > below on a random region server; this error arises and the region
> > server
> > > > kills itself.
> > > >
> > > > ERROR:
> > > > 2014-02-28 09:32:39,755 ERROR
> > > [hconnection-0x116987ad-shared--pool1378-t9]
> > > > security.UserGroupInformation: PriviledgedActionException
> > > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> > initiate
> > > > failed [Caused by GSSException: No valid credentials provided
> > (Mechanism
> > > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > > >
> > > >
> > > >
> > > > We also tried with multiple version of kdc - all the way up to 
latest
> > > > 1.12.1 - still see this error. What is weird is that most put gets
> > > > processed successfully until this error occurs and kills the RS.
> > > >
> > > > Thanks,
> > > > Wei
> > > > ---------------------------------
> > > > Wei Tan, PhD
> > > > Research Staff Member
> > > > IBM T. J. Watson Research Center
> > > > http://researcher.ibm.com/person/us-wtan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Yours, Zesheng
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Best Wishes!

Yours, Zesheng

Re: Occasional GSSException that brings down region server

Posted by Zesheng Wu <wu...@gmail.com>.

Thank you all the same :)


2014-03-13 22:50 GMT+08:00 Andrew Purtell <ap...@apache.org>:

> Then nothing immediately comes to mind, sorry.
>
> On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > We use keytab file for hbase, there's no ticket cache file under /tmp,
> but
> > we still encounter the same error as above,  how about this?
> >
> >
> > 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org
> <javascript:;>
> > >:
> >
> > > If you might have more than one security enabled Java process running
> > under
> > > the same UNIX user on the servers, then this and other weirdness can
> > happen
> > > because they share the same ticket cache file in /tmp. Someone does a
> > > (re)login and another process concurrently accessing the cache gets
> > garbage
> > > or unexpected state.
> > >
> > >
> > > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com<javascript:;>>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > >   We are running a HBase cluster in these settings and with kerberos
> > > > enabled.
> > > > HBase: 0.96.1.1
> > > > Zookeeper: 3.4.5
> > > > Hadoop: 1.1.1
> > > >
> > > >
> > > > We constantly put data into HBase and every several hours we get the
> > > error
> > > > below on a random region server; this error arises and the region
> > server
> > > > kills itself.
> > > >
> > > > ERROR:
> > > > 2014-02-28 09:32:39,755 ERROR
> > > [hconnection-0x116987ad-shared--pool1378-t9]
> > > > security.UserGroupInformation: PriviledgedActionException
> > > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> > initiate
> > > > failed [Caused by GSSException: No valid credentials provided
> > (Mechanism
> > > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > > >
> > > >
> > > >
> > > > We also tried with multiple version of kdc - all the way up to latest
> > > > 1.12.1 - still see this error. What is weird is that most put gets
> > > > processed successfully until this error occurs and kills the RS.
> > > >
> > > > Thanks,
> > > > Wei
> > > > ---------------------------------
> > > > Wei Tan, PhD
> > > > Research Staff Member
> > > > IBM T. J. Watson Research Center
> > > > http://researcher.ibm.com/person/us-wtan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Yours, Zesheng
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Best Wishes!

Yours, Zesheng

Re: Occasional GSSException that brings down region server

Posted by Andrew Purtell <ap...@apache.org>.

Then nothing immediately comes to mind, sorry.

On Wednesday, March 12, 2014, Zesheng Wu <wu...@gmail.com> wrote:

> Hi Andrew,
>
> We use keytab file for hbase, there's no ticket cache file under /tmp, but
> we still encounter the same error as above,  how about this?
>
>
> 2014-03-12 7:37 GMT+08:00 Andrew Purtell <apurtell@apache.org<javascript:;>
> >:
>
> > If you might have more than one security enabled Java process running
> under
> > the same UNIX user on the servers, then this and other weirdness can
> happen
> > because they share the same ticket cache file in /tmp. Someone does a
> > (re)login and another process concurrently accessing the cache gets
> garbage
> > or unexpected state.
> >
> >
> > On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wtan@us.ibm.com <javascript:;>>
> wrote:
> >
> > > Hi,
> > >
> > >   We are running a HBase cluster in these settings and with kerberos
> > > enabled.
> > > HBase: 0.96.1.1
> > > Zookeeper: 3.4.5
> > > Hadoop: 1.1.1
> > >
> > >
> > > We constantly put data into HBase and every several hours we get the
> > error
> > > below on a random region server; this error arises and the region
> server
> > > kills itself.
> > >
> > > ERROR:
> > > 2014-02-28 09:32:39,755 ERROR
> > [hconnection-0x116987ad-shared--pool1378-t9]
> > > security.UserGroupInformation: PriviledgedActionException
> > > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS
> initiate
> > > failed [Caused by GSSException: No valid credentials provided
> (Mechanism
> > > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> > >
> > >
> > >
> > > We also tried with multiple version of kdc - all the way up to latest
> > > 1.12.1 - still see this error. What is weird is that most put gets
> > > processed successfully until this error occurs and kills the RS.
> > >
> > > Thanks,
> > > Wei
> > > ---------------------------------
> > > Wei Tan, PhD
> > > Research Staff Member
> > > IBM T. J. Watson Research Center
> > > http://researcher.ibm.com/person/us-wtan
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Occasional GSSException that brings down region server

Posted by Zesheng Wu <wu...@gmail.com>.

Hi Andrew,

We use keytab file for hbase, there's no ticket cache file under /tmp, but
we still encounter the same error as above,  how about this?


2014-03-12 7:37 GMT+08:00 Andrew Purtell <ap...@apache.org>:

> If you might have more than one security enabled Java process running under
> the same UNIX user on the servers, then this and other weirdness can happen
> because they share the same ticket cache file in /tmp. Someone does a
> (re)login and another process concurrently accessing the cache gets garbage
> or unexpected state.
>
>
> On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:
>
> > Hi,
> >
> >   We are running a HBase cluster in these settings and with kerberos
> > enabled.
> > HBase: 0.96.1.1
> > Zookeeper: 3.4.5
> > Hadoop: 1.1.1
> >
> >
> > We constantly put data into HBase and every several hours we get the
> error
> > below on a random region server; this error arises and the region server
> > kills itself.
> >
> > ERROR:
> > 2014-02-28 09:32:39,755 ERROR
> [hconnection-0x116987ad-shared--pool1378-t9]
> > security.UserGroupInformation: PriviledgedActionException
> > as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> > failed [Caused by GSSException: No valid credentials provided (Mechanism
> > level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
> >
> >
> >
> > We also tried with multiple version of kdc - all the way up to latest
> > 1.12.1 - still see this error. What is weird is that most put gets
> > processed successfully until this error occurs and kills the RS.
> >
> > Thanks,
> > Wei
> > ---------------------------------
> > Wei Tan, PhD
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > http://researcher.ibm.com/person/us-wtan
>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Best Wishes!

Yours, Zesheng

Re: Occasional GSSException that brings down region server

Posted by Andrew Purtell <ap...@apache.org>.

If you might have more than one security enabled Java process running under
the same UNIX user on the servers, then this and other weirdness can happen
because they share the same ticket cache file in /tmp. Someone does a
(re)login and another process concurrently accessing the cache gets garbage
or unexpected state.

On Mon, Mar 10, 2014 at 2:26 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Hi,
>
>   We are running a HBase cluster in these settings and with kerberos
> enabled.
> HBase: 0.96.1.1
> Zookeeper: 3.4.5
> Hadoop: 1.1.1
>
>
> We constantly put data into HBase and every several hours we get the error
> below on a random region server; this error arises and the region server
> kills itself.
>
> ERROR:
> 2014-02-28 09:32:39,755 ERROR [hconnection-0x116987ad-shared--pool1378-t9]
> security.UserGroupInformation: PriviledgedActionException
> as:XXXXXXXX@DOMAIN cause:javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>
>
>
> We also tried with multiple version of kdc - all the way up to latest
> 1.12.1 - still see this error. What is weird is that most put gets
> processed successfully until this error occurs and kills the RS.
>
> Thanks,
> Wei
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)