You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Chris Nauroth (Jira)" <ji...@apache.org> on 2022/10/27 21:38:00 UTC
[jira] [Resolved] (HIVE-26669) Hive Metastore become unresponsive

     [ https://issues.apache.org/jira/browse/HIVE-26669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Nauroth resolved HIVE-26669.
----------------------------------
      Assignee: Chris Nauroth
    Resolution: Not A Problem

Hello [~sandygade].

Thank you for the full thread dump. I can see from the example thread below that Hikari's connection adder threads are blocked in I/O in the Oracle JDBC driver indefinitely. It also appears that the behavior of Hikari is to create new connections on a dedicated thread pool (per connection pool):

[https://github.com/brettwooldridge/HikariCP/blob/dev/src/main/java/com/zaxxer/hikari/pool/HikariPool.java#L115]

Additionally, this thread pool is hard-coded to a size of 1 thread:

[https://github.com/brettwooldridge/HikariCP/blob/dev/src/main/java/com/zaxxer/hikari/util/UtilityElf.java#L139]

This would mean that if the metastore's threads need a new database connection, and if the adder thread blocks indefinitely in the connection attempt, then all of the other threads are going to get blocked behind that. Effectively, HiveMetaStore won't be able to make progress for clients until after a restart, just like you described.

Here are some recommended next steps:
 * Like you said, this could indicate a networking error (e.g. high packet loss), so that's worth investigating.
 * It is suspicious that the socket connections do not timeout and report an error back to the caller. That would at least give an opportunity for retries instead of hanging the whole process. I don't know Oracle myself, but I'm seeing some indications online that the Oracle JDBC driver supports a {{CONNECT_TIMEOUT}} property. Perhaps it would help to get that into the connection string in hive-site.xml {{javax.jdo.option.ConnectionURL}} with a relatively short value, like 10-30 seconds.
 * There is also support for [Apache Commons DBCP|https://commons.apache.org/proper/commons-dbcp/] as the connection pool, as documented at [Hive Metastore Connection Pooling Configuration|https://cwiki.apache.org/confluence/display/hive/configuration+properties#ConfigurationProperties-HiveMetastoreConnectionPoolingConfiguration.1]. I'm not certain, but perhaps you'd see different results with that, if it doesn't have the behavior of blocking new connection attempts in a single thread.

I'm going to close out this issue as there doesn't appear to be a Hive bug. I hope these suggestions help.

> Hive Metastore become unresponsive
> ----------------------------------
>
>                 Key: HIVE-26669
>                 URL: https://issues.apache.org/jira/browse/HIVE-26669
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 3.1.0
>            Reporter: Sandeep Gade
>            Assignee: Chris Nauroth
>            Priority: Critical
>         Attachments: metastore-server1
>
>
> We are experiencing issues with Hive Metastore where it goes unresponsive. Initial investigation shows thousands of thread in WAITING (parking) state as shown below:
>     1    java.lang.Thread.State: BLOCKED (on object monitor)
>     772    java.lang.Thread.State: RUNNABLE
>       2    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>      13    java.lang.Thread.State: TIMED_WAITING (parking)
>       5    java.lang.Thread.State: TIMED_WAITING (sleeping)
>       3    java.lang.Thread.State: WAITING (on object monitor)
>   14308    java.lang.Thread.State: WAITING (parking)
> ==============
> Almost all of the threads are stuck at 'parking to wait for  <0x00007f9ad0795c48> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)'
>  
>      15         - parking to wait for  <0x00007f9ad06c9c10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   14288         - parking to wait for  <0x00007f9ad0795c48> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
>       1         - parking to wait for  <0x00007f9ad0a161f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad0a39248> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad0adb0a0> (a java.util.concurrent.SynchronousQueue$TransferQueue)
>       5         - parking to wait for  <0x00007f9ad0b12278> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad0b12518> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad0b44878> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad0cbe8f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad1318d60> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       1         - parking to wait for  <0x00007f9ad1478c10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       5         - parking to wait for  <0x00007f9ad1494ff8> (a java.util.concurrent.SynchronousQueue$TransferQueue)
> ======================
> complete stack:
> "pool-8-thread-62238" #3582305 prio=5 os_prio=0 tid=0x00007f977bfc9800 nid=0x62011 waiting on condition [0x00007f959d917000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f9ad0795c48> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>         at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>         at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:351)
>         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:77)
>         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
>         at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:59)
>         at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
>         at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStoreForConf(HiveMetaStore.java:750)
>         at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:718)
>         at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:712)
>         at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database_core(HiveMetaStore.java:1488)
>         at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_database(HiveMetaStore.java:1470)
>         at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
>         at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
>         at com.sun.proxy.$Proxy30.get_database(Unknown Source)
>         at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_database.getResult(ThriftHiveMetastore.java:15014)
>         at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_database.getResult(ThriftHiveMetastore.java:14998)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:636)
>         at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:631)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:631)
>         at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
>    Locked ownable synchronizers:
>         - <0x00007fae9f0d8c20> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> ======================
> Looking at linux process, Hive exhausts its 'max processes count' while the issue is happening
> set to:
> Max processes             16000                16000                processes
> As a workaround, we restart Metastores and it works fine for few days.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)