You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/16 09:13:22 UTC

[GitHub] [iceberg] pvary opened a new pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

pvary opened a new pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620


   Last night I was able to reproduce these failures for a while:
   ```
   org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithHiveCatalog > testJoinTablesParquet FAILED
      java.lang.IllegalArgumentException: Failed to executeQuery Hive query SHOW TABLES: Error while compiling statement: FAILED: SemanticException org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)
          Caused by:
          org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)
              Caused by:
              org.apache.hadoop.hive.ql.parse.SemanticException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)
                  Caused by:
                  org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)
                      Caused by:
                      org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)
                          Caused by:
                          java.net.SocketException: Broken pipe (Write failed)
   ```
   
   The issue is that the number of worker threads is exhausted in the TestHiveMetastore.
   We are creating HMSClients in several different places:
   - TestHiveMetastore - we set `iceberg.hive.client-pool-size` to 2, so if HiveCatalog is used we create a pool with 2 connections
   - TestHiveMetastore - own `clientPool`. The size is 1, so we create 1 more connection
   - When we are initializing HiveServer2 for every worker and background thread we create 1-1 connections
   - In Hive3 HiveServer2 creates a NotificationEventPoll thread which also uses a connection
   
   We initialize the TestHiveMetastore with maximum of 5 threads and the connections are not cleaned up consistently. The main culprit is that HiveMetastore and HiveServer2 is not designed as something which is started and stopped in the same thread/JVM multiple times, so cleaning up of the Metastore connections during HiveServer2 restart was never a priority. ThreadLocal connections are kept and reused for the worker threads, and the connections are only closed if the Finalizer thread destroys the object.
   
   I have tested 2 fixes which were working consistently when I have tested them (only one of them was enough to fix the issue, but I wanted to verify my theory of the RC):
   - Increased the threadpool size for the `TestHiveMetastore`
   - Added a `System.gc()` to the `HiveIcebergStorageHandlerBaseTest.after()` - As System.gc() is not guaranteed to run a GC this is just something which might work
   
   I have added one more fix where I have turned off the NotificationEventPoller, which saves 1 connection for sure.
   When I wanted to test this fix I was unable to reproduce the issue again - even without any changes.
   
   Since I have no way to test the fixes again and all of them are test only changes, I have created a pull request which contains all of them. I really hope this will fix all off the flakiness issues once and for all.
   
   Could you please check @massdosage, @marton-bod, @lcspinter, @rdblue 
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620#issuecomment-710225570


   Thanks for looking into the problem, @pvary! I know these are hard to track down so I really appreciate you taking the time.
   
   While I'd like to avoid increasing the number of threads for tests that we control, I'm going to merge this to fix the tests. We can follow up to decrease the number of threads for the HMS tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620#discussion_r507529208



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
##########
@@ -166,8 +166,8 @@ private TServer newThriftServer(TServerSocket socket, HiveConf conf) throws Exce
         .processor(new TSetIpAddressProcessor<>(handler))
         .transportFactory(new TTransportFactory())
         .protocolFactory(new TBinaryProtocol.Factory())
-        .minWorkerThreads(3)
-        .maxWorkerThreads(5);
+        .minWorkerThreads(5)
+        .maxWorkerThreads(15);

Review comment:
       Created the new PR: #1629




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620#discussion_r506601681



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
##########
@@ -166,8 +166,8 @@ private TServer newThriftServer(TServerSocket socket, HiveConf conf) throws Exce
         .processor(new TSetIpAddressProcessor<>(handler))
         .transportFactory(new TTransportFactory())
         .protocolFactory(new TBinaryProtocol.Factory())
-        .minWorkerThreads(3)
-        .maxWorkerThreads(5);
+        .minWorkerThreads(5)
+        .maxWorkerThreads(15);

Review comment:
       I don't think we should change these values for the majority of Hive tests, or at least Hive Metastore tests, because this has helped us track down connection leaks. It sounds like we should change these for HiveRunner tests, where we have less control of the connections. If Hive itself doesn't pool or close connections, then we would need to bump this up.
   
   Could you make this configurable so that we can set it for the tests that need it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #1620: Hive: Fix TestHiveMetastore worker exhaustion

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620#issuecomment-711757170


   Thanks for the merge. Create the new PR: #1629


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org