You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Szehon Ho <sz...@cloudera.com> on 2014/02/04 00:04:03 UTC

Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

Review request for hive.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingMetaStore.  But the RetryingMetaStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingMetaStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 22bb22d 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 2d8e483 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Szehon Ho <sz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

(Updated Feb. 11, 2014, 2:22 a.m.)


Review request for hive.


Changes
-------

Getting rid of the first retry to create the default database, on metastore startup.  Now, the RawStoreProxy is just a rename of RetryingRawStore + getting rid of retry at that layer.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ec1f09e 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 4905be0 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStoreProxy.java PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/#review33980
-----------------------------------------------------------



metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java
<https://reviews.apache.org/r/17678/#comment63850>

    I think it's fine if we don't retry operation (creating default db, etc.) when MetaStore is constructed. Thus, the logic in this file can largely remain unchanged.


- Xuefu Zhang


On Feb. 6, 2014, 1:54 a.m., Szehon Ho wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17678/
> -----------------------------------------------------------
> 
> (Updated Feb. 6, 2014, 1:54 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-4996
>     https://issues.apache.org/jira/browse/HIVE-4996
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Background:
> First issue:
> There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
> Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
>    HMSHandler.createTable()
>       ms.open()  //openTx = 1
>       ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
>       ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
>       ms.commit();  //openTx = 0
> 
> If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
>   HMSHandler.createTable()
>       ms.open()  //openTx = 1
>       ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
>       ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
>       ms.commit();  //unbalanced transaction!
> 
> Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.
> 
> Second issue:
> The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
> WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
> The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  
> 
> This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.
> 
> Changes:
> 1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
> 2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
> 3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 89c9349 
>   itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
>   itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
>   metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ddc6aed 
>   metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
>   metastore/src/java/org/apache/hadoop/hive/metastore/RawStoreProxy.java PRE-CREATION 
>   metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
>   metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 
> 
> Diff: https://reviews.apache.org/r/17678/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Szehon Ho
> 
>


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Szehon Ho <sz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

(Updated Feb. 6, 2014, 1:54 a.m.)


Review request for hive.


Changes
-------

Minor fix to eliminate wrapping an exception, to fix the negative test case comparing the exception messages.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 89c9349 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java ddc6aed 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStoreProxy.java PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Szehon Ho <sz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

(Updated Feb. 5, 2014, 1:08 a.m.)


Review request for hive.


Changes
-------

Seems there is a jdourlhook mechanism that the RetryingRawStore was supporting.  Adding that functionality back to preserve backward compatibility.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 89c9349 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 2d8e483 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStoreProxy.java PRE-CREATION 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Szehon Ho <sz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

(Updated Feb. 3, 2014, 11:24 p.m.)


Review request for hive.


Changes
-------

Fixing redundant logging for NoSuchObjectException.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 22bb22d 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 2d8e483 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho


Re: Review Request 17678: HIVE-4996 unbalanced calls to openTransaction/commitTransaction

Posted by Szehon Ho <sz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17678/
-----------------------------------------------------------

(Updated Feb. 3, 2014, 11:10 p.m.)


Review request for hive.


Bugs: HIVE-4996
    https://issues.apache.org/jira/browse/HIVE-4996


Repository: hive-git


Description (updated)
-------

Background:
First issue:
There are two levels of retrying in case of transient JDO/CP/DB errors: the RetryingHMSHandler and RetryingRawStore.  But the RetryingRawStore is flawed in the case of a nested transaction of a larger RetryingHMSHandler transaction (which is majority of cases).
Consider the following sample RetryingHMSHandler call, where variable "ms" is a RetryingRawStore.
   HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, then openTx = 1 upon intermediate commit
      ms.createTable()  //openTx = 2, then openTx = 1 upon intermediate commit
      ms.commit();  //openTx = 0

If there is any transient error in any intermediate operation and RetryingRawStore tries again, there will always be an unbalanced transaction, like:
  HMSHandler.createTable()
      ms.open()  //openTx = 1
      ms.getTable() // openTx = 2, transient error, then openTx=0 upon rollback.  After a retry, openTx=1, then openTx=0 upon successful intermediate commit 
      ms.createTable()  //openTx = 1, then openTx = 0 upon intermediate commit
      ms.commit();  //unbalanced transaction!

Retrying RawStore operations doesn't make sense in nested transaction cases, as the first part of the transaction is rolled-back upon transient error, and retry logic only saves a second half which may not make sense without the first.  It makes much more sense to retry the entire transaction from the top, which is what RetryingHMSHandler would already be doing if the RetryingRawStore did not interfere.

Second issue:
The recent upgrade to BoneCP 0.8.0 seemed to cause more transient errors that triggered this problem.  In these cases, in-use connections are finalized, as follows:
WARN  bonecp.ConnectionPartition (ConnectionPartition.java:finalizeReferent(162)) - BoneCP detected an unclosed connection and will now attempt to close it for you. You should be closing this connection in your application - enable connectionWatch for additional debugging assistance or set disableConnectionTracking to true to disable this feature entirely.
The retry of this operation seems to get a good connection and allow the operation to proceed.  Reading forums, it seems some others have hit this issue after the upgrade, and switching back to 0.7.1 in our environment eliminated this issue for us.  But that reversion is outside the scope of this JIRA, and would be better-done in either the original or follow-up JIRA that upgraded the version.  

This fix targets the first issue only, as anyway it is needed for any sort of transient error, not just the BoneCP one that I observed.

Changes:
1. Removes RetryingRawStore in favor of RetryingHMSHandler, and removes the configuration property of the former.
2. Addresses the resultant holes in retry, in particular in the RetryingHMSHandler's construction of RawStore (before, RetryingRawStore would have retried failures like in creating the defaultDB).  It didn't seem necessary to increase the default RetryingHMSHandler retries to 2 to compensate, but I am open to that as well.
3. Contribute the instrumentation code that helped me to find the issue.  This includes printing missing stacks of exceptions that triggered retry, including 'unbalanced calls' errors to hive log, and adding debug-level tracing of ObjectStore calls to give better correlation with other errors/warnings in hive log.


Diffs
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 22bb22d 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java d7854fe 
  itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestRawStoreTxn.java 0b87077 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 2d8e483 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 0715e22 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingHMSHandler.java fb70589 
  metastore/src/java/org/apache/hadoop/hive/metastore/RetryingRawStore.java dcf97ec 

Diff: https://reviews.apache.org/r/17678/diff/


Testing
-------


Thanks,

Szehon Ho