You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by guangyy <gi...@git.apache.org> on 2018/11/02 23:29:04 UTC

[GitHub] hive pull request #484: HIVE-16839: Fix a race condidtion during concurrent ...

GitHub user guangyy opened a pull request:

    https://github.com/apache/hive/pull/484

    HIVE-16839: Fix a race condidtion during concurrent partition drops

    We have seen a leaked lock on hive metastore DB which caused all
    PARTITION insertion failed on timeout waiting for lock until the
    metastore service is restarted.
    
    A transaction dump on the DB shows there is a thread that is Sleep which
    potentiall holds the the lock, like:
    ```
      trx_id: 33603171058
                     trx_state: RUNNING
                   trx_started: 2018-10-23 06:43:22
         trx_requested_lock_id: NULL
              trx_wait_started: NULL
                    trx_weight: 70298
           trx_mysql_thread_id: 275402202
                     trx_query: NULL
           trx_operation_state: NULL
             trx_tables_in_use: 0
             trx_tables_locked: 0
              trx_lock_structs: 21286
         trx_lock_memory_bytes: 2881064
               trx_rows_locked: 98810
             trx_rows_modified: 49012
       trx_concurrency_tickets: 0
           trx_isolation_level: READ COMMITTED
             trx_unique_checks: 1
        trx_foreign_key_checks: 1
    trx_last_foreign_key_error: NULL
     trx_adaptive_hash_latched: 0
     trx_adaptive_hash_timeout: 0
              trx_is_read_only: 0
    trx_autocommit_non_locking: 0
                            ID: 275402202
                          USER: metastore_gold
                          HOST: 10.37.182.82:36684
                            DB: metastoregold
                       COMMAND: Sleep
                          TIME: 1
                         STATE:
                          INFO: NULL
                      duration: 1316
    Given the HOST ip, we trace back to the hive metastore instance and found the following exceptions:
    
    No such database row
    org.datanucleus.exceptions.NucleusObjectNotFoundException: No such database row
            at org.datanucleus.store.rdbms.request.FetchRequest.execute(FetchRequest.java:357)
            at org.datanucleus.store.rdbms.RDBMSPersistenceHandler.fetchObject(RDBMSPersistenceHandler.java:324)
            at org.datanucleus.state.AbstractStateManager.loadFieldsFromDatastore(AbstractStateManager.java:1120)
            at org.datanucleus.state.JDOStateManager.loadSpecifiedFields(JDOStateManager.java:2916)
            at org.datanucleus.state.JDOStateManager.isLoaded(JDOStateManager.java:3219)
    ```
    The problem is that the caller expects a NULL if the partition does not exist, however, the convertToPart function would throw
    an exception which lead to the leak.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/guangyy/hive HIVE-16839

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/484.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #484
    
----
commit 5137027ee658990dd1503c09c13a73e2848d8deb
Author: Guang Yang <gu...@...>
Date:   2018-11-02T23:21:35Z

    HIVE-16839: Fix a race condidtion during concurrent partition drops

----


---