You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2016/12/05 19:25:58 UTC

[jira] [Commented] (HIVE-15352) MVCC (Multi Versioned Concurrency Control) in Hive

    [ https://issues.apache.org/jira/browse/HIVE-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723103#comment-15723103 ] 

Sergey Shelukhin commented on HIVE-15352:
-----------------------------------------

Something similar to sequence-based partitioning is pursued in HIVE-14535 (although for different reasons). We were thinking of MVCC being one of the next logical steps there.
Also you might want to take a look at Hive ACID implementation.

> MVCC (Multi Versioned Concurrency Control) in Hive
> --------------------------------------------------
>
>                 Key: HIVE-15352
>                 URL: https://issues.apache.org/jira/browse/HIVE-15352
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Garima Dosi
>         Attachments: Hive MVCC - Requirement & Design.pdf
>
>
> Use Case
> While working with providing solutions for various applications, we see that there is at times, a need to provide multi version concurrency support for certain datasets. The requirement of multi versioned concurrency is mainly due to two reasons –
> • Simultaneous querying and loading from tables or datasets, which requires maintaining versions for reading and writing (Locking is not the right option here)
> • Maintaining historical load of tables/datasets upto some extent
> Both of these requirements are seen in data management systems (warehouses etc).
> What happens without MVCC in Hive?
> In cases, where MVCC had to be done, design similar to this - https://dzone.com/articles/zookeeper-a-real-world-example-of-how-to-use-it  was followed to make it work. Zookeeper was used to maintain versions and provide MVCC support. However, this design poses a limitation if a normal user would like to query a hive table because he will not be aware of the current version to be queried. The additional layer to match versions in zookeeper with the dataset to be queried introduces a bit of an overhead for normal users and hence, the request to make this feature available in Hive.
> Hive Design for Support of MVCC
> The hive design for MVCC support can be as described below (It would somewhat follow the article mentioned in the previous section) –
> 1. The first thing should be the ability for the user to specify that this is a MVCC table. So, a DDL something like this –
> create table <table_name>  ( <column_specs>) MULTI_VERSIONED ON [sequence, time]
> Internally this DDL can be translated to a partitioned table either on a sequence number (auto-generated by Hive) or a timestamp. The metastore would keep this information.
> 2. DMLs related to inserting or loading data to the table would remain the same for an end user. However, internally Hive would automatically detect that a table is a multi-versioned table and write the new data to a new partition with a new version of the dataset. The Hive Metastore would also be updated with the current version.
> 3. DMLs related to querying data from the table would remain the same for a user. However, internally Hive would use the latest version for queries. Latest version is always stored in the metastore.
> Management of obsolete versions 
> The obsolete versions can be deleted based on the following –
> 1.Either a setting which simply says delete the version which is older than a threshold and is not active, OR
> 2.By tracking the count of queries running on older versions and deleting the ones which are not the latest and are not being used by any query. This would require some sort of a background thread monitoring the table for obsolete versions. As shown in the article mentioned above, this would also require incrementing version count whenever a version is queried and decrement it once the query is done. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)