You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Eugene Koifman (JIRA)" <ji...@apache.org> on 2019/01/24 02:10:00 UTC
[jira] [Assigned] (HIVE-21158) Perform update split early

     [ https://issues.apache.org/jira/browse/HIVE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koifman reassigned HIVE-21158:
-------------------------------------


> Perform update split early
> --------------------------
>
>                 Key: HIVE-21158
>                 URL: https://issues.apache.org/jira/browse/HIVE-21158
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>    Affects Versions: 3.0.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>
> Currently Acid 2.0 does U=D+I in the OrcRecordUpdater.  This means that all Updates (wide rows) are shuffled AND sorted.
> We could modify the the multi-insert statement which results from Merge statement so that instead of having one of the legs represent Update, we create 2 legs - 1 representing Delete of original row and 1 representing Insert of the new version.
> Delete events are very small so sorting them is cheap.  The Insert are written to disk in a sorted way by virtue of how ROW__IDs are generated.
> Exactly the same idea applies to regular Update statement.
> Note that the U=D+I in OrcRecordUpdater needs to be kept to keep [Streaming Mutate API |https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API] working on 2.0.
> *This requires that TxnHandler flags 2 Deletes as a conflict - it doesn't currently*
> Incidentally, 2.0 + early split allows updating all columns including bucketing and partition columns
> What is lock acquisition based on?  Need to make sure that conflict detection (write set tracking) still works
> So we want to transform
> {noformat}
> update T set B = 7 where A=1
> {noformat}
> into 
> {noformat}
> from T
> insert into T select ROW__ID where a = 1 SORT BY ROW__ID
> insert into T select a, 7 where a = 1
> {noformat}
> even better to
> {noformat}
> from T where a = 1
> insert into T select ROW__ID SORT BY ROW__ID
> insert into T select a, 7
> {noformat}
> but this won't parse currently.
> This is very similar to how MERGE stmt is handled.
> Need some though on on how WriteSet tracking works.  If we don't allow updating partition column, then even with dynamic partitions TxnHandler.addDynamicPartitions() should see 1 entry (in Update type) for each partition since both the insert and delete land in the same partition.  If part cols can be updated, then then we may insert a Delete event into P1 and corresponding Insert event into P2 so addDynamicPartitions() should see both parts.  I guess both need to be recored in Write_Set but with different types.  The delete as 'update' and insert as insert so that it can conflict with some IOW on the 'new' partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)