You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2016/04/11 20:27:25 UTC
[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

    [ https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235674#comment-15235674 ] 

Owen O'Malley commented on HIVE-13479:
--------------------------------------

The hive.optimize.sort.dynamic.partition=true has other issues, but ACID shouldn't need to disable it. Since it only sorts on the partition columns, which does not interfere with the ACID sort that is only required inside each bucket file.

The bigger requirement is that we need to support sorting on user defined primary keys instead of our internal row ids. That will enable implementation of the upsert/merge commands. That does NOT require moving to a split insert + delete for modifications. There are other advantages to it (like enabling predicate push down on the deltas), but they don't help very much for the case of sorted primary keys.

> Relax sorting requirement in ACID tables
> ----------------------------------------
>
>                 Key: HIVE-13479
>                 URL: https://issues.apache.org/jira/browse/HIVE-13479
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>    Affects Versions: 1.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary key.  This is that base + delta files can be efficiently sort/merged to produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria which can be useful.  One example is using dynamic partition insert (which also occurs for update/delete SQL).  This may create lots of writers (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not require any particular sort on Acid tables.  One way to do that is to treat each update event as an Insert (new internal PK) + delete (old PK).  Delete events are very small since they just need to contain PKs.  So the hash table would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)