You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Edward Capriolo (JIRA)" <ji...@apache.org> on 2013/11/18 19:51:30 UTC
[jira] [Commented] (HIVE-5317) Implement insert, update, and delete in Hive with full ACID support

    [ https://issues.apache.org/jira/browse/HIVE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825611#comment-13825611 ] 

Edward Capriolo commented on HIVE-5317:
---------------------------------------

I have two fundamental problems with this concept.
{quote}
The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset.
{quote}

This is a good reason not to do this. Things that  only work for some formats create fragmentation. What about format's that do not have a row id? What if the user is already using the key for something else like data?

{quote}
Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows) to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed.
{quote}

What this ticket describes seems like a bad use case for hive. Why would the user not simply create a new table partitioned by hour? What is the need to transaction ally in-place update a table? 

It seems like the better solution would be for the user to log these updates themselves and then export the table with a tool like squoop periodically.  

I see this as a really complicated piece of work, for a narrow use case, and I have a very difficult time believing adding transactions to hive to support this is the right answer.

> Implement insert, update, and delete in Hive with full ACID support
> -------------------------------------------------------------------
>
>                 Key: HIVE-5317
>                 URL: https://issues.apache.org/jira/browse/HIVE-5317
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: InsertUpdatesinHive.pdf
>
>
> Many customers want to be able to insert, update and delete rows from Hive tables with full ACID support. The use cases are varied, but the form of the queries that should be supported are:
> * INSERT INTO tbl SELECT …
> * INSERT INTO tbl VALUES ...
> * UPDATE tbl SET … WHERE …
> * DELETE FROM tbl WHERE …
> * MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN ...
> * SET TRANSACTION LEVEL …
> * BEGIN/END TRANSACTION
> Use Cases
> * Once an hour, a set of inserts and updates (up to 500k rows) for various dimension tables (eg. customer, inventory, stores) needs to be processed. The dimension tables have primary keys and are typically bucketed and sorted on those keys.
> * Once a day a small set (up to 100k rows) of records need to be deleted for regulatory compliance.
> * Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows)  to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)