You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Vijay Ramachandran <vi...@linkedin.com> on 2016/09/23 08:46:44 UTC

on duplicate update equivalent?

Hello.

Is there a way to write a query with a behaviour equivalent to mysql's "on
duplicate update"? i.e., try to insert, and if key exists, update the row
instead?

thanks,

Re: on duplicate update equivalent?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Vijay,

If dimensional tables are reasonable size and frequently updated, then you
can deploy *Spark SQL* to get data directly from your MySQL table through
JDBC and do your join with your fact table stored in Hive.
In general these days one can do better with Spark SQL. Your fact table
still remains immutable in Hive. You can parallelize reads from RDBMS to
speed up the query to see data as is from your dimensional tables. Some
dimensional tables are updated frequently or recurring updates like in our
case.

//
//Get the FACT table from Hive
//
var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID,
..... FROM HiveDB.sales")
//
// Get the dimensional tables from RDBMS (in this case Oracle) through JDBC
//
val c = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT CHANNEL_ID, CHANNEL_DESC, ..... FROM
OracleDB.channels)",
"user" -> _username,
"password" -> _password))
val t = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT TIME_ID , CALENDAR_MONTH_DESC, ...... FROM
OracleDB.times)",
"user" -> _username,
"password" -> _password))
//
// Registar three data frames as temporary tables using registerTempTable
call
//
s.registerTempTable("t_s")
c.registerTempTable("t_c")
t.registerTempTable("t_t")

Then you can perform SQL using tempTables much like Hive. HiveContext in
Spark is mapping here to HiveQL

var sqltext = ""
sqltext = """
SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) As Sales, ........
FROM
(
SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
SUM(t_s.AMOUNT_SOLD) AS TotalSales
FROM t_s, t_t, t_c
...) rs
"""
spark.sql(sqltext).collect

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 23 September 2016 at 11:46, Vijay Ramachandran <vi...@linkedin.com>
wrote:

>
> On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> What is the use case for UPSERT in Hive. The functionality does not exist
>> but there are other solutions.
>>
>> Are we talking about a set of dimension tables with primary keys hat need
>> to be updated (existing rows) or inserted (new rows)?
>>
>
>
> Hi Mich.
>
> Exactly, I'm looking at dimension tables.
>
> thanks,
>
>

Re: on duplicate update equivalent?

Posted by Damien Carol <da...@gmail.com>.

Another solution is to use HIVE over HBase.
When you insert in this table, HIVE do an upsert.




2016-09-23 21:00 GMT+02:00 Mich Talebzadeh <mi...@gmail.com>:

> The fundamental question is: do you need these recurring updates to
> dimension tables throttling your Hive tables.
>
> Besides why bother with ETL when one can do ELT.
>
> For dimension table just add two additional columns namely
>
>    , op_type int
>    , op_time timestamp
>
> op_type = 1/2/3 (INSERT/UPDATE/DELETE)  and op_time = timestamp from Hive
> to the original table. New records will be appended to the dimension table.
> So when you have the full Entity Life History (one INSERT, multiple
> UPDATES and one delete) for a given primary key. then you can do whatever
> you want plus of course full audit of every record (for example what
> happened to every trade, who changed what etc).
>
> In your join with the FACT table you will need to use analytics to find
> the last entry for a given primary key (ignoring deletes) or just use
> standard HQL.
>
> If you are going to bring in Hbase etc to it, then Spark solution that I
> suggested earlier on may serve better.
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 23 September 2016 at 19:36, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>> > Dimensions change, and I'd rather do update than recreate a snapshot.
>>
>> Slow changing dimensions are the common use-case for Hive's ACID MERGE.
>>
>> The feature you need is most likely covered by
>>
>> https://issues.apache.org/jira/browse/HIVE-10924
>>
>> 2nd comment from that JIRA
>>
>> "Once an hour, a set of inserts and updates (up to 500k rows) for various
>> dimension tables (eg. customer, inventory, stores) needs to be processed.
>> The dimension tables have primary keys and are typically bucketed and
>> sorted on those keys."
>>
>> Any other approach would need a full snapshot re-materialization, because
>> ACID can generate DELETE + INSERT instead of rewriting the original file
>> for a 2% upsert.
>>
>> If you do not have any isolation concerns (as in, a query doing a read
>> when 50% of your update has applied), using HBase backed dimension tables
>> in Hive is possible, but it does not offer the same transactional
>> consistency as the ACID merge will.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: on duplicate update equivalent?

Posted by Mich Talebzadeh <mi...@gmail.com>.

The fundamental question is: do you need these recurring updates to
dimension tables throttling your Hive tables.

Besides why bother with ETL when one can do ELT.

For dimension table just add two additional columns namely

   , op_type int
   , op_time timestamp

op_type = 1/2/3 (INSERT/UPDATE/DELETE)  and op_time = timestamp from Hive
to the original table. New records will be appended to the dimension table.
So when you have the full Entity Life History (one INSERT, multiple
UPDATES and one delete) for a given primary key. then you can do whatever
you want plus of course full audit of every record (for example what
happened to every trade, who changed what etc).

In your join with the FACT table you will need to use analytics to find the
last entry for a given primary key (ignoring deletes) or just use standard
HQL.

If you are going to bring in Hbase etc to it, then Spark solution that I
suggested earlier on may serve better.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 23 September 2016 at 19:36, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> > Dimensions change, and I'd rather do update than recreate a snapshot.
>
> Slow changing dimensions are the common use-case for Hive's ACID MERGE.
>
> The feature you need is most likely covered by
>
> https://issues.apache.org/jira/browse/HIVE-10924
>
> 2nd comment from that JIRA
>
> "Once an hour, a set of inserts and updates (up to 500k rows) for various
> dimension tables (eg. customer, inventory, stores) needs to be processed.
> The dimension tables have primary keys and are typically bucketed and
> sorted on those keys."
>
> Any other approach would need a full snapshot re-materialization, because
> ACID can generate DELETE + INSERT instead of rewriting the original file
> for a 2% upsert.
>
> If you do not have any isolation concerns (as in, a query doing a read
> when 50% of your update has applied), using HBase backed dimension tables
> in Hive is possible, but it does not offer the same transactional
> consistency as the ACID merge will.
>
> Cheers,
> Gopal
>
>
>

Re: on duplicate update equivalent?

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Dimensions change, and I'd rather do update than recreate a snapshot.

Slow changing dimensions are the common use-case for Hive's ACID MERGE.

The feature you need is most likely covered by 

https://issues.apache.org/jira/browse/HIVE-10924

2nd comment from that JIRA

"Once an hour, a set of inserts and updates (up to 500k rows) for various dimension tables (eg. customer, inventory, stores) needs to be processed. The dimension tables have primary keys and are typically bucketed and sorted on those keys."

Any other approach would need a full snapshot re-materialization, because ACID can generate DELETE + INSERT instead of rewriting the original file for a 2% upsert.

If you do not have any isolation concerns (as in, a query doing a read when 50% of your update has applied), using HBase backed dimension tables in Hive is possible, but it does not offer the same transactional consistency as the ACID merge will.

Cheers,
Gopal

RE: on duplicate update equivalent?

Posted by Vijay Ramachandran <vi...@linkedin.com>.

Dimensions change, and I'd rather do update than recreate a snapshot.

On 23-Sep-2016 17:23, "Markovitz, Dudu" <dm...@paypal.com> wrote:

> If these are dimension tables, what do you need to update there?
>
>
>
> Dudu
>
>
>
> *From:* Vijay Ramachandran [mailto:vijay@linkedin.com]
> *Sent:* Friday, September 23, 2016 1:46 PM
> *To:* user@hive.apache.org
> *Subject:* Re: on duplicate update equivalent?
>
>
>
>
>
> On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
> What is the use case for UPSERT in Hive. The functionality does not exist
> but there are other solutions.
>
>
>
> Are we talking about a set of dimension tables with primary keys hat need
> to be updated (existing rows) or inserted (new rows)?
>
>
>
> Hi Mich.
>
> Exactly, I'm looking at dimension tables.
>
> thanks,
>
>
>

RE: on duplicate update equivalent?

Posted by "Markovitz, Dudu" <dm...@paypal.com>.

If these are dimension tables, what do you need to update there?

Dudu

From: Vijay Ramachandran [mailto:vijay@linkedin.com]
Sent: Friday, September 23, 2016 1:46 PM
To: user@hive.apache.org
Subject: Re: on duplicate update equivalent?

On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:
What is the use case for UPSERT in Hive. The functionality does not exist but there are other solutions.

Are we talking about a set of dimension tables with primary keys hat need to be updated (existing rows) or inserted (new rows)?

Hi Mich.
Exactly, I'm looking at dimension tables.
thanks,

Re: on duplicate update equivalent?

Posted by Vijay Ramachandran <vi...@linkedin.com>.

On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> What is the use case for UPSERT in Hive. The functionality does not exist
> but there are other solutions.
>
> Are we talking about a set of dimension tables with primary keys hat need
> to be updated (existing rows) or inserted (new rows)?
>

Hi Mich.

Exactly, I'm looking at dimension tables.

thanks,

Re: on duplicate update equivalent?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Vijay,

What is the use case for UPSERT in Hive. The functionality does not exist
but there are other solutions.

Are we talking about a set of dimension tables with primary keys hat need
to be updated (existing rows) or inserted (new rows)?

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 23 September 2016 at 09:46, Vijay Ramachandran <vi...@linkedin.com>
wrote:

> Hello.
>
> Is there a way to write a query with a behaviour equivalent to mysql's "on
> duplicate update"? i.e., try to insert, and if key exists, update the row
> instead?
>
> thanks,
>
>

RE: on duplicate update equivalent?

Posted by "Markovitz, Dudu" <dm...@paypal.com>.

You may however use a code similar to the following.
The main idea is to work with 2 target tables.
Instead of merging the source table into a target table, we create an additional target table based of the merge results.
A view is pointing all the time to the most updated target table.

Dudu


Initialize demo -

create table src (i int,c char(1));
insert into src values (2,'b'),(3,'c');

create table trg1 (i int,c char(1)) stored as orc;
insert into trg1 values (1,'X'),(2,'Y');

create view trg as select * from trg1;


Ongoing process -

create table if not exists trg2 as select coalesce (s.i,t.i) as i,coalesce (s.c,t.c) from src as s full join trg as t on t.i = s.i;

alter view trg as select * from trg2;

drop table if exists trg1;


After some time passes and the source table contains new data -

create table if not exists trg1 as select coalesce (s.i,t.i) as i,coalesce (s.c,t.c) from src as s full join trg as t on t.i = s.i;

alter view trg as select * from trg1;

drop table if exists trg2;


etc…




From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Friday, September 23, 2016 1:02 PM
To: user@hive.apache.org
Subject: RE: on duplicate update equivalent?

We’re not there yet…
https://issues.apache.org/jira/browse/HIVE-10924

Dudu

From: Vijay Ramachandran [mailto:vijay@linkedin.com]
Sent: Friday, September 23, 2016 11:47 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: on duplicate update equivalent?

Hello.
Is there a way to write a query with a behaviour equivalent to mysql's "on duplicate update"? i.e., try to insert, and if key exists, update the row instead?
thanks,

RE: on duplicate update equivalent?

Posted by "Markovitz, Dudu" <dm...@paypal.com>.

We’re not there yet…
https://issues.apache.org/jira/browse/HIVE-10924

Dudu

From: Vijay Ramachandran [mailto:vijay@linkedin.com]
Sent: Friday, September 23, 2016 11:47 AM
To: user@hive.apache.org
Subject: on duplicate update equivalent?

Hello.
Is there a way to write a query with a behaviour equivalent to mysql's "on duplicate update"? i.e., try to insert, and if key exists, update the row instead?
thanks,