You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@trafodion.apache.org by Hans Zeller <ha...@esgyn.com> on 2016/03/15 22:39:34 UTC

Upsert semantics

Hi,

Here is a question on how we should define the meaning of an UPSERT
statement. UPSERT is not part of the ISO/ANSI SQL standard, so we have some
leeway to define it.

My personal feeling is that UPSERT should either insert a brand-new row or
it should completely replace an existing row, but it should never combine
columns from a new and an existing row. If users want the latter then they
should use the MERGE command.

We should probably follow what other DBMSs do. I could not yet find a DBMS
that had an UPSERT command, although there probably is one.

- PostgreSQL: Has an insert with a conflict clause, similar to our
MERGE: http://www.postgresql.org/docs/current/static/sql-insert.html

- MySQL: Has an insert with ON DUPLICATE KEY clause, similar to our
MERGE: http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html

- Teradata: Has an update ... else insert ... command, similar to MERGE:
https://forums.teradata.com/forum/enterprise/problem-using-upsert

- Oracle just seems to have the MERGE statement and various methods to
do upsert in PL/SQL:
http://stackoverflow.com/questions/237327/oracle-how-to-upsert-update-or-insert-into-a-table

This seems to support indirectly what I'm proposing. If we want to merge
old and new row then we should use syntax specifying how to merge, which is
what the other DBMSs have done.

See also the discussion in
https://github.com/apache/incubator-trafodion/pull/380. I wanted to see
whether the user group has any input on this.

Thanks,

Hans

Re: Upsert semantics

Posted by Qifan Chen <qi...@esgyn.com>.

I agree omitting default values from storage is an optimization and as such it should provide the same UPSERT semantics as with other storage formats/optimizations.

Specially our code could insert default value checking expression to verify that an value is exact the same as the default value and omit it for storage (extra overhead), or insert/update otherwise.

The other option would be not checking the default value at all and allow mixed storage model for default values (fast upsert but some extra storage overhead).

Any change on the handling of CURRENT defaults should still stick  to ANSI. 

Thanks

-Qifan

Sent from my iPhone

> On Mar 18, 2016, at 7:23 AM, Suresh Subbiah <su...@gmail.com> wrote:
> 
> Hi,
> 
> To me upsert has meant a faster performing version of insert, with duplicate key errors ignored. I would claim that most users are drawn towards upsert since it performs better than insert.
> I do not think compatibility with Phoenix syntax is an important requirement. 
> As everyone has said we would not want a statement to have different semantics depending on row format.   
> I do not quite understand why an omitted CURRENT default is treated differently from other omitted defaults, so I could see the last column in the first row below also being transformed to "Replace the given columns", but this I do feel is not crucial. Whichever is easier for us to implement as long as it is defined should be sufficient.
> 
> 
> With these principles in mind my vote would be for the proposal Hans gave above.
> 
> I am sorry for not stating my opinion clearly during review. 
> 
> Thank you
> Suresh
> 
>                                  Aligned Format                    Aligned format with            Non-Aligned with                     Non-Aligned with
> 
>                                    With no omitted                    omitted columns               with no omitted                       omitted current default
> 
>                                     columns                                                                    / omitted non-current columns
> 
>  
> 
> CQD off                          Replaces row                          MERGE                         Replace the given columns                 MERGE
> 
> CQD on (default)             Replaces row                          Replaces row                Replace all columns                  Replace all columns      
> 
> 
> 
> 
> 
> 
> 
> 
>> On Thu, Mar 17, 2016 at 4:58 PM, Selva Govindarajan <se...@esgyn.com> wrote:
>> Here is what I found with phoenix, just to compare with phoenix’s behavior for upsert.
>> 
>>  
>> 
>> Phoenix expects the table to have a primary key. Upsert specification is
>> 
>> Inserts if not present and updates otherwise the value in the table. The list of columns is optional and if not present, the values will map to the column in the order they are declared in the schema. The values must evaluate to constants.
>> 
>> create table phoenix.testtbl (c1 integer not null primary key, c2 integer , c3 integer) ;
>> upsert into phoenix.testtbl (c1, c2)  values (1,1) ;
>> upsert into phoenix.testtbl (c1,c3)  values (1,1) ;
>> upsert into phoenix.testtbl (c1,c2)  values (1,null) ;
>>  
>> 0: jdbc:phoenix:localhost:51670> select * from phoenix.testtbl ;
>> +------------------------------------------+------------------------------------------+------------------------------------------+
>> |                    C1                    |                    C2                    |                    C3                    |
>> +------------------------------------------+------------------------------------------+------------------------------------------+
>> | 1                                        | null                                     | 1                                        |
>> +------------------------------------------+------------------------------------------+------------------------------------------+
>>  
>> In the raw hbase table, I see the following cells after the above 3 upserts. It looks like phoenix deletes the cell if it updated with null value.
>>  
>> hbase(main):006:0> scan 'PHOENIX.TESTTBL'
>> ROW                                      COLUMN+CELL                                                                                                         
>>  \x80\x00\x00\x01                        column=0:C3, timestamp=1458249350858, value=\x80\x00\x00\x01                                                        
>>  \x80\x00\x00\x01                        column=0:_0, timestamp=1458249392491, value=                                                                        
>> 1 row(s) in 0.0210 seconds
>>  
>> 
>> Selva
>> 
>>  
>> 
>> From: Dave Birdsall [mailto:dave.birdsall@esgyn.com] 
>> Sent: Thursday, March 17, 2016 11:09 AM
>> 
>> 
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> It sounds to me like this makes the semantics of UPSERT depend on physical row layout, which seems contrary to the philosophy of SQL language design as a declarative language.
>> 
>>  
>> 
>> I’d much rather have different syntax for each of these semantics. A different verb perhaps. Or a clause added to it. Then it is clear to the application developer what semantics he is getting. He does not have to examine the physical schema to figure this out.
>> 
>>  
>> 
>> Dave
>> 
>>  
>> 
>> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com] 
>> Sent: Thursday, March 17, 2016 11:01 AM
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>> 
>>  
>> 
>> I wonder if the CQD TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS should be set to “SYSTEM” by default. It can take ‘SYSTEM’, ‘ON’ or ‘OFF’.
>> 
>>  
>> 
>> For aligned format – SYSTEM would be treated as ‘ON’ – User can override it with ‘OFF’ if he/she needs merge semantics.
>> 
>>  
>> 
>> For non-aligned format – SYSTEM would be treated as ‘OFF’. This would ensure that all the columns are not inserted all the time into raw hbase table.  User can avoid merge semantics for omitted default current columns by overriding it with ‘ON’ semantics.
>> 
>>  
>> 
>> Selva
>> 
>>  
>> 
>> From: Hans Zeller [mailto:hans.zeller@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 6:36 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: Re: Upsert semantics
>> 
>>  
>> 
>> Thank you, Selva. The JIRA is https://issues.apache.org/jira/browse/TRAFODION-1896.
>> 
>> 
>> 
>> Hans
>> 
>>  
>> 
>> On Tue, Mar 15, 2016 at 6:15 PM, Selva Govindarajan <se...@esgyn.com> wrote:
>> 
>> Hans,
>> 
>>  
>> 
>> It didn’t occur to me your proposed change would work. I was always thinking we shouldn’t be adding the omitted columns in non-aligned format.   You can file a JIRA and I will fix it.
>> 
>>  
>> 
>> Selva
>> 
>>  
>> 
>> From: Anoop Sharma [mailto:anoop.sharma@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 6:03 PM
>> 
>> 
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>> 
>>  
>> 
>> yes, one cqd to switch between one or the other behavior in all formats is the right way to go.
>> 
>>  
>> 
>> Doing the other way based on the row format would cause more issues when we
>> 
>> support hybrid format rows where some columns are in aligned format and others
>> 
>> are not.
>> 
>>  
>> 
>> anoop
>> 
>>  
>> 
>> From: Hans Zeller [mailto:hans.zeller@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 5:58 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: Re: Upsert semantics
>> 
>>  
>> 
>> Again, IMHO that's the wrong way to go, but I hope others will chime in. Dave gave the best reason, it's a bad idea to make the semantics of UPSERT depend on the internal format. Here is what I would suggest, using Selva's table (proposed changes in red - hope Apache won't mangle them):
>>  
>>                                  Aligned Format                    Aligned format with            Non-Aligned with                     Non-Aligned with
>> 
>>                                    With no omitted                    omitted columns               with no omitted                       omitted current default
>> 
>>                                     columns                                                                    / omitted non-current columns
>> 
>>  
>> 
>> CQD off                          Replaces row                          MERGE                         Replace the given columns                 MERGE
>> 
>> CQD on (default)             Replaces row                          Replaces row                Replace all columns                  Replace all columns                             
>> 
>> 
>> 
>> Hans
>> 
>>  
>> 
>> On Tue, Mar 15, 2016 at 5:36 PM, Selva Govindarajan <se...@esgyn.com> wrote:
>> 
>> I believe phoenix doesn’t support insert semantics or the non-null default value columns.  Trafodion supports insert, upsert, non-null default value columns as well as current default values like current timestamp and current user.
>> 
>>  
>> 
>> Upsert handling in Trafodion is same as phoenix for non-aligned format. For aligned format it can be controlled via CQD.
>> 
>>  
>> 
>>                                  Aligned Format                    Aligned format with            Non-Aligned with                     Non-Aligned with
>> 
>>                                    With no omitted                    omitted columns               with no omitted                       omitted current default
>> 
>>                                     columns                                                                    / omitted non-current columns
>> 
>>  
>> 
>> Default behavior             Replaces row                          MERGE                         Replace the given columns                 MERGE
>> 
>> With the CQD                 Replaces row                          Replaces row                Replace the given columns                 MERGE
>> 
>>          set to on                                       
>> 
>>  
>> 
>> The CQD to be used is TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS (Default is off). In short, this CQD is a no-op for non-aligned format.  
>> 
>>  
>> 
>> The behavior of the non-aligned format can’t be controlled by the CQD because we don’t store values for the omitted columns in hbase and hence when the user switches the CQD settings for upserts with different sets of omitted columns, we could end up with non-deterministic values for these columns.    
>> 
>> For eq. upsert with the cqd set to ‘on’ with a set of omitted columns
>> 
>> Upsert with the cqd set to ‘off’ with a different set of omitted columns
>> 
>> If we switch to insert all column values all the time for non-aligned format, then we can let user to control what value needs to be put in for the omitted column.
>> 
>>  
>> 
>> Selva                                             
>> 
>>  
>> 
>> From: Hans Zeller [mailto:hans.zeller@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 4:01 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: Re: Upsert semantics
>> 
>>  
>> 
>> Yes, that's what I had in mind, using a CQD as the syntax:
>> 
>>  
>> 
>> UPSERT handling                 aligned format        non-aligned format
>> 
>> ------------------------------  --------------------  -------------------------------
>> 
>> default behavior                replace row           replace row (create all values)
>> 
>> Phoenix behavior (via CQD):     transform to MERGE    insert only specified cols (*)
>> 
>>  
>> 
>> (*) One issue here is with "default current". In that case we may also need to transform the statement into a MERGE.
>> 
>>  
>> 
>> From a performance point of view, the "default behavior" would work better for aligned format, the Phoenix behavior would work better for non-aligned format.
>> 
>>  
>> 
>> In some cases it won't matter. Selva's code will detect many of these and automatically choose the faster implementation.
>> 
>>  
>> 
>> Thanks,
>> 
>> 
>> 
>> Hans
>> 
>>  
>> 
>> On Tue, Mar 15, 2016 at 3:41 PM, Dave Birdsall <da...@esgyn.com> wrote:
>> 
>> <Cringe> Not sure we want the logical semantics of an operation to depend on the physical layout of the row.
>> 
>>  
>> 
>> Would be better to have different syntax for each. With an explanation that one works faster on one format, and the other faster on the other format.
>> 
>>  
>> 
>> From: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 3:38 PM
>> 
>> 
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>> 
>>  
>> 
>> Would there be a problem if we implemented the phoenix semantic for non align format, and the  upsert semantic proposed by Hans in align format?
>> 
>> This would allow speed optimization without having the user to know about subtle differences?
>> 
>> Eric
>> 
>>  
>> 
>>  
>> 
>> From: Anoop Sharma [mailto:anoop.sharma@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 5:14 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>> 
>>  
>> 
>> Phoenix has upsert command and from what can tell, they originally came up with upsert syntax.
>> 
>> Their semantic is to insert if not present and update the specified columns with the specified values if present.
>> 
>> We did do an experiment and upsert only updates the specified columns.
>> 
>> Maybe we can add a cqd so full row update vs. specified column update behavior could be chosen.
>> 
>>  
>> 
>> Here is their specification.
>> 
>> Inserts if not present and updates otherwise the value in the table. The list of columns is optional and if not present, the values will map to the column in the order they are declared in the schema. The values must evaluate to constants.
>> 
>> Example:
>> 
>> UPSERT INTO TEST VALUES('foo','bar',3);
>> UPSERT INTO TEST(NAME,ID) VALUES('foo',123);
>> 
>>  
>> 
>>  
>> 
>> From: Dave Birdsall [mailto:dave.birdsall@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 2:55 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: RE: Upsert semantics
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> It seems that when ANSI first added MERGE to the standard, it was portrayed as “upsert” (see https://en.wikipedia.org/wiki/Merge_(SQL)).
>> 
>>  
>> 
>> I agree though that we are free to define our UPSERT to mean anything we want.
>> 
>>  
>> 
>> I like what you suggest. Since our UPSERT syntax already specifies values for all the columns, it makes sense for it to have “replace” semantics. That is, if the row exists, replace it with all the new stuff (with defaults for columns omitted). If the row doesn’t exist, it’s just a straight insert (with defaults for omitted columns).
>> 
>>  
>> 
>> And if one really wants UPDATE semantics as opposed to “replace” semantics, then the ANSI MERGE statement (which Trafodion also supports) is the way to go.
>> 
>>  
>> 
>> There is an analogy to this in linguistic theory. Whenever a language has two words that at a point in time mean the same thing, there is a tendency for the meanings to change over time so they diverge. For example, English “shirt” and “skirt”, originally from the same root, but one via Anglo-Saxon the other via Old Norse.
>> 
>>  
>> 
>> Dave
>> 
>>  
>> 
>>  
>> 
>> From: Hans Zeller [mailto:hans.zeller@esgyn.com] 
>> Sent: Tuesday, March 15, 2016 2:40 PM
>> To: user@trafodion.incubator.apache.org
>> Subject: Upsert semantics
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> Here is a question on how we should define the meaning of an UPSERT statement. UPSERT is not part of the ISO/ANSI SQL standard, so we have some leeway to define it.
>> 
>>  
>> 
>> My personal feeling is that UPSERT should either insert a brand-new row or it should completely replace an existing row, but it should never combine columns from a new and an existing row. If users want the latter then they should use the MERGE command.
>> 
>>  
>> 
>> We should probably follow what other DBMSs do. I could not yet find a DBMS that had an UPSERT command, although there probably is one.
>> 
>>  
>> 
>> PostgreSQL: Has an insert with a conflict clause, similar to our MERGE: http://www.postgresql.org/docs/current/static/sql-insert.html
>> MySQL: Has an insert with ON DUPLICATE KEY clause, similar to our MERGE: http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
>> Teradata: Has an update ... else insert ... command, similar to MERGE: https://forums.teradata.com/forum/enterprise/problem-using-upsert
>> Oracle just seems to have the MERGE statement and various methods to do upsert in PL/SQL: http://stackoverflow.com/questions/237327/oracle-how-to-upsert-update-or-insert-into-a-table
>> This seems to support indirectly what I'm proposing. If we want to merge old and new row then we should use syntax specifying how to merge, which is what the other DBMSs have done.
>> 
>>  
>> 
>> See also the discussion in https://github.com/apache/incubator-trafodion/pull/380. I wanted to see whether the user group has any input on this.
>> 
>>  
>> 
>> Thanks,
>> 
>> 
>> 
>> Hans
>> 
>

Re: Upsert semantics

Posted by Suresh Subbiah <su...@gmail.com>.

Hi,

To me upsert has meant a faster performing version of insert, with
duplicate key errors ignored. I would claim that most users are drawn
towards upsert since it performs better than insert.
I do not think compatibility with Phoenix syntax is an important
requirement.
As everyone has said we would not want a statement to have different
semantics depending on row format.
I do not quite understand why an omitted CURRENT default is treated
differently from other omitted defaults, so I could see the last column in
the first row below also being transformed to "Replace the given columns",
but this I do feel is not crucial. Whichever is easier for us to implement
as long as it is defined should be sufficient.


With these principles in mind my vote would be for the proposal Hans gave
above.

I am sorry for not stating my opinion clearly during review.

Thank you
Suresh

                                 Aligned Format                    Aligned
format with            Non-Aligned with                     Non-Aligned with

                                   With no omitted
                   omitted columns               with no omitted
                      omitted current default

                                    columns
                                                        / omitted
non-current columns



CQD off                          Replaces row
MERGE                         Replace the given columns
MERGE

CQD on (default)             Replaces row                          Replaces
row                Replace all columns                  Replace all columns






On Thu, Mar 17, 2016 at 4:58 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

> Here is what I found with phoenix, just to compare with phoenix’s behavior
> for upsert.
>
>
>
> Phoenix expects the table to have a primary key. Upsert specification is
>
> Inserts if not present and updates otherwise the value in the table. The
> list of columns is optional and if not present, the values will map to the
> column in the order they are declared in the schema. The values must
> evaluate to constants.
>
> create table phoenix.testtbl (c1 integer not null primary key, c2 integer
> , c3 integer) ;
>
> upsert into phoenix.testtbl (c1, c2)  values (1,1) ;
>
> upsert into phoenix.testtbl (c1,c3)  values (1,1) ;
>
> upsert into phoenix.testtbl (c1,c2)  values (1,null) ;
>
>
>
> 0: jdbc:phoenix:localhost:51670> select * from phoenix.testtbl ;
>
>
> +------------------------------------------+------------------------------------------+------------------------------------------+
>
> |                    C1                    |
> C2                    |                    C3                    |
>
>
> +------------------------------------------+------------------------------------------+------------------------------------------+
>
> | 1                                        |
> null                                     |
> 1                                        |
>
>
> +------------------------------------------+------------------------------------------+------------------------------------------+
>
>
>
> In the raw hbase table, I see the following cells after the above 3
> upserts. It looks like phoenix deletes the cell if it updated with null
> value.
>
>
>
> hbase(main):006:0> scan 'PHOENIX.TESTTBL'
>
> ROW
> COLUMN+CELL
>
>
>  \x80\x00\x00\x01                        column=0:C3,
> timestamp=1458249350858,
> value=\x80\x00\x00\x01
>
>
>  \x80\x00\x00\x01                        column=0:_0,
> timestamp=1458249392491,
> value=
>
>
> 1 row(s) in 0.0210 seconds
>
>
>
> Selva
>
>
>
> *From:* Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> *Sent:* Thursday, March 17, 2016 11:09 AM
>
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> Hi,
>
>
>
> It sounds to me like this makes the semantics of UPSERT depend on physical
> row layout, which seems contrary to the philosophy of SQL language design
> as a declarative language.
>
>
>
> I’d much rather have different syntax for each of these semantics. A
> different verb perhaps. Or a clause added to it. Then it is clear to the
> application developer what semantics he is getting. He does not have to
> examine the physical schema to figure this out.
>
>
>
> Dave
>
>
>
> *From:* Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> *Sent:* Thursday, March 17, 2016 11:01 AM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> I wonder if the CQD TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS should be
> set to “SYSTEM” by default. It can take ‘SYSTEM’, ‘ON’ or ‘OFF’.
>
>
>
> For aligned format – SYSTEM would be treated as ‘ON’ – User can override
> it with ‘OFF’ if he/she needs merge semantics.
>
>
>
> For non-aligned format – SYSTEM would be treated as ‘OFF’. This would
> ensure that all the columns are not inserted all the time into raw hbase
> table.  User can avoid merge semantics for omitted default current columns
> by overriding it with ‘ON’ semantics.
>
>
>
> Selva
>
>
>
> *From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 6:36 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* Re: Upsert semantics
>
>
>
> Thank you, Selva. The JIRA is
> https://issues.apache.org/jira/browse/TRAFODION-1896.
>
>
> Hans
>
>
>
> On Tue, Mar 15, 2016 at 6:15 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> Hans,
>
>
>
> It didn’t occur to me your proposed change would work. I was always
> thinking we shouldn’t be adding the omitted columns in non-aligned format.
>   You can file a JIRA and I will fix it.
>
>
>
> Selva
>
>
>
> *From:* Anoop Sharma [mailto:anoop.sharma@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 6:03 PM
>
>
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> yes, one cqd to switch between one or the other behavior in all formats is
> the right way to go.
>
>
>
> Doing the other way based on the row format would cause more issues when we
>
> support hybrid format rows where some columns are in aligned format and
> others
>
> are not.
>
>
>
> anoop
>
>
>
> *From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 5:58 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* Re: Upsert semantics
>
>
>
> Again, IMHO that's the wrong way to go, but I hope others will chime in.
> Dave gave the best reason, it's a bad idea to make the semantics of UPSERT
> depend on the internal format. Here is what I would suggest, using Selva's
> table (proposed changes in red - hope Apache won't mangle them):
>
>
>
>                                  Aligned Format                    Aligned
> format with            Non-Aligned with                     Non-Aligned with
>
>                                    With no omitted
>                    omitted columns               with no omitted
>                       omitted current default
>
>                                     columns
>                                                         / omitted
> non-current columns
>
>
>
> CQD off                          Replaces row
> MERGE                         Replace the given columns
> MERGE
>
> CQD on (default)             Replaces row
> Replaces row                Replace all columns                  Replace all columns
>
>
>
> Hans
>
>
>
> On Tue, Mar 15, 2016 at 5:36 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> I believe phoenix doesn’t support insert semantics or the non-null default
> value columns.  Trafodion supports insert, upsert, non-null default value
> columns as well as current default values like current timestamp and
> current user.
>
>
>
> Upsert handling in Trafodion is same as phoenix for non-aligned format.
> For aligned format it can be controlled via CQD.
>
>
>
>                                  Aligned Format                    Aligned
> format with            Non-Aligned with                     Non-Aligned with
>
>                                    With no omitted
>                    omitted columns               with no omitted
>                       omitted current default
>
>                                     columns
>                                                         / omitted
> non-current columns
>
>
>
> Default behavior             Replaces row
> MERGE                         Replace the given columns
> MERGE
>
> With the CQD                 Replaces row
> Replaces row                Replace the given columns                 MERGE
>
>
>          set to on
>
>
>
> The CQD to be used is TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS (Default
> is off). In short, this CQD is a no-op for non-aligned format.
>
>
>
> The behavior of the non-aligned format can’t be controlled by the CQD
> because we don’t store values for the omitted columns in hbase and hence
> when the user switches the CQD settings for upserts with different sets of
> omitted columns, we could end up with non-deterministic values for these
> columns.
>
> For eq. upsert with the cqd set to ‘on’ with a set of omitted columns
>
> Upsert with the cqd set to ‘off’ with a different set of omitted columns
>
> If we switch to insert all column values all the time for non-aligned
> format, then we can let user to control what value needs to be put in for
> the omitted column.
>
>
>
> Selva
>
>
>
> *From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 4:01 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* Re: Upsert semantics
>
>
>
> Yes, that's what I had in mind, using a CQD as the syntax:
>
>
>
> UPSERT handling                 aligned format        non-aligned format
>
> ------------------------------  --------------------
>  -------------------------------
>
> default behavior                replace row           replace row (create
> all values)
>
> Phoenix behavior (via CQD):     transform to MERGE    insert only
> specified cols (*)
>
>
>
> (*) One issue here is with "default current". In that case we may also
> need to transform the statement into a MERGE.
>
>
>
> From a performance point of view, the "default behavior" would work better
> for aligned format, the Phoenix behavior would work better for non-aligned
> format.
>
>
>
> In some cases it won't matter. Selva's code will detect many of these and
> automatically choose the faster implementation.
>
>
>
> Thanks,
>
>
> Hans
>
>
>
> On Tue, Mar 15, 2016 at 3:41 PM, Dave Birdsall <da...@esgyn.com>
> wrote:
>
> <Cringe> Not sure we want the logical semantics of an operation to depend
> on the physical layout of the row.
>
>
>
> Would be better to have different syntax for each. With an explanation
> that one works faster on one format, and the other faster on the other
> format.
>
>
>
> *From:* Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 3:38 PM
>
>
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> Would there be a problem if we implemented the phoenix semantic for non
> align format, and the  upsert semantic proposed by Hans in align format?
>
> This would allow speed optimization without having the user to know about
> subtle differences?
>
> Eric
>
>
>
>
>
> *From:* Anoop Sharma [mailto:anoop.sharma@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 5:14 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> Phoenix has upsert command and from what can tell, they originally came up
> with upsert syntax.
>
> Their semantic is to insert if not present and update the specified
> columns with the specified values if present.
>
> We did do an experiment and upsert only updates the specified columns.
>
> Maybe we can add a cqd so full row update vs. specified column update
> behavior could be chosen.
>
>
>
> Here is their specification.
>
> Inserts if not present and updates otherwise the value in the table. The
> list of columns is optional and if not present, the values will map to the
> column in the order they are declared in the schema. The values must
> evaluate to constants.
>
> Example:
>
> UPSERT INTO TEST VALUES('foo','bar',3);
> UPSERT INTO TEST(NAME,ID) VALUES('foo',123);
>
>
>
>
>
> *From:* Dave Birdsall [mailto:dave.birdsall@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 2:55 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* RE: Upsert semantics
>
>
>
> Hi,
>
>
>
> It seems that when ANSI first added MERGE to the standard, it was
> portrayed as “upsert” (see https://en.wikipedia.org/wiki/Merge_(SQL)).
>
>
>
> I agree though that we are free to define our UPSERT to mean anything we
> want.
>
>
>
> I like what you suggest. Since our UPSERT syntax already specifies values
> for all the columns, it makes sense for it to have “replace” semantics.
> That is, if the row exists, replace it with all the new stuff (with
> defaults for columns omitted). If the row doesn’t exist, it’s just a
> straight insert (with defaults for omitted columns).
>
>
>
> And if one really wants UPDATE semantics as opposed to “replace”
> semantics, then the ANSI MERGE statement (which Trafodion also supports) is
> the way to go.
>
>
>
> There is an analogy to this in linguistic theory. Whenever a language has
> two words that at a point in time mean the same thing, there is a tendency
> for the meanings to change over time so they diverge. For example, English
> “shirt” and “skirt”, originally from the same root, but one via Anglo-Saxon
> the other via Old Norse.
>
>
>
> Dave
>
>
>
>
>
> *From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
> *Sent:* Tuesday, March 15, 2016 2:40 PM
> *To:* user@trafodion.incubator.apache.org
> *Subject:* Upsert semantics
>
>
>
> Hi,
>
>
>
> Here is a question on how we should define the meaning of an UPSERT
> statement. UPSERT is not part of the ISO/ANSI SQL standard, so we have some
> leeway to define it.
>
>
>
> My personal feeling is that UPSERT should either insert a brand-new row or
> it should completely replace an existing row, but it should never combine
> columns from a new and an existing row. If users want the latter then they
> should use the MERGE command.
>
>
>
> We should probably follow what other DBMSs do. I could not yet find a DBMS
> that had an UPSERT command, although there probably is one.
>
>
>
>    - PostgreSQL: Has an insert with a conflict clause, similar to our
>    MERGE: http://www.postgresql.org/docs/current/static/sql-insert.html
>
>
>    - MySQL: Has an insert with ON DUPLICATE KEY clause, similar to our
>    MERGE: http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
>
>
>    - Teradata: Has an update ... else insert ... command, similar to
>    MERGE:
>    https://forums.teradata.com/forum/enterprise/problem-using-upsert
>
>
>    - Oracle just seems to have the MERGE statement and various methods to
>    do upsert in PL/SQL:
>    http://stackoverflow.com/questions/237327/oracle-how-to-upsert-update-or-insert-into-a-table
>
> This seems to support indirectly what I'm proposing. If we want to merge
> old and new row then we should use syntax specifying how to merge, which is
> what the other DBMSs have done.
>
>
>
> See also the discussion in
> https://github.com/apache/incubator-trafodion/pull/380. I wanted to see
> whether the user group has any input on this.
>
>
>
> Thanks,
>
>
> Hans
>
>
>
>
>
>
>

RE: Upsert semantics

Posted by Selva Govindarajan <se...@esgyn.com>.

Here is what I found with phoenix, just to compare with phoenix’s behavior
for upsert.

Phoenix expects the table to have a primary key. Upsert specification is

Inserts if not present and updates otherwise the value in the table. The
list of columns is optional and if not present, the values will map to the
column in the order they are declared in the schema. The values must
evaluate to constants.

create table phoenix.testtbl (c1 integer not null primary key, c2 integer ,
c3 integer) ;

upsert into phoenix.testtbl (c1, c2) values (1,1) ;

upsert into phoenix.testtbl (c1,c3) values (1,1) ;

upsert into phoenix.testtbl (c1,c2) values (1,null) ;

0: jdbc:phoenix:localhost:51670> select * from phoenix.testtbl ;

+------------------------------------------+------------------------------------------+------------------------------------------+

| C1 |
C2 | C3 |

+------------------------------------------+------------------------------------------+------------------------------------------+

| 1 |
null |
1 |

+------------------------------------------+------------------------------------------+------------------------------------------+

In the raw hbase table, I see the following cells after the above 3
upserts. It looks like phoenix deletes the cell if it updated with null
value.

hbase(main):006:0> scan 'PHOENIX.TESTTBL'

ROW
COLUMN+CELL

\x80\x00\x00\x01 column=0:C3,
timestamp=1458249350858,
value=\x80\x00\x00\x01

\x80\x00\x00\x01 column=0:_0,
timestamp=1458249392491,
value=

1 row(s) in 0.0210 seconds

Selva

*From:* Dave Birdsall [mailto:dave.birdsall@esgyn.com]
*Sent:* Thursday, March 17, 2016 11:09 AM
*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

Hi,

It sounds to me like this makes the semantics of UPSERT depend on physical
row layout, which seems contrary to the philosophy of SQL language design
as a declarative language.

I’d much rather have different syntax for each of these semantics. A
different verb perhaps. Or a clause added to it. Then it is clear to the
application developer what semantics he is getting. He does not have to
examine the physical schema to figure this out.

Dave

*From:* Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
*Sent:* Thursday, March 17, 2016 11:01 AM
*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

I wonder if the CQD TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS should be set
to “SYSTEM” by default. It can take ‘SYSTEM’, ‘ON’ or ‘OFF’.

For aligned format – SYSTEM would be treated as ‘ON’ – User can override it
with ‘OFF’ if he/she needs merge semantics.

For non-aligned format – SYSTEM would be treated as ‘OFF’. This would
ensure that all the columns are not inserted all the time into raw hbase
table. User can avoid merge semantics for omitted default current columns
by overriding it with ‘ON’ semantics.

Selva

*From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
*Sent:* Tuesday, March 15, 2016 6:36 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* Re: Upsert semantics

Thank you, Selva. The JIRA is
https://issues.apache.org/jira/browse/TRAFODION-1896.

Hans

On Tue, Mar 15, 2016 at 6:15 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

Hans,

It didn’t occur to me your proposed change would work. I was always
thinking we shouldn’t be adding the omitted columns in non-aligned format.
You can file a JIRA and I will fix it.

Selva

*From:* Anoop Sharma [mailto:anoop.sharma@esgyn.com]
*Sent:* Tuesday, March 15, 2016 6:03 PM

*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

yes, one cqd to switch between one or the other behavior in all formats is
the right way to go.

Doing the other way based on the row format would cause more issues when we

support hybrid format rows where some columns are in aligned format and
others

are not.

anoop

*From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
*Sent:* Tuesday, March 15, 2016 5:58 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* Re: Upsert semantics

Again, IMHO that's the wrong way to go, but I hope others will chime in.
Dave gave the best reason, it's a bad idea to make the semantics of UPSERT
depend on the internal format. Here is what I would suggest, using Selva's
table (proposed changes in red - hope Apache won't mangle them):

Aligned Format Aligned
format with Non-Aligned with Non-Aligned with

With no omitted
omitted columns with no omitted
omitted current default

columns
/ omitted
non-current columns

CQD off Replaces row
MERGE Replace the given columns
MERGE

CQD on (default) Replaces row Replaces
row Replace all columns Replace all columns

Hans

On Tue, Mar 15, 2016 at 5:36 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

I believe phoenix doesn’t support insert semantics or the non-null default
value columns. Trafodion supports insert, upsert, non-null default value
columns as well as current default values like current timestamp and
current user.

Upsert handling in Trafodion is same as phoenix for non-aligned format. For
aligned format it can be controlled via CQD.

Aligned Format Aligned
format with Non-Aligned with Non-Aligned with

With no omitted
omitted columns with no omitted
omitted current default

columns
/ omitted
non-current columns

Default behavior Replaces row
MERGE Replace the given columns
MERGE

With the CQD Replaces row Replaces
row Replace the given columns MERGE

set to on

The CQD to be used is TRAF_UPSERT_WITH_INSERT_DEFAULT_SEMANTICS (Default is
off). In short, this CQD is a no-op for non-aligned format.

The behavior of the non-aligned format can’t be controlled by the CQD
because we don’t store values for the omitted columns in hbase and hence
when the user switches the CQD settings for upserts with different sets of
omitted columns, we could end up with non-deterministic values for these
columns.

For eq. upsert with the cqd set to ‘on’ with a set of omitted columns

Upsert with the cqd set to ‘off’ with a different set of omitted columns

If we switch to insert all column values all the time for non-aligned
format, then we can let user to control what value needs to be put in for
the omitted column.

Selva

*From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
*Sent:* Tuesday, March 15, 2016 4:01 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* Re: Upsert semantics

Yes, that's what I had in mind, using a CQD as the syntax:

UPSERT handling aligned format non-aligned format

------------------------------ --------------------
-------------------------------

default behavior replace row replace row (create
all values)

Phoenix behavior (via CQD): transform to MERGE insert only specified
cols (*)

(*) One issue here is with "default current". In that case we may also need
to transform the statement into a MERGE.

>From a performance point of view, the "default behavior" would work better
for aligned format, the Phoenix behavior would work better for non-aligned
format.

In some cases it won't matter. Selva's code will detect many of these and
automatically choose the faster implementation.

Thanks,

Hans

On Tue, Mar 15, 2016 at 3:41 PM, Dave Birdsall <da...@esgyn.com>
wrote:

<Cringe> Not sure we want the logical semantics of an operation to depend
on the physical layout of the row.

Would be better to have different syntax for each. With an explanation that
one works faster on one format, and the other faster on the other format.

*From:* Eric Owhadi [mailto:eric.owhadi@esgyn.com]
*Sent:* Tuesday, March 15, 2016 3:38 PM

*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

Would there be a problem if we implemented the phoenix semantic for non
align format, and the upsert semantic proposed by Hans in align format?

This would allow speed optimization without having the user to know about
subtle differences?

Eric

*From:* Anoop Sharma [mailto:anoop.sharma@esgyn.com]
*Sent:* Tuesday, March 15, 2016 5:14 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

Phoenix has upsert command and from what can tell, they originally came up
with upsert syntax.

Their semantic is to insert if not present and update the specified columns
with the specified values if present.

We did do an experiment and upsert only updates the specified columns.

Maybe we can add a cqd so full row update vs. specified column update
behavior could be chosen.

Here is their specification.

Example:

UPSERT INTO TEST VALUES('foo','bar',3);
UPSERT INTO TEST(NAME,ID) VALUES('foo',123);

*From:* Dave Birdsall [mailto:dave.birdsall@esgyn.com]
*Sent:* Tuesday, March 15, 2016 2:55 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* RE: Upsert semantics

Hi,

It seems that when ANSI first added MERGE to the standard, it was portrayed
as “upsert” (see https://en.wikipedia.org/wiki/Merge_(SQL)).

I agree though that we are free to define our UPSERT to mean anything we
want.

I like what you suggest. Since our UPSERT syntax already specifies values
for all the columns, it makes sense for it to have “replace” semantics.
That is, if the row exists, replace it with all the new stuff (with
defaults for columns omitted). If the row doesn’t exist, it’s just a
straight insert (with defaults for omitted columns).

And if one really wants UPDATE semantics as opposed to “replace” semantics,
then the ANSI MERGE statement (which Trafodion also supports) is the way to
go.

There is an analogy to this in linguistic theory. Whenever a language has
two words that at a point in time mean the same thing, there is a tendency
for the meanings to change over time so they diverge. For example, English
“shirt” and “skirt”, originally from the same root, but one via Anglo-Saxon
the other via Old Norse.

Dave

*From:* Hans Zeller [mailto:hans.zeller@esgyn.com]
*Sent:* Tuesday, March 15, 2016 2:40 PM
*To:* user@trafodion.incubator.apache.org
*Subject:* Upsert semantics