You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by James <xu...@gmail.com> on 2017/06/22 08:32:19 UTC

SQL in Stream Computing: MERGE or INSERT?

Hi team,

I am thinking about a SQL and stream computing related problem, want to
hear your opinions.

In stream computing, there is a typical case like this:

*We want to calculate a big wide result table, which has one rowkey and ten
value columns:*
*create table result (*
*    rowkey varchar(127) PRIMARY KEY,*
*    col1 int,*
*    col2 int,*
*    ...*
*    col10 int*
*);*

Each of the value columns is calculated by a complex query, so there will
be ten SQLs to calculate
data for this table, for each sql:

* First check whether there is a row for the specified `rowkey`.
* If yes, then `update`, otherwise `insert`.

There is actually a dedicated sql syntax called `MERGE` designed for
this(SQL2008), a sample usage is:

MERGE INTO result D
   USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
   ON (D.rowkey = S.rowkey)
   WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
   WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)


*The semantic fits perfectly, but it is very verbose, and normal users
rarely used this syntax.*

So my colleagues invented a new syntax for this scenario (Or more
precisely, a new interpretation for the INSERT statement). For the above
scenario, user will always write `insert` statement:

insert into result(rowkey, col1) values(...)
insert into result(rowkey, col2) values(...)

The sql interpreter will do a trick behind the scene: if the `rowkey`
exists, then update, otherwise `insert`. This solution is very concise, but
violates the semantics of `insert`, using this solution INSERT will behave
differently in batch & stream processing.

How do you guys think? which do you prefer? What's your reasoning?

Looking forward to your opinions, thanks in advance.

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by Mingmin Xu <mi...@gmail.com>.

Would like to share my thoughts in another perspective. IMO this is a
typical scenario for column based databases, like Hbase/Cassandra. You may
need to choose a right database if possible.

UPSERT is another alternative option, but I wouldn't suggest to a
customized check-insert/check-update implementation. The actual job should
be done in database side.

On Thu, Jun 22, 2017 at 6:59 PM, James <xu...@gmail.com> wrote:

> Hi Tyler,
>
> I think upsert is a good alternative, concise as INSERT and have the valid
> semantics. Just that user seems rarely use UPSERT either(might because
> there's no UPDATE in batch big data processing).
>
> By *"INSERT will behave differently in batch & stream processing"* I
> mean, if we use the "INSERT" solution I described above, there will be ten
> INSERTs:
>
> *INSERT INTO result(rowkey, col1) values(...)*
>
> *INSERT INTO result(rowkey, col2) values(...)*
>
> *...INSERT INTO result(rowkey, col10) values(...)*
>
> Although we issued ten INSERTs, but there will be only ONE new records in
> the target table, because 9 of the INSERTs are actually UPDATing the
> record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
> batch,* INSERT is just INSERT*.
>
> I think the essence of this problem is, there is no UPDATE in batch, but
> require UPDATE in streaming.
>
>
>
> Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道：
>
>> Calcite appears to have UPSERT
>> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
>> use that instead?
>>
>> Also, I don't understand your statement that "INSERT will behave
>> differently in batch & stream processing". Can you explain further?
>>
>>
>> -Tyler
>>
>>
>> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
>> wrote:
>>
>>> If I'm understanding correctly, Hive does that with a insert into
>>> followed by a select statement that does the aggregation. https://cwiki.
>>> apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-
>>> InsertingdataintoHiveTablesfromqueries
>>>
>>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>>
>>>> Hi team,
>>>>
>>>> I am thinking about a SQL and stream computing related problem, want to
>>>> hear your opinions.
>>>>
>>>> In stream computing, there is a typical case like this:
>>>>
>>>> *We want to calculate a big wide result table, which has one rowkey and
>>>> ten
>>>> value columns:*
>>>> *create table result (*
>>>> *    rowkey varchar(127) PRIMARY KEY,*
>>>> *    col1 int,*
>>>> *    col2 int,*
>>>> *    ...*
>>>> *    col10 int*
>>>> *);*
>>>>
>>>> Each of the value columns is calculated by a complex query, so there
>>>> will
>>>> be ten SQLs to calculate
>>>> data for this table, for each sql:
>>>>
>>>> * First check whether there is a row for the specified `rowkey`.
>>>> * If yes, then `update`, otherwise `insert`.
>>>>
>>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>>> this(SQL2008), a sample usage is:
>>>>
>>>> MERGE INTO result D
>>>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>>    ON (D.rowkey = S.rowkey)
>>>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>>
>>>>
>>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>>> rarely used this syntax.*
>>>>
>>>> So my colleagues invented a new syntax for this scenario (Or more
>>>> precisely, a new interpretation for the INSERT statement). For the above
>>>> scenario, user will always write `insert` statement:
>>>>
>>>> insert into result(rowkey, col1) values(...)
>>>> insert into result(rowkey, col2) values(...)
>>>>
>>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>>> exists, then update, otherwise `insert`. This solution is very concise,
>>>> but
>>>> violates the semantics of `insert`, using this solution INSERT will
>>>> behave
>>>> differently in batch & stream processing.
>>>>
>>>> How do you guys think? which do you prefer? What's your reasoning?
>>>>
>>>> Looking forward to your opinions, thanks in advance.
>>>>
>>> --
>>> Thanks,
>>>
>>> Jesse
>>>
>>


-- 
----
Mingmin

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by James <xu...@gmail.com>.

Hi Tyler,

I think upsert is a good alternative, concise as INSERT and have the valid
semantics. Just that user seems rarely use UPSERT either(might because
there's no UPDATE in batch big data processing).

By *"INSERT will behave differently in batch & stream processing"* I mean,
if we use the "INSERT" solution I described above, there will be ten
INSERTs:

*INSERT INTO result(rowkey, col1) values(...)*

*INSERT INTO result(rowkey, col2) values(...)*

*...INSERT INTO result(rowkey, col10) values(...)*

Although we issued ten INSERTs, but there will be only ONE new records in
the target table, because 9 of the INSERTs are actually UPDATing the
record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
batch,* INSERT is just INSERT*.

I think the essence of this problem is, there is no UPDATE in batch, but
require UPDATE in streaming.



Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道：

> Calcite appears to have UPSERT
> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
> use that instead?
>
> Also, I don't understand your statement that "INSERT will behave
> differently in batch & stream processing". Can you explain further?
>
>
> -Tyler
>
>
> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
>> If I'm understanding correctly, Hive does that with a insert into
>> followed by a select statement that does the aggregation.
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>>
>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am thinking about a SQL and stream computing related problem, want to
>>> hear your opinions.
>>>
>>> In stream computing, there is a typical case like this:
>>>
>>> *We want to calculate a big wide result table, which has one rowkey and
>>> ten
>>> value columns:*
>>> *create table result (*
>>> *    rowkey varchar(127) PRIMARY KEY,*
>>> *    col1 int,*
>>> *    col2 int,*
>>> *    ...*
>>> *    col10 int*
>>> *);*
>>>
>>> Each of the value columns is calculated by a complex query, so there will
>>> be ten SQLs to calculate
>>> data for this table, for each sql:
>>>
>>> * First check whether there is a row for the specified `rowkey`.
>>> * If yes, then `update`, otherwise `insert`.
>>>
>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>> this(SQL2008), a sample usage is:
>>>
>>> MERGE INTO result D
>>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>    ON (D.rowkey = S.rowkey)
>>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>
>>>
>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>> rarely used this syntax.*
>>>
>>> So my colleagues invented a new syntax for this scenario (Or more
>>> precisely, a new interpretation for the INSERT statement). For the above
>>> scenario, user will always write `insert` statement:
>>>
>>> insert into result(rowkey, col1) values(...)
>>> insert into result(rowkey, col2) values(...)
>>>
>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>> exists, then update, otherwise `insert`. This solution is very concise,
>>> but
>>> violates the semantics of `insert`, using this solution INSERT will
>>> behave
>>> differently in batch & stream processing.
>>>
>>> How do you guys think? which do you prefer? What's your reasoning?
>>>
>>> Looking forward to your opinions, thanks in advance.
>>>
>> --
>> Thanks,
>>
>> Jesse
>>
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by James <xu...@gmail.com>.

Hi Tyler,

I think upsert is a good alternative, concise as INSERT and have the valid
semantics. Just that user seems rarely use UPSERT either(might because
there's no UPDATE in batch big data processing).

By *"INSERT will behave differently in batch & stream processing"* I mean,
if we use the "INSERT" solution I described above, there will be ten
INSERTs:

*INSERT INTO result(rowkey, col1) values(...)*

*INSERT INTO result(rowkey, col2) values(...)*

*...INSERT INTO result(rowkey, col10) values(...)*

Although we issued ten INSERTs, but there will be only ONE new records in
the target table, because 9 of the INSERTs are actually UPDATing the
record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
batch,* INSERT is just INSERT*.

I think the essence of this problem is, there is no UPDATE in batch, but
require UPDATE in streaming.



Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道：

> Calcite appears to have UPSERT
> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
> use that instead?
>
> Also, I don't understand your statement that "INSERT will behave
> differently in batch & stream processing". Can you explain further?
>
>
> -Tyler
>
>
> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
>> If I'm understanding correctly, Hive does that with a insert into
>> followed by a select statement that does the aggregation.
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>>
>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am thinking about a SQL and stream computing related problem, want to
>>> hear your opinions.
>>>
>>> In stream computing, there is a typical case like this:
>>>
>>> *We want to calculate a big wide result table, which has one rowkey and
>>> ten
>>> value columns:*
>>> *create table result (*
>>> *    rowkey varchar(127) PRIMARY KEY,*
>>> *    col1 int,*
>>> *    col2 int,*
>>> *    ...*
>>> *    col10 int*
>>> *);*
>>>
>>> Each of the value columns is calculated by a complex query, so there will
>>> be ten SQLs to calculate
>>> data for this table, for each sql:
>>>
>>> * First check whether there is a row for the specified `rowkey`.
>>> * If yes, then `update`, otherwise `insert`.
>>>
>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>> this(SQL2008), a sample usage is:
>>>
>>> MERGE INTO result D
>>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>    ON (D.rowkey = S.rowkey)
>>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>
>>>
>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>> rarely used this syntax.*
>>>
>>> So my colleagues invented a new syntax for this scenario (Or more
>>> precisely, a new interpretation for the INSERT statement). For the above
>>> scenario, user will always write `insert` statement:
>>>
>>> insert into result(rowkey, col1) values(...)
>>> insert into result(rowkey, col2) values(...)
>>>
>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>> exists, then update, otherwise `insert`. This solution is very concise,
>>> but
>>> violates the semantics of `insert`, using this solution INSERT will
>>> behave
>>> differently in batch & stream processing.
>>>
>>> How do you guys think? which do you prefer? What's your reasoning?
>>>
>>> Looking forward to your opinions, thanks in advance.
>>>
>> --
>> Thanks,
>>
>> Jesse
>>
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by Tyler Akidau <ta...@google.com>.

Calcite appears to have UPSERT
<https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
use that instead?

Also, I don't understand your statement that "INSERT will behave
differently in batch & stream processing". Can you explain further?

-Tyler


On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
wrote:

> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> *    rowkey varchar(127) PRIMARY KEY,*
>> *    col1 int,*
>> *    col2 int,*
>> *    ...*
>> *    col10 int*
>> *);*
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>    ON (D.rowkey = S.rowkey)
>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Calcite appears to have UPSERT
<https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
use that instead?

Also, I don't understand your statement that "INSERT will behave
differently in batch & stream processing". Can you explain further?

-Tyler


On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
wrote:

> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> *    rowkey varchar(127) PRIMARY KEY,*
>> *    col1 int,*
>> *    col2 int,*
>> *    ...*
>> *    col10 int*
>> *);*
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>    ON (D.rowkey = S.rowkey)
>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by James <xu...@gmail.com>.

Hi Jesse,

Yeah, I know the insert...select grammar. In my scenario, each of the value
column is calculated separately(might calculated from different
datasources), so insert...select might not be sufficient.

Jesse Anderson <je...@bigdatainstitute.io>于2017年6月22日周四 下午10:35写道：

> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> *    rowkey varchar(127) PRIMARY KEY,*
>> *    col1 int,*
>> *    col2 int,*
>> *    ...*
>> *    col10 int*
>> *);*
>
>
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>    ON (D.rowkey = S.rowkey)
>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>
>
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by James <xu...@gmail.com>.

Hi Jesse,

Yeah, I know the insert...select grammar. In my scenario, each of the value
column is calculated separately(might calculated from different
datasources), so insert...select might not be sufficient.

Jesse Anderson <je...@bigdatainstitute.io>于2017年6月22日周四 下午10:35写道：

> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> *    rowkey varchar(127) PRIMARY KEY,*
>> *    col1 int,*
>> *    col2 int,*
>> *    ...*
>> *    col10 int*
>> *);*
>
>
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>    ON (D.rowkey = S.rowkey)
>>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>
>
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by Jesse Anderson <je...@bigdatainstitute.io>.

If I'm understanding correctly, Hive does that with a insert into followed
by a select statement that does the aggregation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:

> Hi team,
>
> I am thinking about a SQL and stream computing related problem, want to
> hear your opinions.
>
> In stream computing, there is a typical case like this:
>
> *We want to calculate a big wide result table, which has one rowkey and ten
> value columns:*
> *create table result (*
> *    rowkey varchar(127) PRIMARY KEY,*
> *    col1 int,*
> *    col2 int,*
> *    ...*
> *    col10 int*
> *);*
>
> Each of the value columns is calculated by a complex query, so there will
> be ten SQLs to calculate
> data for this table, for each sql:
>
> * First check whether there is a row for the specified `rowkey`.
> * If yes, then `update`, otherwise `insert`.
>
> There is actually a dedicated sql syntax called `MERGE` designed for
> this(SQL2008), a sample usage is:
>
> MERGE INTO result D
>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>    ON (D.rowkey = S.rowkey)
>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>
>
> *The semantic fits perfectly, but it is very verbose, and normal users
> rarely used this syntax.*
>
> So my colleagues invented a new syntax for this scenario (Or more
> precisely, a new interpretation for the INSERT statement). For the above
> scenario, user will always write `insert` statement:
>
> insert into result(rowkey, col1) values(...)
> insert into result(rowkey, col2) values(...)
>
> The sql interpreter will do a trick behind the scene: if the `rowkey`
> exists, then update, otherwise `insert`. This solution is very concise, but
> violates the semantics of `insert`, using this solution INSERT will behave
> differently in batch & stream processing.
>
> How do you guys think? which do you prefer? What's your reasoning?
>
> Looking forward to your opinions, thanks in advance.
>
-- 
Thanks,

Jesse

Re: SQL in Stream Computing: MERGE or INSERT?

Posted by Jesse Anderson <je...@bigdatainstitute.io>.

If I'm understanding correctly, Hive does that with a insert into followed
by a select statement that does the aggregation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:

> Hi team,
>
> I am thinking about a SQL and stream computing related problem, want to
> hear your opinions.
>
> In stream computing, there is a typical case like this:
>
> *We want to calculate a big wide result table, which has one rowkey and ten
> value columns:*
> *create table result (*
> *    rowkey varchar(127) PRIMARY KEY,*
> *    col1 int,*
> *    col2 int,*
> *    ...*
> *    col10 int*
> *);*
>
> Each of the value columns is calculated by a complex query, so there will
> be ten SQLs to calculate
> data for this table, for each sql:
>
> * First check whether there is a row for the specified `rowkey`.
> * If yes, then `update`, otherwise `insert`.
>
> There is actually a dedicated sql syntax called `MERGE` designed for
> this(SQL2008), a sample usage is:
>
> MERGE INTO result D
>    USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>    ON (D.rowkey = S.rowkey)
>    WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>    WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>
>
> *The semantic fits perfectly, but it is very verbose, and normal users
> rarely used this syntax.*
>
> So my colleagues invented a new syntax for this scenario (Or more
> precisely, a new interpretation for the INSERT statement). For the above
> scenario, user will always write `insert` statement:
>
> insert into result(rowkey, col1) values(...)
> insert into result(rowkey, col2) values(...)
>
> The sql interpreter will do a trick behind the scene: if the `rowkey`
> exists, then update, otherwise `insert`. This solution is very concise, but
> violates the semantics of `insert`, using this solution INSERT will behave
> differently in batch & stream processing.
>
> How do you guys think? which do you prefer? What's your reasoning?
>
> Looking forward to your opinions, thanks in advance.
>
-- 
Thanks,

Jesse