You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by James <xu...@gmail.com> on 2017/06/22 08:32:19 UTC
SQL in Stream Computing: MERGE or INSERT?
Hi team,
I am thinking about a SQL and stream computing related problem, want to
hear your opinions.
In stream computing, there is a typical case like this:
*We want to calculate a big wide result table, which has one rowkey and ten
value columns:*
*create table result (*
* rowkey varchar(127) PRIMARY KEY,*
* col1 int,*
* col2 int,*
* ...*
* col10 int*
*);*
Each of the value columns is calculated by a complex query, so there will
be ten SQLs to calculate
data for this table, for each sql:
* First check whether there is a row for the specified `rowkey`.
* If yes, then `update`, otherwise `insert`.
There is actually a dedicated sql syntax called `MERGE` designed for
this(SQL2008), a sample usage is:
MERGE INTO result D
USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
ON (D.rowkey = S.rowkey)
WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
*The semantic fits perfectly, but it is very verbose, and normal users
rarely used this syntax.*
So my colleagues invented a new syntax for this scenario (Or more
precisely, a new interpretation for the INSERT statement). For the above
scenario, user will always write `insert` statement:
insert into result(rowkey, col1) values(...)
insert into result(rowkey, col2) values(...)
The sql interpreter will do a trick behind the scene: if the `rowkey`
exists, then update, otherwise `insert`. This solution is very concise, but
violates the semantics of `insert`, using this solution INSERT will behave
differently in batch & stream processing.
How do you guys think? which do you prefer? What's your reasoning?
Looking forward to your opinions, thanks in advance.
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by Mingmin Xu <mi...@gmail.com>.
Would like to share my thoughts in another perspective. IMO this is a
typical scenario for column based databases, like Hbase/Cassandra. You may
need to choose a right database if possible.
UPSERT is another alternative option, but I wouldn't suggest to a
customized check-insert/check-update implementation. The actual job should
be done in database side.
On Thu, Jun 22, 2017 at 6:59 PM, James <xu...@gmail.com> wrote:
> Hi Tyler,
>
> I think upsert is a good alternative, concise as INSERT and have the valid
> semantics. Just that user seems rarely use UPSERT either(might because
> there's no UPDATE in batch big data processing).
>
> By *"INSERT will behave differently in batch & stream processing"* I
> mean, if we use the "INSERT" solution I described above, there will be ten
> INSERTs:
>
> *INSERT INTO result(rowkey, col1) values(...)*
>
> *INSERT INTO result(rowkey, col2) values(...)*
>
> *...INSERT INTO result(rowkey, col10) values(...)*
>
> Although we issued ten INSERTs, but there will be only ONE new records in
> the target table, because 9 of the INSERTs are actually UPDATing the
> record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
> batch,* INSERT is just INSERT*.
>
> I think the essence of this problem is, there is no UPDATE in batch, but
> require UPDATE in streaming.
>
>
>
> Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道:
>
>> Calcite appears to have UPSERT
>> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
>> use that instead?
>>
>> Also, I don't understand your statement that "INSERT will behave
>> differently in batch & stream processing". Can you explain further?
>>
>>
>> -Tyler
>>
>>
>> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
>> wrote:
>>
>>> If I'm understanding correctly, Hive does that with a insert into
>>> followed by a select statement that does the aggregation. https://cwiki.
>>> apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-
>>> InsertingdataintoHiveTablesfromqueries
>>>
>>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>>
>>>> Hi team,
>>>>
>>>> I am thinking about a SQL and stream computing related problem, want to
>>>> hear your opinions.
>>>>
>>>> In stream computing, there is a typical case like this:
>>>>
>>>> *We want to calculate a big wide result table, which has one rowkey and
>>>> ten
>>>> value columns:*
>>>> *create table result (*
>>>> * rowkey varchar(127) PRIMARY KEY,*
>>>> * col1 int,*
>>>> * col2 int,*
>>>> * ...*
>>>> * col10 int*
>>>> *);*
>>>>
>>>> Each of the value columns is calculated by a complex query, so there
>>>> will
>>>> be ten SQLs to calculate
>>>> data for this table, for each sql:
>>>>
>>>> * First check whether there is a row for the specified `rowkey`.
>>>> * If yes, then `update`, otherwise `insert`.
>>>>
>>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>>> this(SQL2008), a sample usage is:
>>>>
>>>> MERGE INTO result D
>>>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>>> ON (D.rowkey = S.rowkey)
>>>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>>
>>>>
>>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>>> rarely used this syntax.*
>>>>
>>>> So my colleagues invented a new syntax for this scenario (Or more
>>>> precisely, a new interpretation for the INSERT statement). For the above
>>>> scenario, user will always write `insert` statement:
>>>>
>>>> insert into result(rowkey, col1) values(...)
>>>> insert into result(rowkey, col2) values(...)
>>>>
>>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>>> exists, then update, otherwise `insert`. This solution is very concise,
>>>> but
>>>> violates the semantics of `insert`, using this solution INSERT will
>>>> behave
>>>> differently in batch & stream processing.
>>>>
>>>> How do you guys think? which do you prefer? What's your reasoning?
>>>>
>>>> Looking forward to your opinions, thanks in advance.
>>>>
>>> --
>>> Thanks,
>>>
>>> Jesse
>>>
>>
--
----
Mingmin
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by James <xu...@gmail.com>.
Hi Tyler,
I think upsert is a good alternative, concise as INSERT and have the valid
semantics. Just that user seems rarely use UPSERT either(might because
there's no UPDATE in batch big data processing).
By *"INSERT will behave differently in batch & stream processing"* I mean,
if we use the "INSERT" solution I described above, there will be ten
INSERTs:
*INSERT INTO result(rowkey, col1) values(...)*
*INSERT INTO result(rowkey, col2) values(...)*
*...INSERT INTO result(rowkey, col10) values(...)*
Although we issued ten INSERTs, but there will be only ONE new records in
the target table, because 9 of the INSERTs are actually UPDATing the
record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
batch,* INSERT is just INSERT*.
I think the essence of this problem is, there is no UPDATE in batch, but
require UPDATE in streaming.
Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道:
> Calcite appears to have UPSERT
> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
> use that instead?
>
> Also, I don't understand your statement that "INSERT will behave
> differently in batch & stream processing". Can you explain further?
>
>
> -Tyler
>
>
> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
>> If I'm understanding correctly, Hive does that with a insert into
>> followed by a select statement that does the aggregation.
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>>
>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am thinking about a SQL and stream computing related problem, want to
>>> hear your opinions.
>>>
>>> In stream computing, there is a typical case like this:
>>>
>>> *We want to calculate a big wide result table, which has one rowkey and
>>> ten
>>> value columns:*
>>> *create table result (*
>>> * rowkey varchar(127) PRIMARY KEY,*
>>> * col1 int,*
>>> * col2 int,*
>>> * ...*
>>> * col10 int*
>>> *);*
>>>
>>> Each of the value columns is calculated by a complex query, so there will
>>> be ten SQLs to calculate
>>> data for this table, for each sql:
>>>
>>> * First check whether there is a row for the specified `rowkey`.
>>> * If yes, then `update`, otherwise `insert`.
>>>
>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>> this(SQL2008), a sample usage is:
>>>
>>> MERGE INTO result D
>>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>> ON (D.rowkey = S.rowkey)
>>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>
>>>
>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>> rarely used this syntax.*
>>>
>>> So my colleagues invented a new syntax for this scenario (Or more
>>> precisely, a new interpretation for the INSERT statement). For the above
>>> scenario, user will always write `insert` statement:
>>>
>>> insert into result(rowkey, col1) values(...)
>>> insert into result(rowkey, col2) values(...)
>>>
>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>> exists, then update, otherwise `insert`. This solution is very concise,
>>> but
>>> violates the semantics of `insert`, using this solution INSERT will
>>> behave
>>> differently in batch & stream processing.
>>>
>>> How do you guys think? which do you prefer? What's your reasoning?
>>>
>>> Looking forward to your opinions, thanks in advance.
>>>
>> --
>> Thanks,
>>
>> Jesse
>>
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by James <xu...@gmail.com>.
Hi Tyler,
I think upsert is a good alternative, concise as INSERT and have the valid
semantics. Just that user seems rarely use UPSERT either(might because
there's no UPDATE in batch big data processing).
By *"INSERT will behave differently in batch & stream processing"* I mean,
if we use the "INSERT" solution I described above, there will be ten
INSERTs:
*INSERT INTO result(rowkey, col1) values(...)*
*INSERT INTO result(rowkey, col2) values(...)*
*...INSERT INTO result(rowkey, col10) values(...)*
Although we issued ten INSERTs, but there will be only ONE new records in
the target table, because 9 of the INSERTs are actually UPDATing the
record, so in stream computing *INSERT = (INSERT or UPDATE)*, while in
batch,* INSERT is just INSERT*.
I think the essence of this problem is, there is no UPDATE in batch, but
require UPDATE in streaming.
Tyler Akidau <ta...@google.com>于2017年6月22日周四 下午11:35写道:
> Calcite appears to have UPSERT
> <https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
> use that instead?
>
> Also, I don't understand your statement that "INSERT will behave
> differently in batch & stream processing". Can you explain further?
>
>
> -Tyler
>
>
> On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
> wrote:
>
>> If I'm understanding correctly, Hive does that with a insert into
>> followed by a select statement that does the aggregation.
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>>
>> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am thinking about a SQL and stream computing related problem, want to
>>> hear your opinions.
>>>
>>> In stream computing, there is a typical case like this:
>>>
>>> *We want to calculate a big wide result table, which has one rowkey and
>>> ten
>>> value columns:*
>>> *create table result (*
>>> * rowkey varchar(127) PRIMARY KEY,*
>>> * col1 int,*
>>> * col2 int,*
>>> * ...*
>>> * col10 int*
>>> *);*
>>>
>>> Each of the value columns is calculated by a complex query, so there will
>>> be ten SQLs to calculate
>>> data for this table, for each sql:
>>>
>>> * First check whether there is a row for the specified `rowkey`.
>>> * If yes, then `update`, otherwise `insert`.
>>>
>>> There is actually a dedicated sql syntax called `MERGE` designed for
>>> this(SQL2008), a sample usage is:
>>>
>>> MERGE INTO result D
>>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>>> ON (D.rowkey = S.rowkey)
>>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>>
>>>
>>> *The semantic fits perfectly, but it is very verbose, and normal users
>>> rarely used this syntax.*
>>>
>>> So my colleagues invented a new syntax for this scenario (Or more
>>> precisely, a new interpretation for the INSERT statement). For the above
>>> scenario, user will always write `insert` statement:
>>>
>>> insert into result(rowkey, col1) values(...)
>>> insert into result(rowkey, col2) values(...)
>>>
>>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>>> exists, then update, otherwise `insert`. This solution is very concise,
>>> but
>>> violates the semantics of `insert`, using this solution INSERT will
>>> behave
>>> differently in batch & stream processing.
>>>
>>> How do you guys think? which do you prefer? What's your reasoning?
>>>
>>> Looking forward to your opinions, thanks in advance.
>>>
>> --
>> Thanks,
>>
>> Jesse
>>
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by Tyler Akidau <ta...@google.com>.
Calcite appears to have UPSERT
<https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
use that instead?
Also, I don't understand your statement that "INSERT will behave
differently in batch & stream processing". Can you explain further?
-Tyler
On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
wrote:
> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> * rowkey varchar(127) PRIMARY KEY,*
>> * col1 int,*
>> * col2 int,*
>> * ...*
>> * col10 int*
>> *);*
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>> ON (D.rowkey = S.rowkey)
>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by Tyler Akidau <ta...@google.com.INVALID>.
Calcite appears to have UPSERT
<https://issues.apache.org/jira/browse/CALCITE-492> support, can we just
use that instead?
Also, I don't understand your statement that "INSERT will behave
differently in batch & stream processing". Can you explain further?
-Tyler
On Thu, Jun 22, 2017 at 7:35 AM Jesse Anderson <je...@bigdatainstitute.io>
wrote:
> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> * rowkey varchar(127) PRIMARY KEY,*
>> * col1 int,*
>> * col2 int,*
>> * ...*
>> * col10 int*
>> *);*
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>> ON (D.rowkey = S.rowkey)
>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by James <xu...@gmail.com>.
Hi Jesse,
Yeah, I know the insert...select grammar. In my scenario, each of the value
column is calculated separately(might calculated from different
datasources), so insert...select might not be sufficient.
Jesse Anderson <je...@bigdatainstitute.io>于2017年6月22日周四 下午10:35写道:
> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> * rowkey varchar(127) PRIMARY KEY,*
>> * col1 int,*
>> * col2 int,*
>> * ...*
>> * col10 int*
>> *);*
>
>
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>> ON (D.rowkey = S.rowkey)
>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>
>
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by James <xu...@gmail.com>.
Hi Jesse,
Yeah, I know the insert...select grammar. In my scenario, each of the value
column is calculated separately(might calculated from different
datasources), so insert...select might not be sufficient.
Jesse Anderson <je...@bigdatainstitute.io>于2017年6月22日周四 下午10:35写道:
> If I'm understanding correctly, Hive does that with a insert into followed
> by a select statement that does the aggregation.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
>
> On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
>
>> Hi team,
>>
>> I am thinking about a SQL and stream computing related problem, want to
>> hear your opinions.
>>
>> In stream computing, there is a typical case like this:
>>
>> *We want to calculate a big wide result table, which has one rowkey and
>> ten
>> value columns:*
>> *create table result (*
>> * rowkey varchar(127) PRIMARY KEY,*
>> * col1 int,*
>> * col2 int,*
>> * ...*
>> * col10 int*
>> *);*
>
>
>>
>> Each of the value columns is calculated by a complex query, so there will
>> be ten SQLs to calculate
>> data for this table, for each sql:
>>
>> * First check whether there is a row for the specified `rowkey`.
>> * If yes, then `update`, otherwise `insert`.
>>
>> There is actually a dedicated sql syntax called `MERGE` designed for
>> this(SQL2008), a sample usage is:
>>
>> MERGE INTO result D
>> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
>> ON (D.rowkey = S.rowkey)
>> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
>> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>>
>>
>> *The semantic fits perfectly, but it is very verbose, and normal users
>> rarely used this syntax.*
>
>
>>
>> So my colleagues invented a new syntax for this scenario (Or more
>> precisely, a new interpretation for the INSERT statement). For the above
>> scenario, user will always write `insert` statement:
>>
>> insert into result(rowkey, col1) values(...)
>> insert into result(rowkey, col2) values(...)
>>
>> The sql interpreter will do a trick behind the scene: if the `rowkey`
>> exists, then update, otherwise `insert`. This solution is very concise,
>> but
>> violates the semantics of `insert`, using this solution INSERT will behave
>> differently in batch & stream processing.
>>
>> How do you guys think? which do you prefer? What's your reasoning?
>>
>> Looking forward to your opinions, thanks in advance.
>>
> --
> Thanks,
>
> Jesse
>
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by Jesse Anderson <je...@bigdatainstitute.io>.
If I'm understanding correctly, Hive does that with a insert into followed
by a select statement that does the aggregation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
> Hi team,
>
> I am thinking about a SQL and stream computing related problem, want to
> hear your opinions.
>
> In stream computing, there is a typical case like this:
>
> *We want to calculate a big wide result table, which has one rowkey and ten
> value columns:*
> *create table result (*
> * rowkey varchar(127) PRIMARY KEY,*
> * col1 int,*
> * col2 int,*
> * ...*
> * col10 int*
> *);*
>
> Each of the value columns is calculated by a complex query, so there will
> be ten SQLs to calculate
> data for this table, for each sql:
>
> * First check whether there is a row for the specified `rowkey`.
> * If yes, then `update`, otherwise `insert`.
>
> There is actually a dedicated sql syntax called `MERGE` designed for
> this(SQL2008), a sample usage is:
>
> MERGE INTO result D
> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
> ON (D.rowkey = S.rowkey)
> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>
>
> *The semantic fits perfectly, but it is very verbose, and normal users
> rarely used this syntax.*
>
> So my colleagues invented a new syntax for this scenario (Or more
> precisely, a new interpretation for the INSERT statement). For the above
> scenario, user will always write `insert` statement:
>
> insert into result(rowkey, col1) values(...)
> insert into result(rowkey, col2) values(...)
>
> The sql interpreter will do a trick behind the scene: if the `rowkey`
> exists, then update, otherwise `insert`. This solution is very concise, but
> violates the semantics of `insert`, using this solution INSERT will behave
> differently in batch & stream processing.
>
> How do you guys think? which do you prefer? What's your reasoning?
>
> Looking forward to your opinions, thanks in advance.
>
--
Thanks,
Jesse
Re: SQL in Stream Computing: MERGE or INSERT?
Posted by Jesse Anderson <je...@bigdatainstitute.io>.
If I'm understanding correctly, Hive does that with a insert into followed
by a select statement that does the aggregation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
On Thu, Jun 22, 2017 at 1:32 AM James <xu...@gmail.com> wrote:
> Hi team,
>
> I am thinking about a SQL and stream computing related problem, want to
> hear your opinions.
>
> In stream computing, there is a typical case like this:
>
> *We want to calculate a big wide result table, which has one rowkey and ten
> value columns:*
> *create table result (*
> * rowkey varchar(127) PRIMARY KEY,*
> * col1 int,*
> * col2 int,*
> * ...*
> * col10 int*
> *);*
>
> Each of the value columns is calculated by a complex query, so there will
> be ten SQLs to calculate
> data for this table, for each sql:
>
> * First check whether there is a row for the specified `rowkey`.
> * If yes, then `update`, otherwise `insert`.
>
> There is actually a dedicated sql syntax called `MERGE` designed for
> this(SQL2008), a sample usage is:
>
> MERGE INTO result D
> USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
> ON (D.rowkey = S.rowkey)
> WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
> WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)
>
>
> *The semantic fits perfectly, but it is very verbose, and normal users
> rarely used this syntax.*
>
> So my colleagues invented a new syntax for this scenario (Or more
> precisely, a new interpretation for the INSERT statement). For the above
> scenario, user will always write `insert` statement:
>
> insert into result(rowkey, col1) values(...)
> insert into result(rowkey, col2) values(...)
>
> The sql interpreter will do a trick behind the scene: if the `rowkey`
> exists, then update, otherwise `insert`. This solution is very concise, but
> violates the semantics of `insert`, using this solution INSERT will behave
> differently in batch & stream processing.
>
> How do you guys think? which do you prefer? What's your reasoning?
>
> Looking forward to your opinions, thanks in advance.
>
--
Thanks,
Jesse