You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Jungtaek Lim <ka...@gmail.com> on 2016/10/21 07:18:10 UTC

[DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Hi devs,

Sorry to send multiple mails at once, I had been resolving issues
sequentially, and now stopped a bit and retrospect about the direction of
Storm SQL.

I'd like to propose destructive actions, dropping features about GROUP BY
and JOIN from Storm SQL which are fortunately not released yet.

The reason of dropping features is simple: This borrows Trident semantic
(within micro-batch, or stateful), and not making sense of true "streaming"
semantic.

Spark and Flink interpret "streaming" aggregation and join as windowed
operators. Since there's no SQL standard for streaming (even no de-facto),
they are adding the feature to its API (Structured Streaming for Spark, and
Table API for Flink), and don't address them to SQL side yet.

I was eager to add more features on Storm SQL to make progress (even Bobby
pointed out similarly), but after worked on these things, I change my mind
that letting users not confusing is more important than adding features.

Btw, Storm SQL "temporary" relies on Trident since we don't have
higher-level API on core and we don't want to build topology from ground
up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
should run on per-tuple streaming manner instead of micro-batch.
Integrating streams API to Storm SQL could be great internal project for
POC of streams API. Exactly-once needs to be addressed before.

"GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
exactly-once way), so I would like to hear our opinions regarding this.

Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
https://calcite.apache.org/docs/stream.html (For now most of definitions
are not implemented yet.)
This means that we might not support Streaming SQL semantics in SQL
statement unless Calcite finishes their work. I think this is OK since
there're many other works left on Storm SQL, and Storm SQL is now in
experimental anyway (The state of Spark Structured Streaming and Flink
Streaming SQL are also alpha or experiment.)

While waiting, we might want to have LINQ style API like Table API and
address aggregate and join from there, but it requires huge amount of works
and it's a kind of duplicated works with streams API (STORM-1961
<https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
high-level API. IMHO, if streams API is well defined, it should be fairly
easy and not necessary need to have LINQ style API. (though someone feels
more convenient to use 'select', 'where', and so on.)

Please share your opinion about this. Especially I'd like to see JW Player
participating discussion, since aggregation is already supported by SQE.

Thanks for reading a quite long thread.

Thanks,
Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Posted by Jungtaek Lim <ka...@gmail.com>.

This seems to be no other opinions, so I would go on dropping "group by"
and "join" for Storm SQL Trident mode until we're ready to handle windowing
on SQL semantic.

- Jungtaek Lim (HeartSaVioR)

2016년 10월 22일 (토) 오전 12:23, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Yes there seems many things which already supports CQL (for example
> InfluxDB), and what I mean is that there's no Streaming SQL or CQL
> standard, in point of "SQL semantic" view.
>
> We can define LINQ style "API" and include aggregation and join with
> enough discussions (if my understanding is right, that's what Structured
> Streaming is, and Flink Table API is also going ahead), but we don't even
> have higher-level API and it will be going to be duplicated work (and it
> should, if higher-level API is well defined).
>
> As I linked earlier, Calcite project proposes its streaming sql semantic,
> which defines new keywords, and new concepts, and so on.[1
> <https://calcite.apache.org/docs/stream.html>] The thing is it's not
> implemented yet, since Calcite is having small community, and most of
> contributors are from SQL on Hadoop, not streaming area. Only a small
> contributions are done from us and Flink side.
>
> Storm SQL has some remaining works even we don't address aggregation for
> now, so it would be not easy to jump on Calcite side and discuss or
> persuade or even help implementing. One of Flink committer initiated
> discussion regarding "defining Streaming SQL semantics" [2
> <http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E>]
> but it seems that not many devs. are interested. Still seems to be an early
> stage for all.
>
> So working on our own and late participating is also a valid way we can
> choose, or participating Flink's discussion now is also a valid way. Storm
> SQL has limited contributors so IMHO we need to prioritize and concentrate.
> Unless we have more contributors coming in Storm SQL, former way looks more
> realistic.
>
> - Jungtaek Lim (HeartSaVioR)
>
> [1] https://calcite.apache.org/docs/stream.html
> [2]
> http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E
>
> 2016년 10월 21일 (금) 오후 10:50, Bobby Evans <ev...@yahoo-inc.com.invalid>님이
> 작성:
>
> I am not currently very involved with the storm SQL so take my comments
> worth a grain of salt.  I am +1 on waiting to define groupby and join until
> we have a solid base that we can build this on.
> But I do want to contradict a bit about what there being no standard for
> streaming SQL.  Technically that is true, but in general the streaming SQL
> solutions I have seen restrict the supported queries to either have no
> aggregations at all (pure element by element operations) or rely on the
> OVER clause, and more specifically time based windows.
> The issue is that the output of a streaming operation is not a table, it
> is a protocol that includes updates.  This is where BEAM really shines
> because it exposes all of that ugly underbelly and lets users have complete
> control over that. The API is rather complex and I think a bit ugly because
> of that.  I would suggest that we define what we want it to look like, and
> let that drive the underlying implementation.  If we are feeling ambitions
> we can get together with Flink, Spark, and anyone else who might be
> interested and see if we can come to an understanding on what extensions to
> SQL we should put in to really make streaming work properly.
> - Bobby - Bobby
>
>     On Friday, October 21, 2016 2:18 AM, Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>
>  Hi devs,
>
> Sorry to send multiple mails at once, I had been resolving issues
> sequentially, and now stopped a bit and retrospect about the direction of
> Storm SQL.
>
> I'd like to propose destructive actions, dropping features about GROUP BY
> and JOIN from Storm SQL which are fortunately not released yet.
>
> The reason of dropping features is simple: This borrows Trident semantic
> (within micro-batch, or stateful), and not making sense of true "streaming"
> semantic.
>
> Spark and Flink interpret "streaming" aggregation and join as windowed
> operators. Since there's no SQL standard for streaming (even no de-facto),
> they are adding the feature to its API (Structured Streaming for Spark, and
> Table API for Flink), and don't address them to SQL side yet.
>
> I was eager to add more features on Storm SQL to make progress (even Bobby
> pointed out similarly), but after worked on these things, I change my mind
> that letting users not confusing is more important than adding features.
>
> Btw, Storm SQL "temporary" relies on Trident since we don't have
> higher-level API on core and we don't want to build topology from ground
> up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
> should run on per-tuple streaming manner instead of micro-batch.
> Integrating streams API to Storm SQL could be great internal project for
> POC of streams API. Exactly-once needs to be addressed before.
>
> "GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
> exactly-once way), so I would like to hear our opinions regarding this.
>
> Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
> https://calcite.apache.org/docs/stream.html (For now most of definitions
> are not implemented yet.)
> This means that we might not support Streaming SQL semantics in SQL
> statement unless Calcite finishes their work. I think this is OK since
> there're many other works left on Storm SQL, and Storm SQL is now in
> experimental anyway (The state of Spark Structured Streaming and Flink
> Streaming SQL are also alpha or experiment.)
>
> While waiting, we might want to have LINQ style API like Table API and
> address aggregate and join from there, but it requires huge amount of works
> and it's a kind of duplicated works with streams API (STORM-1961
> <https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
> high-level API. IMHO, if streams API is well defined, it should be fairly
> easy and not necessary need to have LINQ style API. (though someone feels
> more convenient to use 'select', 'where', and so on.)
>
> Please share your opinion about this. Especially I'd like to see JW Player
> participating discussion, since aggregation is already supported by SQE.
>
> Thanks for reading a quite long thread.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
>
>
>

Re: [DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Posted by Jungtaek Lim <ka...@gmail.com>.

Yes there seems many things which already supports CQL (for example
InfluxDB), and what I mean is that there's no Streaming SQL or CQL
standard, in point of "SQL semantic" view.

We can define LINQ style "API" and include aggregation and join with enough
discussions (if my understanding is right, that's what Structured Streaming
is, and Flink Table API is also going ahead), but we don't even have
higher-level API and it will be going to be duplicated work (and it should,
if higher-level API is well defined).

As I linked earlier, Calcite project proposes its streaming sql semantic,
which defines new keywords, and new concepts, and so on.[1
<https://calcite.apache.org/docs/stream.html>] The thing is it's not
implemented yet, since Calcite is having small community, and most of
contributors are from SQL on Hadoop, not streaming area. Only a small
contributions are done from us and Flink side.

Storm SQL has some remaining works even we don't address aggregation for
now, so it would be not easy to jump on Calcite side and discuss or
persuade or even help implementing. One of Flink committer initiated
discussion regarding "defining Streaming SQL semantics" [2
<http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E>]
but it seems that not many devs. are interested. Still seems to be an early
stage for all.

So working on our own and late participating is also a valid way we can
choose, or participating Flink's discussion now is also a valid way. Storm
SQL has limited contributors so IMHO we need to prioritize and concentrate.
Unless we have more contributors coming in Storm SQL, former way looks more
realistic.

- Jungtaek Lim (HeartSaVioR)

[1] https://calcite.apache.org/docs/stream.html
[2]
http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E

2016년 10월 21일 (금) 오후 10:50, Bobby Evans <ev...@yahoo-inc.com.invalid>님이 작성:

> I am not currently very involved with the storm SQL so take my comments
> worth a grain of salt.  I am +1 on waiting to define groupby and join until
> we have a solid base that we can build this on.
> But I do want to contradict a bit about what there being no standard for
> streaming SQL.  Technically that is true, but in general the streaming SQL
> solutions I have seen restrict the supported queries to either have no
> aggregations at all (pure element by element operations) or rely on the
> OVER clause, and more specifically time based windows.
> The issue is that the output of a streaming operation is not a table, it
> is a protocol that includes updates.  This is where BEAM really shines
> because it exposes all of that ugly underbelly and lets users have complete
> control over that. The API is rather complex and I think a bit ugly because
> of that.  I would suggest that we define what we want it to look like, and
> let that drive the underlying implementation.  If we are feeling ambitions
> we can get together with Flink, Spark, and anyone else who might be
> interested and see if we can come to an understanding on what extensions to
> SQL we should put in to really make streaming work properly.
> - Bobby - Bobby
>
>     On Friday, October 21, 2016 2:18 AM, Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>
>  Hi devs,
>
> Sorry to send multiple mails at once, I had been resolving issues
> sequentially, and now stopped a bit and retrospect about the direction of
> Storm SQL.
>
> I'd like to propose destructive actions, dropping features about GROUP BY
> and JOIN from Storm SQL which are fortunately not released yet.
>
> The reason of dropping features is simple: This borrows Trident semantic
> (within micro-batch, or stateful), and not making sense of true "streaming"
> semantic.
>
> Spark and Flink interpret "streaming" aggregation and join as windowed
> operators. Since there's no SQL standard for streaming (even no de-facto),
> they are adding the feature to its API (Structured Streaming for Spark, and
> Table API for Flink), and don't address them to SQL side yet.
>
> I was eager to add more features on Storm SQL to make progress (even Bobby
> pointed out similarly), but after worked on these things, I change my mind
> that letting users not confusing is more important than adding features.
>
> Btw, Storm SQL "temporary" relies on Trident since we don't have
> higher-level API on core and we don't want to build topology from ground
> up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
> should run on per-tuple streaming manner instead of micro-batch.
> Integrating streams API to Storm SQL could be great internal project for
> POC of streams API. Exactly-once needs to be addressed before.
>
> "GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
> exactly-once way), so I would like to hear our opinions regarding this.
>
> Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
> https://calcite.apache.org/docs/stream.html (For now most of definitions
> are not implemented yet.)
> This means that we might not support Streaming SQL semantics in SQL
> statement unless Calcite finishes their work. I think this is OK since
> there're many other works left on Storm SQL, and Storm SQL is now in
> experimental anyway (The state of Spark Structured Streaming and Flink
> Streaming SQL are also alpha or experiment.)
>
> While waiting, we might want to have LINQ style API like Table API and
> address aggregate and join from there, but it requires huge amount of works
> and it's a kind of duplicated works with streams API (STORM-1961
> <https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
> high-level API. IMHO, if streams API is well defined, it should be fairly
> easy and not necessary need to have LINQ style API. (though someone feels
> more convenient to use 'select', 'where', and so on.)
>
> Please share your opinion about this. Especially I'd like to see JW Player
> participating discussion, since aggregation is already supported by SQE.
>
> Thanks for reading a quite long thread.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
>

Re: [DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

I am not currently very involved with the storm SQL so take my comments worth a grain of salt. I am +1 on waiting to define groupby and join until we have a solid base that we can build this on.
But I do want to contradict a bit about what there being no standard for streaming SQL. Technically that is true, but in general the streaming SQL solutions I have seen restrict the supported queries to either have no aggregations at all (pure element by element operations) or rely on the OVER clause, and more specifically time based windows.
The issue is that the output of a streaming operation is not a table, it is a protocol that includes updates. This is where BEAM really shines because it exposes all of that ugly underbelly and lets users have complete control over that. The API is rather complex and I think a bit ugly because of that. I would suggest that we define what we want it to look like, and let that drive the underlying implementation. If we are feeling ambitions we can get together with Flink, Spark, and anyone else who might be interested and see if we can come to an understanding on what extensions to SQL we should put in to really make streaming work properly.
- Bobby - Bobby

On Friday, October 21, 2016 2:18 AM, Jungtaek Lim <ka...@gmail.com> wrote:

Hi devs,

Sorry to send multiple mails at once, I had been resolving issues
sequentially, and now stopped a bit and retrospect about the direction of
Storm SQL.

I'd like to propose destructive actions, dropping features about GROUP BY
and JOIN from Storm SQL which are fortunately not released yet.

The reason of dropping features is simple: This borrows Trident semantic
(within micro-batch, or stateful), and not making sense of true "streaming"
semantic.

Spark and Flink interpret "streaming" aggregation and join as windowed
operators. Since there's no SQL standard for streaming (even no de-facto),
they are adding the feature to its API (Structured Streaming for Spark, and
Table API for Flink), and don't address them to SQL side yet.

I was eager to add more features on Storm SQL to make progress (even Bobby
pointed out similarly), but after worked on these things, I change my mind
that letting users not confusing is more important than adding features.

Btw, Storm SQL "temporary" relies on Trident since we don't have
higher-level API on core and we don't want to build topology from ground
up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
should run on per-tuple streaming manner instead of micro-batch.
Integrating streams API to Storm SQL could be great internal project for
POC of streams API. Exactly-once needs to be addressed before.

"GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
exactly-once way), so I would like to hear our opinions regarding this.

Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
https://calcite.apache.org/docs/stream.html (For now most of definitions
are not implemented yet.)
This means that we might not support Streaming SQL semantics in SQL
statement unless Calcite finishes their work. I think this is OK since
there're many other works left on Storm SQL, and Storm SQL is now in
experimental anyway (The state of Spark Structured Streaming and Flink
Streaming SQL are also alpha or experiment.)

While waiting, we might want to have LINQ style API like Table API and
address aggregate and join from there, but it requires huge amount of works
and it's a kind of duplicated works with streams API (STORM-1961
<https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
high-level API. IMHO, if streams API is well defined, it should be fairly
easy and not necessary need to have LINQ style API. (though someone feels
more convenient to use 'select', 'where', and so on.)

Please share your opinion about this. Especially I'd like to see JW Player
participating discussion, since aggregation is already supported by SQE.

Thanks for reading a quite long thread.

Thanks,
Jungtaek Lim (HeartSaVioR)