You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by Jungtaek Lim <ka...@gmail.com> on 2016/11/03 13:48:24 UTC

Re: [DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

This seems to be no other opinions, so I would go on dropping "group by"
and "join" for Storm SQL Trident mode until we're ready to handle windowing
on SQL semantic.

- Jungtaek Lim (HeartSaVioR)

2016년 10월 22일 (토) 오전 12:23, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Yes there seems many things which already supports CQL (for example
> InfluxDB), and what I mean is that there's no Streaming SQL or CQL
> standard, in point of "SQL semantic" view.
>
> We can define LINQ style "API" and include aggregation and join with
> enough discussions (if my understanding is right, that's what Structured
> Streaming is, and Flink Table API is also going ahead), but we don't even
> have higher-level API and it will be going to be duplicated work (and it
> should, if higher-level API is well defined).
>
> As I linked earlier, Calcite project proposes its streaming sql semantic,
> which defines new keywords, and new concepts, and so on.[1
> <https://calcite.apache.org/docs/stream.html>] The thing is it's not
> implemented yet, since Calcite is having small community, and most of
> contributors are from SQL on Hadoop, not streaming area. Only a small
> contributions are done from us and Flink side.
>
> Storm SQL has some remaining works even we don't address aggregation for
> now, so it would be not easy to jump on Calcite side and discuss or
> persuade or even help implementing. One of Flink committer initiated
> discussion regarding "defining Streaming SQL semantics" [2
> <http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E>]
> but it seems that not many devs. are interested. Still seems to be an early
> stage for all.
>
> So working on our own and late participating is also a valid way we can
> choose, or participating Flink's discussion now is also a valid way. Storm
> SQL has limited contributors so IMHO we need to prioritize and concentrate.
> Unless we have more contributors coming in Storm SQL, former way looks more
> realistic.
>
> - Jungtaek Lim (HeartSaVioR)
>
> [1] https://calcite.apache.org/docs/stream.html
> [2]
> http://mail-archives.apache.org/mod_mbox/flink-dev/201610.mbox/%3CCAAdrtT2T397E_jpnJM6zH-ysN8i0oOUnO8BnXSoTvFLMwh5L1g@mail.gmail.com%3E
>
> 2016년 10월 21일 (금) 오후 10:50, Bobby Evans <ev...@yahoo-inc.com.invalid>님이
> 작성:
>
> I am not currently very involved with the storm SQL so take my comments
> worth a grain of salt.  I am +1 on waiting to define groupby and join until
> we have a solid base that we can build this on.
> But I do want to contradict a bit about what there being no standard for
> streaming SQL.  Technically that is true, but in general the streaming SQL
> solutions I have seen restrict the supported queries to either have no
> aggregations at all (pure element by element operations) or rely on the
> OVER clause, and more specifically time based windows.
> The issue is that the output of a streaming operation is not a table, it
> is a protocol that includes updates.  This is where BEAM really shines
> because it exposes all of that ugly underbelly and lets users have complete
> control over that. The API is rather complex and I think a bit ugly because
> of that.  I would suggest that we define what we want it to look like, and
> let that drive the underlying implementation.  If we are feeling ambitions
> we can get together with Flink, Spark, and anyone else who might be
> interested and see if we can come to an understanding on what extensions to
> SQL we should put in to really make streaming work properly.
> - Bobby - Bobby
>
>     On Friday, October 21, 2016 2:18 AM, Jungtaek Lim <ka...@gmail.com>
> wrote:
>
>
>  Hi devs,
>
> Sorry to send multiple mails at once, I had been resolving issues
> sequentially, and now stopped a bit and retrospect about the direction of
> Storm SQL.
>
> I'd like to propose destructive actions, dropping features about GROUP BY
> and JOIN from Storm SQL which are fortunately not released yet.
>
> The reason of dropping features is simple: This borrows Trident semantic
> (within micro-batch, or stateful), and not making sense of true "streaming"
> semantic.
>
> Spark and Flink interpret "streaming" aggregation and join as windowed
> operators. Since there's no SQL standard for streaming (even no de-facto),
> they are adding the feature to its API (Structured Streaming for Spark, and
> Table API for Flink), and don't address them to SQL side yet.
>
> I was eager to add more features on Storm SQL to make progress (even Bobby
> pointed out similarly), but after worked on these things, I change my mind
> that letting users not confusing is more important than adding features.
>
> Btw, Storm SQL "temporary" relies on Trident since we don't have
> higher-level API on core and we don't want to build topology from ground
> up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
> should run on per-tuple streaming manner instead of micro-batch.
> Integrating streams API to Storm SQL could be great internal project for
> POC of streams API. Exactly-once needs to be addressed before.
>
> "GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
> exactly-once way), so I would like to hear our opinions regarding this.
>
> Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
> https://calcite.apache.org/docs/stream.html (For now most of definitions
> are not implemented yet.)
> This means that we might not support Streaming SQL semantics in SQL
> statement unless Calcite finishes their work. I think this is OK since
> there're many other works left on Storm SQL, and Storm SQL is now in
> experimental anyway (The state of Spark Structured Streaming and Flink
> Streaming SQL are also alpha or experiment.)
>
> While waiting, we might want to have LINQ style API like Table API and
> address aggregate and join from there, but it requires huge amount of works
> and it's a kind of duplicated works with streams API (STORM-1961
> <https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
> high-level API. IMHO, if streams API is well defined, it should be fairly
> easy and not necessary need to have LINQ style API. (though someone feels
> more convenient to use 'select', 'where', and so on.)
>
> Please share your opinion about this. Especially I'd like to see JW Player
> participating discussion, since aggregation is already supported by SQE.
>
> Thanks for reading a quite long thread.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
>
>
>