You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Jan Lukavský <je...@seznam.cz> on 2018/11/30 14:29:23 UTC

[DISCUSS] Structuring Java based DSLs

Hi community,

I'm part of Euphoria DSL team, and on behalf of this team, I'd like to 
discuss possible development of Java based DSLs currently present in 
Beam. In my knowledge, there are currently two DSLs based on Java SDK - 
Euphoria and SQL. These DSLs currently share only the SDK itself, 
although there might be room to share some more effort. We already know 
that both Euphoria and SQL have need for retractions, but there are 
probably many more features that these two could share.

So, I'd like to open a discussion on what it would cost and what it 
would possibly bring, if instead of the current structure

   Java SDK

     | ---- SQL

     | ---- Euphoria

these DSLs would be structured as

   Java SDK ---> Euphoria ---> SQL

I'm absolutely sure that this would be a great investment and a huge 
change, but I'd like to gather some opinions and general feelings of the 
community about this. Some points to start the discussion from my side 
would be, that structuring DSLs like this has internal logical 
consistency, because each API layer further narrows completeness, but 
brings simpler API for simpler tasks, while adding additional high-level 
view of the data processing pipeline and thus enabling more 
optimizations. On Euphoria side, these are various implementations joins 
(most effective implementation depends on data), pipeline sampling and 
more. Some (or maybe most) of these optimizations would have to be 
implemented in both DSLs, so implementing them once is beneficial. 
Another benefit is that this would bring Euphoria "closer" to Beam core 
development (which would be good, it is part of the project anyway, 
right? :)) and help better drive features, that although currently 
needed mostly by SQL, might be needed by other Java users anyway.

Thanks for discussion and looking forward to any opinions.

   Jan

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Robert, 

Euphoria must be superset of SQL for the proposed approach to work. And I 
think that it already is, or at least can be made so. There might be some 
subtleties missing or be different, but that is the nice thing - by building
the DSLs bottom up, we can make sure that they are mutually consistent - i.
e. there are not multiple implementations of join semantics with slightly 
different behavior (due to multiple implementations). It is of course 
possible to take some parts that are common and make a separate library, but
the way I see it, it should be possible to make this shared library Euphoria
itself, there are (currently) no known features that would imply 
incompatibility between the two (which would force the approach you 
propose). 

 Jan ---------- Původní e-mail ----------
Od: Robert Bradshaw <ro...@google.com>
Komu: dev@beam.apache.org
Datum: 30. 11. 2018 21:39:01
Předmět: Re: [DISCUSS] Structuring Java based DSLs 
"I don't really see Euphoria as a subset of SQL or the other way 
around, and I think it makes sense to use either without the other, so 
by this criteria keeping them as siblings than a nesting. 

That said, I think it's really good to have a bunch of shared code, 
e.g. a join library that could be used by both. One could even depend 
on the other without having to abandon the sibling relationship. 
Something like retractions belong in the core SDK itself. Deeper than 
that, actually, it should be part of the model. 

- Robert 

On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote: 
> 
> Jan, we made Kryo optional recently (it is a separate module and is used 
only in tests). From a quick look it seems that we forgot to remove compile 
time dependency from euphoria's build.gradle. Only "strong" dependencies I'm
aware of are core SDK and guava. We'll be probably adding sketching 
extension dependency soon. 
> 
> D. 
> 
> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote: 
>> 
>> Hi Anton, 
>> reactions inline. 
>> 
>> ---------- Původní e-mail ---------- 
>> Od: Anton Kedin <ke...@google.com> 
>> Komu: dev@beam.apache.org 
>> Datum: 30. 11. 2018 18:17:06 
>> Předmět: Re: [DISCUSS] Structuring Java based DSLs 
>> 
>> I think this approach makes sense in general, Euphoria can be the 
implementation detail of SQL, similar to Join Library or core SDK Schemas. 
>> 
>> I wonder though whether it would be better to bring Euphoria closer to 
core SDK first, maybe even merge them together. If you look at Reuven's 
recent work around schemas it seems like there are already similarities 
between that and Euphoria's approach, unless I'm missing the point (e.g. 
Filter transforms, FullJoin vs CoGroup... see [2]). And we're already 
switching parts of SQL to those transforms (e.g. SQL Aggregation is now 
implemented by core SDK's Group[3]). 
>> 
>> 
>> 
>> Yes, these transforms seem to be very similar to those Euphoria has. 
Whether or not to merge Euphoria with core is essentially just a decision of
the community (in my point of view). 
>> 
>> 
>> 
>> Adding explicit Schema support to Euphoria will bring it both closer to 
core SDK and make it natural to use for SQL. Can this be a first step 
towards this integration? 
>> 
>> 
>> 
>> Euphoria currently operates on pure PCollections, so when PCollection has
a schema, it will be accessible by Euphoria. It makes sense to make use of 
the schema in Euphoria - it seems natural on inputs to Euphoria operators, 
but it might be tricky (not saying impossible) to actually produce schema-
aware PCollections as outputs from Euphoria operators (generally speaking, 
in special cases that might be possible). Regarding inputs, there is 
actually intention to act on type of PCollection - e.g. when PCollection is 
already of type KV, then it is possible to make key extractor and value 
extractor optional in Euphoria builders, so it feels natural to enable 
changing the builders when a schema-aware PCollection, and make use of the 
provided schema. The rest of Euphoria team might correct me, if I'm wrong. 
>> 
>> 
>> 
>> 
>> One question I have is, does Euphoria bring dependencies that are not 
needed by SQL, or does more or less only rely on the core SDK? 
>> 
>> 
>> 
>> I think the only relevant dependency that Euphoria has besides core SDK 
is Kryo. It is the default coder when no coder is provided, but that could 
be made optional - e.g. the default coder would be supported only if an 
appropriate module would be available. That way I think that Euphoria has no
special dependencies. 
>> 
>> 
>> 
>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/
Group.java#L73 
>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms 
>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/
extensions/sql/impl/rel/BeamAggregationRel.java#L179 
>> 
>> 
>> 
>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote: 
>> 
>> Hi community, 
>> 
>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to 
>> discuss possible development of Java based DSLs currently present in 
>> Beam. In my knowledge, there are currently two DSLs based on Java SDK - 
>> Euphoria and SQL. These DSLs currently share only the SDK itself, 
>> although there might be room to share some more effort. We already know 
>> that both Euphoria and SQL have need for retractions, but there are 
>> probably many more features that these two could share. 
>> 
>> So, I'd like to open a discussion on what it would cost and what it 
>> would possibly bring, if instead of the current structure 
>> 
>> Java SDK 
>> 
>> | ---- SQL 
>> 
>> | ---- Euphoria 
>> 
>> these DSLs would be structured as 
>> 
>> Java SDK ---> Euphoria ---> SQL 
>> 
>> I'm absolutely sure that this would be a great investment and a huge 
>> change, but I'd like to gather some opinions and general feelings of the 
>> community about this. Some points to start the discussion from my side 
>> would be, that structuring DSLs like this has internal logical 
>> consistency, because each API layer further narrows completeness, but 
>> brings simpler API for simpler tasks, while adding additional high-level 
>> view of the data processing pipeline and thus enabling more 
>> optimizations. On Euphoria side, these are various implementations joins 
>> (most effective implementation depends on data), pipeline sampling and 
>> more. Some (or maybe most) of these optimizations would have to be 
>> implemented in both DSLs, so implementing them once is beneficial. 
>> Another benefit is that this would bring Euphoria "closer" to Beam core 
>> development (which would be good, it is part of the project anyway, 
>> right? :)) and help better drive features, that although currently 
>> needed mostly by SQL, might be needed by other Java users anyway. 
>> 
>> Thanks for discussion and looking forward to any opinions. 
>> 
>> Jan 
>> 
"

Re: [DISCUSS] Structuring Java based DSLs

Posted by Reuven Lax <re...@google.com>.

I'll send an update on schemas soon. But the tl;dr is that by the end of
this month, I expect it to be generally usable across a variety of input
formats.

Reuven

On Wed, Dec 12, 2018 at 9:38 AM Xinyu Liu <xi...@gmail.com> wrote:

> Agree with Kenn on this. From our SamzaRunner point of view, we would like
> Beam SQL to be self-contained and flexible enough for our users to use it
> in different scenarios, e.g. pure SQL and embeded in different SDKs. We are
> also extremely interested in the DataFrame-like API mentioned above. To
> digress a little bit from this topic, this is actually the current hurdle
> of letting our users try it out in hadoop since they expect such kind of
> API with columnar data set IO support, e.g. ORC. If there are any more
> details about the status of DF API and columnar support, I will be very
> happy to learn more about it.
>
> Thanks,
> Xinyu
>
> On Wed, Dec 12, 2018 at 8:55 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi all,
>>
>> after letting this sink for a while, I'd like to summarize the feedback
>> and emphasize some questions that appeared:
>>
>>  a) there were several 'it makes sense' opinions
>>
>>  b) there was one 'not right now' - which makes sense, but the purpose of
>> this discussion was to try to first answer the what and then the when :-)
>>
>>  c) there were several 'maybe, but':
>>
>>   i) it would be more complicated to code SQL against user-facing API,
>> because that way, each change needed by SQL would have to be first
>> implemented in this user-friendly API layer
>>
>>      I can absolutely agree with this, it would be definitely more
>> complicated and more work. I see basically two ways out. The first one
>> would suggest to move all the code from Euphoria into something similar to
>> Join library, and let Euphoria be just the user-friendly layer on top of
>> this library (basically just the builders). That way, we could reuse the
>> code and be pretty much sure, that the implementation of SQL transforms are
>> identical to what Euphoria would offer, which is one the goals of this
>> discussion. The drawback would be, that there would be no guaranties, that
>> what this underlying library would offer would be also accessible from
>> Euphoria - that is because the complexity would not disappear, it would be
>> just moved onto different component - new added feature to the shared
>> library would have to be made accessible in Euphoria. The other way around
>> would be to accept this added complexity in favor of making sure, that
>> every feature that is needed by SQL is also available in Euphoria, because
>> the user-facing API would be used by SQL itself. I'd really like to further
>> hear community opinions on pros and cons of these two (or maybe I'm
>> overlooking something and there is a third way).
>>
>>  ii) in some cases, we might want to support relational operators in SDK
>> harness for performance, and we don't want to close doors for this
>>
>>      Again, the motivation of this seems to be clear and valid, but the
>> question that arises is - under the conditions (something like we have
>> schema aware PCollection), would we want to enable code reuse between logic
>> written in SQL and Euphoria to ensure consistent behavior? That would
>> probably mean that Euphoria would have to make use of the provided scheme
>> of PCollection and switch to a different behavior on API level (more
>> DataFrame-like) and/or different operators created and passed to the SDK
>> harness. This feature is currently absolutely missing, but seems to be
>> plausible and maybe there could be benefits for both sides.
>>
>> Many thanks for any more opinions on this.
>>
>>  Jan
>>
>>
>> On 12/4/18 11:32 PM, Rui Wang wrote:
>>
>> For pure SQL users, there shouldn't be a SDK concepts. SQL shell and JDBC
>> driver should be the way to interact Beam by SQL.
>>
>>
>> For embedded SQL use case in all SDKs (Python, Go, etc.), even assume
>> there are relational algebra operators defined in SDKs, SDKs still have to
>> implement its own way to parse SQL into operators (SQL is just a string).
>> To avoid that overhead, I would imagine that SDKs should keep SQL queries
>> and wait for a later but shared processing (I don't know if Portability
>> should handle SQL or if it could).
>>
>>
>> -Rui
>>
>> On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Kenn,
>>>
>>> my intent really was not to propose any changes right now. I'm trying to
>>> create a clear understanding about what the relation between Euphoria and
>>> SQL should be in long run. In my point of view, Euphoria should be always
>>> superset of SQL, because it should support complete relational algebra (and
>>> I'm not saying it does so right now, it should just be our goal) plus more
>>> flexible UDFs (not limited to SQL standard) and stateful processing (which
>>> will probably not be part of SQL any time soon). There should be some sort
>>> of guaranties that the semantics of SQL and Euphoria are the same, because
>>> that is what users would expect it to be. This can be for sure ensured by
>>> introducing another layer between Euphoria and core SDK (e.g. the join
>>> library), but the question is - what makes this solution different from
>>> creating this shared library from Euphoria itself (when looking at the big
>>> picture)? And it is not only about implementations of joins or any other
>>> operators, but there are other techniques that could be beneficial for SQL
>>> - e.g. pipeline sampling, automatic pipeline optimizations based on
>>> statistics from previous runs of batch queries, etc.
>>>
>>> The other way - that relational algebra nodes will become essential part
>>> of (some) SDK, that is equivalent to actually creating SQL SDK, am I right?
>>> I understand, that this approach can bring performance benefits, but
>>> besides that - is the language which implements SQL really important for
>>> users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? How
>>> would the resulting SQL query look like? If it is about allowing using SQL
>>> from all other SDKs (I want to do some basic preprocessing using SQL and
>>> then optimize some hard part in my favorite SDK) - can this be solved by
>>> enabling SQL in all SDKs by mixing various SDKs harnesses in single
>>> pipeline instead (e.g. I want to use SQL in Go SDK, I just tell the
>>> portable layer to run these operators using Java SDK and these using Go)?
>>> That seems plausible, solving interoperability issues, while leaving the
>>> whole implementation of SQL as an internal detail. Generally this solves
>>> more issues, like ability to reuse IOs in all SDKs (I'm aware that there
>>> are caveats, but that is beyond scope of intended discussion topic of this
>>> thread).
>>>
>>>  Jan
>>> On 12/3/18 7:27 PM, Kenneth Knowles wrote:
>>>
>>> To be honest, I don't think there's much worth doing right now. I think
>>> more self-contained is better for Beam SQL, generally. Two things I have on
>>> my mind are (1) SQL as an inline transform in every SDK and (2) supporting
>>> pure SQL like the CLI and JDBC driver, where the underlying language is an
>>> implementation detail.
>>>
>>> Big picture / long term, I would envision pure SQL, embedded SQL
>>> transform, and a DataFrame-like API in ~each SDK all desugaring to
>>> relational algebra nodes, sharing an optimizer, sharing some amount of
>>> mapping the physical plan to Beam transforms. The necessarily SDK-specific
>>> parts are the embedded transform API and UDFs in the host language. The
>>> rest should remain an implementation detail that we can change.
>>>
>>>  - For example, it is easy to imagine a customized columnar
>>> element/bundle encoding and SDK harness that only works for SQL to remove
>>> overhead of being general purpose. It could be written in C/C++/Go if we
>>> wanted to squeeze it for perf. Such things are made harder by having an
>>> elaborate end-user API between SQL and the core Beam model.
>>>  - Conversely, for whatever is chosen to underlie SQL's execution,
>>> stability is paramount. Ideally the simplest and least likely to change
>>> transforms would be the foundation. And I wouldn't want to have to design a
>>> user-friendly API for Euphoria or the join library just to enable a
>>> different join algorithm in SQL.
>>>
>>> So my take is keep SQL flexible, implement SQL on low-level and stable
>>> APIs, use join library, Euphoria, etc, if it looks like a big win, but
>>> don't build any policy here or do big refactors right now.
>>>
>>> Kenn
>>>
>>> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi Robert,
>>>>
>>>> currently there is no actual proposal, I was just trying to gather
>>>> feedback from the community. But my original thoughts would be [1]. I
>>>> actually don't see much need for restructuring the code by nesting
>>>> directories. If the community sees that it would make sense to
>>>> structure
>>>> the dependencies, the second step would probably be to figure out how
>>>> to
>>>> accomplish this. I don't have any exact solution in mind so far, it
>>>> would be probably needed to first identify features that are needed by
>>>> SQL and not supported by Euphoria currently. Then we can actually
>>>> identify costs and see it this still makes sense.
>>>>
>>>>   Jan
>>>>
>>>> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>>>> > Taking a step back, what exactly is the proposal. Looking at the
>>>> > original message, I see
>>>> >
>>>> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
>>>> > taking advantage of the logical nesting of levels of abstraction. This
>>>> > makes sense to me.
>>>> > (2) Nesting the directories (but not the gradle targets or module
>>>> > names?). Here I'm not so sure about the benefit, especially vs. the
>>>> > cost.
>>>> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>> >> I think that the fact that SQL uses some other internal dependency
>>>> >> should remain hidden implementation detail. I absolutely agree that
>>>> the
>>>> >> dependency should of course remain sdks-java-sql in all cases.
>>>> >>
>>>> >>     Jan
>>>> >>
>>>> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>>> >>> I suppose what I'm trying to say is that I see this module structure
>>>> >>> as a tool for discoverability and enumerating end-user endpoints. In
>>>> >>> other words, if one wants to use SQL, it would seem odd to have to
>>>> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>>>> >>> sdks-java-euphoria is also a DSL one might use. A sibling
>>>> relationship
>>>> >>> does not prohibit the layered approach to implementation that sounds
>>>> >>> like it makes sense.
>>>> >>>
>>>> >>> (As for merging Euphoria into core, my initial impression is that's
>>>> >>> probably a good idea, and something we should consider for 3.0 at
>>>> the
>>>> >>> very least.)
>>>> >>>
>>>> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>> Hi Rui,
>>>> >>>>
>>>> >>>> yes, there are optimizations that could be added by each layer.
>>>> The purpose of Euphoria layer actually is not to reorder or modify any user
>>>> operators that are present in the pipeline (because it might not have
>>>> enough information to do this), but it can for instance choose between
>>>> various join implementations (shuffle join, broadcast join, ...) - so the
>>>> optimizations it can do are more low level. But this plays nicely with the
>>>> DSL hierarchy - each layer adds a little more restrictions, but can
>>>> therefore do more optimizations. And I think that the layer between SDK and
>>>> SQL wouldn't have to support SQL optimizations, it would only have to
>>>> support way for SQL to express these optimizations.
>>>> >>>>
>>>> >>>>     Jan ---------- Původní e-mail ----------
>>>> >>>> Od: Rui Wang <ru...@google.com>
>>>> >>>> Komu: dev@beam.apache.org
>>>> >>>> Datum: 30. 11. 2018 22:43:04
>>>> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>> >>>>
>>>> >>>> SQL's optimization is another area to consider for integration.
>>>> SQL optimization includes pushing down filters/projections, merging or
>>>> removing or swapping plan nodes and comparing plan costs to choose best
>>>> plan.  Add another layer between SQL and java core might need the layer to
>>>> support SQL optimizations if there is a need.
>>>> >>>>
>>>> >>>> I don't have a clear image on what SQL needs from Euphoria for
>>>> optimization(best case is nothing). As those optimizations are happening or
>>>> will happen, we might start to have a sense of it.
>>>> >>>>
>>>> >>>> -Rui
>>>> >>>>
>>>> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <
>>>> robertwb@google.com> wrote:
>>>> >>>>
>>>> >>>> I don't really see Euphoria as a subset of SQL or the other way
>>>> >>>> around, and I think it makes sense to use either without the
>>>> other, so
>>>> >>>> by this criteria keeping them as siblings than a nesting.
>>>> >>>>
>>>> >>>> That said, I think it's really good to have a bunch of shared code,
>>>> >>>> e.g. a join library that could be used by both. One could even
>>>> depend
>>>> >>>> on the other without having to abandon the sibling relationship.
>>>> >>>> Something like retractions belong in the core SDK itself. Deeper
>>>> than
>>>> >>>> that, actually, it should be part of the model.
>>>> >>>>
>>>> >>>> - Robert
>>>> >>>>
>>>> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org>
>>>> wrote:
>>>> >>>>> Jan, we made Kryo optional recently (it is a separate module and
>>>> is used only in tests). From a quick look it seems that we forgot to remove
>>>> compile time dependency from euphoria's build.gradle. Only "strong"
>>>> dependencies I'm aware of are core SDK and guava. We'll be probably adding
>>>> sketching extension dependency soon.
>>>> >>>>>
>>>> >>>>> D.
>>>> >>>>>
>>>> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>>>> Hi Anton,
>>>> >>>>>> reactions inline.
>>>> >>>>>>
>>>> >>>>>> ---------- Původní e-mail ----------
>>>> >>>>>> Od: Anton Kedin <ke...@google.com>
>>>> >>>>>> Komu: dev@beam.apache.org
>>>> >>>>>> Datum: 30. 11. 2018 18:17:06
>>>> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>> >>>>>>
>>>> >>>>>> I think this approach makes sense in general, Euphoria can be
>>>> the implementation detail of SQL, similar to Join Library or core SDK
>>>> Schemas.
>>>> >>>>>>
>>>> >>>>>> I wonder though whether it would be better to bring Euphoria
>>>> closer to core SDK first, maybe even merge them together. If you look at
>>>> Reuven's recent work around schemas it seems like there are already
>>>> similarities between that and Euphoria's approach, unless I'm missing the
>>>> point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're
>>>> already switching parts of SQL to those transforms (e.g. SQL Aggregation is
>>>> now implemented by core SDK's Group[3]).
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
>>>> has. Whether or not to merge Euphoria with core is essentially just a
>>>> decision of the community (in my point of view).
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Adding explicit Schema support to Euphoria will bring it both
>>>> closer to core SDK and make it natural to use for SQL. Can this be a first
>>>> step towards this integration?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Euphoria currently operates on pure PCollections, so when
>>>> PCollection has a schema, it will be accessible by Euphoria. It makes sense
>>>> to make use of the schema in Euphoria - it seems natural on inputs to
>>>> Euphoria operators, but it might be tricky (not saying impossible) to
>>>> actually produce schema-aware PCollections as outputs from Euphoria
>>>> operators (generally speaking, in special cases that might be possible).
>>>> Regarding inputs, there is actually intention to act on type of PCollection
>>>> - e.g. when PCollection is already of type KV, then it is possible to make
>>>> key extractor and value extractor optional in Euphoria builders, so it
>>>> feels natural to enable changing the builders when a schema-aware
>>>> PCollection, and make use of the provided schema. The rest of Euphoria team
>>>> might correct me, if I'm wrong.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> One question I have is, does Euphoria bring dependencies that
>>>> are not needed by SQL, or does more or less only rely on the core SDK?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> I think the only relevant dependency that Euphoria has besides
>>>> core SDK is Kryo. It is the default coder when no coder is provided, but
>>>> that could be made optional - e.g. the default coder would be supported
>>>> only if an appropriate module would be available. That way I think that
>>>> Euphoria has no special dependencies.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> [1]
>>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>>> >>>>>> [2]
>>>> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>> >>>>>> [3]
>>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi community,
>>>> >>>>>>
>>>> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd
>>>> like to
>>>> >>>>>> discuss possible development of Java based DSLs currently
>>>> present in
>>>> >>>>>> Beam. In my knowledge, there are currently two DSLs based on
>>>> Java SDK -
>>>> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>>>> >>>>>> although there might be room to share some more effort. We
>>>> already know
>>>> >>>>>> that both Euphoria and SQL have need for retractions, but there
>>>> are
>>>> >>>>>> probably many more features that these two could share.
>>>> >>>>>>
>>>> >>>>>> So, I'd like to open a discussion on what it would cost and what
>>>> it
>>>> >>>>>> would possibly bring, if instead of the current structure
>>>> >>>>>>
>>>> >>>>>>      Java SDK
>>>> >>>>>>
>>>> >>>>>>        | ---- SQL
>>>> >>>>>>
>>>> >>>>>>        | ---- Euphoria
>>>> >>>>>>
>>>> >>>>>> these DSLs would be structured as
>>>> >>>>>>
>>>> >>>>>>      Java SDK ---> Euphoria ---> SQL
>>>> >>>>>>
>>>> >>>>>> I'm absolutely sure that this would be a great investment and a
>>>> huge
>>>> >>>>>> change, but I'd like to gather some opinions and general
>>>> feelings of the
>>>> >>>>>> community about this. Some points to start the discussion from
>>>> my side
>>>> >>>>>> would be, that structuring DSLs like this has internal logical
>>>> >>>>>> consistency, because each API layer further narrows
>>>> completeness, but
>>>> >>>>>> brings simpler API for simpler tasks, while adding additional
>>>> high-level
>>>> >>>>>> view of the data processing pipeline and thus enabling more
>>>> >>>>>> optimizations. On Euphoria side, these are various
>>>> implementations joins
>>>> >>>>>> (most effective implementation depends on data), pipeline
>>>> sampling and
>>>> >>>>>> more. Some (or maybe most) of these optimizations would have to
>>>> be
>>>> >>>>>> implemented in both DSLs, so implementing them once is
>>>> beneficial.
>>>> >>>>>> Another benefit is that this would bring Euphoria "closer" to
>>>> Beam core
>>>> >>>>>> development (which would be good, it is part of the project
>>>> anyway,
>>>> >>>>>> right? :)) and help better drive features, that although
>>>> currently
>>>> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
>>>> >>>>>>
>>>> >>>>>> Thanks for discussion and looking forward to any opinions.
>>>> >>>>>>
>>>> >>>>>>      Jan
>>>> >>>>>>
>>>>
>>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Xinyu Liu <xi...@gmail.com>.

Agree with Kenn on this. From our SamzaRunner point of view, we would like
Beam SQL to be self-contained and flexible enough for our users to use it
in different scenarios, e.g. pure SQL and embeded in different SDKs. We are
also extremely interested in the DataFrame-like API mentioned above. To
digress a little bit from this topic, this is actually the current hurdle
of letting our users try it out in hadoop since they expect such kind of
API with columnar data set IO support, e.g. ORC. If there are any more
details about the status of DF API and columnar support, I will be very
happy to learn more about it.

Thanks,
Xinyu

On Wed, Dec 12, 2018 at 8:55 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi all,
>
> after letting this sink for a while, I'd like to summarize the feedback
> and emphasize some questions that appeared:
>
>  a) there were several 'it makes sense' opinions
>
>  b) there was one 'not right now' - which makes sense, but the purpose of
> this discussion was to try to first answer the what and then the when :-)
>
>  c) there were several 'maybe, but':
>
>   i) it would be more complicated to code SQL against user-facing API,
> because that way, each change needed by SQL would have to be first
> implemented in this user-friendly API layer
>
>      I can absolutely agree with this, it would be definitely more
> complicated and more work. I see basically two ways out. The first one
> would suggest to move all the code from Euphoria into something similar to
> Join library, and let Euphoria be just the user-friendly layer on top of
> this library (basically just the builders). That way, we could reuse the
> code and be pretty much sure, that the implementation of SQL transforms are
> identical to what Euphoria would offer, which is one the goals of this
> discussion. The drawback would be, that there would be no guaranties, that
> what this underlying library would offer would be also accessible from
> Euphoria - that is because the complexity would not disappear, it would be
> just moved onto different component - new added feature to the shared
> library would have to be made accessible in Euphoria. The other way around
> would be to accept this added complexity in favor of making sure, that
> every feature that is needed by SQL is also available in Euphoria, because
> the user-facing API would be used by SQL itself. I'd really like to further
> hear community opinions on pros and cons of these two (or maybe I'm
> overlooking something and there is a third way).
>
>  ii) in some cases, we might want to support relational operators in SDK
> harness for performance, and we don't want to close doors for this
>
>      Again, the motivation of this seems to be clear and valid, but the
> question that arises is - under the conditions (something like we have
> schema aware PCollection), would we want to enable code reuse between logic
> written in SQL and Euphoria to ensure consistent behavior? That would
> probably mean that Euphoria would have to make use of the provided scheme
> of PCollection and switch to a different behavior on API level (more
> DataFrame-like) and/or different operators created and passed to the SDK
> harness. This feature is currently absolutely missing, but seems to be
> plausible and maybe there could be benefits for both sides.
>
> Many thanks for any more opinions on this.
>
>  Jan
>
>
> On 12/4/18 11:32 PM, Rui Wang wrote:
>
> For pure SQL users, there shouldn't be a SDK concepts. SQL shell and JDBC
> driver should be the way to interact Beam by SQL.
>
>
> For embedded SQL use case in all SDKs (Python, Go, etc.), even assume
> there are relational algebra operators defined in SDKs, SDKs still have to
> implement its own way to parse SQL into operators (SQL is just a string).
> To avoid that overhead, I would imagine that SDKs should keep SQL queries
> and wait for a later but shared processing (I don't know if Portability
> should handle SQL or if it could).
>
>
> -Rui
>
> On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi Kenn,
>>
>> my intent really was not to propose any changes right now. I'm trying to
>> create a clear understanding about what the relation between Euphoria and
>> SQL should be in long run. In my point of view, Euphoria should be always
>> superset of SQL, because it should support complete relational algebra (and
>> I'm not saying it does so right now, it should just be our goal) plus more
>> flexible UDFs (not limited to SQL standard) and stateful processing (which
>> will probably not be part of SQL any time soon). There should be some sort
>> of guaranties that the semantics of SQL and Euphoria are the same, because
>> that is what users would expect it to be. This can be for sure ensured by
>> introducing another layer between Euphoria and core SDK (e.g. the join
>> library), but the question is - what makes this solution different from
>> creating this shared library from Euphoria itself (when looking at the big
>> picture)? And it is not only about implementations of joins or any other
>> operators, but there are other techniques that could be beneficial for SQL
>> - e.g. pipeline sampling, automatic pipeline optimizations based on
>> statistics from previous runs of batch queries, etc.
>>
>> The other way - that relational algebra nodes will become essential part
>> of (some) SDK, that is equivalent to actually creating SQL SDK, am I right?
>> I understand, that this approach can bring performance benefits, but
>> besides that - is the language which implements SQL really important for
>> users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? How
>> would the resulting SQL query look like? If it is about allowing using SQL
>> from all other SDKs (I want to do some basic preprocessing using SQL and
>> then optimize some hard part in my favorite SDK) - can this be solved by
>> enabling SQL in all SDKs by mixing various SDKs harnesses in single
>> pipeline instead (e.g. I want to use SQL in Go SDK, I just tell the
>> portable layer to run these operators using Java SDK and these using Go)?
>> That seems plausible, solving interoperability issues, while leaving the
>> whole implementation of SQL as an internal detail. Generally this solves
>> more issues, like ability to reuse IOs in all SDKs (I'm aware that there
>> are caveats, but that is beyond scope of intended discussion topic of this
>> thread).
>>
>>  Jan
>> On 12/3/18 7:27 PM, Kenneth Knowles wrote:
>>
>> To be honest, I don't think there's much worth doing right now. I think
>> more self-contained is better for Beam SQL, generally. Two things I have on
>> my mind are (1) SQL as an inline transform in every SDK and (2) supporting
>> pure SQL like the CLI and JDBC driver, where the underlying language is an
>> implementation detail.
>>
>> Big picture / long term, I would envision pure SQL, embedded SQL
>> transform, and a DataFrame-like API in ~each SDK all desugaring to
>> relational algebra nodes, sharing an optimizer, sharing some amount of
>> mapping the physical plan to Beam transforms. The necessarily SDK-specific
>> parts are the embedded transform API and UDFs in the host language. The
>> rest should remain an implementation detail that we can change.
>>
>>  - For example, it is easy to imagine a customized columnar
>> element/bundle encoding and SDK harness that only works for SQL to remove
>> overhead of being general purpose. It could be written in C/C++/Go if we
>> wanted to squeeze it for perf. Such things are made harder by having an
>> elaborate end-user API between SQL and the core Beam model.
>>  - Conversely, for whatever is chosen to underlie SQL's execution,
>> stability is paramount. Ideally the simplest and least likely to change
>> transforms would be the foundation. And I wouldn't want to have to design a
>> user-friendly API for Euphoria or the join library just to enable a
>> different join algorithm in SQL.
>>
>> So my take is keep SQL flexible, implement SQL on low-level and stable
>> APIs, use join library, Euphoria, etc, if it looks like a big win, but
>> don't build any policy here or do big refactors right now.
>>
>> Kenn
>>
>> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Robert,
>>>
>>> currently there is no actual proposal, I was just trying to gather
>>> feedback from the community. But my original thoughts would be [1]. I
>>> actually don't see much need for restructuring the code by nesting
>>> directories. If the community sees that it would make sense to structure
>>> the dependencies, the second step would probably be to figure out how to
>>> accomplish this. I don't have any exact solution in mind so far, it
>>> would be probably needed to first identify features that are needed by
>>> SQL and not supported by Euphoria currently. Then we can actually
>>> identify costs and see it this still makes sense.
>>>
>>>   Jan
>>>
>>> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>>> > Taking a step back, what exactly is the proposal. Looking at the
>>> > original message, I see
>>> >
>>> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
>>> > taking advantage of the logical nesting of levels of abstraction. This
>>> > makes sense to me.
>>> > (2) Nesting the directories (but not the gradle targets or module
>>> > names?). Here I'm not so sure about the benefit, especially vs. the
>>> > cost.
>>> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>>> >> I think that the fact that SQL uses some other internal dependency
>>> >> should remain hidden implementation detail. I absolutely agree that
>>> the
>>> >> dependency should of course remain sdks-java-sql in all cases.
>>> >>
>>> >>     Jan
>>> >>
>>> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>> >>> I suppose what I'm trying to say is that I see this module structure
>>> >>> as a tool for discoverability and enumerating end-user endpoints. In
>>> >>> other words, if one wants to use SQL, it would seem odd to have to
>>> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>>> >>> sdks-java-euphoria is also a DSL one might use. A sibling
>>> relationship
>>> >>> does not prohibit the layered approach to implementation that sounds
>>> >>> like it makes sense.
>>> >>>
>>> >>> (As for merging Euphoria into core, my initial impression is that's
>>> >>> probably a good idea, and something we should consider for 3.0 at the
>>> >>> very least.)
>>> >>>
>>> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz>
>>> wrote:
>>> >>>> Hi Rui,
>>> >>>>
>>> >>>> yes, there are optimizations that could be added by each layer. The
>>> purpose of Euphoria layer actually is not to reorder or modify any user
>>> operators that are present in the pipeline (because it might not have
>>> enough information to do this), but it can for instance choose between
>>> various join implementations (shuffle join, broadcast join, ...) - so the
>>> optimizations it can do are more low level. But this plays nicely with the
>>> DSL hierarchy - each layer adds a little more restrictions, but can
>>> therefore do more optimizations. And I think that the layer between SDK and
>>> SQL wouldn't have to support SQL optimizations, it would only have to
>>> support way for SQL to express these optimizations.
>>> >>>>
>>> >>>>     Jan ---------- Původní e-mail ----------
>>> >>>> Od: Rui Wang <ru...@google.com>
>>> >>>> Komu: dev@beam.apache.org
>>> >>>> Datum: 30. 11. 2018 22:43:04
>>> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>> >>>>
>>> >>>> SQL's optimization is another area to consider for integration. SQL
>>> optimization includes pushing down filters/projections, merging or removing
>>> or swapping plan nodes and comparing plan costs to choose best plan.  Add
>>> another layer between SQL and java core might need the layer to support SQL
>>> optimizations if there is a need.
>>> >>>>
>>> >>>> I don't have a clear image on what SQL needs from Euphoria for
>>> optimization(best case is nothing). As those optimizations are happening or
>>> will happen, we might start to have a sense of it.
>>> >>>>
>>> >>>> -Rui
>>> >>>>
>>> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>>>
>>> >>>> I don't really see Euphoria as a subset of SQL or the other way
>>> >>>> around, and I think it makes sense to use either without the other,
>>> so
>>> >>>> by this criteria keeping them as siblings than a nesting.
>>> >>>>
>>> >>>> That said, I think it's really good to have a bunch of shared code,
>>> >>>> e.g. a join library that could be used by both. One could even
>>> depend
>>> >>>> on the other without having to abandon the sibling relationship.
>>> >>>> Something like retractions belong in the core SDK itself. Deeper
>>> than
>>> >>>> that, actually, it should be part of the model.
>>> >>>>
>>> >>>> - Robert
>>> >>>>
>>> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org>
>>> wrote:
>>> >>>>> Jan, we made Kryo optional recently (it is a separate module and
>>> is used only in tests). From a quick look it seems that we forgot to remove
>>> compile time dependency from euphoria's build.gradle. Only "strong"
>>> dependencies I'm aware of are core SDK and guava. We'll be probably adding
>>> sketching extension dependency soon.
>>> >>>>>
>>> >>>>> D.
>>> >>>>>
>>> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz>
>>> wrote:
>>> >>>>>> Hi Anton,
>>> >>>>>> reactions inline.
>>> >>>>>>
>>> >>>>>> ---------- Původní e-mail ----------
>>> >>>>>> Od: Anton Kedin <ke...@google.com>
>>> >>>>>> Komu: dev@beam.apache.org
>>> >>>>>> Datum: 30. 11. 2018 18:17:06
>>> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>> >>>>>>
>>> >>>>>> I think this approach makes sense in general, Euphoria can be the
>>> implementation detail of SQL, similar to Join Library or core SDK Schemas.
>>> >>>>>>
>>> >>>>>> I wonder though whether it would be better to bring Euphoria
>>> closer to core SDK first, maybe even merge them together. If you look at
>>> Reuven's recent work around schemas it seems like there are already
>>> similarities between that and Euphoria's approach, unless I'm missing the
>>> point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're
>>> already switching parts of SQL to those transforms (e.g. SQL Aggregation is
>>> now implemented by core SDK's Group[3]).
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
>>> has. Whether or not to merge Euphoria with core is essentially just a
>>> decision of the community (in my point of view).
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Adding explicit Schema support to Euphoria will bring it both
>>> closer to core SDK and make it natural to use for SQL. Can this be a first
>>> step towards this integration?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Euphoria currently operates on pure PCollections, so when
>>> PCollection has a schema, it will be accessible by Euphoria. It makes sense
>>> to make use of the schema in Euphoria - it seems natural on inputs to
>>> Euphoria operators, but it might be tricky (not saying impossible) to
>>> actually produce schema-aware PCollections as outputs from Euphoria
>>> operators (generally speaking, in special cases that might be possible).
>>> Regarding inputs, there is actually intention to act on type of PCollection
>>> - e.g. when PCollection is already of type KV, then it is possible to make
>>> key extractor and value extractor optional in Euphoria builders, so it
>>> feels natural to enable changing the builders when a schema-aware
>>> PCollection, and make use of the provided schema. The rest of Euphoria team
>>> might correct me, if I'm wrong.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> One question I have is, does Euphoria bring dependencies that are
>>> not needed by SQL, or does more or less only rely on the core SDK?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I think the only relevant dependency that Euphoria has besides
>>> core SDK is Kryo. It is the default coder when no coder is provided, but
>>> that could be made optional - e.g. the default coder would be supported
>>> only if an appropriate module would be available. That way I think that
>>> Euphoria has no special dependencies.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> [1]
>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>> >>>>>> [2]
>>> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>> >>>>>> [3]
>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz>
>>> wrote:
>>> >>>>>>
>>> >>>>>> Hi community,
>>> >>>>>>
>>> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd
>>> like to
>>> >>>>>> discuss possible development of Java based DSLs currently present
>>> in
>>> >>>>>> Beam. In my knowledge, there are currently two DSLs based on Java
>>> SDK -
>>> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>>> >>>>>> although there might be room to share some more effort. We
>>> already know
>>> >>>>>> that both Euphoria and SQL have need for retractions, but there
>>> are
>>> >>>>>> probably many more features that these two could share.
>>> >>>>>>
>>> >>>>>> So, I'd like to open a discussion on what it would cost and what
>>> it
>>> >>>>>> would possibly bring, if instead of the current structure
>>> >>>>>>
>>> >>>>>>      Java SDK
>>> >>>>>>
>>> >>>>>>        | ---- SQL
>>> >>>>>>
>>> >>>>>>        | ---- Euphoria
>>> >>>>>>
>>> >>>>>> these DSLs would be structured as
>>> >>>>>>
>>> >>>>>>      Java SDK ---> Euphoria ---> SQL
>>> >>>>>>
>>> >>>>>> I'm absolutely sure that this would be a great investment and a
>>> huge
>>> >>>>>> change, but I'd like to gather some opinions and general feelings
>>> of the
>>> >>>>>> community about this. Some points to start the discussion from my
>>> side
>>> >>>>>> would be, that structuring DSLs like this has internal logical
>>> >>>>>> consistency, because each API layer further narrows completeness,
>>> but
>>> >>>>>> brings simpler API for simpler tasks, while adding additional
>>> high-level
>>> >>>>>> view of the data processing pipeline and thus enabling more
>>> >>>>>> optimizations. On Euphoria side, these are various
>>> implementations joins
>>> >>>>>> (most effective implementation depends on data), pipeline
>>> sampling and
>>> >>>>>> more. Some (or maybe most) of these optimizations would have to be
>>> >>>>>> implemented in both DSLs, so implementing them once is beneficial.
>>> >>>>>> Another benefit is that this would bring Euphoria "closer" to
>>> Beam core
>>> >>>>>> development (which would be good, it is part of the project
>>> anyway,
>>> >>>>>> right? :)) and help better drive features, that although currently
>>> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
>>> >>>>>>
>>> >>>>>> Thanks for discussion and looking forward to any opinions.
>>> >>>>>>
>>> >>>>>>      Jan
>>> >>>>>>
>>>
>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi all,

after letting this sink for a while, I'd like to summarize the feedback 
and emphasize some questions that appeared:

  a) there were several 'it makes sense' opinions

  b) there was one 'not right now' - which makes sense, but the purpose 
of this discussion was to try to first answer the what and then the when :-)

  c) there were several 'maybe, but':

   i) it would be more complicated to code SQL against user-facing API, 
because that way, each change needed by SQL would have to be first 
implemented in this user-friendly API layer

      I can absolutely agree with this, it would be definitely more 
complicated and more work. I see basically two ways out. The first one 
would suggest to move all the code from Euphoria into something similar 
to Join library, and let Euphoria be just the user-friendly layer on top 
of this library (basically just the builders). That way, we could reuse 
the code and be pretty much sure, that the implementation of SQL 
transforms are identical to what Euphoria would offer, which is one the 
goals of this discussion. The drawback would be, that there would be no 
guaranties, that what this underlying library would offer would be also 
accessible from Euphoria - that is because the complexity would not 
disappear, it would be just moved onto different component - new added 
feature to the shared library would have to be made accessible in 
Euphoria. The other way around would be to accept this added complexity 
in favor of making sure, that every feature that is needed by SQL is 
also available in Euphoria, because the user-facing API would be used by 
SQL itself. I'd really like to further hear community opinions on pros 
and cons of these two (or maybe I'm overlooking something and there is a 
third way).

  ii) in some cases, we might want to support relational operators in 
SDK harness for performance, and we don't want to close doors for this

      Again, the motivation of this seems to be clear and valid, but the 
question that arises is - under the conditions (something like we have 
schema aware PCollection), would we want to enable code reuse between 
logic written in SQL and Euphoria to ensure consistent behavior? That 
would probably mean that Euphoria would have to make use of the provided 
scheme of PCollection and switch to a different behavior on API level 
(more DataFrame-like) and/or different operators created and passed to 
the SDK harness. This feature is currently absolutely missing, but seems 
to be plausible and maybe there could be benefits for both sides.

Many thanks for any more opinions on this.

  Jan


On 12/4/18 11:32 PM, Rui Wang wrote:
> For pure SQL users, there shouldn't be a SDK concepts. SQL shell and 
> JDBC driver should be the way to interact Beam by SQL.
>
>
> For embedded SQL use case in all SDKs (Python, Go, etc.), even assume 
> there are relational algebra operators defined in SDKs, SDKs still 
> have to implement its own way to parse SQL into operators (SQL is just 
> a string).  To avoid that overhead, I would imagine that SDKs should 
> keep SQL queries and wait for a later but shared processing (I don't 
> know if Portability should handle SQL or if it could).
>
>
> -Rui
>
> On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je.ik@seznam.cz 
> <ma...@seznam.cz>> wrote:
>
>     Hi Kenn,
>
>     my intent really was not to propose any changes right now. I'm
>     trying to create a clear understanding about what the relation
>     between Euphoria and SQL should be in long run. In my point of
>     view, Euphoria should be always superset of SQL, because it should
>     support complete relational algebra (and I'm not saying it does so
>     right now, it should just be our goal) plus more flexible UDFs
>     (not limited to SQL standard) and stateful processing (which will
>     probably not be part of SQL any time soon). There should be some
>     sort of guaranties that the semantics of SQL and Euphoria are the
>     same, because that is what users would expect it to be. This can
>     be for sure ensured by introducing another layer between Euphoria
>     and core SDK (e.g. the join library), but the question is - what
>     makes this solution different from creating this shared library
>     from Euphoria itself (when looking at the big picture)? And it is
>     not only about implementations of joins or any other operators,
>     but there are other techniques that could be beneficial for SQL -
>     e.g. pipeline sampling, automatic pipeline optimizations based on
>     statistics from previous runs of batch queries, etc.
>
>     The other way - that relational algebra nodes will become
>     essential part of (some) SDK, that is equivalent to actually
>     creating SQL SDK, am I right? I understand, that this approach can
>     bring performance benefits, but besides that - is the language
>     which implements SQL really important for users? Do we need SQL
>     implementing Go UDFs, Java UDFs, Python UDFs? How would the
>     resulting SQL query look like? If it is about allowing using SQL
>     from all other SDKs (I want to do some basic preprocessing using
>     SQL and then optimize some hard part in my favorite SDK) - can
>     this be solved by enabling SQL in all SDKs by mixing various SDKs
>     harnesses in single pipeline instead (e.g. I want to use SQL in Go
>     SDK, I just tell the portable layer to run these operators using
>     Java SDK and these using Go)? That seems plausible, solving
>     interoperability issues, while leaving the whole implementation of
>     SQL as an internal detail. Generally this solves more issues, like
>     ability to reuse IOs in all SDKs (I'm aware that there are
>     caveats, but that is beyond scope of intended discussion topic of
>     this thread).
>
>      Jan
>
>     On 12/3/18 7:27 PM, Kenneth Knowles wrote:
>>     To be honest, I don't think there's much worth doing right now. I
>>     think more self-contained is better for Beam SQL, generally. Two
>>     things I have on my mind are (1) SQL as an inline transform in
>>     every SDK and (2) supporting pure SQL like the CLI and JDBC
>>     driver, where the underlying language is an implementation detail.
>>
>>     Big picture / long term, I would envision pure SQL, embedded SQL
>>     transform, and a DataFrame-like API in ~each SDK all desugaring
>>     to relational algebra nodes, sharing an optimizer, sharing some
>>     amount of mapping the physical plan to Beam transforms. The
>>     necessarily SDK-specific parts are the embedded transform API and
>>     UDFs in the host language. The rest should remain an
>>     implementation detail that we can change.
>>
>>      - For example, it is easy to imagine a customized columnar
>>     element/bundle encoding and SDK harness that only works for SQL
>>     to remove overhead of being general purpose. It could be written
>>     in C/C++/Go if we wanted to squeeze it for perf. Such things are
>>     made harder by having an elaborate end-user API between SQL and
>>     the core Beam model.
>>      - Conversely, for whatever is chosen to underlie SQL's
>>     execution, stability is paramount. Ideally the simplest and least
>>     likely to change transforms would be the foundation. And I
>>     wouldn't want to have to design a user-friendly API for Euphoria
>>     or the join library just to enable a different join algorithm in SQL.
>>
>>     So my take is keep SQL flexible, implement SQL on low-level and
>>     stable APIs, use join library, Euphoria, etc, if it looks like a
>>     big win, but don't build any policy here or do big refactors
>>     right now.
>>
>>     Kenn
>>
>>     On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je.ik@seznam.cz
>>     <ma...@seznam.cz>> wrote:
>>
>>         Hi Robert,
>>
>>         currently there is no actual proposal, I was just trying to
>>         gather
>>         feedback from the community. But my original thoughts would
>>         be [1]. I
>>         actually don't see much need for restructuring the code by
>>         nesting
>>         directories. If the community sees that it would make sense
>>         to structure
>>         the dependencies, the second step would probably be to figure
>>         out how to
>>         accomplish this. I don't have any exact solution in mind so
>>         far, it
>>         would be probably needed to first identify features that are
>>         needed by
>>         SQL and not supported by Euphoria currently. Then we can
>>         actually
>>         identify costs and see it this still makes sense.
>>
>>           Jan
>>
>>         On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>>         > Taking a step back, what exactly is the proposal. Looking
>>         at the
>>         > original message, I see
>>         >
>>         > (1) Letting SQL take a dependency on Euphoria, sharing more
>>         code and
>>         > taking advantage of the logical nesting of levels of
>>         abstraction. This
>>         > makes sense to me.
>>         > (2) Nesting the directories (but not the gradle targets or
>>         module
>>         > names?). Here I'm not so sure about the benefit, especially
>>         vs. the
>>         > cost.
>>         > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský
>>         <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>>         >> I think that the fact that SQL uses some other internal
>>         dependency
>>         >> should remain hidden implementation detail. I absolutely
>>         agree that the
>>         >> dependency should of course remain sdks-java-sql in all cases.
>>         >>
>>         >>     Jan
>>         >>
>>         >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>         >>> I suppose what I'm trying to say is that I see this
>>         module structure
>>         >>> as a tool for discoverability and enumerating end-user
>>         endpoints. In
>>         >>> other words, if one wants to use SQL, it would seem odd
>>         to have to
>>         >>> depend on sdks-java-euphoria-sql rather than just
>>         sdks-java-sql if
>>         >>> sdks-java-euphoria is also a DSL one might use. A sibling
>>         relationship
>>         >>> does not prohibit the layered approach to implementation
>>         that sounds
>>         >>> like it makes sense.
>>         >>>
>>         >>> (As for merging Euphoria into core, my initial impression
>>         is that's
>>         >>> probably a good idea, and something we should consider
>>         for 3.0 at the
>>         >>> very least.)
>>         >>>
>>         >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský
>>         <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>>         >>>> Hi Rui,
>>         >>>>
>>         >>>> yes, there are optimizations that could be added by each
>>         layer. The purpose of Euphoria layer actually is not to
>>         reorder or modify any user operators that are present in the
>>         pipeline (because it might not have enough information to do
>>         this), but it can for instance choose between various join
>>         implementations (shuffle join, broadcast join, ...) - so the
>>         optimizations it can do are more low level. But this plays
>>         nicely with the DSL hierarchy - each layer adds a little more
>>         restrictions, but can therefore do more optimizations. And I
>>         think that the layer between SDK and SQL wouldn't have to
>>         support SQL optimizations, it would only have to support way
>>         for SQL to express these optimizations.
>>         >>>>
>>         >>>>     Jan ---------- Původní e-mail ----------
>>         >>>> Od: Rui Wang <ruwang@google.com <ma...@google.com>>
>>         >>>> Komu: dev@beam.apache.org <ma...@beam.apache.org>
>>         >>>> Datum: 30. 11. 2018 22:43:04
>>         >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>         >>>>
>>         >>>> SQL's optimization is another area to consider for
>>         integration. SQL optimization includes pushing down
>>         filters/projections, merging or removing or swapping plan
>>         nodes and comparing plan costs to choose best plan.  Add
>>         another layer between SQL and java core might need the layer
>>         to support SQL optimizations if there is a need.
>>         >>>>
>>         >>>> I don't have a clear image on what SQL needs from
>>         Euphoria for optimization(best case is nothing). As those
>>         optimizations are happening or will happen, we might start to
>>         have a sense of it.
>>         >>>>
>>         >>>> -Rui
>>         >>>>
>>         >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw
>>         <robertwb@google.com <ma...@google.com>> wrote:
>>         >>>>
>>         >>>> I don't really see Euphoria as a subset of SQL or the
>>         other way
>>         >>>> around, and I think it makes sense to use either without
>>         the other, so
>>         >>>> by this criteria keeping them as siblings than a nesting.
>>         >>>>
>>         >>>> That said, I think it's really good to have a bunch of
>>         shared code,
>>         >>>> e.g. a join library that could be used by both. One
>>         could even depend
>>         >>>> on the other without having to abandon the sibling
>>         relationship.
>>         >>>> Something like retractions belong in the core SDK
>>         itself. Deeper than
>>         >>>> that, actually, it should be part of the model.
>>         >>>>
>>         >>>> - Robert
>>         >>>>
>>         >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek
>>         <dmvk@apache.org <ma...@apache.org>> wrote:
>>         >>>>> Jan, we made Kryo optional recently (it is a separate
>>         module and is used only in tests). From a quick look it seems
>>         that we forgot to remove compile time dependency from
>>         euphoria's build.gradle. Only "strong" dependencies I'm aware
>>         of are core SDK and guava. We'll be probably adding sketching
>>         extension dependency soon.
>>         >>>>>
>>         >>>>> D.
>>         >>>>>
>>         >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský
>>         <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>>         >>>>>> Hi Anton,
>>         >>>>>> reactions inline.
>>         >>>>>>
>>         >>>>>> ---------- Původní e-mail ----------
>>         >>>>>> Od: Anton Kedin <kedin@google.com
>>         <ma...@google.com>>
>>         >>>>>> Komu: dev@beam.apache.org <ma...@beam.apache.org>
>>         >>>>>> Datum: 30. 11. 2018 18:17:06
>>         >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>         >>>>>>
>>         >>>>>> I think this approach makes sense in general, Euphoria
>>         can be the implementation detail of SQL, similar to Join
>>         Library or core SDK Schemas.
>>         >>>>>>
>>         >>>>>> I wonder though whether it would be better to bring
>>         Euphoria closer to core SDK first, maybe even merge them
>>         together. If you look at Reuven's recent work around schemas
>>         it seems like there are already similarities between that and
>>         Euphoria's approach, unless I'm missing the point (e.g.
>>         Filter transforms, FullJoin vs CoGroup... see [2]). And we're
>>         already switching parts of SQL to those transforms (e.g. SQL
>>         Aggregation is now implemented by core SDK's Group[3]).
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> Yes, these transforms seem to be very similar to those
>>         Euphoria has. Whether or not to merge Euphoria with core is
>>         essentially just a decision of the community (in my point of
>>         view).
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> Adding explicit Schema support to Euphoria will bring
>>         it both closer to core SDK and make it natural to use for
>>         SQL. Can this be a first step towards this integration?
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> Euphoria currently operates on pure PCollections, so
>>         when PCollection has a schema, it will be accessible by
>>         Euphoria. It makes sense to make use of the schema in
>>         Euphoria - it seems natural on inputs to Euphoria operators,
>>         but it might be tricky (not saying impossible) to actually
>>         produce schema-aware PCollections as outputs from Euphoria
>>         operators (generally speaking, in special cases that might be
>>         possible). Regarding inputs, there is actually intention to
>>         act on type of PCollection - e.g. when PCollection is already
>>         of type KV, then it is possible to make key extractor and
>>         value extractor optional in Euphoria builders, so it feels
>>         natural to enable changing the builders when a schema-aware
>>         PCollection, and make use of the provided schema. The rest of
>>         Euphoria team might correct me, if I'm wrong.
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> One question I have is, does Euphoria bring
>>         dependencies that are not needed by SQL, or does more or less
>>         only rely on the core SDK?
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> I think the only relevant dependency that Euphoria has
>>         besides core SDK is Kryo. It is the default coder when no
>>         coder is provided, but that could be made optional - e.g. the
>>         default coder would be supported only if an appropriate
>>         module would be available. That way I think that Euphoria has
>>         no special dependencies.
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> [1]
>>         https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>         >>>>>> [2]
>>         https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>         >>>>>> [3]
>>         https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>         >>>>>>
>>         >>>>>>
>>         >>>>>>
>>         >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský
>>         <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>>         >>>>>>
>>         >>>>>> Hi community,
>>         >>>>>>
>>         >>>>>> I'm part of Euphoria DSL team, and on behalf of this
>>         team, I'd like to
>>         >>>>>> discuss possible development of Java based DSLs
>>         currently present in
>>         >>>>>> Beam. In my knowledge, there are currently two DSLs
>>         based on Java SDK -
>>         >>>>>> Euphoria and SQL. These DSLs currently share only the
>>         SDK itself,
>>         >>>>>> although there might be room to share some more
>>         effort. We already know
>>         >>>>>> that both Euphoria and SQL have need for retractions,
>>         but there are
>>         >>>>>> probably many more features that these two could share.
>>         >>>>>>
>>         >>>>>> So, I'd like to open a discussion on what it would
>>         cost and what it
>>         >>>>>> would possibly bring, if instead of the current structure
>>         >>>>>>
>>         >>>>>>      Java SDK
>>         >>>>>>
>>         >>>>>>        | ---- SQL
>>         >>>>>>
>>         >>>>>>        | ---- Euphoria
>>         >>>>>>
>>         >>>>>> these DSLs would be structured as
>>         >>>>>>
>>         >>>>>>      Java SDK ---> Euphoria ---> SQL
>>         >>>>>>
>>         >>>>>> I'm absolutely sure that this would be a great
>>         investment and a huge
>>         >>>>>> change, but I'd like to gather some opinions and
>>         general feelings of the
>>         >>>>>> community about this. Some points to start the
>>         discussion from my side
>>         >>>>>> would be, that structuring DSLs like this has internal
>>         logical
>>         >>>>>> consistency, because each API layer further narrows
>>         completeness, but
>>         >>>>>> brings simpler API for simpler tasks, while adding
>>         additional high-level
>>         >>>>>> view of the data processing pipeline and thus enabling
>>         more
>>         >>>>>> optimizations. On Euphoria side, these are various
>>         implementations joins
>>         >>>>>> (most effective implementation depends on data),
>>         pipeline sampling and
>>         >>>>>> more. Some (or maybe most) of these optimizations
>>         would have to be
>>         >>>>>> implemented in both DSLs, so implementing them once is
>>         beneficial.
>>         >>>>>> Another benefit is that this would bring Euphoria
>>         "closer" to Beam core
>>         >>>>>> development (which would be good, it is part of the
>>         project anyway,
>>         >>>>>> right? :)) and help better drive features, that
>>         although currently
>>         >>>>>> needed mostly by SQL, might be needed by other Java
>>         users anyway.
>>         >>>>>>
>>         >>>>>> Thanks for discussion and looking forward to any opinions.
>>         >>>>>>
>>         >>>>>>      Jan
>>         >>>>>>
>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Rui Wang <ru...@google.com>.

For pure SQL users, there shouldn't be a SDK concepts. SQL shell and JDBC
driver should be the way to interact Beam by SQL.


For embedded SQL use case in all SDKs (Python, Go, etc.), even assume there
are relational algebra operators defined in SDKs, SDKs still have to
implement its own way to parse SQL into operators (SQL is just a string).
To avoid that overhead, I would imagine that SDKs should keep SQL queries
and wait for a later but shared processing (I don't know if Portability
should handle SQL or if it could).


-Rui

On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi Kenn,
>
> my intent really was not to propose any changes right now. I'm trying to
> create a clear understanding about what the relation between Euphoria and
> SQL should be in long run. In my point of view, Euphoria should be always
> superset of SQL, because it should support complete relational algebra (and
> I'm not saying it does so right now, it should just be our goal) plus more
> flexible UDFs (not limited to SQL standard) and stateful processing (which
> will probably not be part of SQL any time soon). There should be some sort
> of guaranties that the semantics of SQL and Euphoria are the same, because
> that is what users would expect it to be. This can be for sure ensured by
> introducing another layer between Euphoria and core SDK (e.g. the join
> library), but the question is - what makes this solution different from
> creating this shared library from Euphoria itself (when looking at the big
> picture)? And it is not only about implementations of joins or any other
> operators, but there are other techniques that could be beneficial for SQL
> - e.g. pipeline sampling, automatic pipeline optimizations based on
> statistics from previous runs of batch queries, etc.
>
> The other way - that relational algebra nodes will become essential part
> of (some) SDK, that is equivalent to actually creating SQL SDK, am I right?
> I understand, that this approach can bring performance benefits, but
> besides that - is the language which implements SQL really important for
> users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? How
> would the resulting SQL query look like? If it is about allowing using SQL
> from all other SDKs (I want to do some basic preprocessing using SQL and
> then optimize some hard part in my favorite SDK) - can this be solved by
> enabling SQL in all SDKs by mixing various SDKs harnesses in single
> pipeline instead (e.g. I want to use SQL in Go SDK, I just tell the
> portable layer to run these operators using Java SDK and these using Go)?
> That seems plausible, solving interoperability issues, while leaving the
> whole implementation of SQL as an internal detail. Generally this solves
> more issues, like ability to reuse IOs in all SDKs (I'm aware that there
> are caveats, but that is beyond scope of intended discussion topic of this
> thread).
>
>  Jan
> On 12/3/18 7:27 PM, Kenneth Knowles wrote:
>
> To be honest, I don't think there's much worth doing right now. I think
> more self-contained is better for Beam SQL, generally. Two things I have on
> my mind are (1) SQL as an inline transform in every SDK and (2) supporting
> pure SQL like the CLI and JDBC driver, where the underlying language is an
> implementation detail.
>
> Big picture / long term, I would envision pure SQL, embedded SQL
> transform, and a DataFrame-like API in ~each SDK all desugaring to
> relational algebra nodes, sharing an optimizer, sharing some amount of
> mapping the physical plan to Beam transforms. The necessarily SDK-specific
> parts are the embedded transform API and UDFs in the host language. The
> rest should remain an implementation detail that we can change.
>
>  - For example, it is easy to imagine a customized columnar element/bundle
> encoding and SDK harness that only works for SQL to remove overhead of
> being general purpose. It could be written in C/C++/Go if we wanted to
> squeeze it for perf. Such things are made harder by having an elaborate
> end-user API between SQL and the core Beam model.
>  - Conversely, for whatever is chosen to underlie SQL's execution,
> stability is paramount. Ideally the simplest and least likely to change
> transforms would be the foundation. And I wouldn't want to have to design a
> user-friendly API for Euphoria or the join library just to enable a
> different join algorithm in SQL.
>
> So my take is keep SQL flexible, implement SQL on low-level and stable
> APIs, use join library, Euphoria, etc, if it looks like a big win, but
> don't build any policy here or do big refactors right now.
>
> Kenn
>
> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi Robert,
>>
>> currently there is no actual proposal, I was just trying to gather
>> feedback from the community. But my original thoughts would be [1]. I
>> actually don't see much need for restructuring the code by nesting
>> directories. If the community sees that it would make sense to structure
>> the dependencies, the second step would probably be to figure out how to
>> accomplish this. I don't have any exact solution in mind so far, it
>> would be probably needed to first identify features that are needed by
>> SQL and not supported by Euphoria currently. Then we can actually
>> identify costs and see it this still makes sense.
>>
>>   Jan
>>
>> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>> > Taking a step back, what exactly is the proposal. Looking at the
>> > original message, I see
>> >
>> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
>> > taking advantage of the logical nesting of levels of abstraction. This
>> > makes sense to me.
>> > (2) Nesting the directories (but not the gradle targets or module
>> > names?). Here I'm not so sure about the benefit, especially vs. the
>> > cost.
>> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>> >> I think that the fact that SQL uses some other internal dependency
>> >> should remain hidden implementation detail. I absolutely agree that the
>> >> dependency should of course remain sdks-java-sql in all cases.
>> >>
>> >>     Jan
>> >>
>> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>> >>> I suppose what I'm trying to say is that I see this module structure
>> >>> as a tool for discoverability and enumerating end-user endpoints. In
>> >>> other words, if one wants to use SQL, it would seem odd to have to
>> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>> >>> sdks-java-euphoria is also a DSL one might use. A sibling relationship
>> >>> does not prohibit the layered approach to implementation that sounds
>> >>> like it makes sense.
>> >>>
>> >>> (As for merging Euphoria into core, my initial impression is that's
>> >>> probably a good idea, and something we should consider for 3.0 at the
>> >>> very least.)
>> >>>
>> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz>
>> wrote:
>> >>>> Hi Rui,
>> >>>>
>> >>>> yes, there are optimizations that could be added by each layer. The
>> purpose of Euphoria layer actually is not to reorder or modify any user
>> operators that are present in the pipeline (because it might not have
>> enough information to do this), but it can for instance choose between
>> various join implementations (shuffle join, broadcast join, ...) - so the
>> optimizations it can do are more low level. But this plays nicely with the
>> DSL hierarchy - each layer adds a little more restrictions, but can
>> therefore do more optimizations. And I think that the layer between SDK and
>> SQL wouldn't have to support SQL optimizations, it would only have to
>> support way for SQL to express these optimizations.
>> >>>>
>> >>>>     Jan ---------- Původní e-mail ----------
>> >>>> Od: Rui Wang <ru...@google.com>
>> >>>> Komu: dev@beam.apache.org
>> >>>> Datum: 30. 11. 2018 22:43:04
>> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>> >>>>
>> >>>> SQL's optimization is another area to consider for integration. SQL
>> optimization includes pushing down filters/projections, merging or removing
>> or swapping plan nodes and comparing plan costs to choose best plan.  Add
>> another layer between SQL and java core might need the layer to support SQL
>> optimizations if there is a need.
>> >>>>
>> >>>> I don't have a clear image on what SQL needs from Euphoria for
>> optimization(best case is nothing). As those optimizations are happening or
>> will happen, we might start to have a sense of it.
>> >>>>
>> >>>> -Rui
>> >>>>
>> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>>>
>> >>>> I don't really see Euphoria as a subset of SQL or the other way
>> >>>> around, and I think it makes sense to use either without the other,
>> so
>> >>>> by this criteria keeping them as siblings than a nesting.
>> >>>>
>> >>>> That said, I think it's really good to have a bunch of shared code,
>> >>>> e.g. a join library that could be used by both. One could even depend
>> >>>> on the other without having to abandon the sibling relationship.
>> >>>> Something like retractions belong in the core SDK itself. Deeper than
>> >>>> that, actually, it should be part of the model.
>> >>>>
>> >>>> - Robert
>> >>>>
>> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org>
>> wrote:
>> >>>>> Jan, we made Kryo optional recently (it is a separate module and is
>> used only in tests). From a quick look it seems that we forgot to remove
>> compile time dependency from euphoria's build.gradle. Only "strong"
>> dependencies I'm aware of are core SDK and guava. We'll be probably adding
>> sketching extension dependency soon.
>> >>>>>
>> >>>>> D.
>> >>>>>
>> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz>
>> wrote:
>> >>>>>> Hi Anton,
>> >>>>>> reactions inline.
>> >>>>>>
>> >>>>>> ---------- Původní e-mail ----------
>> >>>>>> Od: Anton Kedin <ke...@google.com>
>> >>>>>> Komu: dev@beam.apache.org
>> >>>>>> Datum: 30. 11. 2018 18:17:06
>> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>> >>>>>>
>> >>>>>> I think this approach makes sense in general, Euphoria can be the
>> implementation detail of SQL, similar to Join Library or core SDK Schemas.
>> >>>>>>
>> >>>>>> I wonder though whether it would be better to bring Euphoria
>> closer to core SDK first, maybe even merge them together. If you look at
>> Reuven's recent work around schemas it seems like there are already
>> similarities between that and Euphoria's approach, unless I'm missing the
>> point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're
>> already switching parts of SQL to those transforms (e.g. SQL Aggregation is
>> now implemented by core SDK's Group[3]).
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
>> has. Whether or not to merge Euphoria with core is essentially just a
>> decision of the community (in my point of view).
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Adding explicit Schema support to Euphoria will bring it both
>> closer to core SDK and make it natural to use for SQL. Can this be a first
>> step towards this integration?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Euphoria currently operates on pure PCollections, so when
>> PCollection has a schema, it will be accessible by Euphoria. It makes sense
>> to make use of the schema in Euphoria - it seems natural on inputs to
>> Euphoria operators, but it might be tricky (not saying impossible) to
>> actually produce schema-aware PCollections as outputs from Euphoria
>> operators (generally speaking, in special cases that might be possible).
>> Regarding inputs, there is actually intention to act on type of PCollection
>> - e.g. when PCollection is already of type KV, then it is possible to make
>> key extractor and value extractor optional in Euphoria builders, so it
>> feels natural to enable changing the builders when a schema-aware
>> PCollection, and make use of the provided schema. The rest of Euphoria team
>> might correct me, if I'm wrong.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> One question I have is, does Euphoria bring dependencies that are
>> not needed by SQL, or does more or less only rely on the core SDK?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> I think the only relevant dependency that Euphoria has besides
>> core SDK is Kryo. It is the default coder when no coder is provided, but
>> that could be made optional - e.g. the default coder would be supported
>> only if an appropriate module would be available. That way I think that
>> Euphoria has no special dependencies.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> [1]
>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>> >>>>>> [2]
>> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>> >>>>>> [3]
>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz>
>> wrote:
>> >>>>>>
>> >>>>>> Hi community,
>> >>>>>>
>> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd
>> like to
>> >>>>>> discuss possible development of Java based DSLs currently present
>> in
>> >>>>>> Beam. In my knowledge, there are currently two DSLs based on Java
>> SDK -
>> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>> >>>>>> although there might be room to share some more effort. We already
>> know
>> >>>>>> that both Euphoria and SQL have need for retractions, but there are
>> >>>>>> probably many more features that these two could share.
>> >>>>>>
>> >>>>>> So, I'd like to open a discussion on what it would cost and what it
>> >>>>>> would possibly bring, if instead of the current structure
>> >>>>>>
>> >>>>>>      Java SDK
>> >>>>>>
>> >>>>>>        | ---- SQL
>> >>>>>>
>> >>>>>>        | ---- Euphoria
>> >>>>>>
>> >>>>>> these DSLs would be structured as
>> >>>>>>
>> >>>>>>      Java SDK ---> Euphoria ---> SQL
>> >>>>>>
>> >>>>>> I'm absolutely sure that this would be a great investment and a
>> huge
>> >>>>>> change, but I'd like to gather some opinions and general feelings
>> of the
>> >>>>>> community about this. Some points to start the discussion from my
>> side
>> >>>>>> would be, that structuring DSLs like this has internal logical
>> >>>>>> consistency, because each API layer further narrows completeness,
>> but
>> >>>>>> brings simpler API for simpler tasks, while adding additional
>> high-level
>> >>>>>> view of the data processing pipeline and thus enabling more
>> >>>>>> optimizations. On Euphoria side, these are various implementations
>> joins
>> >>>>>> (most effective implementation depends on data), pipeline sampling
>> and
>> >>>>>> more. Some (or maybe most) of these optimizations would have to be
>> >>>>>> implemented in both DSLs, so implementing them once is beneficial.
>> >>>>>> Another benefit is that this would bring Euphoria "closer" to Beam
>> core
>> >>>>>> development (which would be good, it is part of the project anyway,
>> >>>>>> right? :)) and help better drive features, that although currently
>> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
>> >>>>>>
>> >>>>>> Thanks for discussion and looking forward to any opinions.
>> >>>>>>
>> >>>>>>      Jan
>> >>>>>>
>>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Kenn,

my intent really was not to propose any changes right now. I'm trying to 
create a clear understanding about what the relation between Euphoria 
and SQL should be in long run. In my point of view, Euphoria should be 
always superset of SQL, because it should support complete relational 
algebra (and I'm not saying it does so right now, it should just be our 
goal) plus more flexible UDFs (not limited to SQL standard) and stateful 
processing (which will probably not be part of SQL any time soon). There 
should be some sort of guaranties that the semantics of SQL and Euphoria 
are the same, because that is what users would expect it to be. This can 
be for sure ensured by introducing another layer between Euphoria and 
core SDK (e.g. the join library), but the question is - what makes this 
solution different from creating this shared library from Euphoria 
itself (when looking at the big picture)? And it is not only about 
implementations of joins or any other operators, but there are other 
techniques that could be beneficial for SQL - e.g. pipeline sampling, 
automatic pipeline optimizations based on statistics from previous runs 
of batch queries, etc.

The other way - that relational algebra nodes will become essential part 
of (some) SDK, that is equivalent to actually creating SQL SDK, am I 
right? I understand, that this approach can bring performance benefits, 
but besides that - is the language which implements SQL really important 
for users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? 
How would the resulting SQL query look like? If it is about allowing 
using SQL from all other SDKs (I want to do some basic preprocessing 
using SQL and then optimize some hard part in my favorite SDK) - can 
this be solved by enabling SQL in all SDKs by mixing various SDKs 
harnesses in single pipeline instead (e.g. I want to use SQL in Go SDK, 
I just tell the portable layer to run these operators using Java SDK and 
these using Go)? That seems plausible, solving interoperability issues, 
while leaving the whole implementation of SQL as an internal detail. 
Generally this solves more issues, like ability to reuse IOs in all SDKs 
(I'm aware that there are caveats, but that is beyond scope of intended 
discussion topic of this thread).

  Jan

On 12/3/18 7:27 PM, Kenneth Knowles wrote:
> To be honest, I don't think there's much worth doing right now. I 
> think more self-contained is better for Beam SQL, generally. Two 
> things I have on my mind are (1) SQL as an inline transform in every 
> SDK and (2) supporting pure SQL like the CLI and JDBC driver, where 
> the underlying language is an implementation detail.
>
> Big picture / long term, I would envision pure SQL, embedded SQL 
> transform, and a DataFrame-like API in ~each SDK all desugaring to 
> relational algebra nodes, sharing an optimizer, sharing some amount of 
> mapping the physical plan to Beam transforms. The necessarily 
> SDK-specific parts are the embedded transform API and UDFs in the host 
> language. The rest should remain an implementation detail that we can 
> change.
>
>  - For example, it is easy to imagine a customized columnar 
> element/bundle encoding and SDK harness that only works for SQL to 
> remove overhead of being general purpose. It could be written in 
> C/C++/Go if we wanted to squeeze it for perf. Such things are made 
> harder by having an elaborate end-user API between SQL and the core 
> Beam model.
>  - Conversely, for whatever is chosen to underlie SQL's execution, 
> stability is paramount. Ideally the simplest and least likely to 
> change transforms would be the foundation. And I wouldn't want to have 
> to design a user-friendly API for Euphoria or the join library just to 
> enable a different join algorithm in SQL.
>
> So my take is keep SQL flexible, implement SQL on low-level and stable 
> APIs, use join library, Euphoria, etc, if it looks like a big win, but 
> don't build any policy here or do big refactors right now.
>
> Kenn
>
> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je.ik@seznam.cz 
> <ma...@seznam.cz>> wrote:
>
>     Hi Robert,
>
>     currently there is no actual proposal, I was just trying to gather
>     feedback from the community. But my original thoughts would be [1]. I
>     actually don't see much need for restructuring the code by nesting
>     directories. If the community sees that it would make sense to
>     structure
>     the dependencies, the second step would probably be to figure out
>     how to
>     accomplish this. I don't have any exact solution in mind so far, it
>     would be probably needed to first identify features that are
>     needed by
>     SQL and not supported by Euphoria currently. Then we can actually
>     identify costs and see it this still makes sense.
>
>       Jan
>
>     On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>     > Taking a step back, what exactly is the proposal. Looking at the
>     > original message, I see
>     >
>     > (1) Letting SQL take a dependency on Euphoria, sharing more code and
>     > taking advantage of the logical nesting of levels of
>     abstraction. This
>     > makes sense to me.
>     > (2) Nesting the directories (but not the gradle targets or module
>     > names?). Here I'm not so sure about the benefit, especially vs. the
>     > cost.
>     > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je.ik@seznam.cz
>     <ma...@seznam.cz>> wrote:
>     >> I think that the fact that SQL uses some other internal dependency
>     >> should remain hidden implementation detail. I absolutely agree
>     that the
>     >> dependency should of course remain sdks-java-sql in all cases.
>     >>
>     >>     Jan
>     >>
>     >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>     >>> I suppose what I'm trying to say is that I see this module
>     structure
>     >>> as a tool for discoverability and enumerating end-user
>     endpoints. In
>     >>> other words, if one wants to use SQL, it would seem odd to have to
>     >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>     >>> sdks-java-euphoria is also a DSL one might use. A sibling
>     relationship
>     >>> does not prohibit the layered approach to implementation that
>     sounds
>     >>> like it makes sense.
>     >>>
>     >>> (As for merging Euphoria into core, my initial impression is
>     that's
>     >>> probably a good idea, and something we should consider for 3.0
>     at the
>     >>> very least.)
>     >>>
>     >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je.ik@seznam.cz
>     <ma...@seznam.cz>> wrote:
>     >>>> Hi Rui,
>     >>>>
>     >>>> yes, there are optimizations that could be added by each
>     layer. The purpose of Euphoria layer actually is not to reorder or
>     modify any user operators that are present in the pipeline
>     (because it might not have enough information to do this), but it
>     can for instance choose between various join implementations
>     (shuffle join, broadcast join, ...) - so the optimizations it can
>     do are more low level. But this plays nicely with the DSL
>     hierarchy - each layer adds a little more restrictions, but can
>     therefore do more optimizations. And I think that the layer
>     between SDK and SQL wouldn't have to support SQL optimizations, it
>     would only have to support way for SQL to express these optimizations.
>     >>>>
>     >>>>     Jan ---------- Původní e-mail ----------
>     >>>> Od: Rui Wang <ruwang@google.com <ma...@google.com>>
>     >>>> Komu: dev@beam.apache.org <ma...@beam.apache.org>
>     >>>> Datum: 30. 11. 2018 22:43:04
>     >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>     >>>>
>     >>>> SQL's optimization is another area to consider for
>     integration. SQL optimization includes pushing down
>     filters/projections, merging or removing or swapping plan nodes
>     and comparing plan costs to choose best plan.  Add another layer
>     between SQL and java core might need the layer to support SQL
>     optimizations if there is a need.
>     >>>>
>     >>>> I don't have a clear image on what SQL needs from Euphoria
>     for optimization(best case is nothing). As those optimizations are
>     happening or will happen, we might start to have a sense of it.
>     >>>>
>     >>>> -Rui
>     >>>>
>     >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >>>>
>     >>>> I don't really see Euphoria as a subset of SQL or the other way
>     >>>> around, and I think it makes sense to use either without the
>     other, so
>     >>>> by this criteria keeping them as siblings than a nesting.
>     >>>>
>     >>>> That said, I think it's really good to have a bunch of shared
>     code,
>     >>>> e.g. a join library that could be used by both. One could
>     even depend
>     >>>> on the other without having to abandon the sibling relationship.
>     >>>> Something like retractions belong in the core SDK itself.
>     Deeper than
>     >>>> that, actually, it should be part of the model.
>     >>>>
>     >>>> - Robert
>     >>>>
>     >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek
>     <dmvk@apache.org <ma...@apache.org>> wrote:
>     >>>>> Jan, we made Kryo optional recently (it is a separate module
>     and is used only in tests). From a quick look it seems that we
>     forgot to remove compile time dependency from euphoria's
>     build.gradle. Only "strong" dependencies I'm aware of are core SDK
>     and guava. We'll be probably adding sketching extension dependency
>     soon.
>     >>>>>
>     >>>>> D.
>     >>>>>
>     >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský
>     <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>     >>>>>> Hi Anton,
>     >>>>>> reactions inline.
>     >>>>>>
>     >>>>>> ---------- Původní e-mail ----------
>     >>>>>> Od: Anton Kedin <kedin@google.com <ma...@google.com>>
>     >>>>>> Komu: dev@beam.apache.org <ma...@beam.apache.org>
>     >>>>>> Datum: 30. 11. 2018 18:17:06
>     >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>     >>>>>>
>     >>>>>> I think this approach makes sense in general, Euphoria can
>     be the implementation detail of SQL, similar to Join Library or
>     core SDK Schemas.
>     >>>>>>
>     >>>>>> I wonder though whether it would be better to bring
>     Euphoria closer to core SDK first, maybe even merge them together.
>     If you look at Reuven's recent work around schemas it seems like
>     there are already similarities between that and Euphoria's
>     approach, unless I'm missing the point (e.g. Filter transforms,
>     FullJoin vs CoGroup... see [2]). And we're already switching parts
>     of SQL to those transforms (e.g. SQL Aggregation is now
>     implemented by core SDK's Group[3]).
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> Yes, these transforms seem to be very similar to those
>     Euphoria has. Whether or not to merge Euphoria with core is
>     essentially just a decision of the community (in my point of view).
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> Adding explicit Schema support to Euphoria will bring it
>     both closer to core SDK and make it natural to use for SQL. Can
>     this be a first step towards this integration?
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> Euphoria currently operates on pure PCollections, so when
>     PCollection has a schema, it will be accessible by Euphoria. It
>     makes sense to make use of the schema in Euphoria - it seems
>     natural on inputs to Euphoria operators, but it might be tricky
>     (not saying impossible) to actually produce schema-aware
>     PCollections as outputs from Euphoria operators (generally
>     speaking, in special cases that might be possible). Regarding
>     inputs, there is actually intention to act on type of PCollection
>     - e.g. when PCollection is already of type KV, then it is possible
>     to make key extractor and value extractor optional in Euphoria
>     builders, so it feels natural to enable changing the builders when
>     a schema-aware PCollection, and make use of the provided schema.
>     The rest of Euphoria team might correct me, if I'm wrong.
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> One question I have is, does Euphoria bring dependencies
>     that are not needed by SQL, or does more or less only rely on the
>     core SDK?
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> I think the only relevant dependency that Euphoria has
>     besides core SDK is Kryo. It is the default coder when no coder is
>     provided, but that could be made optional - e.g. the default coder
>     would be supported only if an appropriate module would be
>     available. That way I think that Euphoria has no special dependencies.
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> [1]
>     https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>     >>>>>> [2]
>     https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>     >>>>>> [3]
>     https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>     >>>>>>
>     >>>>>>
>     >>>>>>
>     >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský
>     <je.ik@seznam.cz <ma...@seznam.cz>> wrote:
>     >>>>>>
>     >>>>>> Hi community,
>     >>>>>>
>     >>>>>> I'm part of Euphoria DSL team, and on behalf of this team,
>     I'd like to
>     >>>>>> discuss possible development of Java based DSLs currently
>     present in
>     >>>>>> Beam. In my knowledge, there are currently two DSLs based
>     on Java SDK -
>     >>>>>> Euphoria and SQL. These DSLs currently share only the SDK
>     itself,
>     >>>>>> although there might be room to share some more effort. We
>     already know
>     >>>>>> that both Euphoria and SQL have need for retractions, but
>     there are
>     >>>>>> probably many more features that these two could share.
>     >>>>>>
>     >>>>>> So, I'd like to open a discussion on what it would cost and
>     what it
>     >>>>>> would possibly bring, if instead of the current structure
>     >>>>>>
>     >>>>>>      Java SDK
>     >>>>>>
>     >>>>>>        | ---- SQL
>     >>>>>>
>     >>>>>>        | ---- Euphoria
>     >>>>>>
>     >>>>>> these DSLs would be structured as
>     >>>>>>
>     >>>>>>      Java SDK ---> Euphoria ---> SQL
>     >>>>>>
>     >>>>>> I'm absolutely sure that this would be a great investment
>     and a huge
>     >>>>>> change, but I'd like to gather some opinions and general
>     feelings of the
>     >>>>>> community about this. Some points to start the discussion
>     from my side
>     >>>>>> would be, that structuring DSLs like this has internal logical
>     >>>>>> consistency, because each API layer further narrows
>     completeness, but
>     >>>>>> brings simpler API for simpler tasks, while adding
>     additional high-level
>     >>>>>> view of the data processing pipeline and thus enabling more
>     >>>>>> optimizations. On Euphoria side, these are various
>     implementations joins
>     >>>>>> (most effective implementation depends on data), pipeline
>     sampling and
>     >>>>>> more. Some (or maybe most) of these optimizations would
>     have to be
>     >>>>>> implemented in both DSLs, so implementing them once is
>     beneficial.
>     >>>>>> Another benefit is that this would bring Euphoria "closer"
>     to Beam core
>     >>>>>> development (which would be good, it is part of the project
>     anyway,
>     >>>>>> right? :)) and help better drive features, that although
>     currently
>     >>>>>> needed mostly by SQL, might be needed by other Java users
>     anyway.
>     >>>>>>
>     >>>>>> Thanks for discussion and looking forward to any opinions.
>     >>>>>>
>     >>>>>>      Jan
>     >>>>>>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Kenneth Knowles <ke...@apache.org>.

To be honest, I don't think there's much worth doing right now. I think
more self-contained is better for Beam SQL, generally. Two things I have on
my mind are (1) SQL as an inline transform in every SDK and (2) supporting
pure SQL like the CLI and JDBC driver, where the underlying language is an
implementation detail.

Big picture / long term, I would envision pure SQL, embedded SQL transform,
and a DataFrame-like API in ~each SDK all desugaring to relational algebra
nodes, sharing an optimizer, sharing some amount of mapping the physical
plan to Beam transforms. The necessarily SDK-specific parts are the
embedded transform API and UDFs in the host language. The rest should
remain an implementation detail that we can change.

 - For example, it is easy to imagine a customized columnar element/bundle
encoding and SDK harness that only works for SQL to remove overhead of
being general purpose. It could be written in C/C++/Go if we wanted to
squeeze it for perf. Such things are made harder by having an elaborate
end-user API between SQL and the core Beam model.
 - Conversely, for whatever is chosen to underlie SQL's execution,
stability is paramount. Ideally the simplest and least likely to change
transforms would be the foundation. And I wouldn't want to have to design a
user-friendly API for Euphoria or the join library just to enable a
different join algorithm in SQL.

So my take is keep SQL flexible, implement SQL on low-level and stable
APIs, use join library, Euphoria, etc, if it looks like a big win, but
don't build any policy here or do big refactors right now.

Kenn

On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi Robert,
>
> currently there is no actual proposal, I was just trying to gather
> feedback from the community. But my original thoughts would be [1]. I
> actually don't see much need for restructuring the code by nesting
> directories. If the community sees that it would make sense to structure
> the dependencies, the second step would probably be to figure out how to
> accomplish this. I don't have any exact solution in mind so far, it
> would be probably needed to first identify features that are needed by
> SQL and not supported by Euphoria currently. Then we can actually
> identify costs and see it this still makes sense.
>
>   Jan
>
> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
> > Taking a step back, what exactly is the proposal. Looking at the
> > original message, I see
> >
> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
> > taking advantage of the logical nesting of levels of abstraction. This
> > makes sense to me.
> > (2) Nesting the directories (but not the gradle targets or module
> > names?). Here I'm not so sure about the benefit, especially vs. the
> > cost.
> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
> >> I think that the fact that SQL uses some other internal dependency
> >> should remain hidden implementation detail. I absolutely agree that the
> >> dependency should of course remain sdks-java-sql in all cases.
> >>
> >>     Jan
> >>
> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
> >>> I suppose what I'm trying to say is that I see this module structure
> >>> as a tool for discoverability and enumerating end-user endpoints. In
> >>> other words, if one wants to use SQL, it would seem odd to have to
> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
> >>> sdks-java-euphoria is also a DSL one might use. A sibling relationship
> >>> does not prohibit the layered approach to implementation that sounds
> >>> like it makes sense.
> >>>
> >>> (As for merging Euphoria into core, my initial impression is that's
> >>> probably a good idea, and something we should consider for 3.0 at the
> >>> very least.)
> >>>
> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> wrote:
> >>>> Hi Rui,
> >>>>
> >>>> yes, there are optimizations that could be added by each layer. The
> purpose of Euphoria layer actually is not to reorder or modify any user
> operators that are present in the pipeline (because it might not have
> enough information to do this), but it can for instance choose between
> various join implementations (shuffle join, broadcast join, ...) - so the
> optimizations it can do are more low level. But this plays nicely with the
> DSL hierarchy - each layer adds a little more restrictions, but can
> therefore do more optimizations. And I think that the layer between SDK and
> SQL wouldn't have to support SQL optimizations, it would only have to
> support way for SQL to express these optimizations.
> >>>>
> >>>>     Jan ---------- Původní e-mail ----------
> >>>> Od: Rui Wang <ru...@google.com>
> >>>> Komu: dev@beam.apache.org
> >>>> Datum: 30. 11. 2018 22:43:04
> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>>>
> >>>> SQL's optimization is another area to consider for integration. SQL
> optimization includes pushing down filters/projections, merging or removing
> or swapping plan nodes and comparing plan costs to choose best plan.  Add
> another layer between SQL and java core might need the layer to support SQL
> optimizations if there is a need.
> >>>>
> >>>> I don't have a clear image on what SQL needs from Euphoria for
> optimization(best case is nothing). As those optimizations are happening or
> will happen, we might start to have a sense of it.
> >>>>
> >>>> -Rui
> >>>>
> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>>
> >>>> I don't really see Euphoria as a subset of SQL or the other way
> >>>> around, and I think it makes sense to use either without the other, so
> >>>> by this criteria keeping them as siblings than a nesting.
> >>>>
> >>>> That said, I think it's really good to have a bunch of shared code,
> >>>> e.g. a join library that could be used by both. One could even depend
> >>>> on the other without having to abandon the sibling relationship.
> >>>> Something like retractions belong in the core SDK itself. Deeper than
> >>>> that, actually, it should be part of the model.
> >>>>
> >>>> - Robert
> >>>>
> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org>
> wrote:
> >>>>> Jan, we made Kryo optional recently (it is a separate module and is
> used only in tests). From a quick look it seems that we forgot to remove
> compile time dependency from euphoria's build.gradle. Only "strong"
> dependencies I'm aware of are core SDK and guava. We'll be probably adding
> sketching extension dependency soon.
> >>>>>
> >>>>> D.
> >>>>>
> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz>
> wrote:
> >>>>>> Hi Anton,
> >>>>>> reactions inline.
> >>>>>>
> >>>>>> ---------- Původní e-mail ----------
> >>>>>> Od: Anton Kedin <ke...@google.com>
> >>>>>> Komu: dev@beam.apache.org
> >>>>>> Datum: 30. 11. 2018 18:17:06
> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>>>>>
> >>>>>> I think this approach makes sense in general, Euphoria can be the
> implementation detail of SQL, similar to Join Library or core SDK Schemas.
> >>>>>>
> >>>>>> I wonder though whether it would be better to bring Euphoria closer
> to core SDK first, maybe even merge them together. If you look at Reuven's
> recent work around schemas it seems like there are already similarities
> between that and Euphoria's approach, unless I'm missing the point (e.g.
> Filter transforms, FullJoin vs CoGroup... see [2]). And we're already
> switching parts of SQL to those transforms (e.g. SQL Aggregation is now
> implemented by core SDK's Group[3]).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
> has. Whether or not to merge Euphoria with core is essentially just a
> decision of the community (in my point of view).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Adding explicit Schema support to Euphoria will bring it both
> closer to core SDK and make it natural to use for SQL. Can this be a first
> step towards this integration?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Euphoria currently operates on pure PCollections, so when
> PCollection has a schema, it will be accessible by Euphoria. It makes sense
> to make use of the schema in Euphoria - it seems natural on inputs to
> Euphoria operators, but it might be tricky (not saying impossible) to
> actually produce schema-aware PCollections as outputs from Euphoria
> operators (generally speaking, in special cases that might be possible).
> Regarding inputs, there is actually intention to act on type of PCollection
> - e.g. when PCollection is already of type KV, then it is possible to make
> key extractor and value extractor optional in Euphoria builders, so it
> feels natural to enable changing the builders when a schema-aware
> PCollection, and make use of the provided schema. The rest of Euphoria team
> might correct me, if I'm wrong.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> One question I have is, does Euphoria bring dependencies that are
> not needed by SQL, or does more or less only rely on the core SDK?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I think the only relevant dependency that Euphoria has besides core
> SDK is Kryo. It is the default coder when no coder is provided, but that
> could be made optional - e.g. the default coder would be supported only if
> an appropriate module would be available. That way I think that Euphoria
> has no special dependencies.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> >>>>>> [2]
> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> >>>>>> [3]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz>
> wrote:
> >>>>>>
> >>>>>> Hi community,
> >>>>>>
> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like
> to
> >>>>>> discuss possible development of Java based DSLs currently present in
> >>>>>> Beam. In my knowledge, there are currently two DSLs based on Java
> SDK -
> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
> >>>>>> although there might be room to share some more effort. We already
> know
> >>>>>> that both Euphoria and SQL have need for retractions, but there are
> >>>>>> probably many more features that these two could share.
> >>>>>>
> >>>>>> So, I'd like to open a discussion on what it would cost and what it
> >>>>>> would possibly bring, if instead of the current structure
> >>>>>>
> >>>>>>      Java SDK
> >>>>>>
> >>>>>>        | ---- SQL
> >>>>>>
> >>>>>>        | ---- Euphoria
> >>>>>>
> >>>>>> these DSLs would be structured as
> >>>>>>
> >>>>>>      Java SDK ---> Euphoria ---> SQL
> >>>>>>
> >>>>>> I'm absolutely sure that this would be a great investment and a huge
> >>>>>> change, but I'd like to gather some opinions and general feelings
> of the
> >>>>>> community about this. Some points to start the discussion from my
> side
> >>>>>> would be, that structuring DSLs like this has internal logical
> >>>>>> consistency, because each API layer further narrows completeness,
> but
> >>>>>> brings simpler API for simpler tasks, while adding additional
> high-level
> >>>>>> view of the data processing pipeline and thus enabling more
> >>>>>> optimizations. On Euphoria side, these are various implementations
> joins
> >>>>>> (most effective implementation depends on data), pipeline sampling
> and
> >>>>>> more. Some (or maybe most) of these optimizations would have to be
> >>>>>> implemented in both DSLs, so implementing them once is beneficial.
> >>>>>> Another benefit is that this would bring Euphoria "closer" to Beam
> core
> >>>>>> development (which would be good, it is part of the project anyway,
> >>>>>> right? :)) and help better drive features, that although currently
> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
> >>>>>>
> >>>>>> Thanks for discussion and looking forward to any opinions.
> >>>>>>
> >>>>>>      Jan
> >>>>>>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Robert,

currently there is no actual proposal, I was just trying to gather 
feedback from the community. But my original thoughts would be [1]. I 
actually don't see much need for restructuring the code by nesting 
directories. If the community sees that it would make sense to structure 
the dependencies, the second step would probably be to figure out how to 
accomplish this. I don't have any exact solution in mind so far, it 
would be probably needed to first identify features that are needed by 
SQL and not supported by Euphoria currently. Then we can actually 
identify costs and see it this still makes sense.

  Jan

On 12/3/18 6:17 PM, Robert Bradshaw wrote:
> Taking a step back, what exactly is the proposal. Looking at the
> original message, I see
>
> (1) Letting SQL take a dependency on Euphoria, sharing more code and
> taking advantage of the logical nesting of levels of abstraction. This
> makes sense to me.
> (2) Nesting the directories (but not the gradle targets or module
> names?). Here I'm not so sure about the benefit, especially vs. the
> cost.
> On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>> I think that the fact that SQL uses some other internal dependency
>> should remain hidden implementation detail. I absolutely agree that the
>> dependency should of course remain sdks-java-sql in all cases.
>>
>>     Jan
>>
>> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>> I suppose what I'm trying to say is that I see this module structure
>>> as a tool for discoverability and enumerating end-user endpoints. In
>>> other words, if one wants to use SQL, it would seem odd to have to
>>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>>> sdks-java-euphoria is also a DSL one might use. A sibling relationship
>>> does not prohibit the layered approach to implementation that sounds
>>> like it makes sense.
>>>
>>> (As for merging Euphoria into core, my initial impression is that's
>>> probably a good idea, and something we should consider for 3.0 at the
>>> very least.)
>>>
>>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>> Hi Rui,
>>>>
>>>> yes, there are optimizations that could be added by each layer. The purpose of Euphoria layer actually is not to reorder or modify any user operators that are present in the pipeline (because it might not have enough information to do this), but it can for instance choose between various join implementations (shuffle join, broadcast join, ...) - so the optimizations it can do are more low level. But this plays nicely with the DSL hierarchy - each layer adds a little more restrictions, but can therefore do more optimizations. And I think that the layer between SDK and SQL wouldn't have to support SQL optimizations, it would only have to support way for SQL to express these optimizations.
>>>>
>>>>     Jan ---------- Původní e-mail ----------
>>>> Od: Rui Wang <ru...@google.com>
>>>> Komu: dev@beam.apache.org
>>>> Datum: 30. 11. 2018 22:43:04
>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>>
>>>> SQL's optimization is another area to consider for integration. SQL optimization includes pushing down filters/projections, merging or removing or swapping plan nodes and comparing plan costs to choose best plan.  Add another layer between SQL and java core might need the layer to support SQL optimizations if there is a need.
>>>>
>>>> I don't have a clear image on what SQL needs from Euphoria for optimization(best case is nothing). As those optimizations are happening or will happen, we might start to have a sense of it.
>>>>
>>>> -Rui
>>>>
>>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>
>>>> I don't really see Euphoria as a subset of SQL or the other way
>>>> around, and I think it makes sense to use either without the other, so
>>>> by this criteria keeping them as siblings than a nesting.
>>>>
>>>> That said, I think it's really good to have a bunch of shared code,
>>>> e.g. a join library that could be used by both. One could even depend
>>>> on the other without having to abandon the sibling relationship.
>>>> Something like retractions belong in the core SDK itself. Deeper than
>>>> that, actually, it should be part of the model.
>>>>
>>>> - Robert
>>>>
>>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
>>>>> Jan, we made Kryo optional recently (it is a separate module and is used only in tests). From a quick look it seems that we forgot to remove compile time dependency from euphoria's build.gradle. Only "strong" dependencies I'm aware of are core SDK and guava. We'll be probably adding sketching extension dependency soon.
>>>>>
>>>>> D.
>>>>>
>>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>> Hi Anton,
>>>>>> reactions inline.
>>>>>>
>>>>>> ---------- Původní e-mail ----------
>>>>>> Od: Anton Kedin <ke...@google.com>
>>>>>> Komu: dev@beam.apache.org
>>>>>> Datum: 30. 11. 2018 18:17:06
>>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>>>>
>>>>>> I think this approach makes sense in general, Euphoria can be the implementation detail of SQL, similar to Join Library or core SDK Schemas.
>>>>>>
>>>>>> I wonder though whether it would be better to bring Euphoria closer to core SDK first, maybe even merge them together. If you look at Reuven's recent work around schemas it seems like there are already similarities between that and Euphoria's approach, unless I'm missing the point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're already switching parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by core SDK's Group[3]).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, these transforms seem to be very similar to those Euphoria has. Whether or not to merge Euphoria with core is essentially just a decision of the community (in my point of view).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Adding explicit Schema support to Euphoria will bring it both closer to core SDK and make it natural to use for SQL. Can this be a first step towards this integration?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Euphoria currently operates on pure PCollections, so when PCollection has a schema, it will be accessible by Euphoria. It makes sense to make use of the schema in Euphoria - it seems natural on inputs to Euphoria operators, but it might be tricky (not saying impossible) to actually produce schema-aware PCollections as outputs from Euphoria operators (generally speaking, in special cases that might be possible). Regarding inputs, there is actually intention to act on type of PCollection - e.g. when PCollection is already of type KV, then it is possible to make key extractor and value extractor optional in Euphoria builders, so it feels natural to enable changing the builders when a schema-aware PCollection, and make use of the provided schema. The rest of Euphoria team might correct me, if I'm wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> One question I have is, does Euphoria bring dependencies that are not needed by SQL, or does more or less only rely on the core SDK?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think the only relevant dependency that Euphoria has besides core SDK is Kryo. It is the default coder when no coder is provided, but that could be made optional - e.g. the default coder would be supported only if an appropriate module would be available. That way I think that Euphoria has no special dependencies.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>>>>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>>>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>>
>>>>>> Hi community,
>>>>>>
>>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
>>>>>> discuss possible development of Java based DSLs currently present in
>>>>>> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
>>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>>>>>> although there might be room to share some more effort. We already know
>>>>>> that both Euphoria and SQL have need for retractions, but there are
>>>>>> probably many more features that these two could share.
>>>>>>
>>>>>> So, I'd like to open a discussion on what it would cost and what it
>>>>>> would possibly bring, if instead of the current structure
>>>>>>
>>>>>>      Java SDK
>>>>>>
>>>>>>        | ---- SQL
>>>>>>
>>>>>>        | ---- Euphoria
>>>>>>
>>>>>> these DSLs would be structured as
>>>>>>
>>>>>>      Java SDK ---> Euphoria ---> SQL
>>>>>>
>>>>>> I'm absolutely sure that this would be a great investment and a huge
>>>>>> change, but I'd like to gather some opinions and general feelings of the
>>>>>> community about this. Some points to start the discussion from my side
>>>>>> would be, that structuring DSLs like this has internal logical
>>>>>> consistency, because each API layer further narrows completeness, but
>>>>>> brings simpler API for simpler tasks, while adding additional high-level
>>>>>> view of the data processing pipeline and thus enabling more
>>>>>> optimizations. On Euphoria side, these are various implementations joins
>>>>>> (most effective implementation depends on data), pipeline sampling and
>>>>>> more. Some (or maybe most) of these optimizations would have to be
>>>>>> implemented in both DSLs, so implementing them once is beneficial.
>>>>>> Another benefit is that this would bring Euphoria "closer" to Beam core
>>>>>> development (which would be good, it is part of the project anyway,
>>>>>> right? :)) and help better drive features, that although currently
>>>>>> needed mostly by SQL, might be needed by other Java users anyway.
>>>>>>
>>>>>> Thanks for discussion and looking forward to any opinions.
>>>>>>
>>>>>>      Jan
>>>>>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Robert Bradshaw <ro...@google.com>.

Taking a step back, what exactly is the proposal. Looking at the
original message, I see

(1) Letting SQL take a dependency on Euphoria, sharing more code and
taking advantage of the logical nesting of levels of abstraction. This
makes sense to me.
(2) Nesting the directories (but not the gradle targets or module
names?). Here I'm not so sure about the benefit, especially vs. the
cost.
On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>
> I think that the fact that SQL uses some other internal dependency
> should remain hidden implementation detail. I absolutely agree that the
> dependency should of course remain sdks-java-sql in all cases.
>
>    Jan
>
> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
> > I suppose what I'm trying to say is that I see this module structure
> > as a tool for discoverability and enumerating end-user endpoints. In
> > other words, if one wants to use SQL, it would seem odd to have to
> > depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
> > sdks-java-euphoria is also a DSL one might use. A sibling relationship
> > does not prohibit the layered approach to implementation that sounds
> > like it makes sense.
> >
> > (As for merging Euphoria into core, my initial impression is that's
> > probably a good idea, and something we should consider for 3.0 at the
> > very least.)
> >
> > On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> wrote:
> >> Hi Rui,
> >>
> >> yes, there are optimizations that could be added by each layer. The purpose of Euphoria layer actually is not to reorder or modify any user operators that are present in the pipeline (because it might not have enough information to do this), but it can for instance choose between various join implementations (shuffle join, broadcast join, ...) - so the optimizations it can do are more low level. But this plays nicely with the DSL hierarchy - each layer adds a little more restrictions, but can therefore do more optimizations. And I think that the layer between SDK and SQL wouldn't have to support SQL optimizations, it would only have to support way for SQL to express these optimizations.
> >>
> >>    Jan ---------- Původní e-mail ----------
> >> Od: Rui Wang <ru...@google.com>
> >> Komu: dev@beam.apache.org
> >> Datum: 30. 11. 2018 22:43:04
> >> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>
> >> SQL's optimization is another area to consider for integration. SQL optimization includes pushing down filters/projections, merging or removing or swapping plan nodes and comparing plan costs to choose best plan.  Add another layer between SQL and java core might need the layer to support SQL optimizations if there is a need.
> >>
> >> I don't have a clear image on what SQL needs from Euphoria for optimization(best case is nothing). As those optimizations are happening or will happen, we might start to have a sense of it.
> >>
> >> -Rui
> >>
> >> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com> wrote:
> >>
> >> I don't really see Euphoria as a subset of SQL or the other way
> >> around, and I think it makes sense to use either without the other, so
> >> by this criteria keeping them as siblings than a nesting.
> >>
> >> That said, I think it's really good to have a bunch of shared code,
> >> e.g. a join library that could be used by both. One could even depend
> >> on the other without having to abandon the sibling relationship.
> >> Something like retractions belong in the core SDK itself. Deeper than
> >> that, actually, it should be part of the model.
> >>
> >> - Robert
> >>
> >> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
> >>> Jan, we made Kryo optional recently (it is a separate module and is used only in tests). From a quick look it seems that we forgot to remove compile time dependency from euphoria's build.gradle. Only "strong" dependencies I'm aware of are core SDK and guava. We'll be probably adding sketching extension dependency soon.
> >>>
> >>> D.
> >>>
> >>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
> >>>> Hi Anton,
> >>>> reactions inline.
> >>>>
> >>>> ---------- Původní e-mail ----------
> >>>> Od: Anton Kedin <ke...@google.com>
> >>>> Komu: dev@beam.apache.org
> >>>> Datum: 30. 11. 2018 18:17:06
> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>>>
> >>>> I think this approach makes sense in general, Euphoria can be the implementation detail of SQL, similar to Join Library or core SDK Schemas.
> >>>>
> >>>> I wonder though whether it would be better to bring Euphoria closer to core SDK first, maybe even merge them together. If you look at Reuven's recent work around schemas it seems like there are already similarities between that and Euphoria's approach, unless I'm missing the point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're already switching parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by core SDK's Group[3]).
> >>>>
> >>>>
> >>>>
> >>>> Yes, these transforms seem to be very similar to those Euphoria has. Whether or not to merge Euphoria with core is essentially just a decision of the community (in my point of view).
> >>>>
> >>>>
> >>>>
> >>>> Adding explicit Schema support to Euphoria will bring it both closer to core SDK and make it natural to use for SQL. Can this be a first step towards this integration?
> >>>>
> >>>>
> >>>>
> >>>> Euphoria currently operates on pure PCollections, so when PCollection has a schema, it will be accessible by Euphoria. It makes sense to make use of the schema in Euphoria - it seems natural on inputs to Euphoria operators, but it might be tricky (not saying impossible) to actually produce schema-aware PCollections as outputs from Euphoria operators (generally speaking, in special cases that might be possible). Regarding inputs, there is actually intention to act on type of PCollection - e.g. when PCollection is already of type KV, then it is possible to make key extractor and value extractor optional in Euphoria builders, so it feels natural to enable changing the builders when a schema-aware PCollection, and make use of the provided schema. The rest of Euphoria team might correct me, if I'm wrong.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> One question I have is, does Euphoria bring dependencies that are not needed by SQL, or does more or less only rely on the core SDK?
> >>>>
> >>>>
> >>>>
> >>>> I think the only relevant dependency that Euphoria has besides core SDK is Kryo. It is the default coder when no coder is provided, but that could be made optional - e.g. the default coder would be supported only if an appropriate module would be available. That way I think that Euphoria has no special dependencies.
> >>>>
> >>>>
> >>>>
> >>>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> >>>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> >>>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
> >>>>
> >>>> Hi community,
> >>>>
> >>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> >>>> discuss possible development of Java based DSLs currently present in
> >>>> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> >>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
> >>>> although there might be room to share some more effort. We already know
> >>>> that both Euphoria and SQL have need for retractions, but there are
> >>>> probably many more features that these two could share.
> >>>>
> >>>> So, I'd like to open a discussion on what it would cost and what it
> >>>> would possibly bring, if instead of the current structure
> >>>>
> >>>>     Java SDK
> >>>>
> >>>>       | ---- SQL
> >>>>
> >>>>       | ---- Euphoria
> >>>>
> >>>> these DSLs would be structured as
> >>>>
> >>>>     Java SDK ---> Euphoria ---> SQL
> >>>>
> >>>> I'm absolutely sure that this would be a great investment and a huge
> >>>> change, but I'd like to gather some opinions and general feelings of the
> >>>> community about this. Some points to start the discussion from my side
> >>>> would be, that structuring DSLs like this has internal logical
> >>>> consistency, because each API layer further narrows completeness, but
> >>>> brings simpler API for simpler tasks, while adding additional high-level
> >>>> view of the data processing pipeline and thus enabling more
> >>>> optimizations. On Euphoria side, these are various implementations joins
> >>>> (most effective implementation depends on data), pipeline sampling and
> >>>> more. Some (or maybe most) of these optimizations would have to be
> >>>> implemented in both DSLs, so implementing them once is beneficial.
> >>>> Another benefit is that this would bring Euphoria "closer" to Beam core
> >>>> development (which would be good, it is part of the project anyway,
> >>>> right? :)) and help better drive features, that although currently
> >>>> needed mostly by SQL, might be needed by other Java users anyway.
> >>>>
> >>>> Thanks for discussion and looking forward to any opinions.
> >>>>
> >>>>     Jan
> >>>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

I think that the fact that SQL uses some other internal dependency 
should remain hidden implementation detail. I absolutely agree that the 
dependency should of course remain sdks-java-sql in all cases.

   Jan

On 12/1/18 12:54 AM, Robert Bradshaw wrote:
> I suppose what I'm trying to say is that I see this module structure
> as a tool for discoverability and enumerating end-user endpoints. In
> other words, if one wants to use SQL, it would seem odd to have to
> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
> sdks-java-euphoria is also a DSL one might use. A sibling relationship
> does not prohibit the layered approach to implementation that sounds
> like it makes sense.
>
> (As for merging Euphoria into core, my initial impression is that's
> probably a good idea, and something we should consider for 3.0 at the
> very least.)
>
> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> wrote:
>> Hi Rui,
>>
>> yes, there are optimizations that could be added by each layer. The purpose of Euphoria layer actually is not to reorder or modify any user operators that are present in the pipeline (because it might not have enough information to do this), but it can for instance choose between various join implementations (shuffle join, broadcast join, ...) - so the optimizations it can do are more low level. But this plays nicely with the DSL hierarchy - each layer adds a little more restrictions, but can therefore do more optimizations. And I think that the layer between SDK and SQL wouldn't have to support SQL optimizations, it would only have to support way for SQL to express these optimizations.
>>
>>    Jan ---------- Původní e-mail ----------
>> Od: Rui Wang <ru...@google.com>
>> Komu: dev@beam.apache.org
>> Datum: 30. 11. 2018 22:43:04
>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>
>> SQL's optimization is another area to consider for integration. SQL optimization includes pushing down filters/projections, merging or removing or swapping plan nodes and comparing plan costs to choose best plan.  Add another layer between SQL and java core might need the layer to support SQL optimizations if there is a need.
>>
>> I don't have a clear image on what SQL needs from Euphoria for optimization(best case is nothing). As those optimizations are happening or will happen, we might start to have a sense of it.
>>
>> -Rui
>>
>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> I don't really see Euphoria as a subset of SQL or the other way
>> around, and I think it makes sense to use either without the other, so
>> by this criteria keeping them as siblings than a nesting.
>>
>> That said, I think it's really good to have a bunch of shared code,
>> e.g. a join library that could be used by both. One could even depend
>> on the other without having to abandon the sibling relationship.
>> Something like retractions belong in the core SDK itself. Deeper than
>> that, actually, it should be part of the model.
>>
>> - Robert
>>
>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
>>> Jan, we made Kryo optional recently (it is a separate module and is used only in tests). From a quick look it seems that we forgot to remove compile time dependency from euphoria's build.gradle. Only "strong" dependencies I'm aware of are core SDK and guava. We'll be probably adding sketching extension dependency soon.
>>>
>>> D.
>>>
>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>> Hi Anton,
>>>> reactions inline.
>>>>
>>>> ---------- Původní e-mail ----------
>>>> Od: Anton Kedin <ke...@google.com>
>>>> Komu: dev@beam.apache.org
>>>> Datum: 30. 11. 2018 18:17:06
>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>>
>>>> I think this approach makes sense in general, Euphoria can be the implementation detail of SQL, similar to Join Library or core SDK Schemas.
>>>>
>>>> I wonder though whether it would be better to bring Euphoria closer to core SDK first, maybe even merge them together. If you look at Reuven's recent work around schemas it seems like there are already similarities between that and Euphoria's approach, unless I'm missing the point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're already switching parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by core SDK's Group[3]).
>>>>
>>>>
>>>>
>>>> Yes, these transforms seem to be very similar to those Euphoria has. Whether or not to merge Euphoria with core is essentially just a decision of the community (in my point of view).
>>>>
>>>>
>>>>
>>>> Adding explicit Schema support to Euphoria will bring it both closer to core SDK and make it natural to use for SQL. Can this be a first step towards this integration?
>>>>
>>>>
>>>>
>>>> Euphoria currently operates on pure PCollections, so when PCollection has a schema, it will be accessible by Euphoria. It makes sense to make use of the schema in Euphoria - it seems natural on inputs to Euphoria operators, but it might be tricky (not saying impossible) to actually produce schema-aware PCollections as outputs from Euphoria operators (generally speaking, in special cases that might be possible). Regarding inputs, there is actually intention to act on type of PCollection - e.g. when PCollection is already of type KV, then it is possible to make key extractor and value extractor optional in Euphoria builders, so it feels natural to enable changing the builders when a schema-aware PCollection, and make use of the provided schema. The rest of Euphoria team might correct me, if I'm wrong.
>>>>
>>>>
>>>>
>>>>
>>>> One question I have is, does Euphoria bring dependencies that are not needed by SQL, or does more or less only rely on the core SDK?
>>>>
>>>>
>>>>
>>>> I think the only relevant dependency that Euphoria has besides core SDK is Kryo. It is the default coder when no coder is provided, but that could be made optional - e.g. the default coder would be supported only if an appropriate module would be available. That way I think that Euphoria has no special dependencies.
>>>>
>>>>
>>>>
>>>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>>>
>>>>
>>>>
>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>> Hi community,
>>>>
>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
>>>> discuss possible development of Java based DSLs currently present in
>>>> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>>>> although there might be room to share some more effort. We already know
>>>> that both Euphoria and SQL have need for retractions, but there are
>>>> probably many more features that these two could share.
>>>>
>>>> So, I'd like to open a discussion on what it would cost and what it
>>>> would possibly bring, if instead of the current structure
>>>>
>>>>     Java SDK
>>>>
>>>>       | ---- SQL
>>>>
>>>>       | ---- Euphoria
>>>>
>>>> these DSLs would be structured as
>>>>
>>>>     Java SDK ---> Euphoria ---> SQL
>>>>
>>>> I'm absolutely sure that this would be a great investment and a huge
>>>> change, but I'd like to gather some opinions and general feelings of the
>>>> community about this. Some points to start the discussion from my side
>>>> would be, that structuring DSLs like this has internal logical
>>>> consistency, because each API layer further narrows completeness, but
>>>> brings simpler API for simpler tasks, while adding additional high-level
>>>> view of the data processing pipeline and thus enabling more
>>>> optimizations. On Euphoria side, these are various implementations joins
>>>> (most effective implementation depends on data), pipeline sampling and
>>>> more. Some (or maybe most) of these optimizations would have to be
>>>> implemented in both DSLs, so implementing them once is beneficial.
>>>> Another benefit is that this would bring Euphoria "closer" to Beam core
>>>> development (which would be good, it is part of the project anyway,
>>>> right? :)) and help better drive features, that although currently
>>>> needed mostly by SQL, might be needed by other Java users anyway.
>>>>
>>>> Thanks for discussion and looking forward to any opinions.
>>>>
>>>>     Jan
>>>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Robert Bradshaw <ro...@google.com>.

I suppose what I'm trying to say is that I see this module structure
as a tool for discoverability and enumerating end-user endpoints. In
other words, if one wants to use SQL, it would seem odd to have to
depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
sdks-java-euphoria is also a DSL one might use. A sibling relationship
does not prohibit the layered approach to implementation that sounds
like it makes sense.

(As for merging Euphoria into core, my initial impression is that's
probably a good idea, and something we should consider for 3.0 at the
very least.)

On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> wrote:
>
> Hi Rui,
>
> yes, there are optimizations that could be added by each layer. The purpose of Euphoria layer actually is not to reorder or modify any user operators that are present in the pipeline (because it might not have enough information to do this), but it can for instance choose between various join implementations (shuffle join, broadcast join, ...) - so the optimizations it can do are more low level. But this plays nicely with the DSL hierarchy - each layer adds a little more restrictions, but can therefore do more optimizations. And I think that the layer between SDK and SQL wouldn't have to support SQL optimizations, it would only have to support way for SQL to express these optimizations.
>
>   Jan ---------- Původní e-mail ----------
> Od: Rui Wang <ru...@google.com>
> Komu: dev@beam.apache.org
> Datum: 30. 11. 2018 22:43:04
> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>
> SQL's optimization is another area to consider for integration. SQL optimization includes pushing down filters/projections, merging or removing or swapping plan nodes and comparing plan costs to choose best plan.  Add another layer between SQL and java core might need the layer to support SQL optimizations if there is a need.
>
> I don't have a clear image on what SQL needs from Euphoria for optimization(best case is nothing). As those optimizations are happening or will happen, we might start to have a sense of it.
>
> -Rui
>
> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com> wrote:
>
> I don't really see Euphoria as a subset of SQL or the other way
> around, and I think it makes sense to use either without the other, so
> by this criteria keeping them as siblings than a nesting.
>
> That said, I think it's really good to have a bunch of shared code,
> e.g. a join library that could be used by both. One could even depend
> on the other without having to abandon the sibling relationship.
> Something like retractions belong in the core SDK itself. Deeper than
> that, actually, it should be part of the model.
>
> - Robert
>
> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
> >
> > Jan, we made Kryo optional recently (it is a separate module and is used only in tests). From a quick look it seems that we forgot to remove compile time dependency from euphoria's build.gradle. Only "strong" dependencies I'm aware of are core SDK and guava. We'll be probably adding sketching extension dependency soon.
> >
> > D.
> >
> > On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >> Hi Anton,
> >> reactions inline.
> >>
> >> ---------- Původní e-mail ----------
> >> Od: Anton Kedin <ke...@google.com>
> >> Komu: dev@beam.apache.org
> >> Datum: 30. 11. 2018 18:17:06
> >> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>
> >> I think this approach makes sense in general, Euphoria can be the implementation detail of SQL, similar to Join Library or core SDK Schemas.
> >>
> >> I wonder though whether it would be better to bring Euphoria closer to core SDK first, maybe even merge them together. If you look at Reuven's recent work around schemas it seems like there are already similarities between that and Euphoria's approach, unless I'm missing the point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're already switching parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by core SDK's Group[3]).
> >>
> >>
> >>
> >> Yes, these transforms seem to be very similar to those Euphoria has. Whether or not to merge Euphoria with core is essentially just a decision of the community (in my point of view).
> >>
> >>
> >>
> >> Adding explicit Schema support to Euphoria will bring it both closer to core SDK and make it natural to use for SQL. Can this be a first step towards this integration?
> >>
> >>
> >>
> >> Euphoria currently operates on pure PCollections, so when PCollection has a schema, it will be accessible by Euphoria. It makes sense to make use of the schema in Euphoria - it seems natural on inputs to Euphoria operators, but it might be tricky (not saying impossible) to actually produce schema-aware PCollections as outputs from Euphoria operators (generally speaking, in special cases that might be possible). Regarding inputs, there is actually intention to act on type of PCollection - e.g. when PCollection is already of type KV, then it is possible to make key extractor and value extractor optional in Euphoria builders, so it feels natural to enable changing the builders when a schema-aware PCollection, and make use of the provided schema. The rest of Euphoria team might correct me, if I'm wrong.
> >>
> >>
> >>
> >>
> >> One question I have is, does Euphoria bring dependencies that are not needed by SQL, or does more or less only rely on the core SDK?
> >>
> >>
> >>
> >> I think the only relevant dependency that Euphoria has besides core SDK is Kryo. It is the default coder when no coder is provided, but that could be made optional - e.g. the default coder would be supported only if an appropriate module would be available. That way I think that Euphoria has no special dependencies.
> >>
> >>
> >>
> >> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> >> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> >> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
> >>
> >>
> >>
> >> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >> Hi community,
> >>
> >> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> >> discuss possible development of Java based DSLs currently present in
> >> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> >> Euphoria and SQL. These DSLs currently share only the SDK itself,
> >> although there might be room to share some more effort. We already know
> >> that both Euphoria and SQL have need for retractions, but there are
> >> probably many more features that these two could share.
> >>
> >> So, I'd like to open a discussion on what it would cost and what it
> >> would possibly bring, if instead of the current structure
> >>
> >>    Java SDK
> >>
> >>      | ---- SQL
> >>
> >>      | ---- Euphoria
> >>
> >> these DSLs would be structured as
> >>
> >>    Java SDK ---> Euphoria ---> SQL
> >>
> >> I'm absolutely sure that this would be a great investment and a huge
> >> change, but I'd like to gather some opinions and general feelings of the
> >> community about this. Some points to start the discussion from my side
> >> would be, that structuring DSLs like this has internal logical
> >> consistency, because each API layer further narrows completeness, but
> >> brings simpler API for simpler tasks, while adding additional high-level
> >> view of the data processing pipeline and thus enabling more
> >> optimizations. On Euphoria side, these are various implementations joins
> >> (most effective implementation depends on data), pipeline sampling and
> >> more. Some (or maybe most) of these optimizations would have to be
> >> implemented in both DSLs, so implementing them once is beneficial.
> >> Another benefit is that this would bring Euphoria "closer" to Beam core
> >> development (which would be good, it is part of the project anyway,
> >> right? :)) and help better drive features, that although currently
> >> needed mostly by SQL, might be needed by other Java users anyway.
> >>
> >> Thanks for discussion and looking forward to any opinions.
> >>
> >>    Jan
> >>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Rui,

yes, there are optimizations that could be added by each layer. The purpose 
of Euphoria layer actually is not to reorder or modify any user operators 
that are present in the pipeline (because it might not have enough 
information to do this), but it can for instance choose between various join
implementations (shuffle join, broadcast join, ...) - so the optimizations 
it can do are more low level. But this plays nicely with the DSL hierarchy -
each layer adds a little more restrictions, but can therefore do more 
optimizations. And I think that the layer between SDK and SQL wouldn't have 
to support SQL optimizations, it would only have to support way for SQL to 
express these optimizations.
  Jan ---------- Původní e-mail ----------
Od: Rui Wang <ru...@google.com>
Komu: dev@beam.apache.org
Datum: 30. 11. 2018 22:43:04
Předmět: Re: [DISCUSS] Structuring Java based DSLs 
"
SQL's optimization is another area to consider for integration. SQL 
optimization includes pushing down filters/projections, merging or removing 
or swapping plan nodes and comparing plan costs to choose best plan.  Add 
another layer between SQL and java core might need the layer to support SQL 
optimizations if there is a need.




I don't have a clear image on what SQL needs from Euphoria for optimization
(best case is nothing). As those optimizations are happening or will happen,
we might start to have a sense of it.




-Rui 




On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <robertwb@google.com
(mailto:robertwb@google.com)> wrote:

"I don't really see Euphoria as a subset of SQL or the other way
around, and I think it makes sense to use either without the other, so
by this criteria keeping them as siblings than a nesting.

That said, I think it's really good to have a bunch of shared code,
e.g. a join library that could be used by both. One could even depend
on the other without having to abandon the sibling relationship.
Something like retractions belong in the core SDK itself. Deeper than
that, actually, it should be part of the model.

- Robert

On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dmvk@apache.org
(mailto:dmvk@apache.org)> wrote:
>
> Jan, we made Kryo optional recently (it is a separate module and is used 
only in tests). From a quick look it seems that we forgot to remove compile 
time dependency from euphoria's build.gradle. Only "strong" dependencies I'm
aware of are core SDK and guava. We'll be probably adding sketching 
extension dependency soon.
>
> D.
>
> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je.ik@seznam.cz
(mailto:je.ik@seznam.cz)> wrote:
>>
>> Hi Anton,
>> reactions inline.
>>
>> ---------- Původní e-mail ----------
>> Od: Anton Kedin <kedin@google.com(mailto:kedin@google.com)>
>> Komu: dev@beam.apache.org(mailto:dev@beam.apache.org)
>> Datum: 30. 11. 2018 18:17:06
>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>
>> I think this approach makes sense in general, Euphoria can be the 
implementation detail of SQL, similar to Join Library or core SDK Schemas.
>>
>> I wonder though whether it would be better to bring Euphoria closer to 
core SDK first, maybe even merge them together. If you look at Reuven's 
recent work around schemas it seems like there are already similarities 
between that and Euphoria's approach, unless I'm missing the point (e.g. 
Filter transforms, FullJoin vs CoGroup... see [2]). And we're already 
switching parts of SQL to those transforms (e.g. SQL Aggregation is now 
implemented by core SDK's Group[3]).
>>
>>
>>
>> Yes, these transforms seem to be very similar to those Euphoria has. 
Whether or not to merge Euphoria with core is essentially just a decision of
the community (in my point of view).
>>
>>
>>
>> Adding explicit Schema support to Euphoria will bring it both closer to 
core SDK and make it natural to use for SQL. Can this be a first step 
towards this integration?
>>
>>
>>
>> Euphoria currently operates on pure PCollections, so when PCollection has
a schema, it will be accessible by Euphoria. It makes sense to make use of 
the schema in Euphoria - it seems natural on inputs to Euphoria operators, 
but it might be tricky (not saying impossible) to actually produce schema-
aware PCollections as outputs from Euphoria operators (generally speaking, 
in special cases that might be possible). Regarding inputs, there is 
actually intention to act on type of PCollection - e.g. when PCollection is 
already of type KV, then it is possible to make key extractor and value 
extractor optional in Euphoria builders, so it feels natural to enable 
changing the builders when a schema-aware PCollection, and make use of the 
provided schema. The rest of Euphoria team might correct me, if I'm wrong.
>>
>>
>>
>>
>> One question I have is, does Euphoria bring dependencies that are not 
needed by SQL, or does more or less only rely on the core SDK?
>>
>>
>>
>> I think the only relevant dependency that Euphoria has besides core SDK 
is Kryo. It is the default coder when no coder is provided, but that could 
be made optional - e.g. the default coder would be supported only if an 
appropriate module would be available. That way I think that Euphoria has no
special dependencies.
>>
>>
>>
>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/
Group.java#L73
(https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73)
>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
(https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms)
>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef
6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/
extensions/sql/impl/rel/BeamAggregationRel.java#L179
(https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179)
>>
>>
>>
>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je.ik@seznam.cz
(mailto:je.ik@seznam.cz)> wrote:
>>
>> Hi community,
>>
>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
>> discuss possible development of Java based DSLs currently present in
>> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>> although there might be room to share some more effort. We already know
>> that both Euphoria and SQL have need for retractions, but there are
>> probably many more features that these two could share.
>>
>> So, I'd like to open a discussion on what it would cost and what it
>> would possibly bring, if instead of the current structure
>>
>>    Java SDK
>>
>>      | ---- SQL
>>
>>      | ---- Euphoria
>>
>> these DSLs would be structured as
>>
>>    Java SDK ---> Euphoria ---> SQL
>>
>> I'm absolutely sure that this would be a great investment and a huge
>> change, but I'd like to gather some opinions and general feelings of the
>> community about this. Some points to start the discussion from my side
>> would be, that structuring DSLs like this has internal logical
>> consistency, because each API layer further narrows completeness, but
>> brings simpler API for simpler tasks, while adding additional high-level
>> view of the data processing pipeline and thus enabling more
>> optimizations. On Euphoria side, these are various implementations joins
>> (most effective implementation depends on data), pipeline sampling and
>> more. Some (or maybe most) of these optimizations would have to be
>> implemented in both DSLs, so implementing them once is beneficial.
>> Another benefit is that this would bring Euphoria "closer" to Beam core
>> development (which would be good, it is part of the project anyway,
>> right? :)) and help better drive features, that although currently
>> needed mostly by SQL, might be needed by other Java users anyway.
>>
>> Thanks for discussion and looking forward to any opinions.
>>
>>    Jan
>>
"
"

Re: [DISCUSS] Structuring Java based DSLs

Posted by Rui Wang <ru...@google.com>.

SQL's optimization is another area to consider for integration. SQL
optimization includes pushing down filters/projections, merging or removing
or swapping plan nodes and comparing plan costs to choose best plan.  Add
another layer between SQL and java core might need the layer to support SQL
optimizations if there is a need.

I don't have a clear image on what SQL needs from Euphoria for
optimization(best case is nothing). As those optimizations are happening or
will happen, we might start to have a sense of it.

-Rui

On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <ro...@google.com>
wrote:

> I don't really see Euphoria as a subset of SQL or the other way
> around, and I think it makes sense to use either without the other, so
> by this criteria keeping them as siblings than a nesting.
>
> That said, I think it's really good to have a bunch of shared code,
> e.g. a join library that could be used by both. One could even depend
> on the other without having to abandon the sibling relationship.
> Something like retractions belong in the core SDK itself. Deeper than
> that, actually, it should be part of the model.
>
> - Robert
>
> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
> >
> > Jan, we made Kryo optional recently (it is a separate module and is used
> only in tests). From a quick look it seems that we forgot to remove compile
> time dependency from euphoria's build.gradle. Only "strong" dependencies
> I'm aware of are core SDK and guava. We'll be probably adding sketching
> extension dependency soon.
> >
> > D.
> >
> > On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >> Hi Anton,
> >> reactions inline.
> >>
> >> ---------- Původní e-mail ----------
> >> Od: Anton Kedin <ke...@google.com>
> >> Komu: dev@beam.apache.org
> >> Datum: 30. 11. 2018 18:17:06
> >> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>
> >> I think this approach makes sense in general, Euphoria can be the
> implementation detail of SQL, similar to Join Library or core SDK Schemas.
> >>
> >> I wonder though whether it would be better to bring Euphoria closer to
> core SDK first, maybe even merge them together. If you look at Reuven's
> recent work around schemas it seems like there are already similarities
> between that and Euphoria's approach, unless I'm missing the point (e.g.
> Filter transforms, FullJoin vs CoGroup... see [2]). And we're already
> switching parts of SQL to those transforms (e.g. SQL Aggregation is now
> implemented by core SDK's Group[3]).
> >>
> >>
> >>
> >> Yes, these transforms seem to be very similar to those Euphoria has.
> Whether or not to merge Euphoria with core is essentially just a decision
> of the community (in my point of view).
> >>
> >>
> >>
> >> Adding explicit Schema support to Euphoria will bring it both closer to
> core SDK and make it natural to use for SQL. Can this be a first step
> towards this integration?
> >>
> >>
> >>
> >> Euphoria currently operates on pure PCollections, so when PCollection
> has a schema, it will be accessible by Euphoria. It makes sense to make use
> of the schema in Euphoria - it seems natural on inputs to Euphoria
> operators, but it might be tricky (not saying impossible) to actually
> produce schema-aware PCollections as outputs from Euphoria operators
> (generally speaking, in special cases that might be possible). Regarding
> inputs, there is actually intention to act on type of PCollection - e.g.
> when PCollection is already of type KV, then it is possible to make key
> extractor and value extractor optional in Euphoria builders, so it feels
> natural to enable changing the builders when a schema-aware PCollection,
> and make use of the provided schema. The rest of Euphoria team might
> correct me, if I'm wrong.
> >>
> >>
> >>
> >>
> >> One question I have is, does Euphoria bring dependencies that are not
> needed by SQL, or does more or less only rely on the core SDK?
> >>
> >>
> >>
> >> I think the only relevant dependency that Euphoria has besides core SDK
> is Kryo. It is the default coder when no coder is provided, but that could
> be made optional - e.g. the default coder would be supported only if an
> appropriate module would be available. That way I think that Euphoria has
> no special dependencies.
> >>
> >>
> >>
> >> [1]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> >> [2]
> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> >> [3]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
> >>
> >>
> >>
> >> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >> Hi community,
> >>
> >> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> >> discuss possible development of Java based DSLs currently present in
> >> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> >> Euphoria and SQL. These DSLs currently share only the SDK itself,
> >> although there might be room to share some more effort. We already know
> >> that both Euphoria and SQL have need for retractions, but there are
> >> probably many more features that these two could share.
> >>
> >> So, I'd like to open a discussion on what it would cost and what it
> >> would possibly bring, if instead of the current structure
> >>
> >>    Java SDK
> >>
> >>      | ---- SQL
> >>
> >>      | ---- Euphoria
> >>
> >> these DSLs would be structured as
> >>
> >>    Java SDK ---> Euphoria ---> SQL
> >>
> >> I'm absolutely sure that this would be a great investment and a huge
> >> change, but I'd like to gather some opinions and general feelings of the
> >> community about this. Some points to start the discussion from my side
> >> would be, that structuring DSLs like this has internal logical
> >> consistency, because each API layer further narrows completeness, but
> >> brings simpler API for simpler tasks, while adding additional high-level
> >> view of the data processing pipeline and thus enabling more
> >> optimizations. On Euphoria side, these are various implementations joins
> >> (most effective implementation depends on data), pipeline sampling and
> >> more. Some (or maybe most) of these optimizations would have to be
> >> implemented in both DSLs, so implementing them once is beneficial.
> >> Another benefit is that this would bring Euphoria "closer" to Beam core
> >> development (which would be good, it is part of the project anyway,
> >> right? :)) and help better drive features, that although currently
> >> needed mostly by SQL, might be needed by other Java users anyway.
> >>
> >> Thanks for discussion and looking forward to any opinions.
> >>
> >>    Jan
> >>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Robert Bradshaw <ro...@google.com>.

I don't really see Euphoria as a subset of SQL or the other way
around, and I think it makes sense to use either without the other, so
by this criteria keeping them as siblings than a nesting.

That said, I think it's really good to have a bunch of shared code,
e.g. a join library that could be used by both. One could even depend
on the other without having to abandon the sibling relationship.
Something like retractions belong in the core SDK itself. Deeper than
that, actually, it should be part of the model.

- Robert

On Fri, Nov 30, 2018 at 7:20 PM David Morávek <dm...@apache.org> wrote:
>
> Jan, we made Kryo optional recently (it is a separate module and is used only in tests). From a quick look it seems that we forgot to remove compile time dependency from euphoria's build.gradle. Only "strong" dependencies I'm aware of are core SDK and guava. We'll be probably adding sketching extension dependency soon.
>
> D.
>
> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>> Hi Anton,
>> reactions inline.
>>
>> ---------- Původní e-mail ----------
>> Od: Anton Kedin <ke...@google.com>
>> Komu: dev@beam.apache.org
>> Datum: 30. 11. 2018 18:17:06
>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>
>> I think this approach makes sense in general, Euphoria can be the implementation detail of SQL, similar to Join Library or core SDK Schemas.
>>
>> I wonder though whether it would be better to bring Euphoria closer to core SDK first, maybe even merge them together. If you look at Reuven's recent work around schemas it seems like there are already similarities between that and Euphoria's approach, unless I'm missing the point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're already switching parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by core SDK's Group[3]).
>>
>>
>>
>> Yes, these transforms seem to be very similar to those Euphoria has. Whether or not to merge Euphoria with core is essentially just a decision of the community (in my point of view).
>>
>>
>>
>> Adding explicit Schema support to Euphoria will bring it both closer to core SDK and make it natural to use for SQL. Can this be a first step towards this integration?
>>
>>
>>
>> Euphoria currently operates on pure PCollections, so when PCollection has a schema, it will be accessible by Euphoria. It makes sense to make use of the schema in Euphoria - it seems natural on inputs to Euphoria operators, but it might be tricky (not saying impossible) to actually produce schema-aware PCollections as outputs from Euphoria operators (generally speaking, in special cases that might be possible). Regarding inputs, there is actually intention to act on type of PCollection - e.g. when PCollection is already of type KV, then it is possible to make key extractor and value extractor optional in Euphoria builders, so it feels natural to enable changing the builders when a schema-aware PCollection, and make use of the provided schema. The rest of Euphoria team might correct me, if I'm wrong.
>>
>>
>>
>>
>> One question I have is, does Euphoria bring dependencies that are not needed by SQL, or does more or less only rely on the core SDK?
>>
>>
>>
>> I think the only relevant dependency that Euphoria has besides core SDK is Kryo. It is the default coder when no coder is provided, but that could be made optional - e.g. the default coder would be supported only if an appropriate module would be available. That way I think that Euphoria has no special dependencies.
>>
>>
>>
>> [1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>> [2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>> [3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>
>>
>>
>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>> Hi community,
>>
>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
>> discuss possible development of Java based DSLs currently present in
>> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>> although there might be room to share some more effort. We already know
>> that both Euphoria and SQL have need for retractions, but there are
>> probably many more features that these two could share.
>>
>> So, I'd like to open a discussion on what it would cost and what it
>> would possibly bring, if instead of the current structure
>>
>>    Java SDK
>>
>>      | ---- SQL
>>
>>      | ---- Euphoria
>>
>> these DSLs would be structured as
>>
>>    Java SDK ---> Euphoria ---> SQL
>>
>> I'm absolutely sure that this would be a great investment and a huge
>> change, but I'd like to gather some opinions and general feelings of the
>> community about this. Some points to start the discussion from my side
>> would be, that structuring DSLs like this has internal logical
>> consistency, because each API layer further narrows completeness, but
>> brings simpler API for simpler tasks, while adding additional high-level
>> view of the data processing pipeline and thus enabling more
>> optimizations. On Euphoria side, these are various implementations joins
>> (most effective implementation depends on data), pipeline sampling and
>> more. Some (or maybe most) of these optimizations would have to be
>> implemented in both DSLs, so implementing them once is beneficial.
>> Another benefit is that this would bring Euphoria "closer" to Beam core
>> development (which would be good, it is part of the project anyway,
>> right? :)) and help better drive features, that although currently
>> needed mostly by SQL, might be needed by other Java users anyway.
>>
>> Thanks for discussion and looking forward to any opinions.
>>
>>    Jan
>>

Re: [DISCUSS] Structuring Java based DSLs

Posted by David Morávek <dm...@apache.org>.

Jan, we made Kryo optional recently (it is a separate module and is used
only in tests). From a quick look it seems that we forgot to remove compile
time dependency from euphoria's *build.gradle*. Only "strong" dependencies
I'm aware of are core SDK and guava. We'll be probably adding sketching
extension dependency soon.

D.

On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> wrote:

> Hi Anton,
> reactions inline.
>
> ---------- Původní e-mail ----------
> Od: Anton Kedin <ke...@google.com>
> Komu: dev@beam.apache.org
> Datum: 30. 11. 2018 18:17:06
> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>
> I think this approach makes sense in general, Euphoria can be the
> implementation detail of SQL, similar to Join Library or core SDK Schemas.
>
> I wonder though whether it would be better to bring Euphoria closer to
> core SDK first, maybe even merge them together. If you look at Reuven's
> recent work around schemas it seems like there are already similarities
> between that and Euphoria's approach, unless I'm missing the point (e.g.
> Filter transforms, FullJoin vs CoGroup... see [2]). And we're already
> switching parts of SQL to those transforms (e.g. SQL Aggregation is now
> implemented by core SDK's Group[3]).
>
>
>
> Yes, these transforms seem to be very similar to those Euphoria has.
> Whether or not to merge Euphoria with core is essentially just a decision
> of the community (in my point of view).
>
>
>
> Adding explicit Schema support to Euphoria will bring it both closer to
> core SDK and make it natural to use for SQL. Can this be a first step
> towards this integration?
>
>
>
> Euphoria currently operates on pure PCollections, so when PCollection has
> a schema, it will be accessible by Euphoria. It makes sense to make use of
> the schema in Euphoria - it seems natural on inputs to Euphoria operators,
> but it might be tricky (not saying impossible) to actually produce
> schema-aware PCollections as outputs from Euphoria operators (generally
> speaking, in special cases that might be possible). Regarding inputs, there
> is actually intention to act on type of PCollection - e.g. when PCollection
> is already of type KV, then it is possible to make key extractor and value
> extractor optional in Euphoria builders, so it feels natural to enable
> changing the builders when a schema-aware PCollection, and make use of the
> provided schema. The rest of Euphoria team might correct me, if I'm wrong.
>
>
>
>
> One question I have is, does Euphoria bring dependencies that are not
> needed by SQL, or does more or less only rely on the core SDK?
>
>
>
> I think the only relevant dependency that Euphoria has besides core SDK is
> Kryo. It is the default coder when no coder is provided, but that could be
> made optional - e.g. the default coder would be supported only if an
> appropriate module would be available. That way I think that Euphoria has
> no special dependencies.
>
>
>
> [1]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> [2]
> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> [3]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>
>
>
> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:
>
> Hi community,
>
> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> discuss possible development of Java based DSLs currently present in
> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> Euphoria and SQL. These DSLs currently share only the SDK itself,
> although there might be room to share some more effort. We already know
> that both Euphoria and SQL have need for retractions, but there are
> probably many more features that these two could share.
>
> So, I'd like to open a discussion on what it would cost and what it
> would possibly bring, if instead of the current structure
>
>    Java SDK
>
>      | ---- SQL
>
>      | ---- Euphoria
>
> these DSLs would be structured as
>
>    Java SDK ---> Euphoria ---> SQL
>
> I'm absolutely sure that this would be a great investment and a huge
> change, but I'd like to gather some opinions and general feelings of the
> community about this. Some points to start the discussion from my side
> would be, that structuring DSLs like this has internal logical
> consistency, because each API layer further narrows completeness, but
> brings simpler API for simpler tasks, while adding additional high-level
> view of the data processing pipeline and thus enabling more
> optimizations. On Euphoria side, these are various implementations joins
> (most effective implementation depends on data), pipeline sampling and
> more. Some (or maybe most) of these optimizations would have to be
> implemented in both DSLs, so implementing them once is beneficial.
> Another benefit is that this would bring Euphoria "closer" to Beam core
> development (which would be good, it is part of the project anyway,
> right? :)) and help better drive features, that although currently
> needed mostly by SQL, might be needed by other Java users anyway.
>
> Thanks for discussion and looking forward to any opinions.
>
>    Jan
>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jan Lukavský <je...@seznam.cz>.

Hi Anton,

reactions inline.

---------- Původní e-mail ----------
Od: Anton Kedin <ke...@google.com>
Komu: dev@beam.apache.org
Datum: 30. 11. 2018 18:17:06
Předmět: Re: [DISCUSS] Structuring Java based DSLs
"

I think this approach makes sense in general, Euphoria can be the
implementation detail of SQL, similar to Join Library or core SDK Schemas.

I wonder though whether it would be better to bring Euphoria closer to core
SDK first, maybe even merge them together. If you look at Reuven's recent
work around schemas it seems like there are already similarities between
that and Euphoria's approach, unless I'm missing the point (e.g. Filter
transforms, FullJoin vs CoGroup... see [2]). And we're already switching
parts of SQL to those transforms (e.g. SQL Aggregation is now implemented by
core SDK's Group[3]).

"
"

"
Yes, these transforms seem to be very similar to those Euphoria has. Whether
or not to merge Euphoria with core is essentially just a decision of the
community (in my point of view).

Adding explicit Schema support to Euphoria will bring it both closer to core
SDK and make it natural to use for SQL. Can this be a first step towards
this integration?

"
"

"
Euphoria currently operates on pure PCollections, so when PCollection has a
schema, it will be accessible by Euphoria. It makes sense to make use of the
schema in Euphoria - it seems natural on inputs to Euphoria operators, but
it might be tricky (not saying impossible) to actually produce schema-aware
PCollections as outputs from Euphoria operators (generally speaking, in
special cases that might be possible). Regarding inputs, there is actually
intention to act on type of PCollection - e.g. when PCollection is already
of type KV, then it is possible to make key extractor and value extractor
optional in Euphoria builders, so it feels natural to enable changing the
builders when a schema-aware PCollection, and make use of the provided
schema. The rest of Euphoria team might correct me, if I'm wrong.

One question I have is, does Euphoria bring dependencies that are not needed
by SQL, or does more or less only rely on the core SDK?

"
"

"
I think the only relevant dependency that Euphoria has besides core SDK is
Kryo. It is the default coder when no coder is provided, but that could be
made optional - e.g. the default coder would be supported only if an
appropriate module would be available. That way I think that Euphoria has no
special dependencies.

[1] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b
31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/
Group.java#L73
(https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73)

[2] https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b
31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
(https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms)

[3] https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b
31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/
sql/impl/rel/BeamAggregationRel.java#L179
(https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179)

On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je.ik@seznam.cz
(mailto:je.ik@seznam.cz)> wrote:

"Hi community,

I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
discuss possible development of Java based DSLs currently present in
Beam. In my knowledge, there are currently two DSLs based on Java SDK -
Euphoria and SQL. These DSLs currently share only the SDK itself,
although there might be room to share some more effort. We already know
that both Euphoria and SQL have need for retractions, but there are
probably many more features that these two could share.

So, I'd like to open a discussion on what it would cost and what it
would possibly bring, if instead of the current structure

Java SDK

| ---- SQL

| ---- Euphoria

these DSLs would be structured as

Java SDK ---> Euphoria ---> SQL

I'm absolutely sure that this would be a great investment and a huge
change, but I'd like to gather some opinions and general feelings of the
community about this. Some points to start the discussion from my side
would be, that structuring DSLs like this has internal logical
consistency, because each API layer further narrows completeness, but
brings simpler API for simpler tasks, while adding additional high-level
view of the data processing pipeline and thus enabling more
optimizations. On Euphoria side, these are various implementations joins
(most effective implementation depends on data), pipeline sampling and
more. Some (or maybe most) of these optimizations would have to be
implemented in both DSLs, so implementing them once is beneficial.
Another benefit is that this would bring Euphoria "closer" to Beam core
development (which would be good, it is part of the project anyway,
right? :)) and help better drive features, that although currently
needed mostly by SQL, might be needed by other Java users anyway.

Thanks for discussion and looking forward to any opinions.

Jan

"
"

Re: [DISCUSS] Structuring Java based DSLs

Posted by Anton Kedin <ke...@google.com>.

I think this approach makes sense in general, Euphoria can be the
implementation detail of SQL, similar to Join Library or core SDK Schemas.

I wonder though whether it would be better to bring Euphoria closer to core
SDK first, maybe even merge them together. If you look at Reuven's recent
work around schemas it seems like there are already similarities between
that and Euphoria's approach, unless I'm missing the point (e.g. Filter
transforms, FullJoin vs CoGroup... see [2]). And we're already switching
parts of SQL to those transforms (e.g. SQL Aggregation is now implemented
by core SDK's Group[3]).

Adding explicit Schema support to Euphoria will bring it both closer to
core SDK and make it natural to use for SQL. Can this be a first step
towards this integration?

One question I have is, does Euphoria bring dependencies that are not
needed by SQL, or does more or less only rely on the core SDK?

[1]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
[2]
https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
[3]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179



On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi community,
>
> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> discuss possible development of Java based DSLs currently present in
> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> Euphoria and SQL. These DSLs currently share only the SDK itself,
> although there might be room to share some more effort. We already know
> that both Euphoria and SQL have need for retractions, but there are
> probably many more features that these two could share.
>
> So, I'd like to open a discussion on what it would cost and what it
> would possibly bring, if instead of the current structure
>
>    Java SDK
>
>      | ---- SQL
>
>      | ---- Euphoria
>
> these DSLs would be structured as
>
>    Java SDK ---> Euphoria ---> SQL
>
> I'm absolutely sure that this would be a great investment and a huge
> change, but I'd like to gather some opinions and general feelings of the
> community about this. Some points to start the discussion from my side
> would be, that structuring DSLs like this has internal logical
> consistency, because each API layer further narrows completeness, but
> brings simpler API for simpler tasks, while adding additional high-level
> view of the data processing pipeline and thus enabling more
> optimizations. On Euphoria side, these are various implementations joins
> (most effective implementation depends on data), pipeline sampling and
> more. Some (or maybe most) of these optimizations would have to be
> implemented in both DSLs, so implementing them once is beneficial.
> Another benefit is that this would bring Euphoria "closer" to Beam core
> development (which would be good, it is part of the project anyway,
> right? :)) and help better drive features, that although currently
> needed mostly by SQL, might be needed by other Java users anyway.
>
> Thanks for discussion and looking forward to any opinions.
>
>    Jan
>
>

Re: [DISCUSS] Structuring Java based DSLs

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

it sounds good to me.

Regards
JB

On 30/11/2018 15:29, Jan Lukavský wrote:
> Hi community,
> 
> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> discuss possible development of Java based DSLs currently present in
> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> Euphoria and SQL. These DSLs currently share only the SDK itself,
> although there might be room to share some more effort. We already know
> that both Euphoria and SQL have need for retractions, but there are
> probably many more features that these two could share.
> 
> So, I'd like to open a discussion on what it would cost and what it
> would possibly bring, if instead of the current structure
> 
>   Java SDK
> 
>     | ---- SQL
> 
>     | ---- Euphoria
> 
> these DSLs would be structured as
> 
>   Java SDK ---> Euphoria ---> SQL
> 
> I'm absolutely sure that this would be a great investment and a huge
> change, but I'd like to gather some opinions and general feelings of the
> community about this. Some points to start the discussion from my side
> would be, that structuring DSLs like this has internal logical
> consistency, because each API layer further narrows completeness, but
> brings simpler API for simpler tasks, while adding additional high-level
> view of the data processing pipeline and thus enabling more
> optimizations. On Euphoria side, these are various implementations joins
> (most effective implementation depends on data), pipeline sampling and
> more. Some (or maybe most) of these optimizations would have to be
> implemented in both DSLs, so implementing them once is beneficial.
> Another benefit is that this would bring Euphoria "closer" to Beam core
> development (which would be good, it is part of the project anyway,
> right? :)) and help better drive features, that although currently
> needed mostly by SQL, might be needed by other Java users anyway.
> 
> Thanks for discussion and looking forward to any opinions.
> 
>   Jan
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com