You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Milinda Lakmal Pathirage (JIRA)" <ji...@apache.org> on 2015/01/13 15:58:35 UTC

[jira] [Commented] (SAMZA-483) A common representation of relational algebra for streaming SQL

    [ https://issues.apache.org/jira/browse/SAMZA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275300#comment-14275300 ] 

Milinda Lakmal Pathirage commented on SAMZA-483:
------------------------------------------------

I started to look into defining common representation (object model) for streaming algebra. Below are my thoughts and questions that came into my mind.

If we look at the normal flow from a query to a execution plan, this process involves at least following steps (In the context of a DSL, steps 1 and 2 may or may not be there):

1. Tokenization
2. Parsing. Generates abstract syntax tree.
3. Semantic analyzing
4. Optimization
5. Query Plan (Code generation in case of compilers)

If we take a compiler infrastructure like LLVM, it starts from somewhere between step 3 and 4 (AFAIK, there can be some types of semantic analysis happening at the intermediate representation (IR) layer). LLVM has LLVM IR and CLANG like front-ends generate LLVM IR from C/C++ code. In addition to LLVM IR generation, CLANG takes care of parsing and semantic analysis. 

Say we map the LLVM scenario to our problem;

- Do we need something like LLVM IR (semantic analysis will be handled by a upper layer)?
- Or do we need to include semantic analysis also in this layer?

I prefer the LLVM IR like model and let upper layer handle semantic analysis. Even in this case we have several complications.

- Is this model going to be a object model for streaming SQL?
- Or, Will relational algebra like model is enough?

Relational algebra 'like' model is going to be a representation of extended relational algebra expression ('extended' because 	there are streaming specific modifications) which looks like following (I made this expression format up for this example).

σ (expresisons) π (field_list) ρ (rename_list) ((ω (window_spec) S1) ⋈
(ω (window_spec) S2))

σ - Selection
π - Projection
ρ - Renaming
ω - Window operator
⋈ - Natural join

There are pros and cons in both SQL like model and relational algebra like model. For example, DSL developers need to generate a relational algebra model from their internal representations. Depending on the DSL and internals of it, generating SQL like model may be easier than relational model. On the other hand, a relational model may be easier to generate if DSL (or any other high-level API) developer knows how to map his/her language/API constructs to relational algebra.

Please let me know what you think about this.

> A common representation of relational algebra for streaming SQL 
> ----------------------------------------------------------------
>
>                 Key: SAMZA-483
>                 URL: https://issues.apache.org/jira/browse/SAMZA-483
>             Project: Samza
>          Issue Type: Sub-task
>            Reporter: Yi Pan (Data Infrastructure)
>            Priority: Minor
>              Labels: project
>
> Per discussion with [~criccomini] and [~milinda], we agreed that it seems to be a good idea to define a common representation of relational algebra on top of the operators defined in the operator layer (see SAMZA-482), which can be the common base that we can use to generate the description/configuration of a Samza job.
> This common layer can also be used by DSL-like language parser as a result of parsing a DSL program.
> Some additional requirements needed in addition to pure relational algebra:
> 1) the common representation should include window operators and stream operators (i.e. IStream/DStream/RStream)
> 2) the common representation should include description on parallelism of the jobs (i.e. how many partitions the resultant Samza job will use)
> Some references:
> http://web.cs.wpi.edu/~mukherab/i/DCAPE.pdf
> https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf
> http://davis.wpi.edu/dsrg/PROJECTS/CAPE/publications.htm
> http://davis.wpi.edu/dsrg/PROJECTS/CAPE/slides.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)