You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2018/05/16 20:49:00 UTC

[jira] [Commented] (CALCITE-2312) Support Partition By in sql select statement

    [ https://issues.apache.org/jira/browse/CALCITE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478083#comment-16478083 ] 

Julian Hyde commented on CALCITE-2312:
--------------------------------------

Hive added PARTITION BY to their SQL and I think they lived to regret it. It's not good to surface physical stuff (the "how") in logical SQL (the "what").

So, I wouldn't be keen to add this to Calcite's core SQL. You could do it as an extension of course.

But better might be to do CREATE STREAM that creates a stream with a partitioning strategy built in. Or, if it is an intermediate result in the query, let the optimizer figure out what partitioning strategy to use. Optimizers figure out partitioning and sorting for hash joins and merge joins today. 

> Support Partition By in sql select statement
> --------------------------------------------
>
>                 Key: CALCITE-2312
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2312
>             Project: Calcite
>          Issue Type: New Feature
>          Components: core
>            Reporter: Fei Xu
>            Assignee: Julian Hyde
>            Priority: Major
>
> I noticed that calcite already have an Exchange RelNode represents distribution on RelNode's input, but no sql clause support. 
>  
> We are use calcite building SQL layer for out streaming platform. and in streaming computation, data shuffle is a very important function, not only for our engine, but also for our users.
> For example, In stream-join-table, the engine will load the table data to a cache at runtime, and stream join table is actually stream look up table.
> If the stream data could partition by hash or range, before look up table. It will be cache friendly cause particular stream data look up particular table data. 
>  
> So I consider do some extensions on sql clause to support Exchange, e.g. 
> {code:java}
> // shuffle data using hash_distribution
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY o.productId
> // shuffle data to a singleton node
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY SINGLETON
> // shuffle data to all nodes 
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY BROADCAST
> // shuffle data to random node
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY RANDOM
> // shuffle data using round-robin policy
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY ROUND_ROBIN
> // shuffle data using range policy
> // Current I'm not sure about the appropriate clause to represents range 
> // shuffle, so it is just an demo.
> INSERT INTO output
> SELECT
>   *
> FROM orders o
> PARTITION BY RANGE o.productId, 0, 4096 
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)