You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Lai Zhou (JIRA)" <ji...@apache.org> on 2019/05/06 09:37:00 UTC

[jira] [Comment Edited] (CALCITE-2741) Add operator table with Hive-specific built-in functions

    [ https://issues.apache.org/jira/browse/CALCITE-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833649#comment-16833649 ] 

Lai Zhou edited comment on CALCITE-2741 at 5/6/19 9:36 AM:
-----------------------------------------------------------

[~zabetak],I also think it was not exactly an adapter. My initial goal was to

build a real-time/high-performance in memory sql engine that supports hive sql dialects on top of Calcite.

I had a try to use the JDBC interface first, but I encountered some issues:
 # custom config issue:  For every JDBC connection, we need put the data of current session into the schema, it means that current schema is bound to current session.

So the static SchemaFactory can't work out for this, we need introduce the DDL functions like what was in calcite-server module. The SqlDdlNodes in 

calcite-server module would populate the table through FrameworkConfig API .

When we execute a sql like 
{code:java}
create table t1 as select * from t2 where t2.id>100{code}
the populate method will be invoked,see  [SqlDdlNodes.java#L221|https://github.com/apache/calcite/blob/0d504d20d47542e8d461982512ae0e7a94e4d6cb/server/src/main/java/org/apache/calcite/sql/ddl/SqlDdlNodes.java#L221] . We need custom the FrameworkConfig here, include OperatorTable,SqlConformance and more other custom configs. By the way, the FrameworkConfig should be builded with all the configs from current CalcitePrepare.Context rather than only the rootSchema , it was a bug.

And the config options of CalcitePrepare.Context was just a subset of FrameworkConfig, most of the time we need use the FrameworkConfig API directly to build a new sql engine.

When we execute a sql like 
{code:java}
select * from t2 where t2.id>100

{code}
CalcitePrepareImpl would handle this sql flow, it did the similar thing, but some configs are hard coded , such as RexExecutor,Programs.

When implementing the EnumerableRel, the RelImplementor also might need be customized, see the example [HiveEnumerableRelImplementor.java|https://github.com/51nb/marble/blob/master/marble-table-hive/src/main/java/org/apache/calcite/adapter/hive/HiveEnumerableRelImplementor.java].

Now the JDBC interface didn't provide the way to custom these configs, so we proposed a new Table API that inspired by Apache Flink, to simplify the usage of Calcite when building a new sql engine. 

      2. cache issue: It's not easy to cache the whole sql plan if we use JDBC interface to handle a query, due to it's multiple-phase processing flow, but it is very easy to do this with the Table API,see [TableEnv.java#L412|https://github.com/51nb/marble/blob/master/marble-table/src/main/java/org/apache/calcite/table/TableEnv.java#L412].

summary:

The proposed Table API makes it easy to config the sql engine and cache the whole sql plan to improve the query performance.It fits the scenes that satisfy these conditions:

the datasources are  deterministic and already in memory, there is no computation need to be pushed down;

the sql queries are deterministic,without dynamic parameters, so the whole sql plan cache will be helpful(we can also use placeholders in the execution plan to cache the dynamic query  ).

 

 

 

 

 

 

 

 


was (Author: hhlai1990):
[~zabetak],I also think it was not exactly an adapter. My initial goal was to

build a real-time/high-performance in memory sql engine that supports hive sql dialects on top of Calcite.

I had a try to use the JDBC interface first, but I encountered some issues:
 # custom config issue:  For every JDBC connection, we need put the data of current session into the schema, it means that current schema is bound to current session.

So the static SchemaFactory can't work out for this, we need introduce the DDL functions like what was in calcite-server module. The SqlDdlNodes in 

calcite-server module would populate the table through FrameworkConfig API .

When we execute a sql like 
{code:java}
create table t1 as select * from t2 where t2.id>100{code}
the populate method will be invoked,see  [SqlDdlNodes.java#L221|https://github.com/apache/calcite/blob/0d504d20d47542e8d461982512ae0e7a94e4d6cb/server/src/main/java/org/apache/calcite/sql/ddl/SqlDdlNodes.java#L221] . We need custom the FrameworkConfig here, include OperatorTable,SqlConformance and more other custom configs. By the way, the FrameworkConfig should be builded with all the configs from current CalcitePrepare.Context rather than only the rootSchema , it was a bug.

And the config options of CalcitePrepare.Context was just a subset of FrameworkConfig, most of the time we need use the FrameworkConfig API directly to build a new sql engine.

When we execute a sql like 
{code:java}
select * from t2 where t2.id>100

{code}
CalcitePrepareImpl would handle this sql flow, it did the similar thing, but some configs are hard coded , such as RexExecutor,Programs.

When implementing the EnumerableRel, the RelImplementor also might need be customized, see the example [HiveEnumerableRelImplementor.java|https://github.com/51nb/marble/blob/master/marble-table-hive/src/main/java/org/apache/calcite/adapter/hive/HiveEnumerableRelImplementor.java].

Now the JDBC interface didn't provide the way to custom these configs, so we proposed a new Table API that inspired by Apache Flink, to simplify the usage of Calcite when building a new sql engine. 

      2. cache issue: It's not easy to cache the whole sql plan if we use JDBC interface to handle a query, due to it's multiple-phase processing flow, but it is very easy to do this with the Table API,see [TableEnv.java#L412|https://github.com/51nb/marble/blob/master/marble-table/src/main/java/org/apache/calcite/table/TableEnv.java#L412].

summary:

The proposed Table API makes it easy to config the sql engine and cache the whole sql plan to improve the query performance.It fits the scenes that satisfy these conditions:

the datasources are  deterministic and already in memory, there is no computation need to be pushed down;

the sql queries are deterministic, so the whole sql plan cache will be helpful;

 

 

 

 

 

 

 

 

> Add operator table with Hive-specific built-in functions
> --------------------------------------------------------
>
>                 Key: CALCITE-2741
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2741
>             Project: Calcite
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Lai Zhou
>            Priority: Minor
>
> I write a hive adapter for calcite to support Hive sql ,includes UDF、UDAF、UDTF and some of SqlSpecialOperator.
> How do you think of supporting a direct implemention of hive sql like this?
> I think it will be valuable when someone want to migrate his hive etl jobs to real-time scene.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)