You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/12/11 07:58:10 UTC

[DISCUSS] SQL Support using Apache Calcite

Hello all,

One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs

- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")

One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.

To give a taste of how this will look like.

A) If the user wants to mutate a Hudi table using SQL

Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")

B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1

The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be able
to integrate with the sparkSQL catalog and reuse all the tables there.

I am sure there are some gaps in my thinking. Just starting this thread, so
we can discuss and others can chime in/correct me.

thanks
vinoth

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by vino yang <ya...@gmail.com>.
+1 for Calcite

Best,
Vino

David Sheard <da...@datarefactory.com.au> 于2020年12月17日周四 下午2:15写道:

> I agree with Calcite
>
> On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <da...@apache.org> wrote:
>
> > Apache Calcite is a good candidate for parsing and executing the SQL,
> > Apache Flink has an extension for the SQL based on the Calcite parser
> [1],
> >
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > Should user still need to instatiate the hudiSparkSession first ? My
> > desired use case is user use the Hoodie CLI to execute these SQLs. They
> can
> > choose what engine to use by a CLI config option.
> >
> > > If we want those expressed in Calcite as well, we need to also invest
> in
> > the full Query side support, which can increase the scope by a lot.
> >
> > That is true, my thought is that we use the Calcite to execute only these
> > MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> > to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> > parse the query statements and handover it to the engines. One thing
> needs
> > to note is the SQL dialect difference, the Spark may have its own
> > syntax(keywords) that Calcite can not parse/recognize.
> >
> > [1]
> >
> >
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:
> >
> > > Hello all,
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > > implementing this via the Spark APIs
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > > external datasources like Hudi. only DELETE is.
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > > Today, the DataSource V1 API allows this kind of further
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > > iteration once the user calls df.write.format("hudi")
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > > Hudi. This frees us from being very dependent on a particular engine's
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > > most of the key words and (also more streaming friendly syntax). To be
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > > To give a taste of how this will look like.
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > > Instead of writing something like : spark.sql("UPDATE ....")
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> > >
> > > B) To save a Spark data frame to a Hudi table
> > > we continue to use Spark DataSource V1
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > > Spark DataFrames.
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > > the full Query side support, which can increase the scope by a lot.
> > > Some amount of investigation needs to happen, but ideally we should be
> > able
> > > to integrate with the sparkSQL catalog and reuse all the tables there.
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > so
> > > we can discuss and others can chime in/correct me.
> > >
> > > thanks
> > > vinoth
> > >
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by David Sheard <da...@datarefactory.com.au>.
I agree with Calcite

On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <da...@apache.org> wrote:

> Apache Calcite is a good candidate for parsing and executing the SQL,
> Apache Flink has an extension for the SQL based on the Calcite parser [1],
>
> > users will write : hudiSparkSession.sql("UPDATE ....")
>
> Should user still need to instatiate the hudiSparkSession first ? My
> desired use case is user use the Hoodie CLI to execute these SQLs. They can
> choose what engine to use by a CLI config option.
>
> > If we want those expressed in Calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
>
> That is true, my thought is that we use the Calcite to execute only these
> MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> parse the query statements and handover it to the engines. One thing needs
> to note is the SQL dialect difference, the Spark may have its own
> syntax(keywords) that Calcite can not parse/recognize.
>
> [1]
>
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:
>
> > Hello all,
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> > DataSource V2 APIs as well and found several issues that hinder us in
> > implementing this via the Spark APIs
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> > external datasources like Hudi. only DELETE is.
> > - DataSource V2 API offers no flexibility to perform any kind of
> > further transformations to the dataframe. Hudi supports keys, indexes,
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> > Today, the DataSource V1 API allows this kind of further
> > partitions/transformations. But the V2 API is simply offers partition
> level
> > iteration once the user calls df.write.format("hudi")
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> > Hudi. This frees us from being very dependent on a particular engine's
> > syntax support like Spark. Calcite is very popular by itself and supports
> > most of the key words and (also more streaming friendly syntax). To be
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> > writing, just that the SQL grammar is provided by Calcite.
> >
> > To give a taste of how this will look like.
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > B) To save a Spark data frame to a Hudi table
> > we continue to use Spark DataSource V1
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> > Spark DataFrames.
> > If we want those expressed in calcite as well, we need to also invest in
> > the full Query side support, which can increase the scope by a lot.
> > Some amount of investigation needs to happen, but ideally we should be
> able
> > to integrate with the sparkSQL catalog and reuse all the tables there.
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> so
> > we can discuss and others can chime in/correct me.
> >
> > thanks
> > vinoth
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Danny Chan <da...@apache.org>.
Apache Calcite is a good candidate for parsing and executing the SQL,
Apache Flink has an extension for the SQL based on the Calcite parser [1],

> users will write : hudiSparkSession.sql("UPDATE ....")

Should user still need to instatiate the hudiSparkSession first ? My
desired use case is user use the Hoodie CLI to execute these SQLs. They can
choose what engine to use by a CLI config option.

> If we want those expressed in Calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.

That is true, my thought is that we use the Calcite to execute only these
MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
parse the query statements and handover it to the engines. One thing needs
to note is the SQL dialect difference, the Spark may have its own
syntax(keywords) that Calcite can not parse/recognize.

[1]
https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen

Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:

> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be able
> to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread, so
> we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Kabeer Ahmed <ka...@linuxmail.org>.
Vinoth and All,

Users on gmail might be missing out on these emails as Gmail is down and emails sent to gmail.com domain are bouncing back.
At 11pm UK time below is the google update:
https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8b67908fadee664c68c240ff9f529ab
Best to bump this thread again tomorrow when GMail is back up and running.

On Dec 15 2020, at 7:06 am, Vinoth Chandar <vi...@apache.org> wrote:
> Hello all,
>
> Just bumping this thread again
> thanks
> vinoth
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
> > Hello all,
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> > syntax to support writing into Hudi tables. We have looked into the Spark 3
> > DataSource V2 APIs as well and found several issues that hinder us in
> > implementing this via the Spark APIs
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> > external datasources like Hudi. only DELETE is.
> > - DataSource V2 API offers no flexibility to perform any kind of
> > further transformations to the dataframe. Hudi supports keys, indexes,
> > preCombining and custom partitioning that ensures file sizes etc. All this
> > needs shuffling data, looking up/joining against other dataframes so forth.
> > Today, the DataSource V1 API allows this kind of further
> > partitions/transformations. But the V2 API is simply offers partition level
> > iteration once the user calls df.write.format("hudi")
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> > Hudi. This frees us from being very dependent on a particular engine's
> > syntax support like Spark. Calcite is very popular by itself and supports
> > most of the key words and (also more streaming friendly syntax). To be
> > clear, we will still be using Spark/Flink underneath to perform the actual
> > writing, just that the SQL grammar is provided by Calcite.
> >
> > To give a taste of how this will look like.
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > B) To save a Spark data frame to a Hudi table
> > we continue to use Spark DataSource V1
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> > ecosystem. Users would write MERGE SQL statements by joining against other
> > Spark DataFrames.
> > If we want those expressed in calcite as well, we need to also invest in
> > the full Query side support, which can increase the scope by a lot.
> > Some amount of investigation needs to happen, but ideally we should be
> > able to integrate with the sparkSQL catalog and reuse all the tables there.
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> > so we can discuss and others can chime in/correct me.
> >
> > thanks
> > vinoth
> >
>


Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Danny Chan <da...@apache.org>.
That's great, I can help with the Apache Calcite integration.

Vinoth Chandar <vi...@apache.org> 于2020年12月23日周三 上午12:29写道:

> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> > Yes, it looks good .
> > We are building the spark sql extensions to support for hudi in
> > our internal version.
> > I am interested in participating in the extension of SparkSQL on hudi.
> > 2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:
> >
> > Hi,
> >
> > I think what we are landing on finally is.
> >
> > - Keep pushing for SparkSQL support using Spark extensions route
> > - Calcite effort will be a separate/orthogonal approach, down the line
> >
> > Please feel free to correct me, if I got this wrong.
> >
> > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2015@icloud.com
> .invalid>
> > wrote:
> >
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> >
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> >
> > command and also provide parser for the common SQL extensions(e.g. Merge
> >
> > Into). The Engine-related syntax can be taught to the respective engines
> to
> >
> > process. If the common sql layer can handle the input sql, it handle
> >
> > it.Otherwise it is routed to the engine for processing. In long term, the
> >
> > common layer will more and more rich and perfect.
> >
> > 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
> >
> >
> > Hi,all
> >
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other
> analysis
> >
> > components at the same time,
> >
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> >
> > conflict with Calctite syntax.Is our strategy
> >
> > user migration or syntax compatibility?
> >
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
> >
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward
> to
> >
> > the RFC as well!
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> >
> > Thanks
> >
> >
> > Vinoth
> >
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
> >
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i
> would
> >
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
> >
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> >
> > out
> >
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> >
> > (sounds
> >
> >
> > like Hive huh)
> >
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> >
> > extensions route, before we take this on more fully.
> >
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> >
> > windowing/streaming in its syntax.
> >
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> >
> > Flink/Hudi integration?
> >
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> >
> >
> > .invalid>
> >
> >
> > wrote:
> >
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> >
> > ways
> >
> >
> > to
> >
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> >
> > the
> >
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> >
> > familiar
> >
> >
> > with
> >
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> >
> >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> >
> > .
> >
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> >
> > such
> >
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> >
> > tables. Apache Calcite is a great option to considering given
> >
> >
> > DatasourceV2
> >
> >
> > has the limitations that you described.
> >
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> >
> > the
> >
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> >
> > to
> >
> >
> > provide the same experience to users - which in itself could also be
> >
> >
> > considerable amount of work.
> >
> >
> > So, if we’re able to generalize on the SQL story along Calcite, that
> >
> >
> > would
> >
> >
> > help reduce redundant work in some sense.
> >
> >
> > Although, I’m worried about a few things
> >
> >
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL
> >
> >
> > syntax
> >
> >
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> >
> >
> > syntax”
> >
> >
> > for cross table joins, merges etc using data frames. One option is to
> >
> >
> > look
> >
> >
> > at the if there are ways to plug in custom logical and physical plans
> >
> >
> > in
> >
> >
> > Spark, this way, although the merge on sparksql functionality may not
> >
> >
> > be
> >
> >
> > as
> >
> >
> > simple to use, but wouldn’t take away performance and feature set for
> >
> >
> > starters, in the future we could think of having the entire query
> >
> >
> > space
> >
> >
> > be
> >
> >
> > powered by calcite like you mentioned
> >
> >
> > 2) If we continue to use DatasourceV1, is there any downside to this
> >
> >
> > from
> >
> >
> > a performance and optimization perspective when executing plan - I’m
> >
> >
> > guessing not but haven’t delved into the code to see if there’s
> >
> >
> > anything
> >
> >
> > different apart from the API and spec.
> >
> >
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > Just bumping this thread again
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE
> >
> >
> > sql
> >
> >
> >
> > syntax to support writing into Hudi tables. We have looked into the
> >
> >
> > Spark 3
> >
> >
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> >
> >
> > implementing this via the Spark APIs
> >
> >
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> >
> >
> > to
> >
> >
> >
> > external datasources like Hudi. only DELETE is.
> >
> >
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> >
> >
> > further transformations to the dataframe. Hudi supports keys,
> >
> >
> > indexes,
> >
> >
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> >
> >
> > this
> >
> >
> >
> > needs shuffling data, looking up/joining against other dataframes so
> >
> >
> > forth.
> >
> >
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> >
> >
> > partitions/transformations. But the V2 API is simply offers partition
> >
> >
> > level
> >
> >
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter
> >
> >
> > for
> >
> >
> >
> > Hudi. This frees us from being very dependent on a particular
> >
> >
> > engine's
> >
> >
> >
> > syntax support like Spark. Calcite is very popular by itself and
> >
> >
> > supports
> >
> >
> >
> > most of the key words and (also more streaming friendly syntax). To
> >
> >
> > be
> >
> >
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> >
> >
> > actual
> >
> >
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> >
> >
> > To give a taste of how this will look like.
> >
> >
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> >
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> >
> >
> > we continue to use Spark DataSource V1
> >
> >
> >
> >
> > The obvious challenge I see is the disconnect with the Spark
> >
> >
> > DataFrame
> >
> >
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> >
> >
> > other
> >
> >
> >
> > Spark DataFrames.
> >
> >
> >
> > If we want those expressed in calcite as well, we need to also invest
> >
> >
> > in
> >
> >
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> >
> >
> > Some amount of investigation needs to happen, but ideally we should
> >
> >
> > be
> >
> >
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> >
> >
> > there.
> >
> >
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this
> >
> >
> > thread,
> >
> >
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.

On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pe...@icloud.com.invalid>
wrote:

> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
>
> .
>
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
>
> such
>
>
> as MERGE INTO and DDL syntax.
>
>
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
>
>
> Thanks for starting this thread Vinoth.
>
>
> In general, definitely see the need for SQL style semantics on Hudi
>
>
> tables. Apache Calcite is a great option to considering given
>
>
> DatasourceV2
>
>
> has the limitations that you described.
>
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
>
> the
>
>
> same SQL semantics needs to be supported in other engines like Flink
>
>
> to
>
>
> provide the same experience to users - which in itself could also be
>
>
> considerable amount of work.
>
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
>
> would
>
>
> help reduce redundant work in some sense.
>
>
> Although, I’m worried about a few things
>
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
>
> syntax
>
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
>
> syntax”
>
>
> for cross table joins, merges etc using data frames. One option is to
>
>
> look
>
>
> at the if there are ways to plug in custom logical and physical plans
>
>
> in
>
>
> Spark, this way, although the merge on sparksql functionality may not
>
>
> be
>
>
> as
>
>
> simple to use, but wouldn’t take away performance and feature set for
>
>
> starters, in the future we could think of having the entire query
>
>
> space
>
>
> be
>
>
> powered by calcite like you mentioned
>
>
> 2) If we continue to use DatasourceV1, is there any downside to this
>
>
> from
>
>
> a performance and optimization perspective when executing plan - I’m
>
>
> guessing not but haven’t delved into the code to see if there’s
>
>
> anything
>
>
> different apart from the API and spec.
>
>
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> Just bumping this thread again
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE
>
>
> sql
>
>
>
> syntax to support writing into Hudi tables. We have looked into the
>
>
> Spark 3
>
>
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
>
>
> implementing this via the Spark APIs
>
>
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>
>
> to
>
>
>
> external datasources like Hudi. only DELETE is.
>
>
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
>
>
> further transformations to the dataframe. Hudi supports keys,
>
>
> indexes,
>
>
>
> preCombining and custom partitioning that ensures file sizes etc. All
>
>
> this
>
>
>
> needs shuffling data, looking up/joining against other dataframes so
>
>
> forth.
>
>
>
> Today, the DataSource V1 API allows this kind of further
>
>
>
> partitions/transformations. But the V2 API is simply offers partition
>
>
> level
>
>
>
> iteration once the user calls df.write.format("hudi")
>
>
>
>
> One thought I had is to explore Apache Calcite and write an adapter
>
>
> for
>
>
>
> Hudi. This frees us from being very dependent on a particular
>
>
> engine's
>
>
>
> syntax support like Spark. Calcite is very popular by itself and
>
>
> supports
>
>
>
> most of the key words and (also more streaming friendly syntax). To
>
>
> be
>
>
>
> clear, we will still be using Spark/Flink underneath to perform the
>
>
> actual
>
>
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
>
>
> To give a taste of how this will look like.
>
>
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
>
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
>
>
> B) To save a Spark data frame to a Hudi table
>
>
>
> we continue to use Spark DataSource V1
>
>
>
>
> The obvious challenge I see is the disconnect with the Spark
>
>
> DataFrame
>
>
>
> ecosystem. Users would write MERGE SQL statements by joining against
>
>
> other
>
>
>
> Spark DataFrames.
>
>
>
> If we want those expressed in calcite as well, we need to also invest
>
>
> in
>
>
>
> the full Query side support, which can increase the scope by a lot.
>
>
>
> Some amount of investigation needs to happen, but ideally we should
>
>
> be
>
>
>
> able to integrate with the sparkSQL catalog and reuse all the tables
>
>
> there.
>
>
>
>
> I am sure there are some gaps in my thinking. Just starting this
>
>
> thread,
>
>
>
> so we can discuss and others can chime in/correct me.
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
>
>
>
>
>
>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by wei li <lw...@gmail.com>.
First, I think it is necessary to improve spark sql, because the main scenario of hudi is datalake or warehouse, and spark has strong ecological capabilities in this field.

Second, but in the long run, Hudi needs a more general SQL layer, and it is very necessary to embrace calcite. Then based on hudi, a powerful data management processing service can be built

On 2020/12/22 08:30:37, Vinoth Chandar <vi...@apache.org> wrote: 
> Hi,
> 
> I think what we are landing on finally is.
> 
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
> 
> Please feel free to correct me, if I got this wrong.
> 
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
> 
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> > to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> > command and also provide parser for the common SQL extensions(e.g. Merge
> > Into). The Engine-related syntax can be taught to the respective engines to
> > process. If the common sql layer can handle the input sql, it handle
> > it.Otherwise it is routed to the engine for processing. In long term, the
> > common layer will more and more rich and perfect.
> > 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
> >
> > Hi,all
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> > components at the same time,
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> > conflict with Calctite syntax.Is our strategy
> > user migration or syntax compatibility?
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> > the RFC as well!
> >
> >
> > -Nishith
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> > Thanks
> >
> > Vinoth
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> > out
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> > (sounds
> >
> > like Hive huh)
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> > extensions route, before we take this on more fully.
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> > windowing/streaming in its syntax.
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> > Flink/Hudi integration?
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> >
> > .invalid>
> >
> > wrote:
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> > ways
> >
> > to
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> > the
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> > familiar
> >
> > with
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> > .
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> > such
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> > tables. Apache Calcite is a great option to considering given
> >
> > DatasourceV2
> >
> > has the limitations that you described.
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> > the
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> > to
> >
> > provide the same experience to users - which in itself could also be
> >
> > considerable amount of work.
> >
> > So, if we’re able to generalize on the SQL story along Calcite, that
> >
> > would
> >
> > help reduce redundant work in some sense.
> >
> > Although, I’m worried about a few things
> >
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL
> >
> > syntax
> >
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> >
> > syntax”
> >
> > for cross table joins, merges etc using data frames. One option is to
> >
> > look
> >
> > at the if there are ways to plug in custom logical and physical plans
> >
> > in
> >
> > Spark, this way, although the merge on sparksql functionality may not
> >
> > be
> >
> > as
> >
> > simple to use, but wouldn’t take away performance and feature set for
> >
> > starters, in the future we could think of having the entire query
> >
> > space
> >
> > be
> >
> > powered by calcite like you mentioned
> >
> > 2) If we continue to use DatasourceV1, is there any downside to this
> >
> > from
> >
> > a performance and optimization perspective when executing plan - I’m
> >
> > guessing not but haven’t delved into the code to see if there’s
> >
> > anything
> >
> > different apart from the API and spec.
> >
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> >
> > Hello all,
> >
> >
> >
> > Just bumping this thread again
> >
> >
> >
> > thanks
> >
> >
> > vinoth
> >
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> >
> > Hello all,
> >
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE
> >
> > sql
> >
> >
> > syntax to support writing into Hudi tables. We have looked into the
> >
> > Spark 3
> >
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> >
> > implementing this via the Spark APIs
> >
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> >
> > to
> >
> >
> > external datasources like Hudi. only DELETE is.
> >
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> >
> > further transformations to the dataframe. Hudi supports keys,
> >
> > indexes,
> >
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> >
> > this
> >
> >
> > needs shuffling data, looking up/joining against other dataframes so
> >
> > forth.
> >
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> >
> > partitions/transformations. But the V2 API is simply offers partition
> >
> > level
> >
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter
> >
> > for
> >
> >
> > Hudi. This frees us from being very dependent on a particular
> >
> > engine's
> >
> >
> > syntax support like Spark. Calcite is very popular by itself and
> >
> > supports
> >
> >
> > most of the key words and (also more streaming friendly syntax). To
> >
> > be
> >
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> >
> > actual
> >
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> >
> > To give a taste of how this will look like.
> >
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> >
> > we continue to use Spark DataSource V1
> >
> >
> >
> > The obvious challenge I see is the disconnect with the Spark
> >
> > DataFrame
> >
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> >
> > other
> >
> >
> > Spark DataFrames.
> >
> >
> > If we want those expressed in calcite as well, we need to also invest
> >
> > in
> >
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> >
> > Some amount of investigation needs to happen, but ideally we should
> >
> > be
> >
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> >
> > there.
> >
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this
> >
> > thread,
> >
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> >
> > thanks
> >
> >
> > vinoth
> >
> >
> >
> >
> >
> >
> >
> >
> 

Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by 受春柏 <sc...@126.com>.


Yes,I think it should be ok



在 2020-12-22 16:30:37,"Vinoth Chandar" <vi...@apache.org> 写道:
>Hi,
>
>I think what we are landing on finally is.
>
>- Keep pushing for SparkSQL support using Spark extensions route
>- Calcite effort will be a separate/orthogonal approach, down the line
>
>Please feel free to correct me, if I got this wrong.
>
>On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
>wrote:
>
>> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
>> command and also provide parser for the common SQL extensions(e.g. Merge
>> Into). The Engine-related syntax can be taught to the respective engines to
>> process. If the common sql layer can handle the input sql, it handle
>> it.Otherwise it is routed to the engine for processing. In long term, the
>> common layer will more and more rich and perfect.
>> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>>
>> Hi,all
>>
>>
>> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>> components at the same time,
>> But there are some questions about SparkSQL. SparkSQL syntax is in
>> conflict with Calctite syntax.Is our strategy
>> user migration or syntax compatibility?
>> In addition ,will it also support write SQL?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>>
>> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>> the RFC as well!
>>
>>
>> -Nishith
>>
>>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>>
>>
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>>
>>
>> Thanks
>>
>> Vinoth
>>
>>
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>>
>>
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>
>> add support for SQL connectors of Hoodie Flink soon ~
>>
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>
>>
>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>>
>>
>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>
>>
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>
>> They can choose what engine to use by a CLI config option.
>>
>>
>> Yes, that is also another attractive aspect of this route. We can build
>>
>> out
>>
>> a common SQL layer and have this translate to the underlying engine
>>
>> (sounds
>>
>> like Hive huh)
>>
>> Longer term, if we really think we can more easily implement a full DML +
>>
>> DDL + DQL, we can proceed with this.
>>
>>
>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>
>> extensions route, before we take this on more fully.
>>
>>
>> The other part where Calcite is great is, all the support for
>>
>> windowing/streaming in its syntax.
>>
>> Danny, I guess if we should be able to leverage that through a deeper
>>
>> Flink/Hudi integration?
>>
>>
>>
>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>> I think Dongwook is investigating on the same lines. and it does seem
>>
>> better to pursue this first, before trying other approaches.
>>
>>
>>
>>
>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>
>> .invalid>
>>
>> wrote:
>>
>>
>> Yeah I agree with Nishith that an option way is to look at the
>>
>> ways
>>
>> to
>>
>> plug in custom logical and physical plans in Spark. It can simplify
>>
>> the
>>
>> implementation and reuse the Spark SQL syntax. And also users
>>
>> familiar
>>
>> with
>>
>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>
>> In fact, spark have provided the SparkSessionExtensions interface for
>>
>> implement custom syntax extensions and SQL rewrite rule.
>>
>>
>>
>>
>>
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>
>> .
>>
>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>
>> such
>>
>> as MERGE INTO and DDL syntax.
>>
>>
>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>
>>
>> Thanks for starting this thread Vinoth.
>>
>> In general, definitely see the need for SQL style semantics on Hudi
>>
>> tables. Apache Calcite is a great option to considering given
>>
>> DatasourceV2
>>
>> has the limitations that you described.
>>
>>
>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>
>> the
>>
>> same SQL semantics needs to be supported in other engines like Flink
>>
>> to
>>
>> provide the same experience to users - which in itself could also be
>>
>> considerable amount of work.
>>
>> So, if we’re able to generalize on the SQL story along Calcite, that
>>
>> would
>>
>> help reduce redundant work in some sense.
>>
>> Although, I’m worried about a few things
>>
>>
>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>
>> syntax
>>
>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>
>> syntax”
>>
>> for cross table joins, merges etc using data frames. One option is to
>>
>> look
>>
>> at the if there are ways to plug in custom logical and physical plans
>>
>> in
>>
>> Spark, this way, although the merge on sparksql functionality may not
>>
>> be
>>
>> as
>>
>> simple to use, but wouldn’t take away performance and feature set for
>>
>> starters, in the future we could think of having the entire query
>>
>> space
>>
>> be
>>
>> powered by calcite like you mentioned
>>
>> 2) If we continue to use DatasourceV1, is there any downside to this
>>
>> from
>>
>> a performance and optimization perspective when executing plan - I’m
>>
>> guessing not but haven’t delved into the code to see if there’s
>>
>> anything
>>
>> different apart from the API and spec.
>>
>>
>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> Just bumping this thread again
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>
>> sql
>>
>>
>> syntax to support writing into Hudi tables. We have looked into the
>>
>> Spark 3
>>
>>
>> DataSource V2 APIs as well and found several issues that hinder us in
>>
>>
>> implementing this via the Spark APIs
>>
>>
>>
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>
>> to
>>
>>
>> external datasources like Hudi. only DELETE is.
>>
>>
>> - DataSource V2 API offers no flexibility to perform any kind of
>>
>>
>> further transformations to the dataframe. Hudi supports keys,
>>
>> indexes,
>>
>>
>> preCombining and custom partitioning that ensures file sizes etc. All
>>
>> this
>>
>>
>> needs shuffling data, looking up/joining against other dataframes so
>>
>> forth.
>>
>>
>> Today, the DataSource V1 API allows this kind of further
>>
>>
>> partitions/transformations. But the V2 API is simply offers partition
>>
>> level
>>
>>
>> iteration once the user calls df.write.format("hudi")
>>
>>
>>
>> One thought I had is to explore Apache Calcite and write an adapter
>>
>> for
>>
>>
>> Hudi. This frees us from being very dependent on a particular
>>
>> engine's
>>
>>
>> syntax support like Spark. Calcite is very popular by itself and
>>
>> supports
>>
>>
>> most of the key words and (also more streaming friendly syntax). To
>>
>> be
>>
>>
>> clear, we will still be using Spark/Flink underneath to perform the
>>
>> actual
>>
>>
>> writing, just that the SQL grammar is provided by Calcite.
>>
>>
>>
>> To give a taste of how this will look like.
>>
>>
>>
>> A) If the user wants to mutate a Hudi table using SQL
>>
>>
>>
>> Instead of writing something like : spark.sql("UPDATE ....")
>>
>>
>> users will write : hudiSparkSession.sql("UPDATE ....")
>>
>>
>>
>> B) To save a Spark data frame to a Hudi table
>>
>>
>> we continue to use Spark DataSource V1
>>
>>
>>
>> The obvious challenge I see is the disconnect with the Spark
>>
>> DataFrame
>>
>>
>> ecosystem. Users would write MERGE SQL statements by joining against
>>
>> other
>>
>>
>> Spark DataFrames.
>>
>>
>> If we want those expressed in calcite as well, we need to also invest
>>
>> in
>>
>>
>> the full Query side support, which can increase the scope by a lot.
>>
>>
>> Some amount of investigation needs to happen, but ideally we should
>>
>> be
>>
>>
>> able to integrate with the sparkSQL catalog and reuse all the tables
>>
>> there.
>>
>>
>>
>> I am sure there are some gaps in my thinking. Just starting this
>>
>> thread,
>>
>>
>> so we can discuss and others can chime in/correct me.
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>>
>>
>>
>>
>>

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by pzwpzw <pe...@icloud.com.INVALID>.
Yes, it looks good .
We are building the spark sql extensions to support for hudi in our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:


Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
wrote:


Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
to process engine independent SQL, for example most of the DDL、Hoodie CLI
command and also provide parser for the common SQL extensions(e.g. Merge
Into). The Engine-related syntax can be taught to the respective engines to
process. If the common sql layer can handle the input sql, it handle
it.Otherwise it is routed to the engine for processing. In long term, the
common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:


Hi,all




That's very good,Hudi SQL syntax can support Flink、hive and other analysis
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in
conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?








































在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:


That’s awesome. Looks like we have a consensus on Calcite. Look forward to
the RFC as well!




-Nishith




On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:




Sounds good. Look forward to a RFC/DISCUSS thread.




Thanks


Vinoth




On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:




Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would


add support for SQL connectors of Hoodie Flink soon ~


Currently, i'm preparing a refactoring to the current Flink writer code.




Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:




Thanks Kabeer for the note on gmail. Did not realize that. :)




My desired use case is user use the Hoodie CLI to execute these SQLs.


They can choose what engine to use by a CLI config option.




Yes, that is also another attractive aspect of this route. We can build


out


a common SQL layer and have this translate to the underlying engine


(sounds


like Hive huh)


Longer term, if we really think we can more easily implement a full DML +


DDL + DQL, we can proceed with this.




As others pointed out, for Spark SQL, it might be good to try the Spark


extensions route, before we take this on more fully.




The other part where Calcite is great is, all the support for


windowing/streaming in its syntax.


Danny, I guess if we should be able to leverage that through a deeper


Flink/Hudi integration?






On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>


wrote:




I think Dongwook is investigating on the same lines. and it does seem


better to pursue this first, before trying other approaches.








On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com


.invalid>


wrote:




Yeah I agree with Nishith that an option way is to look at the


ways


to


plug in custom logical and physical plans in Spark. It can simplify


the


implementation and reuse the Spark SQL syntax. And also users


familiar


with


Spark SQL will be able to use HUDi's SQL features more quickly.


In fact, spark have provided the SparkSessionExtensions interface for


implement custom syntax extensions and SQL rewrite rule.










https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html


.


We can use the SparkSessionExtensions to extended hoodie sql syntax


such


as MERGE INTO and DDL syntax.




2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:




Thanks for starting this thread Vinoth.


In general, definitely see the need for SQL style semantics on Hudi


tables. Apache Calcite is a great option to considering given


DatasourceV2


has the limitations that you described.




Additionally, even if Spark DatasourceV2 allowed for the flexibility,


the


same SQL semantics needs to be supported in other engines like Flink


to


provide the same experience to users - which in itself could also be


considerable amount of work.


So, if we’re able to generalize on the SQL story along Calcite, that


would


help reduce redundant work in some sense.


Although, I’m worried about a few things




1) Like you pointed out, writing complex user jobs using Spark SQL


syntax


can be harder for users who are moving from “Hudi syntax” to “Spark


syntax”


for cross table joins, merges etc using data frames. One option is to


look


at the if there are ways to plug in custom logical and physical plans


in


Spark, this way, although the merge on sparksql functionality may not


be


as


simple to use, but wouldn’t take away performance and feature set for


starters, in the future we could think of having the entire query


space


be


powered by calcite like you mentioned


2) If we continue to use DatasourceV1, is there any downside to this


from


a performance and optimization perspective when executing plan - I’m


guessing not but haven’t delved into the code to see if there’s


anything


different apart from the API and spec.




On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>


wrote:






Hello all,






Just bumping this thread again






thanks




vinoth






On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>


wrote:






Hello all,






One feature that keeps coming up is the ability to use UPDATE, MERGE


sql




syntax to support writing into Hudi tables. We have looked into the


Spark 3




DataSource V2 APIs as well and found several issues that hinder us in




implementing this via the Spark APIs






- As of this writing, the UPDATE/MERGE syntax is not really opened up


to




external datasources like Hudi. only DELETE is.




- DataSource V2 API offers no flexibility to perform any kind of




further transformations to the dataframe. Hudi supports keys,


indexes,




preCombining and custom partitioning that ensures file sizes etc. All


this




needs shuffling data, looking up/joining against other dataframes so


forth.




Today, the DataSource V1 API allows this kind of further




partitions/transformations. But the V2 API is simply offers partition


level




iteration once the user calls df.write.format("hudi")






One thought I had is to explore Apache Calcite and write an adapter


for




Hudi. This frees us from being very dependent on a particular


engine's




syntax support like Spark. Calcite is very popular by itself and


supports




most of the key words and (also more streaming friendly syntax). To


be




clear, we will still be using Spark/Flink underneath to perform the


actual




writing, just that the SQL grammar is provided by Calcite.






To give a taste of how this will look like.






A) If the user wants to mutate a Hudi table using SQL






Instead of writing something like : spark.sql("UPDATE ....")




users will write : hudiSparkSession.sql("UPDATE ....")






B) To save a Spark data frame to a Hudi table




we continue to use Spark DataSource V1






The obvious challenge I see is the disconnect with the Spark


DataFrame




ecosystem. Users would write MERGE SQL statements by joining against


other




Spark DataFrames.




If we want those expressed in calcite as well, we need to also invest


in




the full Query side support, which can increase the scope by a lot.




Some amount of investigation needs to happen, but ideally we should


be




able to integrate with the sparkSQL catalog and reuse all the tables


there.






I am sure there are some gaps in my thinking. Just starting this


thread,




so we can discuss and others can chime in/correct me.






thanks




vinoth
















Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
Hi,

I think what we are landing on finally is.

- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line

Please feel free to correct me, if I got this wrong.

On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
wrote:

> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> to process engine independent SQL,  for example most of the DDL、Hoodie CLI
> command and also provide parser for the common SQL extensions(e.g. Merge
> Into). The Engine-related syntax can be taught to the respective engines to
> process. If the common sql layer can handle the input sql, it handle
> it.Otherwise it is routed to the engine for processing. In long term, the
> common layer will more and more rich and perfect.
> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>
> Hi,all
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> components at the same time,
> But there are some questions about SparkSQL. SparkSQL syntax is in
> conflict with Calctite syntax.Is our strategy
> user migration or syntax compatibility?
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> the RFC as well!
>
>
> -Nishith
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
> Thanks
>
> Vinoth
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
> add support for SQL connectors of Hoodie Flink soon ~
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
> They can choose what engine to use by a CLI config option.
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
> out
>
> a common SQL layer and have this translate to the underlying engine
>
> (sounds
>
> like Hive huh)
>
> Longer term, if we really think we can more easily implement a full DML +
>
> DDL + DQL, we can proceed with this.
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
> extensions route, before we take this on more fully.
>
>
> The other part where Calcite is great is, all the support for
>
> windowing/streaming in its syntax.
>
> Danny, I guess if we should be able to leverage that through a deeper
>
> Flink/Hudi integration?
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
> better to pursue this first, before trying other approaches.
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>
> .invalid>
>
> wrote:
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
> ways
>
> to
>
> plug in custom logical and physical plans in Spark. It can simplify
>
> the
>
> implementation and reuse the Spark SQL syntax. And also users
>
> familiar
>
> with
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
> .
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
> such
>
> as MERGE INTO and DDL syntax.
>
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
>
> Thanks for starting this thread Vinoth.
>
> In general, definitely see the need for SQL style semantics on Hudi
>
> tables. Apache Calcite is a great option to considering given
>
> DatasourceV2
>
> has the limitations that you described.
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
> the
>
> same SQL semantics needs to be supported in other engines like Flink
>
> to
>
> provide the same experience to users - which in itself could also be
>
> considerable amount of work.
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
> would
>
> help reduce redundant work in some sense.
>
> Although, I’m worried about a few things
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
> syntax
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
> syntax”
>
> for cross table joins, merges etc using data frames. One option is to
>
> look
>
> at the if there are ways to plug in custom logical and physical plans
>
> in
>
> Spark, this way, although the merge on sparksql functionality may not
>
> be
>
> as
>
> simple to use, but wouldn’t take away performance and feature set for
>
> starters, in the future we could think of having the entire query
>
> space
>
> be
>
> powered by calcite like you mentioned
>
> 2) If we continue to use DatasourceV1, is there any downside to this
>
> from
>
> a performance and optimization perspective when executing plan - I’m
>
> guessing not but haven’t delved into the code to see if there’s
>
> anything
>
> different apart from the API and spec.
>
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
>
> Hello all,
>
>
>
> Just bumping this thread again
>
>
>
> thanks
>
>
> vinoth
>
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
>
> Hello all,
>
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE
>
> sql
>
>
> syntax to support writing into Hudi tables. We have looked into the
>
> Spark 3
>
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
>
> implementing this via the Spark APIs
>
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>
> to
>
>
> external datasources like Hudi. only DELETE is.
>
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
>
> further transformations to the dataframe. Hudi supports keys,
>
> indexes,
>
>
> preCombining and custom partitioning that ensures file sizes etc. All
>
> this
>
>
> needs shuffling data, looking up/joining against other dataframes so
>
> forth.
>
>
> Today, the DataSource V1 API allows this kind of further
>
>
> partitions/transformations. But the V2 API is simply offers partition
>
> level
>
>
> iteration once the user calls df.write.format("hudi")
>
>
>
> One thought I had is to explore Apache Calcite and write an adapter
>
> for
>
>
> Hudi. This frees us from being very dependent on a particular
>
> engine's
>
>
> syntax support like Spark. Calcite is very popular by itself and
>
> supports
>
>
> most of the key words and (also more streaming friendly syntax). To
>
> be
>
>
> clear, we will still be using Spark/Flink underneath to perform the
>
> actual
>
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
>
> To give a taste of how this will look like.
>
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
>
> B) To save a Spark data frame to a Hudi table
>
>
> we continue to use Spark DataSource V1
>
>
>
> The obvious challenge I see is the disconnect with the Spark
>
> DataFrame
>
>
> ecosystem. Users would write MERGE SQL statements by joining against
>
> other
>
>
> Spark DataFrames.
>
>
> If we want those expressed in calcite as well, we need to also invest
>
> in
>
>
> the full Query side support, which can increase the scope by a lot.
>
>
> Some amount of investigation needs to happen, but ideally we should
>
> be
>
>
> able to integrate with the sparkSQL catalog and reuse all the tables
>
> there.
>
>
>
> I am sure there are some gaps in my thinking. Just starting this
>
> thread,
>
>
> so we can discuss and others can chime in/correct me.
>
>
>
> thanks
>
>
> vinoth
>
>
>
>
>
>
>
>

Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by 受春柏 <sc...@126.com>.
Hi,pzwpzw
I see what you mean, it is very necessary to implement a common layer for hudi, and we are also planning to implement sparkSQL write capabilities for SQL-based ETL processing.Common Layer and SparkSQL Write can combine to form HUDI's SQL capabilities


















At 2020-12-21 19:30:36, "pzwpzw" <pe...@icloud.com.INVALID> wrote:

Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL,  for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:


Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
ThanksVinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i wouldadd support for SQL connectors of Hoodie Flink soon ~Currently, i'm preparing a refactoring to the current Flink writer code.
Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.They can choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can buildouta common SQL layer and have this translate to the underlying engine(soundslike Hive huh)Longer term, if we really think we can more easily implement a full DML +DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the Sparkextensions route, before we take this on more fully.
The other part where Calcite is great is, all the support forwindowing/streaming in its syntax.Danny, I guess if we should be able to leverage that through a deeperFlink/Hudi integration?

On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>wrote:
I think Dongwook is investigating on the same lines. and it does seembetter to pursue this first, before trying other approaches.


On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>wrote:
Yeah I agree with Nishith that an option way is to look at thewaystoplug in custom logical and physical plans in Spark. It can simplifytheimplementation and reuse the Spark SQL syntax. And also usersfamiliarwithSpark SQL will be able to use HUDi's SQL features more quickly.In fact, spark have provided the SparkSessionExtensions interface forimplement custom syntax extensions and SQL rewrite rule.


https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.We can use the SparkSessionExtensions to extended hoodie sql syntaxsuchas MERGE INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
Thanks for starting this thread Vinoth.In general, definitely see the need for SQL style semantics on Huditables. Apache Calcite is a great option to considering givenDatasourceV2has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,thesame SQL semantics needs to be supported in other engines like Flinktoprovide the same experience to users - which in itself could also beconsiderable amount of work.So, if we’re able to generalize on the SQL story along Calcite, thatwouldhelp reduce redundant work in some sense.Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQLsyntaxcan be harder for users who are moving from “Hudi syntax” to “Sparksyntax”for cross table joins, merges etc using data frames. One option is tolookat the if there are ways to plug in custom logical and physical plansinSpark, this way, although the merge on sparksql functionality may notbeassimple to use, but wouldn’t take away performance and feature set forstarters, in the future we could think of having the entire queryspacebepowered by calcite like you mentioned2) If we continue to use DatasourceV1, is there any downside to thisfroma performance and optimization perspective when executing plan - I’mguessing not but haven’t delved into the code to see if there’sanythingdifferent apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>wrote:

Hello all,

Just bumping this thread again

thanks
vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>wrote:

Hello all,

One feature that keeps coming up is the ability to use UPDATE, MERGEsql
syntax to support writing into Hudi tables. We have looked into theSpark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs

- As of this writing, the UPDATE/MERGE syntax is not really opened upto
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys,indexes,
preCombining and custom partitioning that ensures file sizes etc. Allthis
needs shuffling data, looking up/joining against other dataframes soforth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partitionlevel
iteration once the user calls df.write.format("hudi")

One thought I had is to explore Apache Calcite and write an adapterfor
Hudi. This frees us from being very dependent on a particularengine's
syntax support like Spark. Calcite is very popular by itself andsupports
most of the key words and (also more streaming friendly syntax). Tobe
clear, we will still be using Spark/Flink underneath to perform theactual
writing, just that the SQL grammar is provided by Calcite.

To give a taste of how this will look like.

A) If the user wants to mutate a Hudi table using SQL

Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")

B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1

The obvious challenge I see is the disconnect with the SparkDataFrame
ecosystem. Users would write MERGE SQL statements by joining againstother
Spark DataFrames.
If we want those expressed in calcite as well, we need to also investin
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we shouldbe
able to integrate with the sparkSQL catalog and reuse all the tablesthere.

I am sure there are some gaps in my thinking. Just starting thisthread,
so we can discuss and others can chime in/correct me.

thanks
vinoth






Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by pzwpzw <pe...@icloud.com.INVALID>.
Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL,  for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:


Hi,all


That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:

That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!


-Nishith


On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:


Sounds good. Look forward to a RFC/DISCUSS thread.


Thanks
Vinoth


On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:


Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.


Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:


Thanks Kabeer for the note on gmail. Did not realize that. :)


My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.


Yes, that is also another attractive aspect of this route. We can build
out
a common SQL layer and have this translate to the underlying engine
(sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.


As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.


The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?




On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
wrote:


I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.






On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
.invalid>
wrote:


Yeah I agree with Nishith that an option way is to look at the
ways
to
plug in custom logical and physical plans in Spark. It can simplify
the
implementation and reuse the Spark SQL syntax. And also users
familiar
with
Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for
implement custom syntax extensions and SQL rewrite rule.






https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
.
We can use the SparkSessionExtensions to extended hoodie sql syntax
such
as MERGE INTO and DDL syntax.


2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:


Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi
tables. Apache Calcite is a great option to considering given
DatasourceV2
has the limitations that you described.


Additionally, even if Spark DatasourceV2 allowed for the flexibility,
the
same SQL semantics needs to be supported in other engines like Flink
to
provide the same experience to users - which in itself could also be
considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that
would
help reduce redundant work in some sense.
Although, I’m worried about a few things


1) Like you pointed out, writing complex user jobs using Spark SQL
syntax
can be harder for users who are moving from “Hudi syntax” to “Spark
syntax”
for cross table joins, merges etc using data frames. One option is to
look
at the if there are ways to plug in custom logical and physical plans
in
Spark, this way, although the merge on sparksql functionality may not
be
as
simple to use, but wouldn’t take away performance and feature set for
starters, in the future we could think of having the entire query
space
be
powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this
from
a performance and optimization perspective when executing plan - I’m
guessing not but haven’t delved into the code to see if there’s
anything
different apart from the API and spec.


On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
wrote:




Hello all,




Just bumping this thread again




thanks


vinoth




On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
wrote:




Hello all,




One feature that keeps coming up is the ability to use UPDATE, MERGE
sql


syntax to support writing into Hudi tables. We have looked into the
Spark 3


DataSource V2 APIs as well and found several issues that hinder us in


implementing this via the Spark APIs




- As of this writing, the UPDATE/MERGE syntax is not really opened up
to


external datasources like Hudi. only DELETE is.


- DataSource V2 API offers no flexibility to perform any kind of


further transformations to the dataframe. Hudi supports keys,
indexes,


preCombining and custom partitioning that ensures file sizes etc. All
this


needs shuffling data, looking up/joining against other dataframes so
forth.


Today, the DataSource V1 API allows this kind of further


partitions/transformations. But the V2 API is simply offers partition
level


iteration once the user calls df.write.format("hudi")




One thought I had is to explore Apache Calcite and write an adapter
for


Hudi. This frees us from being very dependent on a particular
engine's


syntax support like Spark. Calcite is very popular by itself and
supports


most of the key words and (also more streaming friendly syntax). To
be


clear, we will still be using Spark/Flink underneath to perform the
actual


writing, just that the SQL grammar is provided by Calcite.




To give a taste of how this will look like.




A) If the user wants to mutate a Hudi table using SQL




Instead of writing something like : spark.sql("UPDATE ....")


users will write : hudiSparkSession.sql("UPDATE ....")




B) To save a Spark data frame to a Hudi table


we continue to use Spark DataSource V1




The obvious challenge I see is the disconnect with the Spark
DataFrame


ecosystem. Users would write MERGE SQL statements by joining against
other


Spark DataFrames.


If we want those expressed in calcite as well, we need to also invest
in


the full Query side support, which can increase the scope by a lot.


Some amount of investigation needs to happen, but ideally we should
be


able to integrate with the sparkSQL catalog and reuse all the tables
there.




I am sure there are some gaps in my thinking. Just starting this
thread,


so we can discuss and others can chime in/correct me.




thanks


vinoth












Reply:Re: [DISCUSS] SQL Support using Apache Calcite

Posted by 受春柏 <sc...@126.com>.
Hi,all


    That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
 user migration or syntax compatibility?
In addition ,will it also support write SQL?




















在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
>
>-Nishith
>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>> 
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>> 
>> Thanks
>> Vinoth
>> 
>>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>>> 
>>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>> add support for SQL connectors of Hoodie Flink soon ~
>>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>> 
>>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>>> 
>>>> Thanks Kabeer for the note on gmail. Did not realize that.  :)
>>>> 
>>>>>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>>> They can choose what engine to use by a CLI config option.
>>>> 
>>>> Yes, that is also another attractive aspect of this route. We can build
>>> out
>>>> a common SQL layer and have this translate to the underlying engine
>>> (sounds
>>>> like Hive huh)
>>>> Longer term, if we really think we can more easily implement a full DML +
>>>> DDL + DQL, we can proceed with this.
>>>> 
>>>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>>> extensions route, before we take this on more fully.
>>>> 
>>>> The other part where Calcite is great is, all the support for
>>>> windowing/streaming in its syntax.
>>>> Danny, I guess if we should be able to leverage that through a deeper
>>>> Flink/Hudi integration?
>>>> 
>>>> 
>>>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>>> 
>>>>> I think Dongwook is investigating on the same lines. and it does seem
>>>>> better to pursue this first, before trying other approaches.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>>> .invalid>
>>>>> wrote:
>>>>> 
>>>>>>   Yeah I agree with Nishith that an option way is to look at the
>>> ways
>>>> to
>>>>>> plug in custom logical and physical plans in Spark. It can simplify
>>> the
>>>>>> implementation and reuse the Spark SQL syntax. And also users
>>> familiar
>>>>> with
>>>>>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>>>>> In fact, spark have provided the SparkSessionExtensions interface for
>>>>>> implement custom syntax extensions and SQL rewrite rule.
>>>>>> 
>>>>> 
>>>> 
>>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>>>> .
>>>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>>> such
>>>>>> as MERGE INTO and DDL syntax.
>>>>>> 
>>>>>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>>>>> 
>>>>>> Thanks for starting this thread Vinoth.
>>>>>> In general, definitely see the need for SQL style semantics on Hudi
>>>>>> tables. Apache Calcite is a great option to considering given
>>>>> DatasourceV2
>>>>>> has the limitations that you described.
>>>>>> 
>>>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>>> the
>>>>>> same SQL semantics needs to be supported in other engines like Flink
>>> to
>>>>>> provide the same experience to users - which in itself could also be
>>>>>> considerable amount of work.
>>>>>> So, if we’re able to generalize on the SQL story along Calcite, that
>>>>> would
>>>>>> help reduce redundant work in some sense.
>>>>>> Although, I’m worried about a few things
>>>>>> 
>>>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>>> syntax
>>>>>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>>>> syntax”
>>>>>> for cross table joins, merges etc using data frames. One option is to
>>>>> look
>>>>>> at the if there are ways to plug in custom logical and physical plans
>>>> in
>>>>>> Spark, this way, although the merge on sparksql functionality may not
>>>> be
>>>>> as
>>>>>> simple to use, but wouldn’t take away performance and feature set for
>>>>>> starters, in the future we could think of having the entire query
>>> space
>>>>> be
>>>>>> powered by calcite like you mentioned
>>>>>> 2) If we continue to use DatasourceV1, is there any downside to this
>>>> from
>>>>>> a performance and optimization perspective when executing plan - I’m
>>>>>> guessing not but haven’t delved into the code to see if there’s
>>>> anything
>>>>>> different apart from the API and spec.
>>>>>> 
>>>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hello all,
>>>>>> 
>>>>>> 
>>>>>> Just bumping this thread again
>>>>>> 
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> vinoth
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hello all,
>>>>>> 
>>>>>> 
>>>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>>> sql
>>>>>> 
>>>>>> syntax to support writing into Hudi tables. We have looked into the
>>>>> Spark 3
>>>>>> 
>>>>>> DataSource V2 APIs as well and found several issues that hinder us in
>>>>>> 
>>>>>> implementing this via the Spark APIs
>>>>>> 
>>>>>> 
>>>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>>> to
>>>>>> 
>>>>>> external datasources like Hudi. only DELETE is.
>>>>>> 
>>>>>> - DataSource V2 API offers no flexibility to perform any kind of
>>>>>> 
>>>>>> further transformations to the dataframe. Hudi supports keys,
>>> indexes,
>>>>>> 
>>>>>> preCombining and custom partitioning that ensures file sizes etc. All
>>>>> this
>>>>>> 
>>>>>> needs shuffling data, looking up/joining against other dataframes so
>>>>> forth.
>>>>>> 
>>>>>> Today, the DataSource V1 API allows this kind of further
>>>>>> 
>>>>>> partitions/transformations. But the V2 API is simply offers partition
>>>>> level
>>>>>> 
>>>>>> iteration once the user calls df.write.format("hudi")
>>>>>> 
>>>>>> 
>>>>>> One thought I had is to explore Apache Calcite and write an adapter
>>> for
>>>>>> 
>>>>>> Hudi. This frees us from being very dependent on a particular
>>> engine's
>>>>>> 
>>>>>> syntax support like Spark. Calcite is very popular by itself and
>>>> supports
>>>>>> 
>>>>>> most of the key words and (also more streaming friendly syntax). To
>>> be
>>>>>> 
>>>>>> clear, we will still be using Spark/Flink underneath to perform the
>>>>> actual
>>>>>> 
>>>>>> writing, just that the SQL grammar is provided by Calcite.
>>>>>> 
>>>>>> 
>>>>>> To give a taste of how this will look like.
>>>>>> 
>>>>>> 
>>>>>> A) If the user wants to mutate a Hudi table using SQL
>>>>>> 
>>>>>> 
>>>>>> Instead of writing something like : spark.sql("UPDATE ....")
>>>>>> 
>>>>>> users will write : hudiSparkSession.sql("UPDATE ....")
>>>>>> 
>>>>>> 
>>>>>> B) To save a Spark data frame to a Hudi table
>>>>>> 
>>>>>> we continue to use Spark DataSource V1
>>>>>> 
>>>>>> 
>>>>>> The obvious challenge I see is the disconnect with the Spark
>>> DataFrame
>>>>>> 
>>>>>> ecosystem. Users would write MERGE SQL statements by joining against
>>>>> other
>>>>>> 
>>>>>> Spark DataFrames.
>>>>>> 
>>>>>> If we want those expressed in calcite as well, we need to also invest
>>>> in
>>>>>> 
>>>>>> the full Query side support, which can increase the scope by a lot.
>>>>>> 
>>>>>> Some amount of investigation needs to happen, but ideally we should
>>> be
>>>>>> 
>>>>>> able to integrate with the sparkSQL catalog and reuse all the tables
>>>>> there.
>>>>>> 
>>>>>> 
>>>>>> I am sure there are some gaps in my thinking. Just starting this
>>>> thread,
>>>>>> 
>>>>>> so we can discuss and others can chime in/correct me.
>>>>>> 
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> vinoth
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Nishith <n3...@gmail.com>.
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!

-Nishith

> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
> 
> Sounds good. Look forward to a RFC/DISCUSS thread.
> 
> Thanks
> Vinoth
> 
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>> 
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>> add support for SQL connectors of Hoodie Flink soon ~
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>> 
>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>> 
>>> Thanks Kabeer for the note on gmail. Did not realize that.  :)
>>> 
>>>>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>> They can choose what engine to use by a CLI config option.
>>> 
>>> Yes, that is also another attractive aspect of this route. We can build
>> out
>>> a common SQL layer and have this translate to the underlying engine
>> (sounds
>>> like Hive huh)
>>> Longer term, if we really think we can more easily implement a full DML +
>>> DDL + DQL, we can proceed with this.
>>> 
>>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>> extensions route, before we take this on more fully.
>>> 
>>> The other part where Calcite is great is, all the support for
>>> windowing/streaming in its syntax.
>>> Danny, I guess if we should be able to leverage that through a deeper
>>> Flink/Hudi integration?
>>> 
>>> 
>>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>> wrote:
>>> 
>>>> I think Dongwook is investigating on the same lines. and it does seem
>>>> better to pursue this first, before trying other approaches.
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>> .invalid>
>>>> wrote:
>>>> 
>>>>>   Yeah I agree with Nishith that an option way is to look at the
>> ways
>>> to
>>>>> plug in custom logical and physical plans in Spark. It can simplify
>> the
>>>>> implementation and reuse the Spark SQL syntax. And also users
>> familiar
>>>> with
>>>>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>>>> In fact, spark have provided the SparkSessionExtensions interface for
>>>>> implement custom syntax extensions and SQL rewrite rule.
>>>>> 
>>>> 
>>> 
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>>> .
>>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>> such
>>>>> as MERGE INTO and DDL syntax.
>>>>> 
>>>>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>>>> 
>>>>> Thanks for starting this thread Vinoth.
>>>>> In general, definitely see the need for SQL style semantics on Hudi
>>>>> tables. Apache Calcite is a great option to considering given
>>>> DatasourceV2
>>>>> has the limitations that you described.
>>>>> 
>>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>> the
>>>>> same SQL semantics needs to be supported in other engines like Flink
>> to
>>>>> provide the same experience to users - which in itself could also be
>>>>> considerable amount of work.
>>>>> So, if we’re able to generalize on the SQL story along Calcite, that
>>>> would
>>>>> help reduce redundant work in some sense.
>>>>> Although, I’m worried about a few things
>>>>> 
>>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>> syntax
>>>>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>>> syntax”
>>>>> for cross table joins, merges etc using data frames. One option is to
>>>> look
>>>>> at the if there are ways to plug in custom logical and physical plans
>>> in
>>>>> Spark, this way, although the merge on sparksql functionality may not
>>> be
>>>> as
>>>>> simple to use, but wouldn’t take away performance and feature set for
>>>>> starters, in the future we could think of having the entire query
>> space
>>>> be
>>>>> powered by calcite like you mentioned
>>>>> 2) If we continue to use DatasourceV1, is there any downside to this
>>> from
>>>>> a performance and optimization perspective when executing plan - I’m
>>>>> guessing not but haven’t delved into the code to see if there’s
>>> anything
>>>>> different apart from the API and spec.
>>>>> 
>>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>>>> 
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> 
>>>>> Just bumping this thread again
>>>>> 
>>>>> 
>>>>> thanks
>>>>> 
>>>>> vinoth
>>>>> 
>>>>> 
>>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>>> wrote:
>>>>> 
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> 
>>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>> sql
>>>>> 
>>>>> syntax to support writing into Hudi tables. We have looked into the
>>>> Spark 3
>>>>> 
>>>>> DataSource V2 APIs as well and found several issues that hinder us in
>>>>> 
>>>>> implementing this via the Spark APIs
>>>>> 
>>>>> 
>>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>> to
>>>>> 
>>>>> external datasources like Hudi. only DELETE is.
>>>>> 
>>>>> - DataSource V2 API offers no flexibility to perform any kind of
>>>>> 
>>>>> further transformations to the dataframe. Hudi supports keys,
>> indexes,
>>>>> 
>>>>> preCombining and custom partitioning that ensures file sizes etc. All
>>>> this
>>>>> 
>>>>> needs shuffling data, looking up/joining against other dataframes so
>>>> forth.
>>>>> 
>>>>> Today, the DataSource V1 API allows this kind of further
>>>>> 
>>>>> partitions/transformations. But the V2 API is simply offers partition
>>>> level
>>>>> 
>>>>> iteration once the user calls df.write.format("hudi")
>>>>> 
>>>>> 
>>>>> One thought I had is to explore Apache Calcite and write an adapter
>> for
>>>>> 
>>>>> Hudi. This frees us from being very dependent on a particular
>> engine's
>>>>> 
>>>>> syntax support like Spark. Calcite is very popular by itself and
>>> supports
>>>>> 
>>>>> most of the key words and (also more streaming friendly syntax). To
>> be
>>>>> 
>>>>> clear, we will still be using Spark/Flink underneath to perform the
>>>> actual
>>>>> 
>>>>> writing, just that the SQL grammar is provided by Calcite.
>>>>> 
>>>>> 
>>>>> To give a taste of how this will look like.
>>>>> 
>>>>> 
>>>>> A) If the user wants to mutate a Hudi table using SQL
>>>>> 
>>>>> 
>>>>> Instead of writing something like : spark.sql("UPDATE ....")
>>>>> 
>>>>> users will write : hudiSparkSession.sql("UPDATE ....")
>>>>> 
>>>>> 
>>>>> B) To save a Spark data frame to a Hudi table
>>>>> 
>>>>> we continue to use Spark DataSource V1
>>>>> 
>>>>> 
>>>>> The obvious challenge I see is the disconnect with the Spark
>> DataFrame
>>>>> 
>>>>> ecosystem. Users would write MERGE SQL statements by joining against
>>>> other
>>>>> 
>>>>> Spark DataFrames.
>>>>> 
>>>>> If we want those expressed in calcite as well, we need to also invest
>>> in
>>>>> 
>>>>> the full Query side support, which can increase the scope by a lot.
>>>>> 
>>>>> Some amount of investigation needs to happen, but ideally we should
>> be
>>>>> 
>>>>> able to integrate with the sparkSQL catalog and reuse all the tables
>>>> there.
>>>>> 
>>>>> 
>>>>> I am sure there are some gaps in my thinking. Just starting this
>>> thread,
>>>>> 
>>>>> so we can discuss and others can chime in/correct me.
>>>>> 
>>>>> 
>>>>> thanks
>>>>> 
>>>>> vinoth
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
Sounds good. Look forward to a RFC/DISCUSS thread.

Thanks
Vinoth

On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:

> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> add support for SQL connectors of Hoodie Flink soon ~
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
> > Thanks Kabeer for the note on gmail. Did not realize that.  :)
> >
> > >>  My desired use case is user use the Hoodie CLI to execute these SQLs.
> > They can choose what engine to use by a CLI config option.
> >
> > Yes, that is also another attractive aspect of this route. We can build
> out
> > a common SQL layer and have this translate to the underlying engine
> (sounds
> > like Hive huh)
> > Longer term, if we really think we can more easily implement a full DML +
> > DDL + DQL, we can proceed with this.
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> > extensions route, before we take this on more fully.
> >
> > The other part where Calcite is great is, all the support for
> > windowing/streaming in its syntax.
> > Danny, I guess if we should be able to leverage that through a deeper
> > Flink/Hudi integration?
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > I think Dongwook is investigating on the same lines. and it does seem
> > > better to pursue this first, before trying other approaches.
> > >
> > >
> > >
> > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> > .invalid>
> > > wrote:
> > >
> > > >    Yeah I agree with Nishith that an option way is to look at the
> ways
> > to
> > > > plug in custom logical and physical plans in Spark. It can simplify
> the
> > > > implementation and reuse the Spark SQL syntax. And also users
> familiar
> > > with
> > > > Spark SQL will be able to use HUDi's SQL features more quickly.
> > > > In fact, spark have provided the SparkSessionExtensions interface for
> > > > implement custom syntax extensions and SQL rewrite rule.
> > > >
> > >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> > > .
> > > > We can use the SparkSessionExtensions to extended hoodie sql syntax
> > such
> > > > as MERGE INTO and DDL syntax.
> > > >
> > > > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> > > >
> > > > Thanks for starting this thread Vinoth.
> > > > In general, definitely see the need for SQL style semantics on Hudi
> > > > tables. Apache Calcite is a great option to considering given
> > > DatasourceV2
> > > > has the limitations that you described.
> > > >
> > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> > the
> > > > same SQL semantics needs to be supported in other engines like Flink
> to
> > > > provide the same experience to users - which in itself could also be
> > > > considerable amount of work.
> > > > So, if we’re able to generalize on the SQL story along Calcite, that
> > > would
> > > > help reduce redundant work in some sense.
> > > > Although, I’m worried about a few things
> > > >
> > > > 1) Like you pointed out, writing complex user jobs using Spark SQL
> > syntax
> > > > can be harder for users who are moving from “Hudi syntax” to “Spark
> > > syntax”
> > > > for cross table joins, merges etc using data frames. One option is to
> > > look
> > > > at the if there are ways to plug in custom logical and physical plans
> > in
> > > > Spark, this way, although the merge on sparksql functionality may not
> > be
> > > as
> > > > simple to use, but wouldn’t take away performance and feature set for
> > > > starters, in the future we could think of having the entire query
> space
> > > be
> > > > powered by calcite like you mentioned
> > > > 2) If we continue to use DatasourceV1, is there any downside to this
> > from
> > > > a performance and optimization perspective when executing plan - I’m
> > > > guessing not but haven’t delved into the code to see if there’s
> > anything
> > > > different apart from the API and spec.
> > > >
> > > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > Just bumping this thread again
> > > >
> > > >
> > > > thanks
> > > >
> > > > vinoth
> > > >
> > > >
> > > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > >
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> > sql
> > > >
> > > > syntax to support writing into Hudi tables. We have looked into the
> > > Spark 3
> > > >
> > > > DataSource V2 APIs as well and found several issues that hinder us in
> > > >
> > > > implementing this via the Spark APIs
> > > >
> > > >
> > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> > to
> > > >
> > > > external datasources like Hudi. only DELETE is.
> > > >
> > > > - DataSource V2 API offers no flexibility to perform any kind of
> > > >
> > > > further transformations to the dataframe. Hudi supports keys,
> indexes,
> > > >
> > > > preCombining and custom partitioning that ensures file sizes etc. All
> > > this
> > > >
> > > > needs shuffling data, looking up/joining against other dataframes so
> > > forth.
> > > >
> > > > Today, the DataSource V1 API allows this kind of further
> > > >
> > > > partitions/transformations. But the V2 API is simply offers partition
> > > level
> > > >
> > > > iteration once the user calls df.write.format("hudi")
> > > >
> > > >
> > > > One thought I had is to explore Apache Calcite and write an adapter
> for
> > > >
> > > > Hudi. This frees us from being very dependent on a particular
> engine's
> > > >
> > > > syntax support like Spark. Calcite is very popular by itself and
> > supports
> > > >
> > > > most of the key words and (also more streaming friendly syntax). To
> be
> > > >
> > > > clear, we will still be using Spark/Flink underneath to perform the
> > > actual
> > > >
> > > > writing, just that the SQL grammar is provided by Calcite.
> > > >
> > > >
> > > > To give a taste of how this will look like.
> > > >
> > > >
> > > > A) If the user wants to mutate a Hudi table using SQL
> > > >
> > > >
> > > > Instead of writing something like : spark.sql("UPDATE ....")
> > > >
> > > > users will write : hudiSparkSession.sql("UPDATE ....")
> > > >
> > > >
> > > > B) To save a Spark data frame to a Hudi table
> > > >
> > > > we continue to use Spark DataSource V1
> > > >
> > > >
> > > > The obvious challenge I see is the disconnect with the Spark
> DataFrame
> > > >
> > > > ecosystem. Users would write MERGE SQL statements by joining against
> > > other
> > > >
> > > > Spark DataFrames.
> > > >
> > > > If we want those expressed in calcite as well, we need to also invest
> > in
> > > >
> > > > the full Query side support, which can increase the scope by a lot.
> > > >
> > > > Some amount of investigation needs to happen, but ideally we should
> be
> > > >
> > > > able to integrate with the sparkSQL catalog and reuse all the tables
> > > there.
> > > >
> > > >
> > > > I am sure there are some gaps in my thinking. Just starting this
> > thread,
> > > >
> > > > so we can discuss and others can chime in/correct me.
> > > >
> > > >
> > > > thanks
> > > >
> > > > vinoth
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Danny Chan <da...@apache.org>.
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.

Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:

> Thanks Kabeer for the note on gmail. Did not realize that.  :)
>
> >>  My desired use case is user use the Hoodie CLI to execute these SQLs.
> They can choose what engine to use by a CLI config option.
>
> Yes, that is also another attractive aspect of this route. We can build out
> a common SQL layer and have this translate to the underlying engine (sounds
> like Hive huh)
> Longer term, if we really think we can more easily implement a full DML +
> DDL + DQL, we can proceed with this.
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
> extensions route, before we take this on more fully.
>
> The other part where Calcite is great is, all the support for
> windowing/streaming in its syntax.
> Danny, I guess if we should be able to leverage that through a deeper
> Flink/Hudi integration?
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > I think Dongwook is investigating on the same lines. and it does seem
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> .invalid>
> > wrote:
> >
> > >    Yeah I agree with Nishith that an option way is to look at the ways
> to
> > > plug in custom logical and physical plans in Spark. It can simplify the
> > > implementation and reuse the Spark SQL syntax. And also users familiar
> > with
> > > Spark SQL will be able to use HUDi's SQL features more quickly.
> > > In fact, spark have provided the SparkSessionExtensions interface for
> > > implement custom syntax extensions and SQL rewrite rule.
> > >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> > .
> > > We can use the SparkSessionExtensions to extended hoodie sql syntax
> such
> > > as MERGE INTO and DDL syntax.
> > >
> > > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> > >
> > > Thanks for starting this thread Vinoth.
> > > In general, definitely see the need for SQL style semantics on Hudi
> > > tables. Apache Calcite is a great option to considering given
> > DatasourceV2
> > > has the limitations that you described.
> > >
> > > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> the
> > > same SQL semantics needs to be supported in other engines like Flink to
> > > provide the same experience to users - which in itself could also be
> > > considerable amount of work.
> > > So, if we’re able to generalize on the SQL story along Calcite, that
> > would
> > > help reduce redundant work in some sense.
> > > Although, I’m worried about a few things
> > >
> > > 1) Like you pointed out, writing complex user jobs using Spark SQL
> syntax
> > > can be harder for users who are moving from “Hudi syntax” to “Spark
> > syntax”
> > > for cross table joins, merges etc using data frames. One option is to
> > look
> > > at the if there are ways to plug in custom logical and physical plans
> in
> > > Spark, this way, although the merge on sparksql functionality may not
> be
> > as
> > > simple to use, but wouldn’t take away performance and feature set for
> > > starters, in the future we could think of having the entire query space
> > be
> > > powered by calcite like you mentioned
> > > 2) If we continue to use DatasourceV1, is there any downside to this
> from
> > > a performance and optimization perspective when executing plan - I’m
> > > guessing not but haven’t delved into the code to see if there’s
> anything
> > > different apart from the API and spec.
> > >
> > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > >
> > > Hello all,
> > >
> > >
> > > Just bumping this thread again
> > >
> > >
> > > thanks
> > >
> > > vinoth
> > >
> > >
> > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > >
> > >
> > > Hello all,
> > >
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > >
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > >
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > >
> > > implementing this via the Spark APIs
> > >
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > >
> > > external datasources like Hudi. only DELETE is.
> > >
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > >
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > >
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > >
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > >
> > > Today, the DataSource V1 API allows this kind of further
> > >
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > >
> > > iteration once the user calls df.write.format("hudi")
> > >
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > >
> > > Hudi. This frees us from being very dependent on a particular engine's
> > >
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > >
> > > most of the key words and (also more streaming friendly syntax). To be
> > >
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > >
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > >
> > > To give a taste of how this will look like.
> > >
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > >
> > > Instead of writing something like : spark.sql("UPDATE ....")
> > >
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> > >
> > >
> > > B) To save a Spark data frame to a Hudi table
> > >
> > > we continue to use Spark DataSource V1
> > >
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > >
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > >
> > > Spark DataFrames.
> > >
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > >
> > > the full Query side support, which can increase the scope by a lot.
> > >
> > > Some amount of investigation needs to happen, but ideally we should be
> > >
> > > able to integrate with the sparkSQL catalog and reuse all the tables
> > there.
> > >
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > >
> > > so we can discuss and others can chime in/correct me.
> > >
> > >
> > > thanks
> > >
> > > vinoth
> > >
> > >
> > >
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
Thanks Kabeer for the note on gmail. Did not realize that.  :)

>>  My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.

Yes, that is also another attractive aspect of this route. We can build out
a common SQL layer and have this translate to the underlying engine (sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.

As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.

The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?


On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org> wrote:

> I think Dongwook is investigating on the same lines. and it does seem
> better to pursue this first, before trying other approaches.
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> >    Yeah I agree with Nishith that an option way is to look at the ways to
> > plug in custom logical and physical plans in Spark. It can simplify the
> > implementation and reuse the Spark SQL syntax. And also users familiar
> with
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> > In fact, spark have provided the SparkSessionExtensions interface for
> > implement custom syntax extensions and SQL rewrite rule.
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> .
> > We can use the SparkSessionExtensions to extended hoodie sql syntax such
> > as MERGE INTO and DDL syntax.
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> > Thanks for starting this thread Vinoth.
> > In general, definitely see the need for SQL style semantics on Hudi
> > tables. Apache Calcite is a great option to considering given
> DatasourceV2
> > has the limitations that you described.
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> > same SQL semantics needs to be supported in other engines like Flink to
> > provide the same experience to users - which in itself could also be
> > considerable amount of work.
> > So, if we’re able to generalize on the SQL story along Calcite, that
> would
> > help reduce redundant work in some sense.
> > Although, I’m worried about a few things
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> syntax”
> > for cross table joins, merges etc using data frames. One option is to
> look
> > at the if there are ways to plug in custom logical and physical plans in
> > Spark, this way, although the merge on sparksql functionality may not be
> as
> > simple to use, but wouldn’t take away performance and feature set for
> > starters, in the future we could think of having the entire query space
> be
> > powered by calcite like you mentioned
> > 2) If we continue to use DatasourceV1, is there any downside to this from
> > a performance and optimization perspective when executing plan - I’m
> > guessing not but haven’t delved into the code to see if there’s anything
> > different apart from the API and spec.
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> > Hello all,
> >
> >
> > Just bumping this thread again
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >
> > Hello all,
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> >
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> > implementing this via the Spark APIs
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> >
> > external datasources like Hudi. only DELETE is.
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> > further transformations to the dataframe. Hudi supports keys, indexes,
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> >
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> > partitions/transformations. But the V2 API is simply offers partition
> level
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> >
> > Hudi. This frees us from being very dependent on a particular engine's
> >
> > syntax support like Spark. Calcite is very popular by itself and supports
> >
> > most of the key words and (also more streaming friendly syntax). To be
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> > To give a taste of how this will look like.
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> > we continue to use Spark DataSource V1
> >
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> >
> > Spark DataFrames.
> >
> > If we want those expressed in calcite as well, we need to also invest in
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> > Some amount of investigation needs to happen, but ideally we should be
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> there.
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> >
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.



On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>
wrote:

>    Yeah I agree with Nishith that an option way is to look at the ways to
> plug in custom logical and physical plans in Spark. It can simplify the
> implementation and reuse the Spark SQL syntax. And also users familiar with
> Spark SQL will be able to use HUDi's SQL features more quickly.
> In fact, spark have provided the SparkSessionExtensions interface for
> implement custom syntax extensions and SQL rewrite rule.
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.
> We can use the SparkSessionExtensions to extended hoodie sql syntax such
> as MERGE INTO and DDL syntax.
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
> Thanks for starting this thread Vinoth.
> In general, definitely see the need for SQL style semantics on Hudi
> tables. Apache Calcite is a great option to considering given DatasourceV2
> has the limitations that you described.
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> same SQL semantics needs to be supported in other engines like Flink to
> provide the same experience to users - which in itself could also be
> considerable amount of work.
> So, if we’re able to generalize on the SQL story along Calcite, that would
> help reduce redundant work in some sense.
> Although, I’m worried about a few things
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> can be harder for users who are moving from “Hudi syntax” to “Spark syntax”
> for cross table joins, merges etc using data frames. One option is to look
> at the if there are ways to plug in custom logical and physical plans in
> Spark, this way, although the merge on sparksql functionality may not be as
> simple to use, but wouldn’t take away performance and feature set for
> starters, in the future we could think of having the entire query space be
> powered by calcite like you mentioned
> 2) If we continue to use DatasourceV1, is there any downside to this from
> a performance and optimization perspective when executing plan - I’m
> guessing not but haven’t delved into the code to see if there’s anything
> different apart from the API and spec.
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Hello all,
>
>
> Just bumping this thread again
>
>
> thanks
>
> vinoth
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Hello all,
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>
> syntax to support writing into Hudi tables. We have looked into the Spark 3
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
> implementing this via the Spark APIs
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>
> external datasources like Hudi. only DELETE is.
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
> further transformations to the dataframe. Hudi supports keys, indexes,
>
> preCombining and custom partitioning that ensures file sizes etc. All this
>
> needs shuffling data, looking up/joining against other dataframes so forth.
>
> Today, the DataSource V1 API allows this kind of further
>
> partitions/transformations. But the V2 API is simply offers partition level
>
> iteration once the user calls df.write.format("hudi")
>
>
> One thought I had is to explore Apache Calcite and write an adapter for
>
> Hudi. This frees us from being very dependent on a particular engine's
>
> syntax support like Spark. Calcite is very popular by itself and supports
>
> most of the key words and (also more streaming friendly syntax). To be
>
> clear, we will still be using Spark/Flink underneath to perform the actual
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
> To give a taste of how this will look like.
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
> B) To save a Spark data frame to a Hudi table
>
> we continue to use Spark DataSource V1
>
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
>
> ecosystem. Users would write MERGE SQL statements by joining against other
>
> Spark DataFrames.
>
> If we want those expressed in calcite as well, we need to also invest in
>
> the full Query side support, which can increase the scope by a lot.
>
> Some amount of investigation needs to happen, but ideally we should be
>
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
>
> I am sure there are some gaps in my thinking. Just starting this thread,
>
> so we can discuss and others can chime in/correct me.
>
>
> thanks
>
> vinoth
>
>
>

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by pzwpzw <pe...@icloud.com.INVALID>.
   Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly.
  In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule.  https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.  We can use the SparkSessionExtensions to extended hoodie sql syntax such as MERGE INTO and DDL syntax.


2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:


Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense.
Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this from a performance and optimization perspective when executing plan - I’m guessing not but haven’t delved into the code to see if there’s anything different apart from the API and spec.


On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:


Hello all,


Just bumping this thread again


thanks
vinoth


On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:


Hello all,


One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs


- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")


One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.


To give a taste of how this will look like.


A) If the user wants to mutate a Hudi table using SQL


Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")


B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1


The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be
able to integrate with the sparkSQL catalog and reuse all the tables there.


I am sure there are some gaps in my thinking. Just starting this thread,
so we can discuss and others can chime in/correct me.


thanks
vinoth


Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Nishith <n3...@gmail.com>.
Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described.

Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work. 
So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense. 
Although, I’m worried about a few things

1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned 
2) If we continue to use DatasourceV1, is there any downside to this from a performance and optimization perspective when executing plan - I’m guessing not but haven’t delved into the code to see if there’s anything different apart from the API and spec.

> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
> 
> Hello all,
> 
> Just bumping this thread again
> 
> thanks
> vinoth
> 
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
>> 
>> Hello all,
>> 
>> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>> syntax to support writing into Hudi tables. We have looked into the Spark 3
>> DataSource V2 APIs as well and found several issues that hinder us in
>> implementing this via the Spark APIs
>> 
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>> external datasources like Hudi. only DELETE is.
>> - DataSource V2 API offers no flexibility to perform any kind of
>> further transformations to the dataframe. Hudi supports keys, indexes,
>> preCombining and custom partitioning that ensures file sizes etc. All this
>> needs shuffling data, looking up/joining against other dataframes so forth.
>> Today, the DataSource V1 API allows this kind of further
>> partitions/transformations. But the V2 API is simply offers partition level
>> iteration once the user calls df.write.format("hudi")
>> 
>> One thought I had is to explore Apache Calcite and write an adapter for
>> Hudi. This frees us from being very dependent on a particular engine's
>> syntax support like Spark. Calcite is very popular by itself and supports
>> most of the key words and (also more streaming friendly syntax). To be
>> clear, we will still be using Spark/Flink underneath to perform the actual
>> writing, just that the SQL grammar is provided by Calcite.
>> 
>> To give a taste of how this will look like.
>> 
>> A) If the user wants to mutate a Hudi table using SQL
>> 
>> Instead of writing something like : spark.sql("UPDATE ....")
>> users will write : hudiSparkSession.sql("UPDATE ....")
>> 
>> B) To save a Spark data frame to a Hudi table
>> we continue to use Spark DataSource V1
>> 
>> The obvious challenge I see is the disconnect with the Spark DataFrame
>> ecosystem. Users would write MERGE SQL statements by joining against other
>> Spark DataFrames.
>> If we want those expressed in calcite as well, we need to also invest in
>> the full Query side support, which can increase the scope by a lot.
>> Some amount of investigation needs to happen, but ideally we should be
>> able to integrate with the sparkSQL catalog and reuse all the tables there.
>> 
>> I am sure there are some gaps in my thinking. Just starting this thread,
>> so we can discuss and others can chime in/correct me.
>> 
>> thanks
>> vinoth
>> 

Re: [DISCUSS] SQL Support using Apache Calcite

Posted by Vinoth Chandar <vi...@apache.org>.
Hello all,

Just bumping this thread again

thanks
vinoth

On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread,
> so we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>