You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/12/11 07:58:10 UTC
[DISCUSS] SQL Support using Apache Calcite
Hello all,
One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs
- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")
One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.
To give a taste of how this will look like.
A) If the user wants to mutate a Hudi table using SQL
Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")
B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1
The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be able
to integrate with the sparkSQL catalog and reuse all the tables there.
I am sure there are some gaps in my thinking. Just starting this thread, so
we can discuss and others can chime in/correct me.
thanks
vinoth
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by vino yang <ya...@gmail.com>.
+1 for Calcite
Best,
Vino
David Sheard <da...@datarefactory.com.au> 于2020年12月17日周四 下午2:15写道:
> I agree with Calcite
>
> On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <da...@apache.org> wrote:
>
> > Apache Calcite is a good candidate for parsing and executing the SQL,
> > Apache Flink has an extension for the SQL based on the Calcite parser
> [1],
> >
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > Should user still need to instatiate the hudiSparkSession first ? My
> > desired use case is user use the Hoodie CLI to execute these SQLs. They
> can
> > choose what engine to use by a CLI config option.
> >
> > > If we want those expressed in Calcite as well, we need to also invest
> in
> > the full Query side support, which can increase the scope by a lot.
> >
> > That is true, my thought is that we use the Calcite to execute only these
> > MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> > to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> > parse the query statements and handover it to the engines. One thing
> needs
> > to note is the SQL dialect difference, the Spark may have its own
> > syntax(keywords) that Calcite can not parse/recognize.
> >
> > [1]
> >
> >
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:
> >
> > > Hello all,
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > > implementing this via the Spark APIs
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > > external datasources like Hudi. only DELETE is.
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > > Today, the DataSource V1 API allows this kind of further
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > > iteration once the user calls df.write.format("hudi")
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > > Hudi. This frees us from being very dependent on a particular engine's
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > > most of the key words and (also more streaming friendly syntax). To be
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > > To give a taste of how this will look like.
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > > Instead of writing something like : spark.sql("UPDATE ....")
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> > >
> > > B) To save a Spark data frame to a Hudi table
> > > we continue to use Spark DataSource V1
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > > Spark DataFrames.
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > > the full Query side support, which can increase the scope by a lot.
> > > Some amount of investigation needs to happen, but ideally we should be
> > able
> > > to integrate with the sparkSQL catalog and reuse all the tables there.
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > so
> > > we can discuss and others can chime in/correct me.
> > >
> > > thanks
> > > vinoth
> > >
> >
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by David Sheard <da...@datarefactory.com.au>.
I agree with Calcite
On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan <da...@apache.org> wrote:
> Apache Calcite is a good candidate for parsing and executing the SQL,
> Apache Flink has an extension for the SQL based on the Calcite parser [1],
>
> > users will write : hudiSparkSession.sql("UPDATE ....")
>
> Should user still need to instatiate the hudiSparkSession first ? My
> desired use case is user use the Hoodie CLI to execute these SQLs. They can
> choose what engine to use by a CLI config option.
>
> > If we want those expressed in Calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
>
> That is true, my thought is that we use the Calcite to execute only these
> MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> parse the query statements and handover it to the engines. One thing needs
> to note is the SQL dialect difference, the Spark may have its own
> syntax(keywords) that Calcite can not parse/recognize.
>
> [1]
>
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:
>
> > Hello all,
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> > DataSource V2 APIs as well and found several issues that hinder us in
> > implementing this via the Spark APIs
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> > external datasources like Hudi. only DELETE is.
> > - DataSource V2 API offers no flexibility to perform any kind of
> > further transformations to the dataframe. Hudi supports keys, indexes,
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> > Today, the DataSource V1 API allows this kind of further
> > partitions/transformations. But the V2 API is simply offers partition
> level
> > iteration once the user calls df.write.format("hudi")
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> > Hudi. This frees us from being very dependent on a particular engine's
> > syntax support like Spark. Calcite is very popular by itself and supports
> > most of the key words and (also more streaming friendly syntax). To be
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> > writing, just that the SQL grammar is provided by Calcite.
> >
> > To give a taste of how this will look like.
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > B) To save a Spark data frame to a Hudi table
> > we continue to use Spark DataSource V1
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> > Spark DataFrames.
> > If we want those expressed in calcite as well, we need to also invest in
> > the full Query side support, which can increase the scope by a lot.
> > Some amount of investigation needs to happen, but ideally we should be
> able
> > to integrate with the sparkSQL catalog and reuse all the tables there.
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> so
> > we can discuss and others can chime in/correct me.
> >
> > thanks
> > vinoth
> >
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Danny Chan <da...@apache.org>.
Apache Calcite is a good candidate for parsing and executing the SQL,
Apache Flink has an extension for the SQL based on the Calcite parser [1],
> users will write : hudiSparkSession.sql("UPDATE ....")
Should user still need to instatiate the hudiSparkSession first ? My
desired use case is user use the Hoodie CLI to execute these SQLs. They can
choose what engine to use by a CLI config option.
> If we want those expressed in Calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
That is true, my thought is that we use the Calcite to execute only these
MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
parse the query statements and handover it to the engines. One thing needs
to note is the SQL dialect difference, the Spark may have its own
syntax(keywords) that Calcite can not parse/recognize.
[1]
https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
Vinoth Chandar <vi...@apache.org> 于2020年12月11日周五 下午3:58写道:
> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be able
> to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread, so
> we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Kabeer Ahmed <ka...@linuxmail.org>.
Vinoth and All,
Users on gmail might be missing out on these emails as Gmail is down and emails sent to gmail.com domain are bouncing back.
At 11pm UK time below is the google update:
https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8b67908fadee664c68c240ff9f529ab
Best to bump this thread again tomorrow when GMail is back up and running.
On Dec 15 2020, at 7:06 am, Vinoth Chandar <vi...@apache.org> wrote:
> Hello all,
>
> Just bumping this thread again
> thanks
> vinoth
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
> > Hello all,
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> > syntax to support writing into Hudi tables. We have looked into the Spark 3
> > DataSource V2 APIs as well and found several issues that hinder us in
> > implementing this via the Spark APIs
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> > external datasources like Hudi. only DELETE is.
> > - DataSource V2 API offers no flexibility to perform any kind of
> > further transformations to the dataframe. Hudi supports keys, indexes,
> > preCombining and custom partitioning that ensures file sizes etc. All this
> > needs shuffling data, looking up/joining against other dataframes so forth.
> > Today, the DataSource V1 API allows this kind of further
> > partitions/transformations. But the V2 API is simply offers partition level
> > iteration once the user calls df.write.format("hudi")
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> > Hudi. This frees us from being very dependent on a particular engine's
> > syntax support like Spark. Calcite is very popular by itself and supports
> > most of the key words and (also more streaming friendly syntax). To be
> > clear, we will still be using Spark/Flink underneath to perform the actual
> > writing, just that the SQL grammar is provided by Calcite.
> >
> > To give a taste of how this will look like.
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> > B) To save a Spark data frame to a Hudi table
> > we continue to use Spark DataSource V1
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> > ecosystem. Users would write MERGE SQL statements by joining against other
> > Spark DataFrames.
> > If we want those expressed in calcite as well, we need to also invest in
> > the full Query side support, which can increase the scope by a lot.
> > Some amount of investigation needs to happen, but ideally we should be
> > able to integrate with the sparkSQL catalog and reuse all the tables there.
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> > so we can discuss and others can chime in/correct me.
> >
> > thanks
> > vinoth
> >
>
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Danny Chan <da...@apache.org>.
That's great, I can help with the Apache Calcite integration.
Vinoth Chandar <vi...@apache.org> 于2020年12月23日周三 上午12:29写道:
> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> > Yes, it looks good .
> > We are building the spark sql extensions to support for hudi in
> > our internal version.
> > I am interested in participating in the extension of SparkSQL on hudi.
> > 2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:
> >
> > Hi,
> >
> > I think what we are landing on finally is.
> >
> > - Keep pushing for SparkSQL support using Spark extensions route
> > - Calcite effort will be a separate/orthogonal approach, down the line
> >
> > Please feel free to correct me, if I got this wrong.
> >
> > On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pengzhiwei2015@icloud.com
> .invalid>
> > wrote:
> >
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> >
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> >
> > command and also provide parser for the common SQL extensions(e.g. Merge
> >
> > Into). The Engine-related syntax can be taught to the respective engines
> to
> >
> > process. If the common sql layer can handle the input sql, it handle
> >
> > it.Otherwise it is routed to the engine for processing. In long term, the
> >
> > common layer will more and more rich and perfect.
> >
> > 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
> >
> >
> > Hi,all
> >
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other
> analysis
> >
> > components at the same time,
> >
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> >
> > conflict with Calctite syntax.Is our strategy
> >
> > user migration or syntax compatibility?
> >
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
> >
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward
> to
> >
> > the RFC as well!
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> >
> > Thanks
> >
> >
> > Vinoth
> >
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
> >
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i
> would
> >
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
> >
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> >
> > out
> >
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> >
> > (sounds
> >
> >
> > like Hive huh)
> >
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> >
> > extensions route, before we take this on more fully.
> >
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> >
> > windowing/streaming in its syntax.
> >
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> >
> > Flink/Hudi integration?
> >
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> >
> >
> > .invalid>
> >
> >
> > wrote:
> >
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> >
> > ways
> >
> >
> > to
> >
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> >
> > the
> >
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> >
> > familiar
> >
> >
> > with
> >
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> >
> >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> >
> > .
> >
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> >
> > such
> >
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> >
> > tables. Apache Calcite is a great option to considering given
> >
> >
> > DatasourceV2
> >
> >
> > has the limitations that you described.
> >
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> >
> > the
> >
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> >
> > to
> >
> >
> > provide the same experience to users - which in itself could also be
> >
> >
> > considerable amount of work.
> >
> >
> > So, if we’re able to generalize on the SQL story along Calcite, that
> >
> >
> > would
> >
> >
> > help reduce redundant work in some sense.
> >
> >
> > Although, I’m worried about a few things
> >
> >
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL
> >
> >
> > syntax
> >
> >
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> >
> >
> > syntax”
> >
> >
> > for cross table joins, merges etc using data frames. One option is to
> >
> >
> > look
> >
> >
> > at the if there are ways to plug in custom logical and physical plans
> >
> >
> > in
> >
> >
> > Spark, this way, although the merge on sparksql functionality may not
> >
> >
> > be
> >
> >
> > as
> >
> >
> > simple to use, but wouldn’t take away performance and feature set for
> >
> >
> > starters, in the future we could think of having the entire query
> >
> >
> > space
> >
> >
> > be
> >
> >
> > powered by calcite like you mentioned
> >
> >
> > 2) If we continue to use DatasourceV1, is there any downside to this
> >
> >
> > from
> >
> >
> > a performance and optimization perspective when executing plan - I’m
> >
> >
> > guessing not but haven’t delved into the code to see if there’s
> >
> >
> > anything
> >
> >
> > different apart from the API and spec.
> >
> >
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > Just bumping this thread again
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> >
> >
> > wrote:
> >
> >
> >
> >
> > Hello all,
> >
> >
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE
> >
> >
> > sql
> >
> >
> >
> > syntax to support writing into Hudi tables. We have looked into the
> >
> >
> > Spark 3
> >
> >
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> >
> >
> > implementing this via the Spark APIs
> >
> >
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> >
> >
> > to
> >
> >
> >
> > external datasources like Hudi. only DELETE is.
> >
> >
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> >
> >
> > further transformations to the dataframe. Hudi supports keys,
> >
> >
> > indexes,
> >
> >
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> >
> >
> > this
> >
> >
> >
> > needs shuffling data, looking up/joining against other dataframes so
> >
> >
> > forth.
> >
> >
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> >
> >
> > partitions/transformations. But the V2 API is simply offers partition
> >
> >
> > level
> >
> >
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter
> >
> >
> > for
> >
> >
> >
> > Hudi. This frees us from being very dependent on a particular
> >
> >
> > engine's
> >
> >
> >
> > syntax support like Spark. Calcite is very popular by itself and
> >
> >
> > supports
> >
> >
> >
> > most of the key words and (also more streaming friendly syntax). To
> >
> >
> > be
> >
> >
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> >
> >
> > actual
> >
> >
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> >
> >
> > To give a taste of how this will look like.
> >
> >
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> >
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> >
> >
> > we continue to use Spark DataSource V1
> >
> >
> >
> >
> > The obvious challenge I see is the disconnect with the Spark
> >
> >
> > DataFrame
> >
> >
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> >
> >
> > other
> >
> >
> >
> > Spark DataFrames.
> >
> >
> >
> > If we want those expressed in calcite as well, we need to also invest
> >
> >
> > in
> >
> >
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> >
> >
> > Some amount of investigation needs to happen, but ideally we should
> >
> >
> > be
> >
> >
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> >
> >
> > there.
> >
> >
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this
> >
> >
> > thread,
> >
> >
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> >
> >
> > thanks
> >
> >
> >
> > vinoth
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
love to have you involved.
On Tue, Dec 22, 2020 at 3:20 AM pzwpzw <pe...@icloud.com.invalid>
wrote:
> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
>
> .
>
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
>
> such
>
>
> as MERGE INTO and DDL syntax.
>
>
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
>
>
> Thanks for starting this thread Vinoth.
>
>
> In general, definitely see the need for SQL style semantics on Hudi
>
>
> tables. Apache Calcite is a great option to considering given
>
>
> DatasourceV2
>
>
> has the limitations that you described.
>
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
>
> the
>
>
> same SQL semantics needs to be supported in other engines like Flink
>
>
> to
>
>
> provide the same experience to users - which in itself could also be
>
>
> considerable amount of work.
>
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
>
> would
>
>
> help reduce redundant work in some sense.
>
>
> Although, I’m worried about a few things
>
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
>
> syntax
>
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
>
> syntax”
>
>
> for cross table joins, merges etc using data frames. One option is to
>
>
> look
>
>
> at the if there are ways to plug in custom logical and physical plans
>
>
> in
>
>
> Spark, this way, although the merge on sparksql functionality may not
>
>
> be
>
>
> as
>
>
> simple to use, but wouldn’t take away performance and feature set for
>
>
> starters, in the future we could think of having the entire query
>
>
> space
>
>
> be
>
>
> powered by calcite like you mentioned
>
>
> 2) If we continue to use DatasourceV1, is there any downside to this
>
>
> from
>
>
> a performance and optimization perspective when executing plan - I’m
>
>
> guessing not but haven’t delved into the code to see if there’s
>
>
> anything
>
>
> different apart from the API and spec.
>
>
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> Just bumping this thread again
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>
>
> wrote:
>
>
>
>
> Hello all,
>
>
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE
>
>
> sql
>
>
>
> syntax to support writing into Hudi tables. We have looked into the
>
>
> Spark 3
>
>
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
>
>
> implementing this via the Spark APIs
>
>
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>
>
> to
>
>
>
> external datasources like Hudi. only DELETE is.
>
>
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
>
>
> further transformations to the dataframe. Hudi supports keys,
>
>
> indexes,
>
>
>
> preCombining and custom partitioning that ensures file sizes etc. All
>
>
> this
>
>
>
> needs shuffling data, looking up/joining against other dataframes so
>
>
> forth.
>
>
>
> Today, the DataSource V1 API allows this kind of further
>
>
>
> partitions/transformations. But the V2 API is simply offers partition
>
>
> level
>
>
>
> iteration once the user calls df.write.format("hudi")
>
>
>
>
> One thought I had is to explore Apache Calcite and write an adapter
>
>
> for
>
>
>
> Hudi. This frees us from being very dependent on a particular
>
>
> engine's
>
>
>
> syntax support like Spark. Calcite is very popular by itself and
>
>
> supports
>
>
>
> most of the key words and (also more streaming friendly syntax). To
>
>
> be
>
>
>
> clear, we will still be using Spark/Flink underneath to perform the
>
>
> actual
>
>
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
>
>
> To give a taste of how this will look like.
>
>
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
>
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
>
>
> B) To save a Spark data frame to a Hudi table
>
>
>
> we continue to use Spark DataSource V1
>
>
>
>
> The obvious challenge I see is the disconnect with the Spark
>
>
> DataFrame
>
>
>
> ecosystem. Users would write MERGE SQL statements by joining against
>
>
> other
>
>
>
> Spark DataFrames.
>
>
>
> If we want those expressed in calcite as well, we need to also invest
>
>
> in
>
>
>
> the full Query side support, which can increase the scope by a lot.
>
>
>
> Some amount of investigation needs to happen, but ideally we should
>
>
> be
>
>
>
> able to integrate with the sparkSQL catalog and reuse all the tables
>
>
> there.
>
>
>
>
> I am sure there are some gaps in my thinking. Just starting this
>
>
> thread,
>
>
>
> so we can discuss and others can chime in/correct me.
>
>
>
>
> thanks
>
>
>
> vinoth
>
>
>
>
>
>
>
>
>
>
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by wei li <lw...@gmail.com>.
First, I think it is necessary to improve spark sql, because the main scenario of hudi is datalake or warehouse, and spark has strong ecological capabilities in this field.
Second, but in the long run, Hudi needs a more general SQL layer, and it is very necessary to embrace calcite. Then based on hudi, a powerful data management processing service can be built
On 2020/12/22 08:30:37, Vinoth Chandar <vi...@apache.org> wrote:
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> > Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> > to process engine independent SQL, for example most of the DDL、Hoodie CLI
> > command and also provide parser for the common SQL extensions(e.g. Merge
> > Into). The Engine-related syntax can be taught to the respective engines to
> > process. If the common sql layer can handle the input sql, it handle
> > it.Otherwise it is routed to the engine for processing. In long term, the
> > common layer will more and more rich and perfect.
> > 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
> >
> > Hi,all
> >
> >
> > That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> > components at the same time,
> > But there are some questions about SparkSQL. SparkSQL syntax is in
> > conflict with Calctite syntax.Is our strategy
> > user migration or syntax compatibility?
> > In addition ,will it also support write SQL?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
> >
> > That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> > the RFC as well!
> >
> >
> > -Nishith
> >
> >
> > On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> > Sounds good. Look forward to a RFC/DISCUSS thread.
> >
> >
> > Thanks
> >
> > Vinoth
> >
> >
> > On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
> >
> >
> > Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> >
> > add support for SQL connectors of Hoodie Flink soon ~
> >
> > Currently, i'm preparing a refactoring to the current Flink writer code.
> >
> >
> > Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
> >
> >
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> >
> > My desired use case is user use the Hoodie CLI to execute these SQLs.
> >
> > They can choose what engine to use by a CLI config option.
> >
> >
> > Yes, that is also another attractive aspect of this route. We can build
> >
> > out
> >
> > a common SQL layer and have this translate to the underlying engine
> >
> > (sounds
> >
> > like Hive huh)
> >
> > Longer term, if we really think we can more easily implement a full DML +
> >
> > DDL + DQL, we can proceed with this.
> >
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> >
> > extensions route, before we take this on more fully.
> >
> >
> > The other part where Calcite is great is, all the support for
> >
> > windowing/streaming in its syntax.
> >
> > Danny, I guess if we should be able to leverage that through a deeper
> >
> > Flink/Hudi integration?
> >
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> > I think Dongwook is investigating on the same lines. and it does seem
> >
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> >
> > .invalid>
> >
> > wrote:
> >
> >
> > Yeah I agree with Nishith that an option way is to look at the
> >
> > ways
> >
> > to
> >
> > plug in custom logical and physical plans in Spark. It can simplify
> >
> > the
> >
> > implementation and reuse the Spark SQL syntax. And also users
> >
> > familiar
> >
> > with
> >
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> >
> > In fact, spark have provided the SparkSessionExtensions interface for
> >
> > implement custom syntax extensions and SQL rewrite rule.
> >
> >
> >
> >
> >
> > https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> >
> > .
> >
> > We can use the SparkSessionExtensions to extended hoodie sql syntax
> >
> > such
> >
> > as MERGE INTO and DDL syntax.
> >
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> >
> > Thanks for starting this thread Vinoth.
> >
> > In general, definitely see the need for SQL style semantics on Hudi
> >
> > tables. Apache Calcite is a great option to considering given
> >
> > DatasourceV2
> >
> > has the limitations that you described.
> >
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> >
> > the
> >
> > same SQL semantics needs to be supported in other engines like Flink
> >
> > to
> >
> > provide the same experience to users - which in itself could also be
> >
> > considerable amount of work.
> >
> > So, if we’re able to generalize on the SQL story along Calcite, that
> >
> > would
> >
> > help reduce redundant work in some sense.
> >
> > Although, I’m worried about a few things
> >
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL
> >
> > syntax
> >
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> >
> > syntax”
> >
> > for cross table joins, merges etc using data frames. One option is to
> >
> > look
> >
> > at the if there are ways to plug in custom logical and physical plans
> >
> > in
> >
> > Spark, this way, although the merge on sparksql functionality may not
> >
> > be
> >
> > as
> >
> > simple to use, but wouldn’t take away performance and feature set for
> >
> > starters, in the future we could think of having the entire query
> >
> > space
> >
> > be
> >
> > powered by calcite like you mentioned
> >
> > 2) If we continue to use DatasourceV1, is there any downside to this
> >
> > from
> >
> > a performance and optimization perspective when executing plan - I’m
> >
> > guessing not but haven’t delved into the code to see if there’s
> >
> > anything
> >
> > different apart from the API and spec.
> >
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> >
> > Hello all,
> >
> >
> >
> > Just bumping this thread again
> >
> >
> >
> > thanks
> >
> >
> > vinoth
> >
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> >
> > wrote:
> >
> >
> >
> > Hello all,
> >
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE
> >
> > sql
> >
> >
> > syntax to support writing into Hudi tables. We have looked into the
> >
> > Spark 3
> >
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> >
> > implementing this via the Spark APIs
> >
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> >
> > to
> >
> >
> > external datasources like Hudi. only DELETE is.
> >
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> >
> > further transformations to the dataframe. Hudi supports keys,
> >
> > indexes,
> >
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> >
> > this
> >
> >
> > needs shuffling data, looking up/joining against other dataframes so
> >
> > forth.
> >
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> >
> > partitions/transformations. But the V2 API is simply offers partition
> >
> > level
> >
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter
> >
> > for
> >
> >
> > Hudi. This frees us from being very dependent on a particular
> >
> > engine's
> >
> >
> > syntax support like Spark. Calcite is very popular by itself and
> >
> > supports
> >
> >
> > most of the key words and (also more streaming friendly syntax). To
> >
> > be
> >
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> >
> > actual
> >
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> >
> > To give a taste of how this will look like.
> >
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> >
> > we continue to use Spark DataSource V1
> >
> >
> >
> > The obvious challenge I see is the disconnect with the Spark
> >
> > DataFrame
> >
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> >
> > other
> >
> >
> > Spark DataFrames.
> >
> >
> > If we want those expressed in calcite as well, we need to also invest
> >
> > in
> >
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> >
> > Some amount of investigation needs to happen, but ideally we should
> >
> > be
> >
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> >
> > there.
> >
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this
> >
> > thread,
> >
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> >
> > thanks
> >
> >
> > vinoth
> >
> >
> >
> >
> >
> >
> >
> >
>
Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by 受春柏 <sc...@126.com>.
Yes,I think it should be ok
在 2020-12-22 16:30:37,"Vinoth Chandar" <vi...@apache.org> 写道:
>Hi,
>
>I think what we are landing on finally is.
>
>- Keep pushing for SparkSQL support using Spark extensions route
>- Calcite effort will be a separate/orthogonal approach, down the line
>
>Please feel free to correct me, if I got this wrong.
>
>On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
>wrote:
>
>> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
>> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>> command and also provide parser for the common SQL extensions(e.g. Merge
>> Into). The Engine-related syntax can be taught to the respective engines to
>> process. If the common sql layer can handle the input sql, it handle
>> it.Otherwise it is routed to the engine for processing. In long term, the
>> common layer will more and more rich and perfect.
>> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>>
>> Hi,all
>>
>>
>> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>> components at the same time,
>> But there are some questions about SparkSQL. SparkSQL syntax is in
>> conflict with Calctite syntax.Is our strategy
>> user migration or syntax compatibility?
>> In addition ,will it also support write SQL?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>>
>> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>> the RFC as well!
>>
>>
>> -Nishith
>>
>>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>>
>>
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>>
>>
>> Thanks
>>
>> Vinoth
>>
>>
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>>
>>
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>
>> add support for SQL connectors of Hoodie Flink soon ~
>>
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>
>>
>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>>
>>
>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>
>>
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>
>> They can choose what engine to use by a CLI config option.
>>
>>
>> Yes, that is also another attractive aspect of this route. We can build
>>
>> out
>>
>> a common SQL layer and have this translate to the underlying engine
>>
>> (sounds
>>
>> like Hive huh)
>>
>> Longer term, if we really think we can more easily implement a full DML +
>>
>> DDL + DQL, we can proceed with this.
>>
>>
>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>
>> extensions route, before we take this on more fully.
>>
>>
>> The other part where Calcite is great is, all the support for
>>
>> windowing/streaming in its syntax.
>>
>> Danny, I guess if we should be able to leverage that through a deeper
>>
>> Flink/Hudi integration?
>>
>>
>>
>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>> I think Dongwook is investigating on the same lines. and it does seem
>>
>> better to pursue this first, before trying other approaches.
>>
>>
>>
>>
>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>
>> .invalid>
>>
>> wrote:
>>
>>
>> Yeah I agree with Nishith that an option way is to look at the
>>
>> ways
>>
>> to
>>
>> plug in custom logical and physical plans in Spark. It can simplify
>>
>> the
>>
>> implementation and reuse the Spark SQL syntax. And also users
>>
>> familiar
>>
>> with
>>
>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>
>> In fact, spark have provided the SparkSessionExtensions interface for
>>
>> implement custom syntax extensions and SQL rewrite rule.
>>
>>
>>
>>
>>
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>
>> .
>>
>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>
>> such
>>
>> as MERGE INTO and DDL syntax.
>>
>>
>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>
>>
>> Thanks for starting this thread Vinoth.
>>
>> In general, definitely see the need for SQL style semantics on Hudi
>>
>> tables. Apache Calcite is a great option to considering given
>>
>> DatasourceV2
>>
>> has the limitations that you described.
>>
>>
>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>
>> the
>>
>> same SQL semantics needs to be supported in other engines like Flink
>>
>> to
>>
>> provide the same experience to users - which in itself could also be
>>
>> considerable amount of work.
>>
>> So, if we’re able to generalize on the SQL story along Calcite, that
>>
>> would
>>
>> help reduce redundant work in some sense.
>>
>> Although, I’m worried about a few things
>>
>>
>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>
>> syntax
>>
>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>
>> syntax”
>>
>> for cross table joins, merges etc using data frames. One option is to
>>
>> look
>>
>> at the if there are ways to plug in custom logical and physical plans
>>
>> in
>>
>> Spark, this way, although the merge on sparksql functionality may not
>>
>> be
>>
>> as
>>
>> simple to use, but wouldn’t take away performance and feature set for
>>
>> starters, in the future we could think of having the entire query
>>
>> space
>>
>> be
>>
>> powered by calcite like you mentioned
>>
>> 2) If we continue to use DatasourceV1, is there any downside to this
>>
>> from
>>
>> a performance and optimization perspective when executing plan - I’m
>>
>> guessing not but haven’t delved into the code to see if there’s
>>
>> anything
>>
>> different apart from the API and spec.
>>
>>
>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> Just bumping this thread again
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>
>> wrote:
>>
>>
>>
>> Hello all,
>>
>>
>>
>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>
>> sql
>>
>>
>> syntax to support writing into Hudi tables. We have looked into the
>>
>> Spark 3
>>
>>
>> DataSource V2 APIs as well and found several issues that hinder us in
>>
>>
>> implementing this via the Spark APIs
>>
>>
>>
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>
>> to
>>
>>
>> external datasources like Hudi. only DELETE is.
>>
>>
>> - DataSource V2 API offers no flexibility to perform any kind of
>>
>>
>> further transformations to the dataframe. Hudi supports keys,
>>
>> indexes,
>>
>>
>> preCombining and custom partitioning that ensures file sizes etc. All
>>
>> this
>>
>>
>> needs shuffling data, looking up/joining against other dataframes so
>>
>> forth.
>>
>>
>> Today, the DataSource V1 API allows this kind of further
>>
>>
>> partitions/transformations. But the V2 API is simply offers partition
>>
>> level
>>
>>
>> iteration once the user calls df.write.format("hudi")
>>
>>
>>
>> One thought I had is to explore Apache Calcite and write an adapter
>>
>> for
>>
>>
>> Hudi. This frees us from being very dependent on a particular
>>
>> engine's
>>
>>
>> syntax support like Spark. Calcite is very popular by itself and
>>
>> supports
>>
>>
>> most of the key words and (also more streaming friendly syntax). To
>>
>> be
>>
>>
>> clear, we will still be using Spark/Flink underneath to perform the
>>
>> actual
>>
>>
>> writing, just that the SQL grammar is provided by Calcite.
>>
>>
>>
>> To give a taste of how this will look like.
>>
>>
>>
>> A) If the user wants to mutate a Hudi table using SQL
>>
>>
>>
>> Instead of writing something like : spark.sql("UPDATE ....")
>>
>>
>> users will write : hudiSparkSession.sql("UPDATE ....")
>>
>>
>>
>> B) To save a Spark data frame to a Hudi table
>>
>>
>> we continue to use Spark DataSource V1
>>
>>
>>
>> The obvious challenge I see is the disconnect with the Spark
>>
>> DataFrame
>>
>>
>> ecosystem. Users would write MERGE SQL statements by joining against
>>
>> other
>>
>>
>> Spark DataFrames.
>>
>>
>> If we want those expressed in calcite as well, we need to also invest
>>
>> in
>>
>>
>> the full Query side support, which can increase the scope by a lot.
>>
>>
>> Some amount of investigation needs to happen, but ideally we should
>>
>> be
>>
>>
>> able to integrate with the sparkSQL catalog and reuse all the tables
>>
>> there.
>>
>>
>>
>> I am sure there are some gaps in my thinking. Just starting this
>>
>> thread,
>>
>>
>> so we can discuss and others can chime in/correct me.
>>
>>
>>
>> thanks
>>
>>
>> vinoth
>>
>>
>>
>>
>>
>>
>>
>>
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by pzwpzw <pe...@icloud.com.INVALID>.
Yes, it looks good .
We are building the spark sql extensions to support for hudi in our internal version.
I am interested in participating in the extension of SparkSQL on hudi.
2020年12月22日 下午4:30,Vinoth Chandar <vi...@apache.org> 写道:
Hi,
I think what we are landing on finally is.
- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line
Please feel free to correct me, if I got this wrong.
On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
wrote:
Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
to process engine independent SQL, for example most of the DDL、Hoodie CLI
command and also provide parser for the common SQL extensions(e.g. Merge
Into). The Engine-related syntax can be taught to the respective engines to
process. If the common sql layer can handle the input sql, it handle
it.Otherwise it is routed to the engine for processing. In long term, the
common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
Hi,all
That's very good,Hudi SQL syntax can support Flink、hive and other analysis
components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in
conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?
在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to
the RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
Thanks
Vinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.
Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can build
out
a common SQL layer and have this translate to the underlying engine
(sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.
The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?
On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
wrote:
I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.
On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
.invalid>
wrote:
Yeah I agree with Nishith that an option way is to look at the
ways
to
plug in custom logical and physical plans in Spark. It can simplify
the
implementation and reuse the Spark SQL syntax. And also users
familiar
with
Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for
implement custom syntax extensions and SQL rewrite rule.
https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
.
We can use the SparkSessionExtensions to extended hoodie sql syntax
such
as MERGE INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi
tables. Apache Calcite is a great option to considering given
DatasourceV2
has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,
the
same SQL semantics needs to be supported in other engines like Flink
to
provide the same experience to users - which in itself could also be
considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that
would
help reduce redundant work in some sense.
Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQL
syntax
can be harder for users who are moving from “Hudi syntax” to “Spark
syntax”
for cross table joins, merges etc using data frames. One option is to
look
at the if there are ways to plug in custom logical and physical plans
in
Spark, this way, although the merge on sparksql functionality may not
be
as
simple to use, but wouldn’t take away performance and feature set for
starters, in the future we could think of having the entire query
space
be
powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this
from
a performance and optimization perspective when executing plan - I’m
guessing not but haven’t delved into the code to see if there’s
anything
different apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
wrote:
Hello all,
Just bumping this thread again
thanks
vinoth
On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
wrote:
Hello all,
One feature that keeps coming up is the ability to use UPDATE, MERGE
sql
syntax to support writing into Hudi tables. We have looked into the
Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs
- As of this writing, the UPDATE/MERGE syntax is not really opened up
to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys,
indexes,
preCombining and custom partitioning that ensures file sizes etc. All
this
needs shuffling data, looking up/joining against other dataframes so
forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition
level
iteration once the user calls df.write.format("hudi")
One thought I had is to explore Apache Calcite and write an adapter
for
Hudi. This frees us from being very dependent on a particular
engine's
syntax support like Spark. Calcite is very popular by itself and
supports
most of the key words and (also more streaming friendly syntax). To
be
clear, we will still be using Spark/Flink underneath to perform the
actual
writing, just that the SQL grammar is provided by Calcite.
To give a taste of how this will look like.
A) If the user wants to mutate a Hudi table using SQL
Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")
B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1
The obvious challenge I see is the disconnect with the Spark
DataFrame
ecosystem. Users would write MERGE SQL statements by joining against
other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest
in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should
be
able to integrate with the sparkSQL catalog and reuse all the tables
there.
I am sure there are some gaps in my thinking. Just starting this
thread,
so we can discuss and others can chime in/correct me.
thanks
vinoth
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
Hi,
I think what we are landing on finally is.
- Keep pushing for SparkSQL support using Spark extensions route
- Calcite effort will be a separate/orthogonal approach, down the line
Please feel free to correct me, if I got this wrong.
On Mon, Dec 21, 2020 at 3:30 AM pzwpzw <pe...@icloud.com.invalid>
wrote:
> Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
> command and also provide parser for the common SQL extensions(e.g. Merge
> Into). The Engine-related syntax can be taught to the respective engines to
> process. If the common sql layer can handle the input sql, it handle
> it.Otherwise it is routed to the engine for processing. In long term, the
> common layer will more and more rich and perfect.
> 2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
>
> Hi,all
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
> components at the same time,
> But there are some questions about SparkSQL. SparkSQL syntax is in
> conflict with Calctite syntax.Is our strategy
> user migration or syntax compatibility?
> In addition ,will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
> the RFC as well!
>
>
> -Nishith
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
> Thanks
>
> Vinoth
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
> add support for SQL connectors of Hoodie Flink soon ~
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
> They can choose what engine to use by a CLI config option.
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
> out
>
> a common SQL layer and have this translate to the underlying engine
>
> (sounds
>
> like Hive huh)
>
> Longer term, if we really think we can more easily implement a full DML +
>
> DDL + DQL, we can proceed with this.
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
> extensions route, before we take this on more fully.
>
>
> The other part where Calcite is great is, all the support for
>
> windowing/streaming in its syntax.
>
> Danny, I guess if we should be able to leverage that through a deeper
>
> Flink/Hudi integration?
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
> better to pursue this first, before trying other approaches.
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>
> .invalid>
>
> wrote:
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
> ways
>
> to
>
> plug in custom logical and physical plans in Spark. It can simplify
>
> the
>
> implementation and reuse the Spark SQL syntax. And also users
>
> familiar
>
> with
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
> In fact, spark have provided the SparkSessionExtensions interface for
>
> implement custom syntax extensions and SQL rewrite rule.
>
>
>
>
>
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>
> .
>
> We can use the SparkSessionExtensions to extended hoodie sql syntax
>
> such
>
> as MERGE INTO and DDL syntax.
>
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
>
> Thanks for starting this thread Vinoth.
>
> In general, definitely see the need for SQL style semantics on Hudi
>
> tables. Apache Calcite is a great option to considering given
>
> DatasourceV2
>
> has the limitations that you described.
>
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>
> the
>
> same SQL semantics needs to be supported in other engines like Flink
>
> to
>
> provide the same experience to users - which in itself could also be
>
> considerable amount of work.
>
> So, if we’re able to generalize on the SQL story along Calcite, that
>
> would
>
> help reduce redundant work in some sense.
>
> Although, I’m worried about a few things
>
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL
>
> syntax
>
> can be harder for users who are moving from “Hudi syntax” to “Spark
>
> syntax”
>
> for cross table joins, merges etc using data frames. One option is to
>
> look
>
> at the if there are ways to plug in custom logical and physical plans
>
> in
>
> Spark, this way, although the merge on sparksql functionality may not
>
> be
>
> as
>
> simple to use, but wouldn’t take away performance and feature set for
>
> starters, in the future we could think of having the entire query
>
> space
>
> be
>
> powered by calcite like you mentioned
>
> 2) If we continue to use DatasourceV1, is there any downside to this
>
> from
>
> a performance and optimization perspective when executing plan - I’m
>
> guessing not but haven’t delved into the code to see if there’s
>
> anything
>
> different apart from the API and spec.
>
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
>
> Hello all,
>
>
>
> Just bumping this thread again
>
>
>
> thanks
>
>
> vinoth
>
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>
> wrote:
>
>
>
> Hello all,
>
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE
>
> sql
>
>
> syntax to support writing into Hudi tables. We have looked into the
>
> Spark 3
>
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
>
> implementing this via the Spark APIs
>
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>
> to
>
>
> external datasources like Hudi. only DELETE is.
>
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
>
> further transformations to the dataframe. Hudi supports keys,
>
> indexes,
>
>
> preCombining and custom partitioning that ensures file sizes etc. All
>
> this
>
>
> needs shuffling data, looking up/joining against other dataframes so
>
> forth.
>
>
> Today, the DataSource V1 API allows this kind of further
>
>
> partitions/transformations. But the V2 API is simply offers partition
>
> level
>
>
> iteration once the user calls df.write.format("hudi")
>
>
>
> One thought I had is to explore Apache Calcite and write an adapter
>
> for
>
>
> Hudi. This frees us from being very dependent on a particular
>
> engine's
>
>
> syntax support like Spark. Calcite is very popular by itself and
>
> supports
>
>
> most of the key words and (also more streaming friendly syntax). To
>
> be
>
>
> clear, we will still be using Spark/Flink underneath to perform the
>
> actual
>
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
>
> To give a taste of how this will look like.
>
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
>
> B) To save a Spark data frame to a Hudi table
>
>
> we continue to use Spark DataSource V1
>
>
>
> The obvious challenge I see is the disconnect with the Spark
>
> DataFrame
>
>
> ecosystem. Users would write MERGE SQL statements by joining against
>
> other
>
>
> Spark DataFrames.
>
>
> If we want those expressed in calcite as well, we need to also invest
>
> in
>
>
> the full Query side support, which can increase the scope by a lot.
>
>
> Some amount of investigation needs to happen, but ideally we should
>
> be
>
>
> able to integrate with the sparkSQL catalog and reuse all the tables
>
> there.
>
>
>
> I am sure there are some gaps in my thinking. Just starting this
>
> thread,
>
>
> so we can discuss and others can chime in/correct me.
>
>
>
> thanks
>
>
> vinoth
>
>
>
>
>
>
>
>
Reply:Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by 受春柏 <sc...@126.com>.
Hi,pzwpzw
I see what you mean, it is very necessary to implement a common layer for hudi, and we are also planning to implement sparkSQL write capabilities for SQL-based ETL processing.Common Layer and SparkSQL Write can combine to form HUDI's SQL capabilities
At 2020-12-21 19:30:36, "pzwpzw" <pe...@icloud.com.INVALID> wrote:
Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
Hi,all
That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?
在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
ThanksVinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i wouldadd support for SQL connectors of Hoodie Flink soon ~Currently, i'm preparing a refactoring to the current Flink writer code.
Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.They can choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can buildouta common SQL layer and have this translate to the underlying engine(soundslike Hive huh)Longer term, if we really think we can more easily implement a full DML +DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the Sparkextensions route, before we take this on more fully.
The other part where Calcite is great is, all the support forwindowing/streaming in its syntax.Danny, I guess if we should be able to leverage that through a deeperFlink/Hudi integration?
On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>wrote:
I think Dongwook is investigating on the same lines. and it does seembetter to pursue this first, before trying other approaches.
On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>wrote:
Yeah I agree with Nishith that an option way is to look at thewaystoplug in custom logical and physical plans in Spark. It can simplifytheimplementation and reuse the Spark SQL syntax. And also usersfamiliarwithSpark SQL will be able to use HUDi's SQL features more quickly.In fact, spark have provided the SparkSessionExtensions interface forimplement custom syntax extensions and SQL rewrite rule.
https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.We can use the SparkSessionExtensions to extended hoodie sql syntaxsuchas MERGE INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
Thanks for starting this thread Vinoth.In general, definitely see the need for SQL style semantics on Huditables. Apache Calcite is a great option to considering givenDatasourceV2has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,thesame SQL semantics needs to be supported in other engines like Flinktoprovide the same experience to users - which in itself could also beconsiderable amount of work.So, if we’re able to generalize on the SQL story along Calcite, thatwouldhelp reduce redundant work in some sense.Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQLsyntaxcan be harder for users who are moving from “Hudi syntax” to “Sparksyntax”for cross table joins, merges etc using data frames. One option is tolookat the if there are ways to plug in custom logical and physical plansinSpark, this way, although the merge on sparksql functionality may notbeassimple to use, but wouldn’t take away performance and feature set forstarters, in the future we could think of having the entire queryspacebepowered by calcite like you mentioned2) If we continue to use DatasourceV1, is there any downside to thisfroma performance and optimization perspective when executing plan - I’mguessing not but haven’t delved into the code to see if there’sanythingdifferent apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>wrote:
Hello all,
Just bumping this thread again
thanks
vinoth
On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>wrote:
Hello all,
One feature that keeps coming up is the ability to use UPDATE, MERGEsql
syntax to support writing into Hudi tables. We have looked into theSpark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs
- As of this writing, the UPDATE/MERGE syntax is not really opened upto
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys,indexes,
preCombining and custom partitioning that ensures file sizes etc. Allthis
needs shuffling data, looking up/joining against other dataframes soforth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partitionlevel
iteration once the user calls df.write.format("hudi")
One thought I had is to explore Apache Calcite and write an adapterfor
Hudi. This frees us from being very dependent on a particularengine's
syntax support like Spark. Calcite is very popular by itself andsupports
most of the key words and (also more streaming friendly syntax). Tobe
clear, we will still be using Spark/Flink underneath to perform theactual
writing, just that the SQL grammar is provided by Calcite.
To give a taste of how this will look like.
A) If the user wants to mutate a Hudi table using SQL
Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")
B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1
The obvious challenge I see is the disconnect with the SparkDataFrame
ecosystem. Users would write MERGE SQL statements by joining againstother
Spark DataFrames.
If we want those expressed in calcite as well, we need to also investin
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we shouldbe
able to integrate with the sparkSQL catalog and reuse all the tablesthere.
I am sure there are some gaps in my thinking. Just starting thisthread,
so we can discuss and others can chime in/correct me.
thanks
vinoth
Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by pzwpzw <pe...@icloud.com.INVALID>.
Hi 受春柏 ,here is my point. We can use Calcite to build a common sql layer to process engine independent SQL, for example most of the DDL、Hoodie CLI command and also provide parser for the common SQL extensions(e.g. Merge Into). The Engine-related syntax can be taught to the respective engines to process. If the common sql layer can handle the input sql, it handle it.Otherwise it is routed to the engine for processing. In long term, the common layer will more and more rich and perfect.
2020年12月21日 下午4:38,受春柏 <sc...@126.com> 写道:
Hi,all
That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?
在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
-Nishith
On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
Sounds good. Look forward to a RFC/DISCUSS thread.
Thanks
Vinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.
Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
Thanks Kabeer for the note on gmail. Did not realize that. :)
My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can build
out
a common SQL layer and have this translate to the underlying engine
(sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.
The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?
On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
wrote:
I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.
On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
.invalid>
wrote:
Yeah I agree with Nishith that an option way is to look at the
ways
to
plug in custom logical and physical plans in Spark. It can simplify
the
implementation and reuse the Spark SQL syntax. And also users
familiar
with
Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for
implement custom syntax extensions and SQL rewrite rule.
https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
.
We can use the SparkSessionExtensions to extended hoodie sql syntax
such
as MERGE INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi
tables. Apache Calcite is a great option to considering given
DatasourceV2
has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility,
the
same SQL semantics needs to be supported in other engines like Flink
to
provide the same experience to users - which in itself could also be
considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that
would
help reduce redundant work in some sense.
Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQL
syntax
can be harder for users who are moving from “Hudi syntax” to “Spark
syntax”
for cross table joins, merges etc using data frames. One option is to
look
at the if there are ways to plug in custom logical and physical plans
in
Spark, this way, although the merge on sparksql functionality may not
be
as
simple to use, but wouldn’t take away performance and feature set for
starters, in the future we could think of having the entire query
space
be
powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this
from
a performance and optimization perspective when executing plan - I’m
guessing not but haven’t delved into the code to see if there’s
anything
different apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
wrote:
Hello all,
Just bumping this thread again
thanks
vinoth
On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
wrote:
Hello all,
One feature that keeps coming up is the ability to use UPDATE, MERGE
sql
syntax to support writing into Hudi tables. We have looked into the
Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs
- As of this writing, the UPDATE/MERGE syntax is not really opened up
to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys,
indexes,
preCombining and custom partitioning that ensures file sizes etc. All
this
needs shuffling data, looking up/joining against other dataframes so
forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition
level
iteration once the user calls df.write.format("hudi")
One thought I had is to explore Apache Calcite and write an adapter
for
Hudi. This frees us from being very dependent on a particular
engine's
syntax support like Spark. Calcite is very popular by itself and
supports
most of the key words and (also more streaming friendly syntax). To
be
clear, we will still be using Spark/Flink underneath to perform the
actual
writing, just that the SQL grammar is provided by Calcite.
To give a taste of how this will look like.
A) If the user wants to mutate a Hudi table using SQL
Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")
B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1
The obvious challenge I see is the disconnect with the Spark
DataFrame
ecosystem. Users would write MERGE SQL statements by joining against
other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest
in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should
be
able to integrate with the sparkSQL catalog and reuse all the tables
there.
I am sure there are some gaps in my thinking. Just starting this
thread,
so we can discuss and others can chime in/correct me.
thanks
vinoth
Reply:Re: [DISCUSS] SQL Support using Apache Calcite
Posted by 受春柏 <sc...@126.com>.
Hi,all
That's very good,Hudi SQL syntax can support Flink、hive and other analysis components at the same time,
But there are some questions about SparkSQL. SparkSQL syntax is in conflict with Calctite syntax.Is our strategy
user migration or syntax compatibility?
In addition ,will it also support write SQL?
在 2020-12-19 02:10:16,"Nishith" <n3...@gmail.com> 写道:
>That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
>
>-Nishith
>
>> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>>
>> Sounds good. Look forward to a RFC/DISCUSS thread.
>>
>> Thanks
>> Vinoth
>>
>>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>>>
>>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>>> add support for SQL connectors of Hoodie Flink soon ~
>>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>>
>>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>>>
>>>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>>>
>>>>>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>>> They can choose what engine to use by a CLI config option.
>>>>
>>>> Yes, that is also another attractive aspect of this route. We can build
>>> out
>>>> a common SQL layer and have this translate to the underlying engine
>>> (sounds
>>>> like Hive huh)
>>>> Longer term, if we really think we can more easily implement a full DML +
>>>> DDL + DQL, we can proceed with this.
>>>>
>>>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>>> extensions route, before we take this on more fully.
>>>>
>>>> The other part where Calcite is great is, all the support for
>>>> windowing/streaming in its syntax.
>>>> Danny, I guess if we should be able to leverage that through a deeper
>>>> Flink/Hudi integration?
>>>>
>>>>
>>>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>>>
>>>>> I think Dongwook is investigating on the same lines. and it does seem
>>>>> better to pursue this first, before trying other approaches.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>>> .invalid>
>>>>> wrote:
>>>>>
>>>>>> Yeah I agree with Nishith that an option way is to look at the
>>> ways
>>>> to
>>>>>> plug in custom logical and physical plans in Spark. It can simplify
>>> the
>>>>>> implementation and reuse the Spark SQL syntax. And also users
>>> familiar
>>>>> with
>>>>>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>>>>> In fact, spark have provided the SparkSessionExtensions interface for
>>>>>> implement custom syntax extensions and SQL rewrite rule.
>>>>>>
>>>>>
>>>>
>>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>>>> .
>>>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>>> such
>>>>>> as MERGE INTO and DDL syntax.
>>>>>>
>>>>>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>>>>>
>>>>>> Thanks for starting this thread Vinoth.
>>>>>> In general, definitely see the need for SQL style semantics on Hudi
>>>>>> tables. Apache Calcite is a great option to considering given
>>>>> DatasourceV2
>>>>>> has the limitations that you described.
>>>>>>
>>>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>>> the
>>>>>> same SQL semantics needs to be supported in other engines like Flink
>>> to
>>>>>> provide the same experience to users - which in itself could also be
>>>>>> considerable amount of work.
>>>>>> So, if we’re able to generalize on the SQL story along Calcite, that
>>>>> would
>>>>>> help reduce redundant work in some sense.
>>>>>> Although, I’m worried about a few things
>>>>>>
>>>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>>> syntax
>>>>>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>>>> syntax”
>>>>>> for cross table joins, merges etc using data frames. One option is to
>>>>> look
>>>>>> at the if there are ways to plug in custom logical and physical plans
>>>> in
>>>>>> Spark, this way, although the merge on sparksql functionality may not
>>>> be
>>>>> as
>>>>>> simple to use, but wouldn’t take away performance and feature set for
>>>>>> starters, in the future we could think of having the entire query
>>> space
>>>>> be
>>>>>> powered by calcite like you mentioned
>>>>>> 2) If we continue to use DatasourceV1, is there any downside to this
>>>> from
>>>>>> a performance and optimization perspective when executing plan - I’m
>>>>>> guessing not but haven’t delved into the code to see if there’s
>>>> anything
>>>>>> different apart from the API and spec.
>>>>>>
>>>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>>
>>>>>> Just bumping this thread again
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> vinoth
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>>
>>>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>>> sql
>>>>>>
>>>>>> syntax to support writing into Hudi tables. We have looked into the
>>>>> Spark 3
>>>>>>
>>>>>> DataSource V2 APIs as well and found several issues that hinder us in
>>>>>>
>>>>>> implementing this via the Spark APIs
>>>>>>
>>>>>>
>>>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>>> to
>>>>>>
>>>>>> external datasources like Hudi. only DELETE is.
>>>>>>
>>>>>> - DataSource V2 API offers no flexibility to perform any kind of
>>>>>>
>>>>>> further transformations to the dataframe. Hudi supports keys,
>>> indexes,
>>>>>>
>>>>>> preCombining and custom partitioning that ensures file sizes etc. All
>>>>> this
>>>>>>
>>>>>> needs shuffling data, looking up/joining against other dataframes so
>>>>> forth.
>>>>>>
>>>>>> Today, the DataSource V1 API allows this kind of further
>>>>>>
>>>>>> partitions/transformations. But the V2 API is simply offers partition
>>>>> level
>>>>>>
>>>>>> iteration once the user calls df.write.format("hudi")
>>>>>>
>>>>>>
>>>>>> One thought I had is to explore Apache Calcite and write an adapter
>>> for
>>>>>>
>>>>>> Hudi. This frees us from being very dependent on a particular
>>> engine's
>>>>>>
>>>>>> syntax support like Spark. Calcite is very popular by itself and
>>>> supports
>>>>>>
>>>>>> most of the key words and (also more streaming friendly syntax). To
>>> be
>>>>>>
>>>>>> clear, we will still be using Spark/Flink underneath to perform the
>>>>> actual
>>>>>>
>>>>>> writing, just that the SQL grammar is provided by Calcite.
>>>>>>
>>>>>>
>>>>>> To give a taste of how this will look like.
>>>>>>
>>>>>>
>>>>>> A) If the user wants to mutate a Hudi table using SQL
>>>>>>
>>>>>>
>>>>>> Instead of writing something like : spark.sql("UPDATE ....")
>>>>>>
>>>>>> users will write : hudiSparkSession.sql("UPDATE ....")
>>>>>>
>>>>>>
>>>>>> B) To save a Spark data frame to a Hudi table
>>>>>>
>>>>>> we continue to use Spark DataSource V1
>>>>>>
>>>>>>
>>>>>> The obvious challenge I see is the disconnect with the Spark
>>> DataFrame
>>>>>>
>>>>>> ecosystem. Users would write MERGE SQL statements by joining against
>>>>> other
>>>>>>
>>>>>> Spark DataFrames.
>>>>>>
>>>>>> If we want those expressed in calcite as well, we need to also invest
>>>> in
>>>>>>
>>>>>> the full Query side support, which can increase the scope by a lot.
>>>>>>
>>>>>> Some amount of investigation needs to happen, but ideally we should
>>> be
>>>>>>
>>>>>> able to integrate with the sparkSQL catalog and reuse all the tables
>>>>> there.
>>>>>>
>>>>>>
>>>>>> I am sure there are some gaps in my thinking. Just starting this
>>>> thread,
>>>>>>
>>>>>> so we can discuss and others can chime in/correct me.
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> vinoth
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Nishith <n3...@gmail.com>.
That’s awesome. Looks like we have a consensus on Calcite. Look forward to the RFC as well!
-Nishith
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar <vi...@apache.org> wrote:
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
> Thanks
> Vinoth
>
>> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
>>
>> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>> add support for SQL connectors of Hoodie Flink soon ~
>> Currently, i'm preparing a refactoring to the current Flink writer code.
>>
>> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>>
>>> Thanks Kabeer for the note on gmail. Did not realize that. :)
>>>
>>>>> My desired use case is user use the Hoodie CLI to execute these SQLs.
>>> They can choose what engine to use by a CLI config option.
>>>
>>> Yes, that is also another attractive aspect of this route. We can build
>> out
>>> a common SQL layer and have this translate to the underlying engine
>> (sounds
>>> like Hive huh)
>>> Longer term, if we really think we can more easily implement a full DML +
>>> DDL + DQL, we can proceed with this.
>>>
>>> As others pointed out, for Spark SQL, it might be good to try the Spark
>>> extensions route, before we take this on more fully.
>>>
>>> The other part where Calcite is great is, all the support for
>>> windowing/streaming in its syntax.
>>> Danny, I guess if we should be able to leverage that through a deeper
>>> Flink/Hudi integration?
>>>
>>>
>>> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
>> wrote:
>>>
>>>> I think Dongwook is investigating on the same lines. and it does seem
>>>> better to pursue this first, before trying other approaches.
>>>>
>>>>
>>>>
>>>> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
>>> .invalid>
>>>> wrote:
>>>>
>>>>> Yeah I agree with Nishith that an option way is to look at the
>> ways
>>> to
>>>>> plug in custom logical and physical plans in Spark. It can simplify
>> the
>>>>> implementation and reuse the Spark SQL syntax. And also users
>> familiar
>>>> with
>>>>> Spark SQL will be able to use HUDi's SQL features more quickly.
>>>>> In fact, spark have provided the SparkSessionExtensions interface for
>>>>> implement custom syntax extensions and SQL rewrite rule.
>>>>>
>>>>
>>>
>> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
>>>> .
>>>>> We can use the SparkSessionExtensions to extended hoodie sql syntax
>>> such
>>>>> as MERGE INTO and DDL syntax.
>>>>>
>>>>> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>>>>>
>>>>> Thanks for starting this thread Vinoth.
>>>>> In general, definitely see the need for SQL style semantics on Hudi
>>>>> tables. Apache Calcite is a great option to considering given
>>>> DatasourceV2
>>>>> has the limitations that you described.
>>>>>
>>>>> Additionally, even if Spark DatasourceV2 allowed for the flexibility,
>>> the
>>>>> same SQL semantics needs to be supported in other engines like Flink
>> to
>>>>> provide the same experience to users - which in itself could also be
>>>>> considerable amount of work.
>>>>> So, if we’re able to generalize on the SQL story along Calcite, that
>>>> would
>>>>> help reduce redundant work in some sense.
>>>>> Although, I’m worried about a few things
>>>>>
>>>>> 1) Like you pointed out, writing complex user jobs using Spark SQL
>>> syntax
>>>>> can be harder for users who are moving from “Hudi syntax” to “Spark
>>>> syntax”
>>>>> for cross table joins, merges etc using data frames. One option is to
>>>> look
>>>>> at the if there are ways to plug in custom logical and physical plans
>>> in
>>>>> Spark, this way, although the merge on sparksql functionality may not
>>> be
>>>> as
>>>>> simple to use, but wouldn’t take away performance and feature set for
>>>>> starters, in the future we could think of having the entire query
>> space
>>>> be
>>>>> powered by calcite like you mentioned
>>>>> 2) If we continue to use DatasourceV1, is there any downside to this
>>> from
>>>>> a performance and optimization perspective when executing plan - I’m
>>>>> guessing not but haven’t delved into the code to see if there’s
>>> anything
>>>>> different apart from the API and spec.
>>>>>
>>>>> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
>>> wrote:
>>>>>
>>>>>
>>>>> Hello all,
>>>>>
>>>>>
>>>>> Just bumping this thread again
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>> vinoth
>>>>>
>>>>>
>>>>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hello all,
>>>>>
>>>>>
>>>>> One feature that keeps coming up is the ability to use UPDATE, MERGE
>>> sql
>>>>>
>>>>> syntax to support writing into Hudi tables. We have looked into the
>>>> Spark 3
>>>>>
>>>>> DataSource V2 APIs as well and found several issues that hinder us in
>>>>>
>>>>> implementing this via the Spark APIs
>>>>>
>>>>>
>>>>> - As of this writing, the UPDATE/MERGE syntax is not really opened up
>>> to
>>>>>
>>>>> external datasources like Hudi. only DELETE is.
>>>>>
>>>>> - DataSource V2 API offers no flexibility to perform any kind of
>>>>>
>>>>> further transformations to the dataframe. Hudi supports keys,
>> indexes,
>>>>>
>>>>> preCombining and custom partitioning that ensures file sizes etc. All
>>>> this
>>>>>
>>>>> needs shuffling data, looking up/joining against other dataframes so
>>>> forth.
>>>>>
>>>>> Today, the DataSource V1 API allows this kind of further
>>>>>
>>>>> partitions/transformations. But the V2 API is simply offers partition
>>>> level
>>>>>
>>>>> iteration once the user calls df.write.format("hudi")
>>>>>
>>>>>
>>>>> One thought I had is to explore Apache Calcite and write an adapter
>> for
>>>>>
>>>>> Hudi. This frees us from being very dependent on a particular
>> engine's
>>>>>
>>>>> syntax support like Spark. Calcite is very popular by itself and
>>> supports
>>>>>
>>>>> most of the key words and (also more streaming friendly syntax). To
>> be
>>>>>
>>>>> clear, we will still be using Spark/Flink underneath to perform the
>>>> actual
>>>>>
>>>>> writing, just that the SQL grammar is provided by Calcite.
>>>>>
>>>>>
>>>>> To give a taste of how this will look like.
>>>>>
>>>>>
>>>>> A) If the user wants to mutate a Hudi table using SQL
>>>>>
>>>>>
>>>>> Instead of writing something like : spark.sql("UPDATE ....")
>>>>>
>>>>> users will write : hudiSparkSession.sql("UPDATE ....")
>>>>>
>>>>>
>>>>> B) To save a Spark data frame to a Hudi table
>>>>>
>>>>> we continue to use Spark DataSource V1
>>>>>
>>>>>
>>>>> The obvious challenge I see is the disconnect with the Spark
>> DataFrame
>>>>>
>>>>> ecosystem. Users would write MERGE SQL statements by joining against
>>>> other
>>>>>
>>>>> Spark DataFrames.
>>>>>
>>>>> If we want those expressed in calcite as well, we need to also invest
>>> in
>>>>>
>>>>> the full Query side support, which can increase the scope by a lot.
>>>>>
>>>>> Some amount of investigation needs to happen, but ideally we should
>> be
>>>>>
>>>>> able to integrate with the sparkSQL catalog and reuse all the tables
>>>> there.
>>>>>
>>>>>
>>>>> I am sure there are some gaps in my thinking. Just starting this
>>> thread,
>>>>>
>>>>> so we can discuss and others can chime in/correct me.
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>> vinoth
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
Sounds good. Look forward to a RFC/DISCUSS thread.
Thanks
Vinoth
On Thu, Dec 17, 2020 at 6:04 PM Danny Chan <da...@apache.org> wrote:
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
> add support for SQL connectors of Hoodie Flink soon ~
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
> Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
>
> > Thanks Kabeer for the note on gmail. Did not realize that. :)
> >
> > >> My desired use case is user use the Hoodie CLI to execute these SQLs.
> > They can choose what engine to use by a CLI config option.
> >
> > Yes, that is also another attractive aspect of this route. We can build
> out
> > a common SQL layer and have this translate to the underlying engine
> (sounds
> > like Hive huh)
> > Longer term, if we really think we can more easily implement a full DML +
> > DDL + DQL, we can proceed with this.
> >
> > As others pointed out, for Spark SQL, it might be good to try the Spark
> > extensions route, before we take this on more fully.
> >
> > The other part where Calcite is great is, all the support for
> > windowing/streaming in its syntax.
> > Danny, I guess if we should be able to leverage that through a deeper
> > Flink/Hudi integration?
> >
> >
> > On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > I think Dongwook is investigating on the same lines. and it does seem
> > > better to pursue this first, before trying other approaches.
> > >
> > >
> > >
> > > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> > .invalid>
> > > wrote:
> > >
> > > > Yeah I agree with Nishith that an option way is to look at the
> ways
> > to
> > > > plug in custom logical and physical plans in Spark. It can simplify
> the
> > > > implementation and reuse the Spark SQL syntax. And also users
> familiar
> > > with
> > > > Spark SQL will be able to use HUDi's SQL features more quickly.
> > > > In fact, spark have provided the SparkSessionExtensions interface for
> > > > implement custom syntax extensions and SQL rewrite rule.
> > > >
> > >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> > > .
> > > > We can use the SparkSessionExtensions to extended hoodie sql syntax
> > such
> > > > as MERGE INTO and DDL syntax.
> > > >
> > > > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> > > >
> > > > Thanks for starting this thread Vinoth.
> > > > In general, definitely see the need for SQL style semantics on Hudi
> > > > tables. Apache Calcite is a great option to considering given
> > > DatasourceV2
> > > > has the limitations that you described.
> > > >
> > > > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> > the
> > > > same SQL semantics needs to be supported in other engines like Flink
> to
> > > > provide the same experience to users - which in itself could also be
> > > > considerable amount of work.
> > > > So, if we’re able to generalize on the SQL story along Calcite, that
> > > would
> > > > help reduce redundant work in some sense.
> > > > Although, I’m worried about a few things
> > > >
> > > > 1) Like you pointed out, writing complex user jobs using Spark SQL
> > syntax
> > > > can be harder for users who are moving from “Hudi syntax” to “Spark
> > > syntax”
> > > > for cross table joins, merges etc using data frames. One option is to
> > > look
> > > > at the if there are ways to plug in custom logical and physical plans
> > in
> > > > Spark, this way, although the merge on sparksql functionality may not
> > be
> > > as
> > > > simple to use, but wouldn’t take away performance and feature set for
> > > > starters, in the future we could think of having the entire query
> space
> > > be
> > > > powered by calcite like you mentioned
> > > > 2) If we continue to use DatasourceV1, is there any downside to this
> > from
> > > > a performance and optimization perspective when executing plan - I’m
> > > > guessing not but haven’t delved into the code to see if there’s
> > anything
> > > > different apart from the API and spec.
> > > >
> > > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > Just bumping this thread again
> > > >
> > > >
> > > > thanks
> > > >
> > > > vinoth
> > > >
> > > >
> > > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > >
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> > sql
> > > >
> > > > syntax to support writing into Hudi tables. We have looked into the
> > > Spark 3
> > > >
> > > > DataSource V2 APIs as well and found several issues that hinder us in
> > > >
> > > > implementing this via the Spark APIs
> > > >
> > > >
> > > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> > to
> > > >
> > > > external datasources like Hudi. only DELETE is.
> > > >
> > > > - DataSource V2 API offers no flexibility to perform any kind of
> > > >
> > > > further transformations to the dataframe. Hudi supports keys,
> indexes,
> > > >
> > > > preCombining and custom partitioning that ensures file sizes etc. All
> > > this
> > > >
> > > > needs shuffling data, looking up/joining against other dataframes so
> > > forth.
> > > >
> > > > Today, the DataSource V1 API allows this kind of further
> > > >
> > > > partitions/transformations. But the V2 API is simply offers partition
> > > level
> > > >
> > > > iteration once the user calls df.write.format("hudi")
> > > >
> > > >
> > > > One thought I had is to explore Apache Calcite and write an adapter
> for
> > > >
> > > > Hudi. This frees us from being very dependent on a particular
> engine's
> > > >
> > > > syntax support like Spark. Calcite is very popular by itself and
> > supports
> > > >
> > > > most of the key words and (also more streaming friendly syntax). To
> be
> > > >
> > > > clear, we will still be using Spark/Flink underneath to perform the
> > > actual
> > > >
> > > > writing, just that the SQL grammar is provided by Calcite.
> > > >
> > > >
> > > > To give a taste of how this will look like.
> > > >
> > > >
> > > > A) If the user wants to mutate a Hudi table using SQL
> > > >
> > > >
> > > > Instead of writing something like : spark.sql("UPDATE ....")
> > > >
> > > > users will write : hudiSparkSession.sql("UPDATE ....")
> > > >
> > > >
> > > > B) To save a Spark data frame to a Hudi table
> > > >
> > > > we continue to use Spark DataSource V1
> > > >
> > > >
> > > > The obvious challenge I see is the disconnect with the Spark
> DataFrame
> > > >
> > > > ecosystem. Users would write MERGE SQL statements by joining against
> > > other
> > > >
> > > > Spark DataFrames.
> > > >
> > > > If we want those expressed in calcite as well, we need to also invest
> > in
> > > >
> > > > the full Query side support, which can increase the scope by a lot.
> > > >
> > > > Some amount of investigation needs to happen, but ideally we should
> be
> > > >
> > > > able to integrate with the sparkSQL catalog and reuse all the tables
> > > there.
> > > >
> > > >
> > > > I am sure there are some gaps in my thinking. Just starting this
> > thread,
> > > >
> > > > so we can discuss and others can chime in/correct me.
> > > >
> > > >
> > > > thanks
> > > >
> > > > vinoth
> > > >
> > > >
> > > >
> > >
> >
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Danny Chan <da...@apache.org>.
Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
add support for SQL connectors of Hoodie Flink soon ~
Currently, i'm preparing a refactoring to the current Flink writer code.
Vinoth Chandar <vi...@apache.org> 于2020年12月18日周五 上午6:39写道:
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
> >> My desired use case is user use the Hoodie CLI to execute these SQLs.
> They can choose what engine to use by a CLI config option.
>
> Yes, that is also another attractive aspect of this route. We can build out
> a common SQL layer and have this translate to the underlying engine (sounds
> like Hive huh)
> Longer term, if we really think we can more easily implement a full DML +
> DDL + DQL, we can proceed with this.
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
> extensions route, before we take this on more fully.
>
> The other part where Calcite is great is, all the support for
> windowing/streaming in its syntax.
> Danny, I guess if we should be able to leverage that through a deeper
> Flink/Hudi integration?
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > I think Dongwook is investigating on the same lines. and it does seem
> > better to pursue this first, before trying other approaches.
> >
> >
> >
> > On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pengzhiwei2015@icloud.com
> .invalid>
> > wrote:
> >
> > > Yeah I agree with Nishith that an option way is to look at the ways
> to
> > > plug in custom logical and physical plans in Spark. It can simplify the
> > > implementation and reuse the Spark SQL syntax. And also users familiar
> > with
> > > Spark SQL will be able to use HUDi's SQL features more quickly.
> > > In fact, spark have provided the SparkSessionExtensions interface for
> > > implement custom syntax extensions and SQL rewrite rule.
> > >
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> > .
> > > We can use the SparkSessionExtensions to extended hoodie sql syntax
> such
> > > as MERGE INTO and DDL syntax.
> > >
> > > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> > >
> > > Thanks for starting this thread Vinoth.
> > > In general, definitely see the need for SQL style semantics on Hudi
> > > tables. Apache Calcite is a great option to considering given
> > DatasourceV2
> > > has the limitations that you described.
> > >
> > > Additionally, even if Spark DatasourceV2 allowed for the flexibility,
> the
> > > same SQL semantics needs to be supported in other engines like Flink to
> > > provide the same experience to users - which in itself could also be
> > > considerable amount of work.
> > > So, if we’re able to generalize on the SQL story along Calcite, that
> > would
> > > help reduce redundant work in some sense.
> > > Although, I’m worried about a few things
> > >
> > > 1) Like you pointed out, writing complex user jobs using Spark SQL
> syntax
> > > can be harder for users who are moving from “Hudi syntax” to “Spark
> > syntax”
> > > for cross table joins, merges etc using data frames. One option is to
> > look
> > > at the if there are ways to plug in custom logical and physical plans
> in
> > > Spark, this way, although the merge on sparksql functionality may not
> be
> > as
> > > simple to use, but wouldn’t take away performance and feature set for
> > > starters, in the future we could think of having the entire query space
> > be
> > > powered by calcite like you mentioned
> > > 2) If we continue to use DatasourceV1, is there any downside to this
> from
> > > a performance and optimization perspective when executing plan - I’m
> > > guessing not but haven’t delved into the code to see if there’s
> anything
> > > different apart from the API and spec.
> > >
> > > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > >
> > > Hello all,
> > >
> > >
> > > Just bumping this thread again
> > >
> > >
> > > thanks
> > >
> > > vinoth
> > >
> > >
> > > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > >
> > >
> > > Hello all,
> > >
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > >
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > >
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > >
> > > implementing this via the Spark APIs
> > >
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > >
> > > external datasources like Hudi. only DELETE is.
> > >
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > >
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > >
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > >
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > >
> > > Today, the DataSource V1 API allows this kind of further
> > >
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > >
> > > iteration once the user calls df.write.format("hudi")
> > >
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > >
> > > Hudi. This frees us from being very dependent on a particular engine's
> > >
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > >
> > > most of the key words and (also more streaming friendly syntax). To be
> > >
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > >
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > >
> > > To give a taste of how this will look like.
> > >
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > >
> > > Instead of writing something like : spark.sql("UPDATE ....")
> > >
> > > users will write : hudiSparkSession.sql("UPDATE ....")
> > >
> > >
> > > B) To save a Spark data frame to a Hudi table
> > >
> > > we continue to use Spark DataSource V1
> > >
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > >
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > >
> > > Spark DataFrames.
> > >
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > >
> > > the full Query side support, which can increase the scope by a lot.
> > >
> > > Some amount of investigation needs to happen, but ideally we should be
> > >
> > > able to integrate with the sparkSQL catalog and reuse all the tables
> > there.
> > >
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > >
> > > so we can discuss and others can chime in/correct me.
> > >
> > >
> > > thanks
> > >
> > > vinoth
> > >
> > >
> > >
> >
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
Thanks Kabeer for the note on gmail. Did not realize that. :)
>> My desired use case is user use the Hoodie CLI to execute these SQLs.
They can choose what engine to use by a CLI config option.
Yes, that is also another attractive aspect of this route. We can build out
a common SQL layer and have this translate to the underlying engine (sounds
like Hive huh)
Longer term, if we really think we can more easily implement a full DML +
DDL + DQL, we can proceed with this.
As others pointed out, for Spark SQL, it might be good to try the Spark
extensions route, before we take this on more fully.
The other part where Calcite is great is, all the support for
windowing/streaming in its syntax.
Danny, I guess if we should be able to leverage that through a deeper
Flink/Hudi integration?
On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar <vi...@apache.org> wrote:
> I think Dongwook is investigating on the same lines. and it does seem
> better to pursue this first, before trying other approaches.
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>
> wrote:
>
> > Yeah I agree with Nishith that an option way is to look at the ways to
> > plug in custom logical and physical plans in Spark. It can simplify the
> > implementation and reuse the Spark SQL syntax. And also users familiar
> with
> > Spark SQL will be able to use HUDi's SQL features more quickly.
> > In fact, spark have provided the SparkSessionExtensions interface for
> > implement custom syntax extensions and SQL rewrite rule.
> >
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html
> .
> > We can use the SparkSessionExtensions to extended hoodie sql syntax such
> > as MERGE INTO and DDL syntax.
> >
> > 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
> >
> > Thanks for starting this thread Vinoth.
> > In general, definitely see the need for SQL style semantics on Hudi
> > tables. Apache Calcite is a great option to considering given
> DatasourceV2
> > has the limitations that you described.
> >
> > Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> > same SQL semantics needs to be supported in other engines like Flink to
> > provide the same experience to users - which in itself could also be
> > considerable amount of work.
> > So, if we’re able to generalize on the SQL story along Calcite, that
> would
> > help reduce redundant work in some sense.
> > Although, I’m worried about a few things
> >
> > 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> > can be harder for users who are moving from “Hudi syntax” to “Spark
> syntax”
> > for cross table joins, merges etc using data frames. One option is to
> look
> > at the if there are ways to plug in custom logical and physical plans in
> > Spark, this way, although the merge on sparksql functionality may not be
> as
> > simple to use, but wouldn’t take away performance and feature set for
> > starters, in the future we could think of having the entire query space
> be
> > powered by calcite like you mentioned
> > 2) If we continue to use DatasourceV1, is there any downside to this from
> > a performance and optimization perspective when executing plan - I’m
> > guessing not but haven’t delved into the code to see if there’s anything
> > different apart from the API and spec.
> >
> > On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
> >
> >
> > Hello all,
> >
> >
> > Just bumping this thread again
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> > On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >
> > Hello all,
> >
> >
> > One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> >
> > syntax to support writing into Hudi tables. We have looked into the
> Spark 3
> >
> > DataSource V2 APIs as well and found several issues that hinder us in
> >
> > implementing this via the Spark APIs
> >
> >
> > - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> >
> > external datasources like Hudi. only DELETE is.
> >
> > - DataSource V2 API offers no flexibility to perform any kind of
> >
> > further transformations to the dataframe. Hudi supports keys, indexes,
> >
> > preCombining and custom partitioning that ensures file sizes etc. All
> this
> >
> > needs shuffling data, looking up/joining against other dataframes so
> forth.
> >
> > Today, the DataSource V1 API allows this kind of further
> >
> > partitions/transformations. But the V2 API is simply offers partition
> level
> >
> > iteration once the user calls df.write.format("hudi")
> >
> >
> > One thought I had is to explore Apache Calcite and write an adapter for
> >
> > Hudi. This frees us from being very dependent on a particular engine's
> >
> > syntax support like Spark. Calcite is very popular by itself and supports
> >
> > most of the key words and (also more streaming friendly syntax). To be
> >
> > clear, we will still be using Spark/Flink underneath to perform the
> actual
> >
> > writing, just that the SQL grammar is provided by Calcite.
> >
> >
> > To give a taste of how this will look like.
> >
> >
> > A) If the user wants to mutate a Hudi table using SQL
> >
> >
> > Instead of writing something like : spark.sql("UPDATE ....")
> >
> > users will write : hudiSparkSession.sql("UPDATE ....")
> >
> >
> > B) To save a Spark data frame to a Hudi table
> >
> > we continue to use Spark DataSource V1
> >
> >
> > The obvious challenge I see is the disconnect with the Spark DataFrame
> >
> > ecosystem. Users would write MERGE SQL statements by joining against
> other
> >
> > Spark DataFrames.
> >
> > If we want those expressed in calcite as well, we need to also invest in
> >
> > the full Query side support, which can increase the scope by a lot.
> >
> > Some amount of investigation needs to happen, but ideally we should be
> >
> > able to integrate with the sparkSQL catalog and reuse all the tables
> there.
> >
> >
> > I am sure there are some gaps in my thinking. Just starting this thread,
> >
> > so we can discuss and others can chime in/correct me.
> >
> >
> > thanks
> >
> > vinoth
> >
> >
> >
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
I think Dongwook is investigating on the same lines. and it does seem
better to pursue this first, before trying other approaches.
On Tue, Dec 15, 2020 at 1:38 AM pzwpzw <pe...@icloud.com.invalid>
wrote:
> Yeah I agree with Nishith that an option way is to look at the ways to
> plug in custom logical and physical plans in Spark. It can simplify the
> implementation and reuse the Spark SQL syntax. And also users familiar with
> Spark SQL will be able to use HUDi's SQL features more quickly.
> In fact, spark have provided the SparkSessionExtensions interface for
> implement custom syntax extensions and SQL rewrite rule.
> https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html.
> We can use the SparkSessionExtensions to extended hoodie sql syntax such
> as MERGE INTO and DDL syntax.
>
> 2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
>
> Thanks for starting this thread Vinoth.
> In general, definitely see the need for SQL style semantics on Hudi
> tables. Apache Calcite is a great option to considering given DatasourceV2
> has the limitations that you described.
>
> Additionally, even if Spark DatasourceV2 allowed for the flexibility, the
> same SQL semantics needs to be supported in other engines like Flink to
> provide the same experience to users - which in itself could also be
> considerable amount of work.
> So, if we’re able to generalize on the SQL story along Calcite, that would
> help reduce redundant work in some sense.
> Although, I’m worried about a few things
>
> 1) Like you pointed out, writing complex user jobs using Spark SQL syntax
> can be harder for users who are moving from “Hudi syntax” to “Spark syntax”
> for cross table joins, merges etc using data frames. One option is to look
> at the if there are ways to plug in custom logical and physical plans in
> Spark, this way, although the merge on sparksql functionality may not be as
> simple to use, but wouldn’t take away performance and feature set for
> starters, in the future we could think of having the entire query space be
> powered by calcite like you mentioned
> 2) If we continue to use DatasourceV1, is there any downside to this from
> a performance and optimization perspective when executing plan - I’m
> guessing not but haven’t delved into the code to see if there’s anything
> different apart from the API and spec.
>
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Hello all,
>
>
> Just bumping this thread again
>
>
> thanks
>
> vinoth
>
>
> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>
> Hello all,
>
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>
> syntax to support writing into Hudi tables. We have looked into the Spark 3
>
> DataSource V2 APIs as well and found several issues that hinder us in
>
> implementing this via the Spark APIs
>
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>
> external datasources like Hudi. only DELETE is.
>
> - DataSource V2 API offers no flexibility to perform any kind of
>
> further transformations to the dataframe. Hudi supports keys, indexes,
>
> preCombining and custom partitioning that ensures file sizes etc. All this
>
> needs shuffling data, looking up/joining against other dataframes so forth.
>
> Today, the DataSource V1 API allows this kind of further
>
> partitions/transformations. But the V2 API is simply offers partition level
>
> iteration once the user calls df.write.format("hudi")
>
>
> One thought I had is to explore Apache Calcite and write an adapter for
>
> Hudi. This frees us from being very dependent on a particular engine's
>
> syntax support like Spark. Calcite is very popular by itself and supports
>
> most of the key words and (also more streaming friendly syntax). To be
>
> clear, we will still be using Spark/Flink underneath to perform the actual
>
> writing, just that the SQL grammar is provided by Calcite.
>
>
> To give a taste of how this will look like.
>
>
> A) If the user wants to mutate a Hudi table using SQL
>
>
> Instead of writing something like : spark.sql("UPDATE ....")
>
> users will write : hudiSparkSession.sql("UPDATE ....")
>
>
> B) To save a Spark data frame to a Hudi table
>
> we continue to use Spark DataSource V1
>
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
>
> ecosystem. Users would write MERGE SQL statements by joining against other
>
> Spark DataFrames.
>
> If we want those expressed in calcite as well, we need to also invest in
>
> the full Query side support, which can increase the scope by a lot.
>
> Some amount of investigation needs to happen, but ideally we should be
>
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
>
> I am sure there are some gaps in my thinking. Just starting this thread,
>
> so we can discuss and others can chime in/correct me.
>
>
> thanks
>
> vinoth
>
>
>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by pzwpzw <pe...@icloud.com.INVALID>.
Yeah I agree with Nishith that an option way is to look at the ways to plug in custom logical and physical plans in Spark. It can simplify the implementation and reuse the Spark SQL syntax. And also users familiar with Spark SQL will be able to use HUDi's SQL features more quickly.
In fact, spark have provided the SparkSessionExtensions interface for implement custom syntax extensions and SQL rewrite rule. https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSessionExtensions.html. We can use the SparkSessionExtensions to extended hoodie sql syntax such as MERGE INTO and DDL syntax.
2020年12月15日 下午3:27,Nishith <n3...@gmail.com> 写道:
Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense.
Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this from a performance and optimization perspective when executing plan - I’m guessing not but haven’t delved into the code to see if there’s anything different apart from the API and spec.
On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
Hello all,
Just bumping this thread again
thanks
vinoth
On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
Hello all,
One feature that keeps coming up is the ability to use UPDATE, MERGE sql
syntax to support writing into Hudi tables. We have looked into the Spark 3
DataSource V2 APIs as well and found several issues that hinder us in
implementing this via the Spark APIs
- As of this writing, the UPDATE/MERGE syntax is not really opened up to
external datasources like Hudi. only DELETE is.
- DataSource V2 API offers no flexibility to perform any kind of
further transformations to the dataframe. Hudi supports keys, indexes,
preCombining and custom partitioning that ensures file sizes etc. All this
needs shuffling data, looking up/joining against other dataframes so forth.
Today, the DataSource V1 API allows this kind of further
partitions/transformations. But the V2 API is simply offers partition level
iteration once the user calls df.write.format("hudi")
One thought I had is to explore Apache Calcite and write an adapter for
Hudi. This frees us from being very dependent on a particular engine's
syntax support like Spark. Calcite is very popular by itself and supports
most of the key words and (also more streaming friendly syntax). To be
clear, we will still be using Spark/Flink underneath to perform the actual
writing, just that the SQL grammar is provided by Calcite.
To give a taste of how this will look like.
A) If the user wants to mutate a Hudi table using SQL
Instead of writing something like : spark.sql("UPDATE ....")
users will write : hudiSparkSession.sql("UPDATE ....")
B) To save a Spark data frame to a Hudi table
we continue to use Spark DataSource V1
The obvious challenge I see is the disconnect with the Spark DataFrame
ecosystem. Users would write MERGE SQL statements by joining against other
Spark DataFrames.
If we want those expressed in calcite as well, we need to also invest in
the full Query side support, which can increase the scope by a lot.
Some amount of investigation needs to happen, but ideally we should be
able to integrate with the sparkSQL catalog and reuse all the tables there.
I am sure there are some gaps in my thinking. Just starting this thread,
so we can discuss and others can chime in/correct me.
thanks
vinoth
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Nishith <n3...@gmail.com>.
Thanks for starting this thread Vinoth.
In general, definitely see the need for SQL style semantics on Hudi tables. Apache Calcite is a great option to considering given DatasourceV2 has the limitations that you described.
Additionally, even if Spark DatasourceV2 allowed for the flexibility, the same SQL semantics needs to be supported in other engines like Flink to provide the same experience to users - which in itself could also be considerable amount of work.
So, if we’re able to generalize on the SQL story along Calcite, that would help reduce redundant work in some sense.
Although, I’m worried about a few things
1) Like you pointed out, writing complex user jobs using Spark SQL syntax can be harder for users who are moving from “Hudi syntax” to “Spark syntax” for cross table joins, merges etc using data frames. One option is to look at the if there are ways to plug in custom logical and physical plans in Spark, this way, although the merge on sparksql functionality may not be as simple to use, but wouldn’t take away performance and feature set for starters, in the future we could think of having the entire query space be powered by calcite like you mentioned
2) If we continue to use DatasourceV1, is there any downside to this from a performance and optimization perspective when executing plan - I’m guessing not but haven’t delved into the code to see if there’s anything different apart from the API and spec.
> On Dec 14, 2020, at 11:06 PM, Vinoth Chandar <vi...@apache.org> wrote:
>
> Hello all,
>
> Just bumping this thread again
>
> thanks
> vinoth
>
>> On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
>>
>> Hello all,
>>
>> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
>> syntax to support writing into Hudi tables. We have looked into the Spark 3
>> DataSource V2 APIs as well and found several issues that hinder us in
>> implementing this via the Spark APIs
>>
>> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
>> external datasources like Hudi. only DELETE is.
>> - DataSource V2 API offers no flexibility to perform any kind of
>> further transformations to the dataframe. Hudi supports keys, indexes,
>> preCombining and custom partitioning that ensures file sizes etc. All this
>> needs shuffling data, looking up/joining against other dataframes so forth.
>> Today, the DataSource V1 API allows this kind of further
>> partitions/transformations. But the V2 API is simply offers partition level
>> iteration once the user calls df.write.format("hudi")
>>
>> One thought I had is to explore Apache Calcite and write an adapter for
>> Hudi. This frees us from being very dependent on a particular engine's
>> syntax support like Spark. Calcite is very popular by itself and supports
>> most of the key words and (also more streaming friendly syntax). To be
>> clear, we will still be using Spark/Flink underneath to perform the actual
>> writing, just that the SQL grammar is provided by Calcite.
>>
>> To give a taste of how this will look like.
>>
>> A) If the user wants to mutate a Hudi table using SQL
>>
>> Instead of writing something like : spark.sql("UPDATE ....")
>> users will write : hudiSparkSession.sql("UPDATE ....")
>>
>> B) To save a Spark data frame to a Hudi table
>> we continue to use Spark DataSource V1
>>
>> The obvious challenge I see is the disconnect with the Spark DataFrame
>> ecosystem. Users would write MERGE SQL statements by joining against other
>> Spark DataFrames.
>> If we want those expressed in calcite as well, we need to also invest in
>> the full Query side support, which can increase the scope by a lot.
>> Some amount of investigation needs to happen, but ideally we should be
>> able to integrate with the sparkSQL catalog and reuse all the tables there.
>>
>> I am sure there are some gaps in my thinking. Just starting this thread,
>> so we can discuss and others can chime in/correct me.
>>
>> thanks
>> vinoth
>>
Re: [DISCUSS] SQL Support using Apache Calcite
Posted by Vinoth Chandar <vi...@apache.org>.
Hello all,
Just bumping this thread again
thanks
vinoth
On Thu, Dec 10, 2020 at 11:58 PM Vinoth Chandar <vi...@apache.org> wrote:
> Hello all,
>
> One feature that keeps coming up is the ability to use UPDATE, MERGE sql
> syntax to support writing into Hudi tables. We have looked into the Spark 3
> DataSource V2 APIs as well and found several issues that hinder us in
> implementing this via the Spark APIs
>
> - As of this writing, the UPDATE/MERGE syntax is not really opened up to
> external datasources like Hudi. only DELETE is.
> - DataSource V2 API offers no flexibility to perform any kind of
> further transformations to the dataframe. Hudi supports keys, indexes,
> preCombining and custom partitioning that ensures file sizes etc. All this
> needs shuffling data, looking up/joining against other dataframes so forth.
> Today, the DataSource V1 API allows this kind of further
> partitions/transformations. But the V2 API is simply offers partition level
> iteration once the user calls df.write.format("hudi")
>
> One thought I had is to explore Apache Calcite and write an adapter for
> Hudi. This frees us from being very dependent on a particular engine's
> syntax support like Spark. Calcite is very popular by itself and supports
> most of the key words and (also more streaming friendly syntax). To be
> clear, we will still be using Spark/Flink underneath to perform the actual
> writing, just that the SQL grammar is provided by Calcite.
>
> To give a taste of how this will look like.
>
> A) If the user wants to mutate a Hudi table using SQL
>
> Instead of writing something like : spark.sql("UPDATE ....")
> users will write : hudiSparkSession.sql("UPDATE ....")
>
> B) To save a Spark data frame to a Hudi table
> we continue to use Spark DataSource V1
>
> The obvious challenge I see is the disconnect with the Spark DataFrame
> ecosystem. Users would write MERGE SQL statements by joining against other
> Spark DataFrames.
> If we want those expressed in calcite as well, we need to also invest in
> the full Query side support, which can increase the scope by a lot.
> Some amount of investigation needs to happen, but ideally we should be
> able to integrate with the sparkSQL catalog and reuse all the tables there.
>
> I am sure there are some gaps in my thinking. Just starting this thread,
> so we can discuss and others can chime in/correct me.
>
> thanks
> vinoth
>