You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kyuubi.apache.org by kaifei yi <yi...@gmail.com> on 2022/08/16 09:42:46 UTC

Support Hive V2 DataSource in Kyuubi

Hi, kyuubi community:

Currently, Users are clamoring for the ability to federated queries in
Lakehouse architecture,  we probably need a serval datasource to meet this.

In practice, some user services need to access other hive warehouse for
federated queries. currently, Apache Spark supports access to hive data
sources. however, in federated scenarios, some capabilities may be
disabled, for example, users may need to access different hive warehouse at
the single job to perform federated query, and the hive versions are
different, this requirement can be met by a hive V2 datasource

Does the Kyuubi community have any idea how to include hive V2 in the
feature list?

Re: Support Hive V2 DataSource in Kyuubi

Posted by Kent Yao <ya...@apache.org>.

Hi Yi,

Supporting multiple HMSs or hive warehouses sounds interesting. I do understand
the need for accessing different hive data sources within a single
query in real-world
cases.

So far, the Kyuubi community has already supported several Spark Datasource V2
implementations, such as TPC-DS, TPCH, and kudu, on the Kyuubi extension layer.
 And I believe we will support more and more connectors with the
community effort
in the future.

The only thing that worries me is that the source code of spark
builtin hive catalog itself
is a mess. I don't know whether it will put a huge maintenance burden
on our community
or not.

But I am +1 for it as it's worth trying.

BR

Kent


zhaomin <zh...@163.com> 于2022年8月16日周二 17:57写道：

>
> I'm also interested in it.
>
>
>
> Best Regards,
> Min Zhao
>
>
>
>
> ---- Replied Message ----
> | From | kaifei yi<yi...@gmail.com> |
> | Date | 08/16/2022 17:42 |
> | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
> | Cc | |
> | Subject | Support Hive V2 DataSource in Kyuubi |
> Hi, kyuubi community:
>
> Currently, Users are clamoring for the ability to federated queries in
> Lakehouse architecture,  we probably need a serval datasource to meet this.
>
> In practice, some user services need to access other hive warehouse for
> federated queries. currently, Apache Spark supports access to hive data
> sources. however, in federated scenarios, some capabilities may be
> disabled, for example, users may need to access different hive warehouse at
> the single job to perform federated query, and the hive versions are
> different, this requirement can be met by a hive V2 datasource
>
> Does the Kyuubi community have any idea how to include hive V2 in the
> feature list?

Re: Support Hive V2 DataSource in Kyuubi

Posted by Cheng Pan <pa...@gmail.com>.

I see you have started reviewing on #3260, thank you for your participation.

Thanks,
Cheng Pan


On Aug 17, 2022 at 23:44:46, Heng Su <pe...@gmail.com> wrote:

> Hi, Pan
>
> Sorry to reply late, it’s another busy day.
> It’s my pleasure to provide any help to kyuubi community.
> Fairly say, I havent deep into kyuubi yet and currently was focusing on
> another project.
> Since #3260 has been opened, I think we can follow this way to deliver the
> functionality in a short time.
>
>
> Best regards, Heng Su
>
>
>
> 2022年8月17日 下午4:07，Cheng Pan <pa...@gmail.com> 写道：
>
> Thanks for sharing your experience.
>
> kyuubi default set kyuubi.engine.single.spark.session=true
>>
>
> It’s `false` in default.
>
> … provide concurrency sql execution in context isolation(I guess)
>>
>
> The concept of spark session is similar to JDBC/RDMS connection.
>
> This cause spark.newSession invoked for each transaction, although the
>> embedded sessionCatalog is shared across all spark session(include the new
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin
>> catalogs) will be created every time a sparkSession constructed, in this
>> case when concurrent query fires, the more dsv2 catalogs we used, the more
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test
>> for 256m metaspace, oom will occur.
>>
>
> Yea, this is the big different between v1 catalog and v2, maybe we can
> introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
>
> … a viewfs or router based federation must be configured in advance, I am
>> not sure if we can configure the hadoop conf separatly for each hive
>> catalog. a viewfs or router based federation must be configured in advance,
>> I am not sure if we can configure the hadoop conf separatly for each hive
>> catalog.
>>
>
> Maybe we can learning something from Iceberg.
>
> Since you already have a good shape of Hive DSv2 catalog implementation,
> and there are more and more people are interested in this feature, would
> you like contribute it to the Kyuubi project?
>
> Thanks,
> Cheng Pan
>
>
> On Aug 17, 2022 at 11:23:07, Heng Su <pe...@gmail.com> wrote:
>
>> Hi, Cheng Pan
>>
>> Glad to join the session.
>>
>> The git repo you point out is truly used in our internal production etl
>> pipeline, of course currently not combine it with kyuubi.
>>
>> But I have the plan to refact it in two aspects:
>>
>> 1. As the spark3.3 released, most dsv2 functionality seems to be
>> production ready[1], and some api has changed since 3.1, maybe upgrade it
>> to this version is more stable
>> 2. We also have strong will to integrate kyuubi as spark sql query
>> engine, while currently the work is just in research.
>>    I have found some issue to integrate the hive-catalog extention with
>> kyuubi, for instance, kyuubi default set
>> `kyuubi.engine.single.spark.session`=true to provide concurrency sql
>> execution in context isolation(I guess),
>>    This cause spark.newSession invoked for each transaction, although the
>> embedded sessionCatalog is shared across all spark session(include the new
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin
>> catalogs)
>>   will be created every time a sparkSession constructed, in this case
>> when concurrent query fires, the more dsv2 catalogs we used, the more
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test
>> for 256m metaspace, oom will occur.
>>   Another one is currently the hive-catalog is based on that all the
>> target hadoop clusters can be visit by spark executin runtime, that say, a
>> viewfs or router based federation must be configured in advance, I am not
>> sure if we can configure the hadoop conf separatly for each hive catalog.
>> Similarly,  I just use SQLConf in sessionState as the global sqlConf of all
>> hive catalog, maybe in some case the default conf value will be different
>> in different catalog.
>>
>>
>> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
>>
>> 2022年8月16日 下午6:23，Cheng Pan <pa...@gmail.com> 写道：
>>
>> Thanks for your idea.
>>
>> It's up to the community if Kyuubi will support this feature, if anyone
>> is interested in this feature, feel free to open PR for it, I'm happy to
>> review.
>>
>> In fact, I found that one guy (also +he as receiver) has done (probably
>> part of) the job [1], but I didn't test it, and I would appreciate if we
>> had a chance to collaborate.
>>
>> [1] https://github.com/permanentstar/spark-sql-dsv2-extension
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On Aug 16, 2022 at 17:57:35, zhaomin <zh...@163.com> wrote:
>>
>>> I'm also interested in it.
>>>
>>>
>>>
>>> Best Regards,
>>> Min Zhao
>>>
>>>
>>>
>>>
>>> ---- Replied Message ----
>>> | From | kaifei yi<yi...@gmail.com> |
>>> | Date | 08/16/2022 17:42 |
>>> | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
>>> | Cc | |
>>> | Subject | Support Hive V2 DataSource in Kyuubi |
>>> Hi, kyuubi community:
>>>
>>> Currently, Users are clamoring for the ability to federated queries in
>>> Lakehouse architecture,  we probably need a serval datasource to meet
>>> this.
>>>
>>> In practice, some user services need to access other hive warehouse for
>>> federated queries. currently, Apache Spark supports access to hive data
>>> sources. however, in federated scenarios, some capabilities may be
>>> disabled, for example, users may need to access different hive warehouse
>>> at
>>> the single job to perform federated query, and the hive versions are
>>> different, this requirement can be met by a hive V2 datasource
>>>
>>> Does the Kyuubi community have any idea how to include hive V2 in the
>>> feature list?
>>>
>>
>>
>

Re: Support Hive V2 DataSource in Kyuubi

Posted by Heng Su <pe...@gmail.com>.

Hi, Pan

Sorry to reply late, it’s another busy day.
It’s my pleasure to provide any help to kyuubi community.
Fairly say, I havent deep into kyuubi yet and currently was focusing on another project.
Since #3260 has been opened, I think we can follow this way to deliver the functionality in a short time.


Best regards, Heng Su



> 2022年8月17日 下午4:07，Cheng Pan <pa...@gmail.com> 写道：
> 
> Thanks for sharing your experience.
> 
>> kyuubi default set kyuubi.engine.single.spark.session=true
> 
> It’s `false` in default.
> 
>> … provide concurrency sql execution in context isolation(I guess)
> 
> The concept of spark session is similar to JDBC/RDMS connection.
> 
>> This cause spark.newSession invoked for each transaction, although the embedded sessionCatalog is shared across all spark session(include the new one), but in dsv2 architecture, the catalogManager(which hold all plugin catalogs) will be created every time a sparkSession constructed, in this case when concurrent query fires, the more dsv2 catalogs we used, the more overhead(mainly in metaspace usage) the engine driver will hold, in my test for 256m metaspace, oom will occur.
> 
> Yea, this is the big different between v1 catalog and v2, maybe we can introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
> 
>> … a viewfs or router based federation must be configured in advance, I am not sure if we can configure the hadoop conf separatly for each hive catalog. a viewfs or router based federation must be configured in advance, I am not sure if we can configure the hadoop conf separatly for each hive catalog.
> 
> Maybe we can learning something from Iceberg.
> 
> Since you already have a good shape of Hive DSv2 catalog implementation, and there are more and more people are interested in this feature, would you like contribute it to the Kyuubi project?
> 
> Thanks,
> Cheng Pan
> 
> 
> On Aug 17, 2022 at 11:23:07, Heng Su <permanent.star@gmail.com <ma...@gmail.com>> wrote:
>> Hi, Cheng Pan
>> 
>> Glad to join the session.
>> 
>> The git repo you point out is truly used in our internal production etl pipeline, of course currently not combine it with kyuubi.
>> 
>> But I have the plan to refact it in two aspects:
>> 
>> 1. As the spark3.3 released, most dsv2 functionality seems to be production ready[1], and some api has changed since 3.1, maybe upgrade it to this version is more stable
>> 2. We also have strong will to integrate kyuubi as spark sql query engine, while currently the work is just in research.
>>    I have found some issue to integrate the hive-catalog extention with kyuubi, for instance, kyuubi default set `kyuubi.engine.single.spark.session`=true to provide concurrency sql execution in context isolation(I guess),
>>    This cause spark.newSession invoked for each transaction, although the embedded sessionCatalog is shared across all spark session(include the new one), but in dsv2 architecture, the catalogManager(which hold all plugin catalogs)
>>   will be created every time a sparkSession constructed, in this case when concurrent query fires, the more dsv2 catalogs we used, the more overhead(mainly in metaspace usage) the engine driver will hold, in my test for 256m metaspace, oom will occur.
>>   Another one is currently the hive-catalog is based on that all the target hadoop clusters can be visit by spark executin runtime, that say, a viewfs or router based federation must be configured in advance, I am not sure if we can configure the hadoop conf separatly for each hive catalog. Similarly,  I just use SQLConf in sessionState as the global sqlConf of all hive catalog, maybe in some case the default conf value will be different in different catalog.
>> 
>> 
>> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA <https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA>
>> 
>>> 2022年8月16日 下午6:23，Cheng Pan <pan3793@gmail.com <ma...@gmail.com>> 写道：
>>> 
>>> Thanks for your idea.
>>> 
>>> It's up to the community if Kyuubi will support this feature, if anyone is interested in this feature, feel free to open PR for it, I'm happy to review.
>>> 
>>> In fact, I found that one guy (also +he as receiver) has done (probably part of) the job [1], but I didn't test it, and I would appreciate if we had a chance to collaborate.
>>> 
>>> [1] https://github.com/permanentstar/spark-sql-dsv2-extension <https://github.com/permanentstar/spark-sql-dsv2-extension>
>>> 
>>> Thanks,
>>> Cheng Pan
>>> 
>>> 
>>> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1423@163.com <ma...@163.com>> wrote:
>>>> I'm also interested in it.
>>>> 
>>>> 
>>>> 
>>>> Best Regards,
>>>> Min Zhao
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---- Replied Message ----
>>>> | From | kaifei yi<yikaifei1@gmail.com <ma...@gmail.com>> |
>>>> | Date | 08/16/2022 17:42 |
>>>> | To | dev@kyuubi.apache.org <ma...@kyuubi.apache.org><dev@kyuubi.apache.org <ma...@kyuubi.apache.org>> |
>>>> | Cc | |
>>>> | Subject | Support Hive V2 DataSource in Kyuubi |
>>>> Hi, kyuubi community:
>>>> 
>>>> Currently, Users are clamoring for the ability to federated queries in
>>>> Lakehouse architecture,  we probably need a serval datasource to meet this.
>>>> 
>>>> In practice, some user services need to access other hive warehouse for
>>>> federated queries. currently, Apache Spark supports access to hive data
>>>> sources. however, in federated scenarios, some capabilities may be
>>>> disabled, for example, users may need to access different hive warehouse at
>>>> the single job to perform federated query, and the hive versions are
>>>> different, this requirement can be met by a hive V2 datasource
>>>> 
>>>> Does the Kyuubi community have any idea how to include hive V2 in the
>>>> feature list?
>>

Re: Support Hive V2 DataSource in Kyuubi

Posted by kaifei yi <yi...@gmail.com>.

Hi everyone

There is an issue[1] and an initial implementation pr[2] for this
discussion, please go to Github for further discussion and review

[1] https://github.com/apache/incubator-kyuubi/issues/3259
[2] https://github.com/apache/incubator-kyuubi/pull/3260

Thanks

Cheng Pan <pa...@gmail.com> 于2022年8月17日周三 16:07写道：

> Thanks for sharing your experience.
>
> kyuubi default set kyuubi.engine.single.spark.session=true
> >
>
> It’s `false` in default.
>
> … provide concurrency sql execution in context isolation(I guess)
> >
>
> The concept of spark session is similar to JDBC/RDMS connection.
>
> This cause spark.newSession invoked for each transaction, although the
> > embedded sessionCatalog is shared across all spark session(include the
> new
> > one), but in dsv2 architecture, the catalogManager(which hold all plugin
> > catalogs) will be created every time a sparkSession constructed, in this
> > case when concurrent query fires, the more dsv2 catalogs we used, the
> more
> > overhead(mainly in metaspace usage) the engine driver will hold, in my
> test
> > for 256m metaspace, oom will occur.
> >
>
> Yea, this is the big different between v1 catalog and v2, maybe we can
> introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
>
> … a viewfs or router based federation must be configured in advance, I am
> > not sure if we can configure the hadoop conf separatly for each hive
> > catalog. a viewfs or router based federation must be configured in
> advance,
> > I am not sure if we can configure the hadoop conf separatly for each hive
> > catalog.
> >
>
> Maybe we can learning something from Iceberg.
>
> Since you already have a good shape of Hive DSv2 catalog implementation,
> and there are more and more people are interested in this feature, would
> you like contribute it to the Kyuubi project?
>
> Thanks,
> Cheng Pan
>
>
> On Aug 17, 2022 at 11:23:07, Heng Su <pe...@gmail.com> wrote:
>
> > Hi, Cheng Pan
> >
> > Glad to join the session.
> >
> > The git repo you point out is truly used in our internal production etl
> > pipeline, of course currently not combine it with kyuubi.
> >
> > But I have the plan to refact it in two aspects:
> >
> > 1. As the spark3.3 released, most dsv2 functionality seems to be
> > production ready[1], and some api has changed since 3.1, maybe upgrade it
> > to this version is more stable
> > 2. We also have strong will to integrate kyuubi as spark sql query
> engine,
> > while currently the work is just in research.
> >    I have found some issue to integrate the hive-catalog extention with
> > kyuubi, for instance, kyuubi default set
> > `kyuubi.engine.single.spark.session`=true to provide concurrency sql
> > execution in context isolation(I guess),
> >    This cause spark.newSession invoked for each transaction, although the
> > embedded sessionCatalog is shared across all spark session(include the
> new
> > one), but in dsv2 architecture, the catalogManager(which hold all plugin
> > catalogs)
> >   will be created every time a sparkSession constructed, in this case
> when
> > concurrent query fires, the more dsv2 catalogs we used, the more
> > overhead(mainly in metaspace usage) the engine driver will hold, in my
> test
> > for 256m metaspace, oom will occur.
> >   Another one is currently the hive-catalog is based on that all the
> > target hadoop clusters can be visit by spark executin runtime, that say,
> a
> > viewfs or router based federation must be configured in advance, I am not
> > sure if we can configure the hadoop conf separatly for each hive catalog.
> > Similarly,  I just use SQLConf in sessionState as the global sqlConf of
> all
> > hive catalog, maybe in some case the default conf value will be different
> > in different catalog.
> >
> >
> > [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
> >
> > 2022年8月16日 下午6:23，Cheng Pan <pa...@gmail.com> 写道：
> >
> > Thanks for your idea.
> >
> > It's up to the community if Kyuubi will support this feature, if anyone
> is
> > interested in this feature, feel free to open PR for it, I'm happy to
> > review.
> >
> > In fact, I found that one guy (also +he as receiver) has done (probably
> > part of) the job [1], but I didn't test it, and I would appreciate if we
> > had a chance to collaborate.
> >
> > [1] https://github.com/permanentstar/spark-sql-dsv2-extension
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > On Aug 16, 2022 at 17:57:35, zhaomin <zh...@163.com> wrote:
> >
> >> I'm also interested in it.
> >>
> >>
> >>
> >> Best Regards,
> >> Min Zhao
> >>
> >>
> >>
> >>
> >> ---- Replied Message ----
> >> | From | kaifei yi<yi...@gmail.com> |
> >> | Date | 08/16/2022 17:42 |
> >> | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
> >> | Cc | |
> >> | Subject | Support Hive V2 DataSource in Kyuubi |
> >> Hi, kyuubi community:
> >>
> >> Currently, Users are clamoring for the ability to federated queries in
> >> Lakehouse architecture,  we probably need a serval datasource to meet
> >> this.
> >>
> >> In practice, some user services need to access other hive warehouse for
> >> federated queries. currently, Apache Spark supports access to hive data
> >> sources. however, in federated scenarios, some capabilities may be
> >> disabled, for example, users may need to access different hive warehouse
> >> at
> >> the single job to perform federated query, and the hive versions are
> >> different, this requirement can be met by a hive V2 datasource
> >>
> >> Does the Kyuubi community have any idea how to include hive V2 in the
> >> feature list?
> >>
> >
> >
>

Re: Support Hive V2 DataSource in Kyuubi

Posted by Cheng Pan <pa...@gmail.com>.

Thanks for sharing your experience.

kyuubi default set kyuubi.engine.single.spark.session=true
>

It’s `false` in default.

… provide concurrency sql execution in context isolation(I guess)
>

The concept of spark session is similar to JDBC/RDMS connection.

This cause spark.newSession invoked for each transaction, although the
> embedded sessionCatalog is shared across all spark session(include the new
> one), but in dsv2 architecture, the catalogManager(which hold all plugin
> catalogs) will be created every time a sparkSession constructed, in this
> case when concurrent query fires, the more dsv2 catalogs we used, the more
> overhead(mainly in metaspace usage) the engine driver will hold, in my test
> for 256m metaspace, oom will occur.
>

Yea, this is the big different between v1 catalog and v2, maybe we can
introduce a cache mechanism to reduce the overhead, i.e. hive client pool.

… a viewfs or router based federation must be configured in advance, I am
> not sure if we can configure the hadoop conf separatly for each hive
> catalog. a viewfs or router based federation must be configured in advance,
> I am not sure if we can configure the hadoop conf separatly for each hive
> catalog.
>

Maybe we can learning something from Iceberg.

Since you already have a good shape of Hive DSv2 catalog implementation,
and there are more and more people are interested in this feature, would
you like contribute it to the Kyuubi project?

Thanks,
Cheng Pan


On Aug 17, 2022 at 11:23:07, Heng Su <pe...@gmail.com> wrote:

> Hi, Cheng Pan
>
> Glad to join the session.
>
> The git repo you point out is truly used in our internal production etl
> pipeline, of course currently not combine it with kyuubi.
>
> But I have the plan to refact it in two aspects:
>
> 1. As the spark3.3 released, most dsv2 functionality seems to be
> production ready[1], and some api has changed since 3.1, maybe upgrade it
> to this version is more stable
> 2. We also have strong will to integrate kyuubi as spark sql query engine,
> while currently the work is just in research.
>    I have found some issue to integrate the hive-catalog extention with
> kyuubi, for instance, kyuubi default set
> `kyuubi.engine.single.spark.session`=true to provide concurrency sql
> execution in context isolation(I guess),
>    This cause spark.newSession invoked for each transaction, although the
> embedded sessionCatalog is shared across all spark session(include the new
> one), but in dsv2 architecture, the catalogManager(which hold all plugin
> catalogs)
>   will be created every time a sparkSession constructed, in this case when
> concurrent query fires, the more dsv2 catalogs we used, the more
> overhead(mainly in metaspace usage) the engine driver will hold, in my test
> for 256m metaspace, oom will occur.
>   Another one is currently the hive-catalog is based on that all the
> target hadoop clusters can be visit by spark executin runtime, that say, a
> viewfs or router based federation must be configured in advance, I am not
> sure if we can configure the hadoop conf separatly for each hive catalog.
> Similarly,  I just use SQLConf in sessionState as the global sqlConf of all
> hive catalog, maybe in some case the default conf value will be different
> in different catalog.
>
>
> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
>
> 2022年8月16日 下午6:23，Cheng Pan <pa...@gmail.com> 写道：
>
> Thanks for your idea.
>
> It's up to the community if Kyuubi will support this feature, if anyone is
> interested in this feature, feel free to open PR for it, I'm happy to
> review.
>
> In fact, I found that one guy (also +he as receiver) has done (probably
> part of) the job [1], but I didn't test it, and I would appreciate if we
> had a chance to collaborate.
>
> [1] https://github.com/permanentstar/spark-sql-dsv2-extension
>
> Thanks,
> Cheng Pan
>
>
> On Aug 16, 2022 at 17:57:35, zhaomin <zh...@163.com> wrote:
>
>> I'm also interested in it.
>>
>>
>>
>> Best Regards,
>> Min Zhao
>>
>>
>>
>>
>> ---- Replied Message ----
>> | From | kaifei yi<yi...@gmail.com> |
>> | Date | 08/16/2022 17:42 |
>> | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
>> | Cc | |
>> | Subject | Support Hive V2 DataSource in Kyuubi |
>> Hi, kyuubi community:
>>
>> Currently, Users are clamoring for the ability to federated queries in
>> Lakehouse architecture,  we probably need a serval datasource to meet
>> this.
>>
>> In practice, some user services need to access other hive warehouse for
>> federated queries. currently, Apache Spark supports access to hive data
>> sources. however, in federated scenarios, some capabilities may be
>> disabled, for example, users may need to access different hive warehouse
>> at
>> the single job to perform federated query, and the hive versions are
>> different, this requirement can be met by a hive V2 datasource
>>
>> Does the Kyuubi community have any idea how to include hive V2 in the
>> feature list?
>>
>
>

Re: Support Hive V2 DataSource in Kyuubi

Posted by Heng Su <pe...@gmail.com>.

Hi, Cheng Pan

Glad to join the session.

The git repo you point out is truly used in our internal production etl pipeline, of course currently not combine it with kyuubi.

But I have the plan to refact it in two aspects:

1. As the spark3.3 released, most dsv2 functionality seems to be production ready[1], and some api has changed since 3.1, maybe upgrade it to this version is more stable
2. We also have strong will to integrate kyuubi as spark sql query engine, while currently the work is just in research.
   I have found some issue to integrate the hive-catalog extention with kyuubi, for instance, kyuubi default set `kyuubi.engine.single.spark.session`=true to provide concurrency sql execution in context isolation(I guess),
   This cause spark.newSession invoked for each transaction, although the embedded sessionCatalog is shared across all spark session(include the new one), but in dsv2 architecture, the catalogManager(which hold all plugin catalogs)
  will be created every time a sparkSession constructed, in this case when concurrent query fires, the more dsv2 catalogs we used, the more overhead(mainly in metaspace usage) the engine driver will hold, in my test for 256m metaspace, oom will occur.
  Another one is currently the hive-catalog is based on that all the target hadoop clusters can be visit by spark executin runtime, that say, a viewfs or router based federation must be configured in advance, I am not sure if we can configure the hadoop conf separatly for each hive catalog. Similarly,  I just use SQLConf in sessionState as the global sqlConf of all hive catalog, maybe in some case the default conf value will be different in different catalog.

[1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA

> 2022年8月16日 下午6:23，Cheng Pan <pa...@gmail.com> 写道：
> 
> Thanks for your idea.
> 
> It's up to the community if Kyuubi will support this feature, if anyone is interested in this feature, feel free to open PR for it, I'm happy to review.
> 
> In fact, I found that one guy (also +he as receiver) has done (probably part of) the job [1], but I didn't test it, and I would appreciate if we had a chance to collaborate.
> 
> [1] https://github.com/permanentstar/spark-sql-dsv2-extension <https://github.com/permanentstar/spark-sql-dsv2-extension>
> 
> Thanks,
> Cheng Pan
> 
> 
> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1423@163.com <ma...@163.com>> wrote:
>> I'm also interested in it.
>> 
>> 
>> 
>> Best Regards,
>> Min Zhao
>> 
>> 
>> 
>> 
>> ---- Replied Message ----
>> | From | kaifei yi<yikaifei1@gmail.com <ma...@gmail.com>> |
>> | Date | 08/16/2022 17:42 |
>> | To | dev@kyuubi.apache.org <ma...@kyuubi.apache.org><dev@kyuubi.apache.org <ma...@kyuubi.apache.org>> |
>> | Cc | |
>> | Subject | Support Hive V2 DataSource in Kyuubi |
>> Hi, kyuubi community:
>> 
>> Currently, Users are clamoring for the ability to federated queries in
>> Lakehouse architecture,  we probably need a serval datasource to meet this.
>> 
>> In practice, some user services need to access other hive warehouse for
>> federated queries. currently, Apache Spark supports access to hive data
>> sources. however, in federated scenarios, some capabilities may be
>> disabled, for example, users may need to access different hive warehouse at
>> the single job to perform federated query, and the hive versions are
>> different, this requirement can be met by a hive V2 datasource
>> 
>> Does the Kyuubi community have any idea how to include hive V2 in the
>> feature list?

Re: Support Hive V2 DataSource in Kyuubi

Posted by kaifei yi <yi...@gmail.com>.

haha, I already have the implementation, and if the vote passes, I will
submit a PR in the next step, I'll submit a base implementation and list
specific subtasks that we can work on together

also @zhaomin


Cheng Pan <pa...@gmail.com> 于2022年8月16日周二 18:23写道：

> Thanks for your idea.
>
> It's up to the community if Kyuubi will support this feature, if anyone is
> interested in this feature, feel free to open PR for it, I'm happy to
> review.
>
> In fact, I found that one guy (also +he as receiver) has done (probably
> part of) the job [1], but I didn't test it, and I would appreciate if we
> had a chance to collaborate.
>
> [1] https://github.com/permanentstar/spark-sql-dsv2-extension
>
> Thanks,
> Cheng Pan
>
>
> On Aug 16, 2022 at 17:57:35, zhaomin <zh...@163.com> wrote:
>
> > I'm also interested in it.
> >
> >
> >
> > Best Regards,
> > Min Zhao
> >
> >
> >
> >
> > ---- Replied Message ----
> > | From | kaifei yi<yi...@gmail.com> |
> > | Date | 08/16/2022 17:42 |
> > | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
> > | Cc | |
> > | Subject | Support Hive V2 DataSource in Kyuubi |
> > Hi, kyuubi community:
> >
> > Currently, Users are clamoring for the ability to federated queries in
> > Lakehouse architecture,  we probably need a serval datasource to meet
> this.
> >
> > In practice, some user services need to access other hive warehouse for
> > federated queries. currently, Apache Spark supports access to hive data
> > sources. however, in federated scenarios, some capabilities may be
> > disabled, for example, users may need to access different hive warehouse
> at
> > the single job to perform federated query, and the hive versions are
> > different, this requirement can be met by a hive V2 datasource
> >
> > Does the Kyuubi community have any idea how to include hive V2 in the
> > feature list?
> >
>

Re: Support Hive V2 DataSource in Kyuubi

Posted by Cheng Pan <pa...@gmail.com>.

Thanks for your idea.

It's up to the community if Kyuubi will support this feature, if anyone is
interested in this feature, feel free to open PR for it, I'm happy to
review.

In fact, I found that one guy (also +he as receiver) has done (probably
part of) the job [1], but I didn't test it, and I would appreciate if we
had a chance to collaborate.

[1] https://github.com/permanentstar/spark-sql-dsv2-extension

Thanks,
Cheng Pan


On Aug 16, 2022 at 17:57:35, zhaomin <zh...@163.com> wrote:

> I'm also interested in it.
>
>
>
> Best Regards,
> Min Zhao
>
>
>
>
> ---- Replied Message ----
> | From | kaifei yi<yi...@gmail.com> |
> | Date | 08/16/2022 17:42 |
> | To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
> | Cc | |
> | Subject | Support Hive V2 DataSource in Kyuubi |
> Hi, kyuubi community:
>
> Currently, Users are clamoring for the ability to federated queries in
> Lakehouse architecture,  we probably need a serval datasource to meet this.
>
> In practice, some user services need to access other hive warehouse for
> federated queries. currently, Apache Spark supports access to hive data
> sources. however, in federated scenarios, some capabilities may be
> disabled, for example, users may need to access different hive warehouse at
> the single job to perform federated query, and the hive versions are
> different, this requirement can be met by a hive V2 datasource
>
> Does the Kyuubi community have any idea how to include hive V2 in the
> feature list?
>

Re: Support Hive V2 DataSource in Kyuubi

Posted by zhaomin <zh...@163.com>.

I'm also interested in it.



Best Regards,
Min Zhao




---- Replied Message ----
| From | kaifei yi<yi...@gmail.com> |
| Date | 08/16/2022 17:42 |
| To | dev@kyuubi.apache.org<de...@kyuubi.apache.org> |
| Cc | |
| Subject | Support Hive V2 DataSource in Kyuubi |
Hi, kyuubi community:

Currently, Users are clamoring for the ability to federated queries in
Lakehouse architecture,  we probably need a serval datasource to meet this.

In practice, some user services need to access other hive warehouse for
federated queries. currently, Apache Spark supports access to hive data
sources. however, in federated scenarios, some capabilities may be
disabled, for example, users may need to access different hive warehouse at
the single job to perform federated query, and the hive versions are
different, this requirement can be met by a hive V2 datasource

Does the Kyuubi community have any idea how to include hive V2 in the
feature list?