You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ca...@free.fr on 2022/03/26 00:26:20 UTC

spark as data warehouse?

In the past time we have been using hive for building the data 
warehouse.
Do you think if spark can used for this purpose? it's even more realtime 
than hive.

Thanks.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: spark as data warehouse?

Posted by Cheng Pan <pa...@gmail.com>.

Sorry I missed the original channel, added it back.

-----

I have less knowledge about dbt. If it supports Hive, it should support Kyuubi.
Basically, Kyuubi is gateway between your client(e.g. beeline, hive
jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the
most valuable things are:
1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi
as a HiveServer2, and continue use beeline, hive jdbc driver to
connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if
a tool claims it supports Hive, then it supports Kyuubi.
2) Kyuubi manages the compute engine lifecycle and share level, makes
a good trade-off between isolation and resource consumption.[1]

PS: Kyuubi's support for Spark is very mature, you can find lots of
production use cases here[2]. The support for Flink & Trino is in beta
phase.

[1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html
[2] https://github.com/apache/incubator-kyuubi/discussions/925

Thanks,
Cheng Pan

-------

Thanks, I'll check it out.
I have a use case where we want to use dbt as data middling tool .
Will it take dbt queries and create the resulting model ?
I see it supports Trino , so I am guessing yes .

I will love to contribute to it as well.

Thanks
Deepak

-------

Spark SQL can indeed take over your Hive workloads, and if you're
looking for an open source solution, Apache Kyuubi(Incubating)[1]
might help.

[1] https://kyuubi.apache.org/

Thanks,
Cheng Pan

On Sat, Mar 26, 2022 at 4:51 PM Cheng Pan <pa...@gmail.com> wrote:
>
> I have less knowledge about dbt. If it supports Hive, it should support Kyuubi.
> Basically, Kyuubi is gateway between your client(e.g. beeline, hive
> jdbc client) and compute engine(e.g. Spark, Flink, Trino), I think the
> most valuable things are:
> 1) Kyuubi reuses the Hive Thrift Protocol, it say you can treat Kyuubi
> as a HiveServer2, and continue use beeline, hive jdbc driver to
> connect Kyuubi to run SQL(in your compute engine dialect). Ideally, if
> a tool claims it supports Hive, then it supports Kyuubi.
> 2) Kyuubi manages the compute engine lifecycle and share level, makes
> a good trade-off between isolation and resource consumption.[1]
>
> PS: Kyuubi's support for Spark is very mature, you can find lots of
> production use cases here[2]. The support for Flink & Trino is in beta
> phase.
>
> [1] https://kyuubi.apache.org/docs/latest/deployment/engine_share_level.html
> [2] https://github.com/apache/incubator-kyuubi/discussions/925
>
> Thanks,
> Cheng Pan
>
> On Sat, Mar 26, 2022 at 4:16 PM Deepak Sharma <de...@gmail.com> wrote:
> >
> > Thanks, I'll check it out.
> > I have a use case where we want to use dbt as data middling tool .
> > Will it take dbt queries and create the resulting model ?
> > I see it supports Trino , so I am guessing yes .
> >
> > I will love to contribute to it as well.
> >
> >
> > Thanks
> > Deepak
> >
> > On Sat, 26 Mar 2022 at 1:24 PM, Cheng Pan <pa...@gmail.com> wrote:
> >>
> >> Spark SQL can indeed take over your Hive workloads, and if you're
> >> looking for an open source solution, Apache Kyuubi(Incubating)[1]
> >> might help.
> >>
> >> [1] https://kyuubi.apache.org/
> >>
> >> Thanks,
> >> Cheng Pan
> >>
> >> On Sat, Mar 26, 2022 at 11:45 AM Deepak Sharma <de...@gmail.com> wrote:
> >> >
> >> > It can be used as warehouse but then you have to keep long running spark jobs.
> >> > This can be possible using cached data frames or dataset .
> >> >
> >> > Thanks
> >> > Deepak
> >> >
> >> > On Sat, 26 Mar 2022 at 5:56 AM, <ca...@free.fr> wrote:
> >> >>
> >> >> In the past time we have been using hive for building the data
> >> >> warehouse.
> >> >> Do you think if spark can used for this purpose? it's even more realtime
> >> >> than hive.
> >> >>
> >> >> Thanks.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >> >>
> >> > --
> >> > Thanks
> >> > Deepak
> >> > www.bigdatabig.com
> >> > www.keosha.net
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: spark as data warehouse?

Posted by Deepak Sharma <de...@gmail.com>.

It can be used as warehouse but then you have to keep long running spark
jobs.
This can be possible using cached data frames or dataset .

Thanks
Deepak

On Sat, 26 Mar 2022 at 5:56 AM, <ca...@free.fr> wrote:

> In the past time we have been using hive for building the data
> warehouse.
> Do you think if spark can used for this purpose? it's even more realtime
> than hive.
>
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net