You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Zhan Zhang <zh...@gmail.com> on 2014/11/21 23:51:23 UTC
How spark and hive integrate in long term?
Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.
With every release of hive, there need a significant effort on spark part to
support it.
For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.
Does anyone have any insight or idea in mind?
Thanks.
Zhan Zhang
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: How spark and hive integrate in long term?
Posted by Patrick Wendell <pw...@gmail.com>.
There are two distinct topics when it comes to hive integration. Part
of the 1.3 roadmap will likely be better defining the plan for Hive
integration as Hive adds future versions.
1. Ability to interact with Hive metastore's from different versions
==> I.e. if a user has a metastore, can Spark SQL read the data? This
one we want need to solve by asking Hive for a stable metastore thrift
API, or adding sufficient features to the HCatalog API so we can use
that.
2. Compatibility with HQL over time as Hive adds new features.
==> This relates to how often we update our internal library
dependency on Hive and/or build support for new Hive features
internally.
On Sat, Nov 22, 2014 at 10:01 AM, Zhan Zhang <zz...@hortonworks.com> wrote:
> Thanks Cheng for the insights.
>
> Regarding the HCatalog, I did some initial investigation too and agree with you. As of now, it seems not a good solution. I will try to talk to Hive people to see whether there is such guarantee for downward compatibility for thrift protocol. By the way, I tried some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible.
>
> Thanks.
>
> Zhan Zhang
>
>
> On Nov 22, 2014, at 7:14 AM, Cheng Lian <li...@gmail.com> wrote:
>
>> Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users.
>>
>> On 11/22/14 11:05 PM, Cheng Lian wrote:
>>>
>>> Hey Zhan,
>>>
>>> This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 <https://issues.apache.org/jira/browse/SPARK-4114> was opened for this. Did some research into HCatalog recently, but I must confess that I'm not an expert on HCatalog, actually spent only 1 day on exploring it. So please don't hesitate to correct me if I was wrong about the conclusions I made below.
>>>
>>> First, although HCatalog API is more pleasant to work with, it's unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It's also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI.
>>>
>>> Second, HCatalog API itself doesn't ensure compatibility, it's the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn't find any official documents about Thrift protocol compatibility.
>>>
>>> That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility.
>>>
>>> -- Cheng
>>>
>>> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>>>
>>>> Now Spark and hive integration is a very nice feature. But I am wondering
>>>> what the long term roadmap is for spark integration with hive. Both of these
>>>> two projects are undergoing fast improvement and changes. Currently, my
>>>> understanding is that spark hive sql part relies on hive meta store and
>>>> basic parser to operate, and the thrift-server intercept hive query and
>>>> replace it with its own engine.
>>>>
>>>> With every release of hive, there need a significant effort on spark part to
>>>> support it.
>>>>
>>>> For the metastore part, we may possibly replace it with hcatalog. But given
>>>> the dependency of other parts on hive, e.g., metastore, thriftserver,
>>>> hcatlog may not be able to help much.
>>>>
>>>> Does anyone have any insight or idea in mind?
>>>>
>>>> Thanks.
>>>>
>>>> Zhan Zhang
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
>>>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail:dev-help@spark.apache.org
>>>>
>>>> .
>>>>
>>>
>>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: How spark and hive integrate in long term?
Posted by Zhan Zhang <zz...@hortonworks.com>.
Thanks Cheng for the insights.
Regarding the HCatalog, I did some initial investigation too and agree with you. As of now, it seems not a good solution. I will try to talk to Hive people to see whether there is such guarantee for downward compatibility for thrift protocol. By the way, I tried some basic functions using hive-0.13 connect to hive-0.14 metastore, and it looks like they are compatible.
Thanks.
Zhan Zhang
On Nov 22, 2014, at 7:14 AM, Cheng Lian <li...@gmail.com> wrote:
> Should emphasize that this is still a quick and rough conclusion, will investigate this in more detail after 1.2.0 release. Anyway we really like to provide Hive support in Spark SQL as smooth and clean as possible for both developers and end users.
>
> On 11/22/14 11:05 PM, Cheng Lian wrote:
>>
>> Hey Zhan,
>>
>> This is a great question. We are also seeking for a stable API/protocol that works with multiple Hive versions (esp. 0.12+). SPARK-4114 <https://issues.apache.org/jira/browse/SPARK-4114> was opened for this. Did some research into HCatalog recently, but I must confess that I’m not an expert on HCatalog, actually spent only 1 day on exploring it. So please don’t hesitate to correct me if I was wrong about the conclusions I made below.
>>
>> First, although HCatalog API is more pleasant to work with, it’s unfortunately feature incomplete. It only provides a subset of most commonly used operations. For example, |HCatCreateTableDesc| maps only a subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are missing. It’s also impossible to alter table properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| command). The |hcat| CLI tool provides all those features missing in HCatalog API via raw Metastore API, and is structurally similar to the old Hive CLI.
>>
>> Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift protocol that matters. HCatalog is directly built upon raw Metastore API, and talks the same Metastore Thrift protocol. The problem we encountered in Spark SQL is that, usually we deploy Spark SQL Hive support with embedded mode (for testing) or local mode Metastore, and this makes us suffer from things like Metastore database schema changes. If Hive Metastore Thrift protocol is guaranteed to be downward compatible, then hopefully we can resort to remote mode Metastore and always depend on most recent Hive APIs. I had a glance of Thrift protocol version handling code in Hive, it seems that downward compatibility is not an issue. However I didn’t find any official documents about Thrift protocol compatibility.
>>
>> That said, in the future, hopefully we can only depend on most recent Hive dependencies and remove the Hive shim layer introduced in branch 1.2. For users who use exactly the same version of Hive as Spark SQL, they can use either remote or local/embedded Metastore; while for users who want to interact with existing legacy Hive clusters, they have to setup a remote Metastore and let the Thrift protocol to handle compatibility.
>>
>> — Cheng
>>
>> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>>
>>> Now Spark and hive integration is a very nice feature. But I am wondering
>>> what the long term roadmap is for spark integration with hive. Both of these
>>> two projects are undergoing fast improvement and changes. Currently, my
>>> understanding is that spark hive sql part relies on hive meta store and
>>> basic parser to operate, and the thrift-server intercept hive query and
>>> replace it with its own engine.
>>>
>>> With every release of hive, there need a significant effort on spark part to
>>> support it.
>>>
>>> For the metastore part, we may possibly replace it with hcatalog. But given
>>> the dependency of other parts on hive, e.g., metastore, thriftserver,
>>> hcatlog may not be able to help much.
>>>
>>> Does anyone have any insight or idea in mind?
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>>
>>>
>>> --
>>> View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
>>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail:dev-help@spark.apache.org
>>>
>>> .
>>>
>>
>
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: How spark and hive integrate in long term?
Posted by Cheng Lian <li...@gmail.com>.
Should emphasize that this is still a quick and rough conclusion, will
investigate this in more detail after 1.2.0 release. Anyway we really
like to provide Hive support in Spark SQL as smooth and clean as
possible for both developers and end users.
On 11/22/14 11:05 PM, Cheng Lian wrote:
>
> Hey Zhan,
>
> This is a great question. We are also seeking for a stable
> API/protocol that works with multiple Hive versions (esp. 0.12+).
> SPARK-4114 <https://issues.apache.org/jira/browse/SPARK-4114> was
> opened for this. Did some research into HCatalog recently, but I must
> confess that I’m not an expert on HCatalog, actually spent only 1 day
> on exploring it. So please don’t hesitate to correct me if I was wrong
> about the conclusions I made below.
>
> First, although HCatalog API is more pleasant to work with, it’s
> unfortunately feature incomplete. It only provides a subset of most
> commonly used operations. For example, |HCatCreateTableDesc| maps only
> a subset of |CreateTableDesc|, properties like
> |storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are
> missing. It’s also impossible to alter table properties via HCatalog
> API (Spark SQL uses this to implement the |ANALYZE| command). The
> |hcat| CLI tool provides all those features missing in HCatalog API
> via raw Metastore API, and is structurally similar to the old Hive CLI.
>
> Second, HCatalog API itself doesn’t ensure compatibility, it’s the
> Thrift protocol that matters. HCatalog is directly built upon raw
> Metastore API, and talks the same Metastore Thrift protocol. The
> problem we encountered in Spark SQL is that, usually we deploy Spark
> SQL Hive support with embedded mode (for testing) or local mode
> Metastore, and this makes us suffer from things like Metastore
> database schema changes. If Hive Metastore Thrift protocol is
> guaranteed to be downward compatible, then hopefully we can resort to
> remote mode Metastore and always depend on most recent Hive APIs. I
> had a glance of Thrift protocol version handling code in Hive, it
> seems that downward compatibility is not an issue. However I didn’t
> find any official documents about Thrift protocol compatibility.
>
> That said, in the future, hopefully we can only depend on most recent
> Hive dependencies and remove the Hive shim layer introduced in branch
> 1.2. For users who use exactly the same version of Hive as Spark SQL,
> they can use either remote or local/embedded Metastore; while for
> users who want to interact with existing legacy Hive clusters, they
> have to setup a remote Metastore and let the Thrift protocol to handle
> compatibility.
>
> — Cheng
>
> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>
>> Now Spark and hive integration is a very nice feature. But I am wondering
>> what the long term roadmap is for spark integration with hive. Both of these
>> two projects are undergoing fast improvement and changes. Currently, my
>> understanding is that spark hive sql part relies on hive meta store and
>> basic parser to operate, and the thrift-server intercept hive query and
>> replace it with its own engine.
>>
>> With every release of hive, there need a significant effort on spark part to
>> support it.
>>
>> For the metastore part, we may possibly replace it with hcatalog. But given
>> the dependency of other parts on hive, e.g., metastore, thriftserver,
>> hcatlog may not be able to help much.
>>
>> Does anyone have any insight or idea in mind?
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>> --
>> View this message in context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
>> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail:dev-help@spark.apache.org
>>
>> .
>>
>
Re: How spark and hive integrate in long term?
Posted by Cheng Lian <li...@gmail.com>.
Hey Zhan,
This is a great question. We are also seeking for a stable API/protocol
that works with multiple Hive versions (esp. 0.12+). SPARK-4114
<https://issues.apache.org/jira/browse/SPARK-4114> was opened for this.
Did some research into HCatalog recently, but I must confess that I’m
not an expert on HCatalog, actually spent only 1 day on exploring it. So
please don’t hesitate to correct me if I was wrong about the conclusions
I made below.
First, although HCatalog API is more pleasant to work with, it’s
unfortunately feature incomplete. It only provides a subset of most
commonly used operations. For example, |HCatCreateTableDesc| maps only a
subset of |CreateTableDesc|, properties like |storeAsSubDirectories|,
|skewedColNames| and |skewedColValues| are missing. It’s also impossible
to alter table properties via HCatalog API (Spark SQL uses this to
implement the |ANALYZE| command). The |hcat| CLI tool provides all those
features missing in HCatalog API via raw Metastore API, and is
structurally similar to the old Hive CLI.
Second, HCatalog API itself doesn’t ensure compatibility, it’s the
Thrift protocol that matters. HCatalog is directly built upon raw
Metastore API, and talks the same Metastore Thrift protocol. The problem
we encountered in Spark SQL is that, usually we deploy Spark SQL Hive
support with embedded mode (for testing) or local mode Metastore, and
this makes us suffer from things like Metastore database schema changes.
If Hive Metastore Thrift protocol is guaranteed to be downward
compatible, then hopefully we can resort to remote mode Metastore and
always depend on most recent Hive APIs. I had a glance of Thrift
protocol version handling code in Hive, it seems that downward
compatibility is not an issue. However I didn’t find any official
documents about Thrift protocol compatibility.
That said, in the future, hopefully we can only depend on most recent
Hive dependencies and remove the Hive shim layer introduced in branch
1.2. For users who use exactly the same version of Hive as Spark SQL,
they can use either remote or local/embedded Metastore; while for users
who want to interact with existing legacy Hive clusters, they have to
setup a remote Metastore and let the Thrift protocol to handle
compatibility.
— Cheng
On 11/22/14 6:51 AM, Zhan Zhang wrote:
> Now Spark and hive integration is a very nice feature. But I am wondering
> what the long term roadmap is for spark integration with hive. Both of these
> two projects are undergoing fast improvement and changes. Currently, my
> understanding is that spark hive sql part relies on hive meta store and
> basic parser to operate, and the thrift-server intercept hive query and
> replace it with its own engine.
>
> With every release of hive, there need a significant effort on spark part to
> support it.
>
> For the metastore part, we may possibly replace it with hcatalog. But given
> the dependency of other parts on hive, e.g., metastore, thriftserver,
> hcatlog may not be able to help much.
>
> Does anyone have any insight or idea in mind?
>
> Thanks.
>
> Zhan Zhang
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
> .
>
Re: How spark and hive integrate in long term?
Posted by Ted Yu <yu...@gmail.com>.
bq. spark-0.12 also has some nice feature added
Minor correction: you meant Spark 1.2.0 I guess
Cheers
On Fri, Nov 21, 2014 at 3:45 PM, Zhan Zhang <zz...@hortonworks.com> wrote:
> Thanks Dean, for the information.
>
> Hive-on-spark is nice. Spark sql has the advantage to take the full
> advantage of spark and allows user to manipulate the table as RDD through
> native spark support.
>
> When I tried to upgrade the current hive-0.13.1 support to hive-0.14.0. I
> found the hive parser is not compatible any more. In the meantime, those
> new feature introduced in hive-0.14.1, e.g, ACID, etc, is not there yet. In
> the meantime, spark-0.12 also
> has some nice feature added which is supported by thrift-server too, e.g.,
> hive-0.13, table cache, etc.
>
> Given that both have more and more features added, it would be great if
> user can take advantage of both. Current, spark sql give us such benefits
> partially, but I am wondering how to keep such integration in long term.
>
> Thanks.
>
> Zhan Zhang
>
> On Nov 21, 2014, at 3:12 PM, Dean Wampler <de...@gmail.com> wrote:
>
> > I can't comment on plans for Spark SQL's support for Hive, but several
> > companies are porting Hive itself onto Spark:
> >
> >
> http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
> >
> > I'm not sure if they are leveraging the old Shark code base or not, but
> it
> > appears to be a fresh effort.
> >
> > dean
> >
> > Dean Wampler, Ph.D.
> > Author: Programming Scala, 2nd Edition
> > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> > Typesafe <http://typesafe.com>
> > @deanwampler <http://twitter.com/deanwampler>
> > http://polyglotprogramming.com
> >
> > On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang <zh...@gmail.com> wrote:
> >
> >> Now Spark and hive integration is a very nice feature. But I am
> wondering
> >> what the long term roadmap is for spark integration with hive. Both of
> >> these
> >> two projects are undergoing fast improvement and changes. Currently, my
> >> understanding is that spark hive sql part relies on hive meta store and
> >> basic parser to operate, and the thrift-server intercept hive query and
> >> replace it with its own engine.
> >>
> >> With every release of hive, there need a significant effort on spark
> part
> >> to
> >> support it.
> >>
> >> For the metastore part, we may possibly replace it with hcatalog. But
> given
> >> the dependency of other parts on hive, e.g., metastore, thriftserver,
> >> hcatlog may not be able to help much.
> >>
> >> Does anyone have any insight or idea in mind?
> >>
> >> Thanks.
> >>
> >> Zhan Zhang
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>
Re: How spark and hive integrate in long term?
Posted by Zhan Zhang <zz...@hortonworks.com>.
Thanks Dean, for the information.
Hive-on-spark is nice. Spark sql has the advantage to take the full advantage of spark and allows user to manipulate the table as RDD through native spark support.
When I tried to upgrade the current hive-0.13.1 support to hive-0.14.0. I found the hive parser is not compatible any more. In the meantime, those new feature introduced in hive-0.14.1, e.g, ACID, etc, is not there yet. In the meantime, spark-0.12 also
has some nice feature added which is supported by thrift-server too, e.g., hive-0.13, table cache, etc.
Given that both have more and more features added, it would be great if user can take advantage of both. Current, spark sql give us such benefits partially, but I am wondering how to keep such integration in long term.
Thanks.
Zhan Zhang
On Nov 21, 2014, at 3:12 PM, Dean Wampler <de...@gmail.com> wrote:
> I can't comment on plans for Spark SQL's support for Hive, but several
> companies are porting Hive itself onto Spark:
>
> http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
>
> I'm not sure if they are leveraging the old Shark code base or not, but it
> appears to be a fresh effort.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang <zh...@gmail.com> wrote:
>
>> Now Spark and hive integration is a very nice feature. But I am wondering
>> what the long term roadmap is for spark integration with hive. Both of
>> these
>> two projects are undergoing fast improvement and changes. Currently, my
>> understanding is that spark hive sql part relies on hive meta store and
>> basic parser to operate, and the thrift-server intercept hive query and
>> replace it with its own engine.
>>
>> With every release of hive, there need a significant effort on spark part
>> to
>> support it.
>>
>> For the metastore part, we may possibly replace it with hcatalog. But given
>> the dependency of other parts on hive, e.g., metastore, thriftserver,
>> hcatlog may not be able to help much.
>>
>> Does anyone have any insight or idea in mind?
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>>
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: How spark and hive integrate in long term?
Posted by Dean Wampler <de...@gmail.com>.
I can't comment on plans for Spark SQL's support for Hive, but several
companies are porting Hive itself onto Spark:
http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
I'm not sure if they are leveraging the old Shark code base or not, but it
appears to be a fresh effort.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang <zh...@gmail.com> wrote:
> Now Spark and hive integration is a very nice feature. But I am wondering
> what the long term roadmap is for spark integration with hive. Both of
> these
> two projects are undergoing fast improvement and changes. Currently, my
> understanding is that spark hive sql part relies on hive meta store and
> basic parser to operate, and the thrift-server intercept hive query and
> replace it with its own engine.
>
> With every release of hive, there need a significant effort on spark part
> to
> support it.
>
> For the metastore part, we may possibly replace it with hcatalog. But given
> the dependency of other parts on hive, e.g., metastore, thriftserver,
> hcatlog may not be able to help much.
>
> Does anyone have any insight or idea in mind?
>
> Thanks.
>
> Zhan Zhang
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>