You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Daniel Mateus Pires <dm...@gmail.com> on 2019/03/11 16:14:54 UTC

Running Hive on Spark

Hi there,

I would like to run Hive using Spark as the execution engine and I'm pretty
confused with the set up.

For reference I'm using AWS EMR.

First, I'm confused at the difference between running Hive with Spark as
its execution engine sending queries to Hive using HiveServer2 (Thrift),
and using the SparkThriftServer (I thought it was built on top of
HiveServer2) ? Could I read more about the differences somewhere ?

I followed the following docs:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
and after changing the execution engine from the EMR default (tez) to
spark, I can see the difference on the HiveServer2 UI at port 10002 where
now the steps show "spark" as the execution engine.

However I've set up the following config to get the Spark History Server
displaying queries coming through JDBC and I can see queries sent to the
SparkThriftServer (port 10001) but not to the HiveServer2 with execution
engine of Spark (port 10000)

set spark.eventLog.enabled=true;
set spark.master=localhost:18080;
set spark.eventLog.dir=hdfs:///var/log/spark/apps;
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Thanks!

Re: Running Hive on Spark

Posted by Rajesh Balamohan <rb...@apache.org>.
"Hive on Spark" uses Spark purely as execution engine. It would not get the
benefits of codegen and other optimizations of Spark.

If it is mainly for testing, OOTB parameters should work without issues.

However, Tez has lot better edge than Hive on Spark.

Some of the areas where Hive on Spark needs to catch up are,

* No support for auto reduce parallelism.
* Not full dynamic partition pruning is supported.
* Fetchers can start only when all mappers are complete. This can be a huge
painpoint in lot of cases.
* Have to specify CombinedInputFormat for tackling small files, but that
has issues in splitting.

~Rajesh.B

On Tue, Mar 12, 2019 at 2:25 PM Daniel Mateus Pires <dm...@gmail.com>
wrote:

> Hi Rajesh,
>
> I'm trying to further my understanding of the various interactions and
> set-ups for Hive + Spark
>
> My understanding so far is that running queries against the
> SparkThriftServer uses the SparkSQL engine whereas the HiveServer2 + Hive +
> Spark execution engine uses Hive primitives and only uses Spark for the
> actual computations
>
> I get your question about "why would I do that?" But my goal right now is
> to understand "what does it mean if I do that"
>
> Best regards
> Daniel
>
> On Tue 12 Mar 2019, 02:21 Rajesh Balamohan, <rb...@apache.org> wrote:
>
>> Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
>> good enough for this.
>>
>> Is there any specific reason for moving from tez to spark as execution
>> engine?
>>
>> ~Rajesh.B
>>
>> On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires <dm...@gmail.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> I would like to run Hive using Spark as the execution engine and I'm
>>> pretty confused with the set up.
>>>
>>> For reference I'm using AWS EMR.
>>>
>>> First, I'm confused at the difference between running Hive with Spark as
>>> its execution engine sending queries to Hive using HiveServer2 (Thrift),
>>> and using the SparkThriftServer (I thought it was built on top of
>>> HiveServer2) ? Could I read more about the differences somewhere ?
>>>
>>> I followed the following docs:
>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>> and after changing the execution engine from the EMR default (tez) to
>>> spark, I can see the difference on the HiveServer2 UI at port 10002 where
>>> now the steps show "spark" as the execution engine.
>>>
>>> However I've set up the following config to get the Spark History Server
>>> displaying queries coming through JDBC and I can see queries sent to the
>>> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
>>> engine of Spark (port 10000)
>>>
>>> set spark.eventLog.enabled=true;
>>> set spark.master=localhost:18080;
>>> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
>>> set spark.executor.memory=512m;
>>> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>>
>>> Thanks!
>>>
>>

Re: Running Hive on Spark

Posted by Daniel Mateus Pires <dm...@gmail.com>.
Hi Rajesh,

I'm trying to further my understanding of the various interactions and
set-ups for Hive + Spark

My understanding so far is that running queries against the
SparkThriftServer uses the SparkSQL engine whereas the HiveServer2 + Hive +
Spark execution engine uses Hive primitives and only uses Spark for the
actual computations

I get your question about "why would I do that?" But my goal right now is
to understand "what does it mean if I do that"

Best regards
Daniel

On Tue 12 Mar 2019, 02:21 Rajesh Balamohan, <rb...@apache.org> wrote:

> Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
> good enough for this.
>
> Is there any specific reason for moving from tez to spark as execution
> engine?
>
> ~Rajesh.B
>
> On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires <dm...@gmail.com>
> wrote:
>
>> Hi there,
>>
>> I would like to run Hive using Spark as the execution engine and I'm
>> pretty confused with the set up.
>>
>> For reference I'm using AWS EMR.
>>
>> First, I'm confused at the difference between running Hive with Spark as
>> its execution engine sending queries to Hive using HiveServer2 (Thrift),
>> and using the SparkThriftServer (I thought it was built on top of
>> HiveServer2) ? Could I read more about the differences somewhere ?
>>
>> I followed the following docs:
>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>> and after changing the execution engine from the EMR default (tez) to
>> spark, I can see the difference on the HiveServer2 UI at port 10002 where
>> now the steps show "spark" as the execution engine.
>>
>> However I've set up the following config to get the Spark History Server
>> displaying queries coming through JDBC and I can see queries sent to the
>> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
>> engine of Spark (port 10000)
>>
>> set spark.eventLog.enabled=true;
>> set spark.master=localhost:18080;
>> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
>> set spark.executor.memory=512m;
>> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>>
>> Thanks!
>>
>

Re: Running Hive on Spark

Posted by Rajesh Balamohan <rb...@apache.org>.
Not sure why you are using SparkThriftServer. OOTB HiveServer2 would be
good enough for this.

Is there any specific reason for moving from tez to spark as execution
engine?

~Rajesh.B

On Mon, Mar 11, 2019 at 9:45 PM Daniel Mateus Pires <dm...@gmail.com>
wrote:

> Hi there,
>
> I would like to run Hive using Spark as the execution engine and I'm
> pretty confused with the set up.
>
> For reference I'm using AWS EMR.
>
> First, I'm confused at the difference between running Hive with Spark as
> its execution engine sending queries to Hive using HiveServer2 (Thrift),
> and using the SparkThriftServer (I thought it was built on top of
> HiveServer2) ? Could I read more about the differences somewhere ?
>
> I followed the following docs:
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> and after changing the execution engine from the EMR default (tez) to
> spark, I can see the difference on the HiveServer2 UI at port 10002 where
> now the steps show "spark" as the execution engine.
>
> However I've set up the following config to get the Spark History Server
> displaying queries coming through JDBC and I can see queries sent to the
> SparkThriftServer (port 10001) but not to the HiveServer2 with execution
> engine of Spark (port 10000)
>
> set spark.eventLog.enabled=true;
> set spark.master=localhost:18080;
> set spark.eventLog.dir=hdfs:///var/log/spark/apps;
> set spark.executor.memory=512m;
> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
>
> Thanks!
>