You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bigtop.apache.org by "Kengo Seki (Jira)" <ji...@apache.org> on 2022/03/01 04:57:00 UTC
[jira] [Commented] (BIGTOP-3641) Hive on Spark error

    [ https://issues.apache.org/jira/browse/BIGTOP-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499333#comment-17499333 ] 

Kengo Seki commented on BIGTOP-3641:
------------------------------------

As Leona mentioned, Hive currently does not support Spark 3.x. Some workaround for avoiding version conflict between common libraries is required to run Hive on Spark.
I found the following steps enable to run simple Hive queries on Spark, but I'm not sure if all functionalities of Hive/Spark are available.

Deploy three-node cluster using docker provisioner:

{code}
$ cd bigtop/provisioner/docker
$ ./docker-hadoop.sh -d -k hdfs,yarn,hbase,hive,spark -c 3
{code}

Login the master node and upload jar files under /usr/lib/spark/jars to HDFS in accordance with the step 4 of https://cwiki.apache.org//confluence/display/Hive/Hive+on+Spark:+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive.
Exclude hive-* and kryo-* jars for avoiding version incompatibilities:

{code}
$ ./docker-hadoop.sh -e 1 bash
# sudo -u hdfs hdfs dfs -mkdir /spark-jars
# sudo -u hdfs hdfs dfs -put /usr/lib/spark/jars/* /spark-jars
# sudo -u hdfs hdfs dfs -rm /spark-jars/hive-* /spark-jars/kryo-*
# vi /etc/hive/conf/hive-site.xml  # add the spark.yarn.jar property
{code}

Add some required functionalities to hive-exec and kryo jars:

{code}
# jar xf /usr/lib/spark/jars/commons-lang3-3.9.jar org/apache/commons/lang3/JavaVersion.class
# jar uf /usr/lib/hive/lib/hive-exec-3.1.2.jar org/apache/commons/lang3/JavaVersion.class
# jar xf /usr/lib/spark/jars/kryo-shaded-4.0.2.jar com/esotericsoftware/kryo/serializers
# jar uf /usr/lib/hive/lib/kryo-shaded-3.0.3.jar com/esotericsoftware/kryo/serializers/ClosureSerializer*
# sudo -u hdfs hdfs dfs -put /usr/lib/hive/lib/kryo-shaded-3.0.3.jar /spark-jars
{code}

Run Hive query with {{hive.execution.engine=spark}}:

{code}
# hive

...

hive> set hive.execution.engine=spark;
hive> create table test(id int, name string);
OK
Time taken: 1.179 seconds
hive> insert into test values (1, 'foo'), (2, 'bar');

...

Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      1          1        0        0       0  
Stage-1 ........         0      FINISHED      1          1        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 5.08 s     
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 5.09 second(s)
Loading data to table default.test
OK
Time taken: 26.258 seconds
hive> select id, count(*) from test group by id;

...

Query Hive on Spark job[1] stages: [2, 3]
Spark job[1] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-2 ........         0      FINISHED      1          1        0        0       0  
Stage-3 ........         0      FINISHED      2          2        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 4.04 s     
--------------------------------------------------------------------------------------
Spark job[1] finished successfully in 4.04 second(s)
OK
1	1
2	1
Time taken: 4.334 seconds, Fetched: 2 row(s)
{code}

> Hive on Spark error
> -------------------
>
>                 Key: BIGTOP-3641
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-3641
>             Project: Bigtop
>          Issue Type: Bug
>          Components: hive, spark
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Andrew
>            Priority: Major
>
> Hi! I've tried to launch Hadoop stack in docker in 2 ways:
>  # successfully build _hdfs, yarn, mapreduce, hbase, hive, spark, zookeeper_ from bigtop master branch (3.1.0 version) and launched docker from local repo via provisioner with all this components
>  # same as 1st approach but with bigtop repo (3.0.0 version)
> In both cases everything works fine, but Hive on Spark fails with an error:
> {code:java}
> hive> set hive.execution.engine=spark;
> hive> select id, count(*) from default.test group by id;
> Query ID = root_20220209133134_cf3aec7d-ee2e-4d38-b200-6d616020d4b6
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Job failed with java.lang.ClassNotFoundException: oot_20220209133134_cf3aec7d-ee2e-4d38-b200-6d616020d4b6:1
> FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause.{code}
>  
> From spark-shell everything works fine:
> {code:java}
> scala> sql("select id, count(*) from default.test group by id").show()
> +---+--------+                                                                  
> | id|count(1)|
> +---+--------+
> |  1|       1|
> |  2|       1|
> +---+--------+{code}
>  
> I've also tried to create an hdfs dir with spark libs and specify config was done in https://issues.apache.org/jira/browse/BIGTOP-3333 - it didn't help. Any ideas what is missing and how to fix it?
> P.S. Spark is used as spark-on-yarn



--
This message was sent by Atlassian Jira
(v8.20.1#820001)