You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bigtop.apache.org by "Kengo Seki (Jira)" <ji...@apache.org> on 2022/03/01 04:57:00 UTC
[jira] [Commented] (BIGTOP-3641) Hive on Spark error
[ https://issues.apache.org/jira/browse/BIGTOP-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499333#comment-17499333 ]
Kengo Seki commented on BIGTOP-3641:
------------------------------------
As Leona mentioned, Hive currently does not support Spark 3.x. Some workaround for avoiding version conflict between common libraries is required to run Hive on Spark.
I found the following steps enable to run simple Hive queries on Spark, but I'm not sure if all functionalities of Hive/Spark are available.
Deploy three-node cluster using docker provisioner:
{code}
$ cd bigtop/provisioner/docker
$ ./docker-hadoop.sh -d -k hdfs,yarn,hbase,hive,spark -c 3
{code}
Login the master node and upload jar files under /usr/lib/spark/jars to HDFS in accordance with the step 4 of https://cwiki.apache.org//confluence/display/Hive/Hive+on+Spark:+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive.
Exclude hive-* and kryo-* jars for avoiding version incompatibilities:
{code}
$ ./docker-hadoop.sh -e 1 bash
# sudo -u hdfs hdfs dfs -mkdir /spark-jars
# sudo -u hdfs hdfs dfs -put /usr/lib/spark/jars/* /spark-jars
# sudo -u hdfs hdfs dfs -rm /spark-jars/hive-* /spark-jars/kryo-*
# vi /etc/hive/conf/hive-site.xml # add the spark.yarn.jar property
{code}
Add some required functionalities to hive-exec and kryo jars:
{code}
# jar xf /usr/lib/spark/jars/commons-lang3-3.9.jar org/apache/commons/lang3/JavaVersion.class
# jar uf /usr/lib/hive/lib/hive-exec-3.1.2.jar org/apache/commons/lang3/JavaVersion.class
# jar xf /usr/lib/spark/jars/kryo-shaded-4.0.2.jar com/esotericsoftware/kryo/serializers
# jar uf /usr/lib/hive/lib/kryo-shaded-3.0.3.jar com/esotericsoftware/kryo/serializers/ClosureSerializer*
# sudo -u hdfs hdfs dfs -put /usr/lib/hive/lib/kryo-shaded-3.0.3.jar /spark-jars
{code}
Run Hive query with {{hive.execution.engine=spark}}:
{code}
# hive
...
hive> set hive.execution.engine=spark;
hive> create table test(id int, name string);
OK
Time taken: 1.179 seconds
hive> insert into test values (1, 'foo'), (2, 'bar');
...
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 1 1 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 5.08 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 5.09 second(s)
Loading data to table default.test
OK
Time taken: 26.258 seconds
hive> select id, count(*) from test group by id;
...
Query Hive on Spark job[1] stages: [2, 3]
Spark job[1] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-2 ........ 0 FINISHED 1 1 0 0 0
Stage-3 ........ 0 FINISHED 2 2 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.04 s
--------------------------------------------------------------------------------------
Spark job[1] finished successfully in 4.04 second(s)
OK
1 1
2 1
Time taken: 4.334 seconds, Fetched: 2 row(s)
{code}
> Hive on Spark error
> -------------------
>
> Key: BIGTOP-3641
> URL: https://issues.apache.org/jira/browse/BIGTOP-3641
> Project: Bigtop
> Issue Type: Bug
> Components: hive, spark
> Affects Versions: 3.0.0, 3.1.0
> Reporter: Andrew
> Priority: Major
>
> Hi! I've tried to launch Hadoop stack in docker in 2 ways:
> # successfully build _hdfs, yarn, mapreduce, hbase, hive, spark, zookeeper_ from bigtop master branch (3.1.0 version) and launched docker from local repo via provisioner with all this components
> # same as 1st approach but with bigtop repo (3.0.0 version)
> In both cases everything works fine, but Hive on Spark fails with an error:
> {code:java}
> hive> set hive.execution.engine=spark;
> hive> select id, count(*) from default.test group by id;
> Query ID = root_20220209133134_cf3aec7d-ee2e-4d38-b200-6d616020d4b6
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
> set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
> set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
> set mapreduce.job.reduces=<number>
> Job failed with java.lang.ClassNotFoundException: oot_20220209133134_cf3aec7d-ee2e-4d38-b200-6d616020d4b6:1
> FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause.{code}
>
> From spark-shell everything works fine:
> {code:java}
> scala> sql("select id, count(*) from default.test group by id").show()
> +---+--------+
> | id|count(1)|
> +---+--------+
> | 1| 1|
> | 2| 1|
> +---+--------+{code}
>
> I've also tried to create an hdfs dir with spark libs and specify config was done in https://issues.apache.org/jira/browse/BIGTOP-3333 - it didn't help. Any ideas what is missing and how to fix it?
> P.S. Spark is used as spark-on-yarn
--
This message was sent by Atlassian Jira
(v8.20.1#820001)