You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peng Cheng (Jira)" <ji...@apache.org> on 2022/01/25 01:35:00 UTC

[jira] [Created] (SPARK-38009) In start-thriftserver.sh arguments, "--hiveconf xxx" should have higher precedence over "--conf spark.hadoop.xxx", or any other hadoop configurations

Peng Cheng created SPARK-38009:
----------------------------------

             Summary: In start-thriftserver.sh arguments, "--hiveconf xxx" should have higher precedence over "--conf spark.hadoop.xxx", or any other hadoop configurations
                 Key: SPARK-38009
                 URL: https://issues.apache.org/jira/browse/SPARK-38009
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.0, 2.4.8
         Environment: The above experiment is conducted on Apache Spark 2.4.7 & 3.2.0 respectively.

 

OS: Ubuntu 20.04

Java: OpenJDK1.8.0

 
            Reporter: Peng Cheng


By convention, An Apache Hive server will read configuration options from different sources with different precedence, and the precedence of "–hiveconf" options in command line options should only be lower than those set by using the {*}set command (see [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration] for detail){*}. It should be higher than hadoop configuration, or any of the configuration files on the server (including, but not limited to hive-site.xml and core-site.xml)

This convention is clearly not maintained very well by Apache Spark thrift server. As demonstrated in the following example: If I start this server with diverging option values on "hive.server2.thrift.port":

 

```
./sbin/start-thriftserver.sh \
--conf spark.hadoop.hive.server2.thrift.port=10001 \
--hiveconf hive.server2.thrift.port=10002
```

 

"–conf"/port 10001 will be preferred over "–hiveconf"/port 10002:

 

```

Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/conf/:/home/xxx/spark-2.4.7-bin-hadoop2.7-scala2.12/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --conf spark.hadoop.hive.server2.thrift.port=10001 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server spark-internal --hiveconf hive.server2.thrift.port=10002
========================================
...
22/01/24 17:32:18 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10001 with 5...500 worker threads

```

 

replacing "--conf" line with an entry in core-site.xml makes no difference.

I doubt if this divergence from conventional hive server behaviour is deliberate. Thus I'm calling the precedence of hive configuration options to be set to be on par or maximally similar to that of an Apache Hive server of the same version. To my knowledge, it should be:

 

SET command > --hiveconf > hive-site.xml > hive-default.xml > --conf > core-site.xml >. core-default.xml



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org