You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2018/12/20 15:38:41 UTC
[GitHub] HyukjinKwon opened a new pull request #23356: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork

HyukjinKwon opened a new pull request #23356: [SPARK-26422][R] Support to disable Hive support in SparkR even for Hadoop versions unsupported by Hive fork
URL: https://github.com/apache/spark/pull/23356
 
 
   ## What changes were proposed in this pull request?
   
   Currently,  even if I explicitly disable Hive support in SparkR session as below:
   
   ```r
   sparkSession <- sparkR.session("local[4]", "SparkR", Sys.getenv("SPARK_HOME"),
                                  enableHiveSupport = FALSE)
   ```
   
   produces when the Hadoop version is not supported by our Hive fork:
   
   ```
   java.lang.reflect.InvocationTargetException
   ...
   Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.1.3.1.0.0-78
   	at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
   	at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
   	at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
   	at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
   	... 43 more
   Error in handleErrors(returnStatus, conn) :
     java.lang.ExceptionInInitializerError
   	at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
   	at java.lang.Class.forName0(Native Method)
   	at java.lang.Class.forName(Class.java:348)
   	at org.apache.spark.util.Utils$.classForName(Utils.scala:193)
   	at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1116)
   	at org.apache.spark.sql.api.r.SQLUtils$.getOrCreateSparkSession(SQLUtils.scala:52)
   	at org.apache.spark.sql.api.r.SQLUtils.getOrCreateSparkSession(SQLUtils.scala)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   ```
   
   The root cause is that:
   
   ```
   SparkSession.hiveClassesArePresent
   ```
   
   check if the class is loadable or not to check if that's in classpath but `org.apache.hadoop.hive.conf.HiveConf` has a check for Hadoop version as static logic which is executed right away. This throws an `IllegalArgumentException` and that's not caught:
   
   https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1113-L1121
   
   So, currently, if users have a Hive built-in Spark with unsupported Hadoop version by our fork (namely 3+), there's no way to use SparkR even thought it could work.
   
   This PR just propose to change the order of bool comparison so that we can don't execute `SparkSession.hiveClassesArePresent` when:
   
     1. `enableHiveSupport` is explicitly disabled
     2. `spark.sql.catalogImplementation` is `in-memory`
   
   so that we **only** check `SparkSession.hiveClassesArePresent` when Hive support is explicitly enabled by short short circuiting.
   
   ## How was this patch tested?
   
   It's difficult to write a test since we don't run tests against Hadoop 3 yet. See https://github.com/apache/spark/pull/21588. Manually tested.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org