You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "angerszhu (Jira)" <ji...@apache.org> on 2021/07/23 07:29:00 UTC

[jira] [Resolved] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

     [ https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

angerszhu resolved SPARK-29018.
-------------------------------
    Resolution: Won't Fix

> Build spark thrift server on it's own code based on protocol v11
> ----------------------------------------------------------------
>
>                 Key: SPARK-29018
>                 URL: https://issues.apache.org/jira/browse/SPARK-29018
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: angerszhu
>            Priority: Major
>
> h2. Background
>     With the development of Spark and Hive,in current sql/hive-thriftserver module, we need to do a lot of work to solve code conflicts for different built-in hive versions. It's an annoying and unending work in current ways. And these issues have limited our ability and convenience to develop new features for Spark’s thrift server. 
>     We suppose to implement a new thrift server and JDBC driver based on Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift server have below feature:
>  # Build new module spark-service as spark’s thrift server 
>  # Don't need as much reflection and inherited code as `hive-thriftser` modules
>  # Support all functions current `sql/hive-thriftserver` support
>  # Use all code maintained by spark itself, won’t depend on Hive
>  # Support origin functions use spark’s own way, won't limited by Hive's code
>  # Support running without hive metastore or with hive metastore
>  # Support user impersonation by Multi-tenant splited hive authentication and DFS authentication
>  # Support session hook for with spark’s own code
>  # Add a new jdbc driver spark-jdbc, with spark’s own connection url  “jdbc:spark:<host>:<port>/<db>”
>  # Support both hive-jdbc and spark-jdbc client, then we can support most clients and BI platform
> h2. How to start?
>      We can start this new thrift server by shell *sbin/start-spark-thriftserver.sh* and stop it by *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to determine the characteristics of the current spark thrift server service, we  have implemented all need configuration by spark itself in `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used to connect to hive metastore. We can write all we needed conf in *conf/spark-default.conf* or in startup command *--conf*
> h2. How to connect through jdbc?
>    Now we support both hive-jdbc and spark-jdbc, user can choose which one he likes
> h3. spark-jdbc
>  # use `SparkDriver` as jdbc driver class
>  # Connection url `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list` most samse as hive but with spark’s special url prefix `jdbc:spark`
>  # For proxy, use SparkDriver should set proxy conf `spark.sql.thriftserver.proxy.user=username` 
> h3. hive-jdbc
>  # use `HiveDriver` as jdbc driver class
>  # connection str jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list  as origin 
>  # For proxy, use HiveDriver should set proxy conf hive.server2.proxy.user=username, current server support both config
> h2. How is it done today, and what are the limits of current practice?
> h3. Current practice
> We have completed two modules `spark-service` & `spark-jdbc` now, it can run well  and we have add origin UT to it these two module and it can pass the UT, for impersonation, we have write the code and test it in our kerberized environment, it can work well and wait for review. Now we will raise pr to apace/spark master branch step by step.
> h3. Here are some known changes:
>  # Don’t use any hive code in `spark-service` `spark-jdbc` module
>  # In current service, default rcfile suffix  `.hiverc` was replaced by `.sparkrc`
>  # When use SparkDriver as jdbc driver class, url should use jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>  # When use SparkDriver as jdbc driver class, proxy conf should be `spark.sql.thriftserver.proxy.user=proxy_user_name`
>  # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
> h2. What are the risks?
>     Totally new module, won’t change other module’s code except for supporting impersonation. Except impersonation, we have added a lot of UT changed (fit grammar without hive) from origin UT, and all pass it. For impersonation I have test it in our kerberized environment but still need detail review since change a lot.
> h2. How long will it take?
>        We have done all these works in our own repo, now we plan merge our code into the master step by step.
>  # Phase1 pr about build new module *spark-service* on folder *sql/service*
>  # Phase2 pr thrift protocol and generated thrift protocol java code
>  # Phase3 pr with all *spark-service* module code and description about design, also Unnit Test
>  # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
>  # Phase5 pr with all *spark-jdbc* module code and Unit Tests
>  # Phase6 pr about support thriftserver Impersonation
>  # Phase7 pr about build spark's own beeline client *spark-beeline*
>  # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* module named *spark-cli*
> h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account.
> Compared to current `sql/hive-thriftserver`,  corresponding API changes as below:
>  
>  # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, contains all needed configuration for spark thrift server
>  # ServiceSessionXxx as origin HiveSessionXxx
>  # In ServiceSessionImpl, remove  code spark won’t use
>  # In ServiceSessionImpl set session conf directly to sqlConf  like [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
>  # Remove SparkSQLSessionManager, add logic into SessionMananger
>  # Implement all OperationMananegr logic into SparkSQLOperationMananger and rename it to OperationManager
>  # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by SparkSQLOperationManager, just get it by parentSession.getSqlContext() session conf was set to this sqlContext.sqlConf
>  # Remove HiveServer2 since we don’t need the logic in it
>  # Remove logic about hive impersonation since it won’t be useful in spark thrift server and remove parameter delegationTokenStr in ServiceSessionImplWithUGI [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353]   we will use new way for spark’s impersonation.
>  # Remove ThriftserverShimUtils, since we don’t need this
>  # Remove SparkSQLCLIService just use CLIService 
>  # Remove ReflectionUtils and ReflactCompositeService since we don’t need interition and reflection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org