You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ajay Chander <it...@gmail.com> on 2016/10/03 13:56:29 UTC
Spark_Jdbc_Hive
Hi Everyone,
First of all let me explain you what I am trying to do and I apologize for
writing a lengthy mail.
1) Pragmatically connect to remote secured(Kerberized) Hadoop cluster(CDH
5.7) from my local machine.
- Once connected, I want to read the data from remote Hive table into
Spark Dataframe.
- Once the data is loaded into my local Dataframe, I would like to apply
some transformations and do some tests.
I know that we can use spark-shell from edge node and do these things, but
I am trying to find out a way to do it from my IDE.
My Local Environment (Windows 7):
I am using IntelliJ IDE, Maven as build tool and Java .
Things that I have got working,
- Since the cluster is secured using Kerberos, I had to use a keytab
file to authenticate like below,
System.setProperty("java.security.krb5.conf",
"C:\\Users\\Ajay\\Documents\\Kerberos\\krb5.conf");
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "Kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("ajay@INTERNAL.DOMAIN.COM",
"C:\\Users\\Ajay\\Documents\\Kerberos\\rc4\\rc4.keytab");
- Now that I have authenticated myself to talk to cluster using keytab,
initially I tried to make a pure JDBC call (Not Spark API) and see if I was
able to read the data successfully? and Yes I was able to read the data
successfully this way like below,
String driverName = "org.apache.hive.jdbc.HiveDriver";
Class.forName(driverName);
String url = "jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_HOST@INTERNAL.DOMAIN.COM;saslQop=auth-conf";
Connection con = DriverManager.getConnection(url);
String query = "select * from test.test_data limit 10";
Statement stmt = con.createStatement();
System.out.println("Executing Query...");
ResultSet rs = stmt.executeQuery(query);
while (rs.next()) {
String emp_name = rs.getString("emp_name");
System.out.println("Employee Name: "+emp_name);
}
Here is the Hive JDBC driver in my pom.xml
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.1.0</version>
</dependency>
- Now that I have made sure that the JDBC connection to secured
cluster is working fine, the next step is to try to use Spark API to
read the Hive table into Dataframe. I use Spark 1.6. I tried below,
// Trying to use jdbc connection to Hive through spark 1.6 and hive jdbc 1.1.0
String JDBC_DB_URL =
"jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_HOST@INTERNAL.DOMAIN.COM;saslQop=auth-conf";
Map<String, String> options = new HashMap<String, String>();
options.put("driver", "org.apache.hive.jdbc.HiveDriver");
options.put("url", JDBC_DB_URL);
options.put("dbtable", "test.test_data");
DataFrame jdbcDF = hc.read().format("jdbc").options(options).load();
jdbcDF.printSchema();
Now I came across the below error,
Exception in thread "main" java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:141)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at Dev_Cluster_Test.main(Dev_Cluster_Test.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Then I did look at the Spark API code base below.
https://github.com/apache/spark/blob/branch-1.6/sql/core/src
/main/scala/org/apache/spark/sql/execution/datasources/
jdbc/JDBCRDD.scala#L136
which is referring to hive-jdbc API code base below.
https://github.com/apache/hive/blob/master/jdbc/src/java/org
/apache/hive/jdbc/HiveResultSetMetaData.java#L143
Thus the error.
Then I looked at Spark 2.0.0 API below.
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L139
Which results in same error "Method not supported".
Can anyone please shed some lights on this and tell me if I am missing
anything here. I appreciate your time. Thank you.
Regards,
Ajay