You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ajay Chander <it...@gmail.com> on 2016/10/03 13:56:29 UTC
Spark_Jdbc_Hive

Hi Everyone,

First of all let me explain you what I am trying to do and I apologize for
writing a lengthy mail.

1) Pragmatically connect to remote secured(Kerberized) Hadoop cluster(CDH
5.7) from my local machine.

   - Once connected, I want to read the data from remote Hive table into
   Spark Dataframe.
   - Once the data is loaded into my local Dataframe, I would like to apply
   some transformations and do some tests.

I know that we can use spark-shell from edge node and do these things, but
I am trying to find out a way to do it from my IDE.

My Local Environment (Windows 7):
I am using IntelliJ IDE, Maven as build tool and Java .

Things that I have got working,


   - Since the cluster is secured using Kerberos, I had to use a keytab
   file to authenticate like below,

System.setProperty("java.security.krb5.conf",
"C:\\Users\\Ajay\\Documents\\Kerberos\\krb5.conf");

Configuration conf = new Configuration();

conf.set("hadoop.security.authentication", "Kerberos");

UserGroupInformation.setConfiguration(conf);

UserGroupInformation.loginUserFromKeytab("ajay@INTERNAL.DOMAIN.COM",

        "C:\\Users\\Ajay\\Documents\\Kerberos\\rc4\\rc4.keytab");


   - Now that I have authenticated myself to talk to cluster using keytab,
   initially I tried to make a pure JDBC call (Not Spark API) and see if I was
   able to read the data successfully? and Yes I was able to read the data
   successfully this way like below,

String driverName = "org.apache.hive.jdbc.HiveDriver";
Class.forName(driverName);
String url = "jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_HOST@INTERNAL.DOMAIN.COM;saslQop=auth-conf";
Connection con = DriverManager.getConnection(url);
String query = "select * from test.test_data limit 10";
Statement stmt = con.createStatement();
System.out.println("Executing Query...");
ResultSet rs = stmt.executeQuery(query);
while (rs.next()) {
    String emp_name = rs.getString("emp_name");
    System.out.println("Employee Name: "+emp_name);
}

Here is the Hive JDBC driver in my pom.xml

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.1.0</version>
</dependency>



   - Now that I have made sure that the JDBC connection to secured
cluster is working fine, the next step is to try to use Spark API to
read the Hive table into Dataframe. I use Spark 1.6. I tried below,

// Trying to use jdbc connection to Hive through spark 1.6 and hive jdbc 1.1.0
String JDBC_DB_URL =
"jdbc:hive2://mydevcluster.domain.com:10000/test;principal=hive/_HOST@INTERNAL.DOMAIN.COM;saslQop=auth-conf";
Map<String, String> options = new HashMap<String, String>();
options.put("driver", "org.apache.hive.jdbc.HiveDriver");
options.put("url", JDBC_DB_URL);
options.put("dbtable", "test.test_data");
DataFrame jdbcDF = hc.read().format("jdbc").options(options).load();
jdbcDF.printSchema();

Now I came across the below error,

Exception in thread "main" java.sql.SQLException: Method not supported

	at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:141)

	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)

	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117)

	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)

	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)

	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)

	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)

	at Dev_Cluster_Test.main(Dev_Cluster_Test.java:88)

	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

	at java.lang.reflect.Method.invoke(Method.java:498)

	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)


Then I did look at the Spark API code base below.
https://github.com/apache/spark/blob/branch-1.6/sql/core/src
/main/scala/org/apache/spark/sql/execution/datasources/
jdbc/JDBCRDD.scala#L136

which is referring to hive-jdbc API code base below.
https://github.com/apache/hive/blob/master/jdbc/src/java/org
/apache/hive/jdbc/HiveResultSetMetaData.java#L143


Thus the error.

Then I looked at Spark 2.0.0 API below.
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L139

Which results in same error "Method not supported".

Can anyone please shed some lights on this and tell me if I am missing
anything here. I appreciate your time. Thank you.


Regards,

Ajay