You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jererc <je...@gmail.com> on 2014/11/19 17:40:03 UTC

tableau spark sql cassandra

Hello!

I'm working on a POC with Spark SQL, where I’m trying to get data from
Cassandra into Tableau using Spark SQL.

Here is the stack:
- cassandra (v2.1)
- spark SQL (pre build v1.1 hadoop v2.4)
- cassandra / spark sql connector
(https://github.com/datastax/spark-cassandra-connector)
- hive
- mysql
- hive / mysql connector
- hive / cassandra handler
(https://github.com/tuplejump/cash/tree/master/cassandra-handler)
- tableau
- tableau / spark sql connector

I get an exception in spark-sql (bin/spark-sql) when trying to query the
cassandra table (java.lang.InstantiationError:
org.apache.hadoop.mapreduce.JobContext), it looks like a missing hadoop
dependency; showing tables or describing them work fine.

Do you know how to solve this without of hadoop?
Is Hive a dependency in Spark SQL?

Best,
Jerome




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: tableau spark sql cassandra

Posted by Mohammed Guller <mo...@glassbeam.com>.
Thanks, Jerome.

BTW, have you tried the CalliopeServer2 from tuplejump? I was able to quickly connect from beeline/Squirrel to my Cassandra cluster using CalliopeServer2, which extends Spark SQL Thrift Server. It was very straight forward.

Next step is to connect from Tableau, but I can't find Tableau's Spark connector. Where did you download it from?

Mohammed

-----Original Message-----
From: jererc [mailto:jererc@gmail.com] 
Sent: Friday, November 21, 2014 5:27 AM
To: user@spark.incubator.apache.org
Subject: RE: tableau spark sql cassandra

Hi!

Sure, I'll post the info I grabbed once the cassandra tables values appear in Tableau.

Best,
Jerome



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: tableau spark sql cassandra

Posted by jererc <je...@gmail.com>.
Hi!

Sure, I'll post the info I grabbed once the cassandra tables values appear
in Tableau.

Best,
Jerome



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: tableau spark sql cassandra

Posted by Mohammed Guller <mo...@glassbeam.com>.
Hi Jerome,
This is cool. It would be great if you could share more details about you got your setup to work finally. For example, what additional libraries/jars you are using. How are you configuring the ThriftServer to use the additional jars to communicate with Cassandra?

In addition, how you are you mapping HIVE tables to Cassandra CFs in beeline? It would be great if you could share an example beeline session right from the beginning.

Thanks.
Mohammed

From: Ashic Mahtab [mailto:ashic@live.com]
Sent: Thursday, November 20, 2014 10:15 AM
To: jererc; user@spark.incubator.apache.org
Subject: RE: tableau spark sql cassandra

Hi Jerome,
I've been trying to get this working as well...

Where are you specifying cassandra parameters (i.e. seed nodes, consistency levels, etc.)?

-Ashic.
> Date: Thu, 20 Nov 2014 10:34:58 -0700
> From: jererc@gmail.com<ma...@gmail.com>
> To: user@spark.incubator.apache.org<ma...@spark.incubator.apache.org>
> Subject: Re: tableau spark sql cassandra
>
> Well, after many attempts I can now successfully run the thrift server using
> root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
> spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0
> --hiveconf hive.server2.thrift.port 10000
>
> (the command was failing because of the --driver-class-path $CLASSPATH
> parameter which I guess was setting the spark.driver.extraClassPath) and I
> can get the cassandra data using beeline!
>
> However, the table's values are null in Tableau but this is another problem
> ;)
>
> Best,
> Jerome
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19392.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>
>

RE: tableau spark sql cassandra

Posted by Ashic Mahtab <as...@live.com>.
Hi Jerome,
I've been trying to get this working as well...

Where are you specifying cassandra parameters (i.e. seed nodes, consistency levels, etc.)?

-Ashic.

> Date: Thu, 20 Nov 2014 10:34:58 -0700
> From: jererc@gmail.com
> To: user@spark.incubator.apache.org
> Subject: Re: tableau spark sql cassandra
> 
> Well, after many attempts I can now successfully run the thrift server using
> root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
> spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0
> --hiveconf hive.server2.thrift.port 10000
> 
> (the command was failing because of the --driver-class-path $CLASSPATH
> parameter which I guess was setting the spark.driver.extraClassPath) and I
> can get the cassandra data using beeline!
> 
> However, the table's values are null in Tableau but this is another problem
> ;)
> 
> Best,
> Jerome
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19392.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
 		 	   		  

Re: tableau spark sql cassandra

Posted by jererc <je...@gmail.com>.
Well, after many attempts I can now successfully run the thrift server using
root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0
--hiveconf hive.server2.thrift.port 10000

(the command was failing because of the --driver-class-path $CLASSPATH
parameter which I guess was setting the spark.driver.extraClassPath) and I
can get the cassandra data using beeline!

However, the table's values are null in Tableau but this is another problem
;)

Best,
Jerome



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19392.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: tableau spark sql cassandra

Posted by jererc <je...@gmail.com>.
I finally solved this problem.
The org.apache.hadoop.mapreduce.JobContext is a class in hadoop < 2.0 and is
an interface in hadoop >= 2.0.
I have to use a spark build for hadoop v1.

So spark-sql seems fine. 
But, the thriftserver does not work with my config!

Here is my spark-env.sh:

#!/usr/bin/env bash
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export SPARK_HOME=/home/jererc/spark
export SPARK_MASTER_IP=10.194.30.2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=4
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=4g
export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
export CLASSPATH=$(echo ${SPARK_HOME}/lib/*.jar | sed 's/ /:/g'):$CLASSPATH
export SPARK_CLASSPATH=$CLASSPATH

Here is the output:

root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
spark://10.194.30.2:7077 --driver-class-path $CLASSPATH --hiveconf
hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port
10000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/11/20 14:55:35 INFO thriftserver.HiveThriftServer2: Starting SparkContext
14/11/20 14:55:35 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

14/11/20 14:55:35 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:'
as a work-around.
Exception in thread "main" org.apache.spark.SparkException: Found both
spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
	at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300)
	at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298)
	at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:158)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:36)
	at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:57)
	at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


And if I don't use SPARK_CLASSPATH then spark-sql does not work.
I tried ADD_JARS without much success.

What's the best way to set the CLASSPATH and the jars?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19379.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: tableau spark sql cassandra

Posted by jererc <je...@gmail.com>.
Hi!

The hive table is an external table, which I created like this:

CREATE EXTERNAL TABLE MyHiveTable
        ( id int, data string )
        STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'
        TBLPROPERTIES ("cassandra.host" = "10.194.30.2", "cassandra.ks.name"
= "test" ,
          "cassandra.cf.name" = "mytable" ,
          "cassandra.ks.repfactor" = "1" ,
          "cassandra.ks.strategy" =
            "org.apache.cassandra.locator.SimpleStrategy" );


Here is the output from spark-sql for different commands:

spark-sql> show tables;
14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables
14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:32 INFO Configuration.deprecation: mapred.input.dir.recursive
is deprecated. Instead, use
mapreduce.input.fileinputformat.input.dir.recursive
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=Driver.run>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=TimeToSubmit>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=compile>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=parse>
14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables
14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=parse start=1416473432290
end=1416473432290 duration=0>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=semanticAnalyze>
14/11/20 09:50:32 INFO ql.Driver: Semantic Analysis Completed
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=semanticAnalyze
start=1416473432290 end=1416473432295 duration=5>
14/11/20 09:50:32 INFO exec.ListSinkOperator: Initializing Self 0 OP
14/11/20 09:50:32 INFO exec.ListSinkOperator: Operator 0 OP initialized
14/11/20 09:50:32 INFO exec.ListSinkOperator: Initialization Done 0 OP
14/11/20 09:50:32 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from
deserializer)], properties:null)
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=compile
start=1416473432289 end=1416473432298 duration=9>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=Driver.execute>
14/11/20 09:50:32 INFO ql.Driver: Starting command: show tables
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=TimeToSubmit
start=1416473432289 end=1416473432298 duration=9>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=runTasks>
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=task.DDL.Stage-0>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=task.DDL.Stage-0
start=1416473432298 end=1416473432314 duration=16>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=runTasks
start=1416473432298 end=1416473432314 duration=16>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=Driver.execute
start=1416473432298 end=1416473432314 duration=16>
OK
14/11/20 09:50:32 INFO ql.Driver: OK
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=releaseLocks>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=releaseLocks
start=1416473432314 end=1416473432315 duration=1>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=Driver.run
start=1416473432289 end=1416473432315 duration=26>
14/11/20 09:50:32 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=releaseLocks>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=releaseLocks
start=1416473432319 end=1416473432319 duration=0>
myhivetable
Time taken: 0.088 seconds
14/11/20 09:50:32 INFO CliDriver: Time taken: 0.088 seconds
14/11/20 09:50:32 INFO ql.Driver: <PERFLOG method=releaseLocks>
14/11/20 09:50:32 INFO ql.Driver: </PERFLOG method=releaseLocks
start=1416473432325 end=1416473432325 duration=0>
spark-sql> describe myhivetable;
14/11/20 09:50:35 INFO parse.ParseDriver: Parsing command: describe
myhivetable
14/11/20 09:50:35 INFO parse.ParseDriver: Parse Completed
id                  	int                 	from deserializer
data                	string              	from deserializer
Time taken: 0.226 seconds
14/11/20 09:50:35 INFO CliDriver: Time taken: 0.226 seconds
spark-sql> select * from myhivetable;
14/11/20 09:50:39 INFO parse.ParseDriver: Parsing command: select * from
myhivetable
14/11/20 09:50:39 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:39 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(420085) called
with curMem=0, maxMem=278302556
14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 410.2 KB, free 265.0 MB)
14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(30564) called
with curMem=420085, maxMem=278302556
14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 29.8 KB, free 265.0 MB)
14/11/20 09:50:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.194.30.2:57707 (size: 29.8 KB, free: 265.4 MB)
14/11/20 09:50:39 INFO storage.BlockManagerMaster: Updated info of block
broadcast_0_piece0
14/11/20 09:50:39 ERROR thriftserver.SparkSQLDriver: Failed in [select *
from myhivetable]
java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
	at
org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:774)
	at
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
	at
org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:774)
	at
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

14/11/20 09:50:39 ERROR CliDriver: java.lang.InstantiationError:
org.apache.hadoop.mapreduce.JobContext
	at
org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:774)
	at
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
	at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: tableau spark sql cassandra

Posted by Michael Armbrust <mi...@databricks.com>.
The whole stacktrack/exception would be helpful.  Hive is an optional
dependency of Spark SQL, but you will need to include it if you are
planning to use the thrift server to connect to Tableau.  You can enable it
by add -Phive when you build Spark.

You might also try asking on the cassandra mailing list as there could be
something wrong with your configuration there.

On Wed, Nov 19, 2014 at 8:40 AM, jererc <je...@gmail.com> wrote:

> Hello!
>
> I'm working on a POC with Spark SQL, where I’m trying to get data from
> Cassandra into Tableau using Spark SQL.
>
> Here is the stack:
> - cassandra (v2.1)
> - spark SQL (pre build v1.1 hadoop v2.4)
> - cassandra / spark sql connector
> (https://github.com/datastax/spark-cassandra-connector)
> - hive
> - mysql
> - hive / mysql connector
> - hive / cassandra handler
> (https://github.com/tuplejump/cash/tree/master/cassandra-handler)
> - tableau
> - tableau / spark sql connector
>
> I get an exception in spark-sql (bin/spark-sql) when trying to query the
> cassandra table (java.lang.InstantiationError:
> org.apache.hadoop.mapreduce.JobContext), it looks like a missing hadoop
> dependency; showing tables or describing them work fine.
>
> Do you know how to solve this without of hadoop?
> Is Hive a dependency in Spark SQL?
>
> Best,
> Jerome
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>