You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philip Limbeck <ph...@gmail.com> on 2014/03/05 15:56:41 UTC

Problem with HBase external table on freshly created EMR cluster

Hi!

I created an EMR cluster with Spark and HBase according to
http://aws.amazon.com/articles/4926593393724923 with --hbase flag to
include HBase. Although spark and shark both work nicely with the provided
S3 examples, there is a problem with external tables pointing to the HBase
instance.

We create the following external table with shark:

CREATE EXTERNAL TABLE oh (id STRING, name STRING, title STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.zookeeper.quorum" =
"172.31.13.161","hbase.zookeeper.property.clientPort"="2181",
"hbase.columns.mapping" = ":key,o:OH_Name,o:OH_Title") TBLPROPERTIES("
hbase.table.name" = "objects")

The objects table exists and has all columns as defined in the DDL.
The Zookeeper for HBase is running on the specified hostname and port.

CREATE TABLE oh_cached AS SELECT * FROM OH fails with the following error:

org.apache.spark.SparkException: Job aborted: Task 11.0:0 failed more than
4 times
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
        at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
        at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

The logfiles of the spark workers are almost empty, however, the stages
information in the spark web console reveals additional hints:

 0 4 FAILED NODE_LOCAL ip-172-31-10-246.ec2.internal 2014/03/05 13:38:20
java.lang.IllegalStateException (java.lang.IllegalStateException: unread
block data)

java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1954)j

ava.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1848)java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)java.io.ObjectInput

Stream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)org.apa

che.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:199)org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:18

2)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:724)

Re: Problem with HBase external table on freshly created EMR cluster

Posted by Kanwaldeep <ka...@gmail.com>.
Seems like this could be a version mismatch issue between the HBase version
deployed and the jars being used. 

Here are the details on the versions we have setup

We are running CDH-4.6.0 (which includes hadoop 2.0.0), and the spark was
compiled against that version. Below is environment variable we set before
compiling:
SPARK_HADOOP_VERSION=2.0.0+1554-cdh4.6.0

And the code being deployed is using the following maven dependency
                <dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.10</artifactId>
			<version>0.9.0-incubating</version>
		</dependency>

Thanks for your help.
Kanwal




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-HBase-external-table-on-freshly-created-EMR-cluster-tp2307p3004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Problem with HBase external table on freshly created EMR cluster

Posted by Kanwaldeep <ka...@gmail.com>.
I'm getting the same error on writing data to HBase cluster using SPark
Streaming.

Any suggestions on how to fix this?

2014-03-14 23:10:33,832 ERROR o.a.s.s.scheduler.JobScheduler  -
				Error running job streaming job 1394863830000 ms.0
org.apache.spark.SparkException: Job aborted: Task 9.0:0 failed 4 times
(most recent failure: Exception failure: java.lang.IllegalStateException:
unread block data)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
~[scala-library-2.10.2.jar:na]
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
~[scala-library-2.10.2.jar:na]
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at scala.Option.foreach(Option.scala:236) ~[scala-library-2.10.2.jar:na]
	at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
~[spark-core_2.10-0.9.0-incubating.jar:0.9.0-incubating]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
[akka-actor_2.10-2.2.3.jar:2.2.3]
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
[akka-actor_2.10-2.2.3.jar:2.2.3]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
[akka-actor_2.10-2.2.3.jar:2.2.3]
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
[akka-actor_2.10-2.2.3.jar:2.2.3]
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
[akka-actor_2.10-2.2.3.jar:2.2.3]
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[scala-library-2.10.2.jar:na]
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[scala-library-2.10.2.jar:na]
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[scala-library-2.10.2.jar:na]
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[scala-library-2.10.2.jar:na]



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-HBase-external-table-on-freshly-created-EMR-cluster-tp2307p2710.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.