You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gaurav Jain <ja...@student.ethz.ch> on 2014/06/10 14:06:40 UTC
Calling JavaPairRDD.first after calling JavaPairRDD.groupByKey results in NullPointerException

I am getting a strange null pointer exception when trying to list the first
entry of a JavaPairRDD after calling groupByKey on it. Following is my code:


JavaPairRDD<Tuple3&lt;String, String, String>, List<String>> KeyToAppList =
				KeyToApp.distinct().groupByKey();
		// System.out.println("First member of the key-val list: " +
KeyToAppList.first());
                // Above call to .first causes a null pointer exception
		JavaRDD<Integer> KeyToAppCount = KeyToAppList.map(
				new Function<Tuple2&lt;Tuple3&lt;String, String, String>, List<String>>,
Integer>() {
					@Override
					public Integer call(Tuple2<Tuple3&lt;String, String, String>,
List<String>> tupleOfTupAndList) throws Exception {
						List<String> apps = tupleOfTupAndList._2;
						Set<String> uniqueApps = new HashSet<String>(apps);
						return uniqueApps.size();
					}
				});
		System.out.println("First member of the key-val list: " +
KeyToAppCount.first());
                // Above call to .first prints the first element all right. 


The first call to JavaPairRDD results in a null pointer exception. However,
if I comment out the call to JavaPairRDD.first(), and instead proceed onto
applying the map function, the call to JavaPairRDD.first() doesn't raise any
exception. Why the null pointer exception immediately after applying
groupByKey?

The null pointer exception looks like follows:
Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Exception while deserializing and fetching task:
java.lang.NullPointerException
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
	at scala.Option.foreach(Option.scala:236)
	at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calling-JavaPairRDD-first-after-calling-JavaPairRDD-groupByKey-results-in-NullPointerException-tp7318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.