You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cosmin Radoi <co...@gmail.com> on 2014/03/03 02:37:13 UTC

flatten RDD[RDD[T]]

I'm trying to flatten an RDD of RDDs. The straightforward approach:

a: [RDD[RDD[Int]]
a flatMap { _.collect } 

throws a java.lang.NullPointerException at org.apache.spark.rdd.RDD.collect(RDD.scala:602)

In a more complex scenario I also got:
Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

So I guess this may be related to the context not being available inside the map.

Are nested RDDs not supported?

Thanks,

Cosmin Radoi 


Re: flatten RDD[RDD[T]]

Posted by Josh Rosen <ro...@gmail.com>.
Nope, nested RDDs aren't supported:

https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ
https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ
https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ


On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi <co...@gmail.com> wrote:

>
> I'm trying to flatten an RDD of RDDs. The straightforward approach:
>
> a: [RDD[RDD[Int]]
> a flatMap { _.collect }
>
> throws a java.lang.NullPointerException at
> org.apache.spark.rdd.RDD.collect(RDD.scala:602)
>
> In a more complex scenario I also got:
> Task not serializable: java.io.NotSerializableException:
> org.apache.spark.SparkContext
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>
> So I guess this may be related to the context not being available inside
> the map.
>
> Are nested RDDs not supported?
>
> Thanks,
>
> Cosmin Radoi
>
>