You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lukas Nalezenec <lu...@firma.seznam.cz> on 2014/04/17 16:39:23 UTC
strange StreamCorruptedException
Hi all,
I am running algorithm similar to wordcount and I am not sure why it
fails at end, there are only 200 words so result of the computation
should be small.
I have got SIMR command line with Spark 0.8.1 , 50 workers each with
~512M RAM.
The dataset is 100 GB tab separated text HadoopRDD, it has ~6000 partitions.
My command line is:
dataset.map(x => x.split("\t")).map(x => (x(2),
x(3).toInt)).reduceByKey(_ + _).collect
It throws this exception:
java.io.StreamCorruptedException (java.io.StreamCorruptedException:
invalid type code: AC)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26)org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53)org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95)
What am I doing wrong ?
Thanks!
Best Regards
Lukas