You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Lukas Nalezenec <lu...@firma.seznam.cz> on 2014/04/17 16:39:23 UTC

strange StreamCorruptedException

Hi all,

I am running algorithm similar to wordcount and I am not sure why it 
fails at end, there are only 200 words so result of the computation 
should be small.

I have got SIMR command line with Spark 0.8.1 , 50 workers each with 
~512M RAM.
The dataset is 100 GB tab separated text HadoopRDD, it has ~6000 partitions.

My command line is:
dataset.map(x => x.split("\t")).map(x => (x(2), 
x(3).toInt)).reduceByKey(_ + _).collect

It throws this exception:

java.io.StreamCorruptedException (java.io.StreamCorruptedException: 
invalid type code: AC)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:101)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:26)org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:53)org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:95)

What am I doing wrong ?

Thanks!
Best Regards
Lukas