You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sandy Ryza (JIRA)" <ji...@apache.org> on 2015/05/27 21:06:18 UTC

[jira] [Issue Comment Deleted] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

     [ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandy Ryza updated SPARK-7896:
------------------------------
    Comment: was deleted

(was: ChainedBuffer splits data into smaller buffers.  The default size for these buffers is 4 MB.)

> IndexOutOfBoundsException in ChainedBuffer
> ------------------------------------------
>
>                 Key: SPARK-7896
>                 URL: https://issues.apache.org/jira/browse/SPARK-7896
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Arun Ahuja
>            Assignee: Sandy Ryza
>            Priority: Blocker
>
> I've run into this on two tasks that use the same dataset.
> The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each.
> for this rdd: RDD[String], I can do rdd.map( x => (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey().
> Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error
> {code}
> 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512
>         at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>         at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
>         at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
>         at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
>         at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
>         at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
>         at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
>         at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
>         at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:70)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org