You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Joel Keller <jk...@miovision.com> on 2016/01/26 18:36:14 UTC

Scala closure exceeds ByteArrayOutputStream limit (~2gb)

Hello,

I am running RandomForest from mllib on a data-set which has very-high
dimensional data (~50k dimensions).

I get the following stack trace:

16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception:
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at
org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:624)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at
org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
at
org.apache.spark.mllib.tree.RandomForest.trainClassifier(RandomForest.scala)
at com.miovision.cv.spark.CifarTrainer.main(CifarTrainer.java:108)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525



I have determined that the problem is that when the ClosureCleaner checks
that a closure is serializable (ensureSerializable), it serializes the
closure to an underlying java bytebuffer, which is limited to about 2gb
(due to signed 32-bit int).

I believe that the closure has grown very large due to the high number of
features (dimensions), and the statistics that must be collected for them.


Does anyone know if there is a way that I can make mllib's randomforest
implementation limit the size here such that it will not exceed 2gb
serialized-closures, or alternatively is there a way to allow spark to work
with such a large closure?


I am running this training on a very large cluster of very large machines,
so RAM is not the problem here.  Problem is java's 32-bit limit on array
sizes.


Thanks,

Joel

Re: Scala closure exceeds ByteArrayOutputStream limit (~2gb)

Posted by Mungeol Heo <mu...@gmail.com>.

Hello, Joel.

Have you solved the problem which is Java's 32-bit limit on array sizes?

Thanks.

On Wed, Jan 27, 2016 at 2:36 AM, Joel Keller <jk...@miovision.com> wrote:
> Hello,
>
> I am running RandomForest from mllib on a data-set which has very-high
> dimensional data (~50k dimensions).
>
> I get the following stack trace:
>
> 16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception:
> java.lang.OutOfMemoryError
> java.lang.OutOfMemoryError
> at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
> at
> org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:624)
> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
> at
> org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
> at
> org.apache.spark.mllib.tree.RandomForest.trainClassifier(RandomForest.scala)
> at com.miovision.cv.spark.CifarTrainer.main(CifarTrainer.java:108)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525
>
>
>
> I have determined that the problem is that when the ClosureCleaner checks
> that a closure is serializable (ensureSerializable), it serializes the
> closure to an underlying java bytebuffer, which is limited to about 2gb (due
> to signed 32-bit int).
>
> I believe that the closure has grown very large due to the high number of
> features (dimensions), and the statistics that must be collected for them.
>
>
> Does anyone know if there is a way that I can make mllib's randomforest
> implementation limit the size here such that it will not exceed 2gb
> serialized-closures, or alternatively is there a way to allow spark to work
> with such a large closure?
>
>
> I am running this training on a very large cluster of very large machines,
> so RAM is not the problem here.  Problem is java's 32-bit limit on array
> sizes.
>
>
> Thanks,
>
> Joel

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org