You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Xavier Rampino <xr...@senscritique.com> on 2015/05/13 11:21:02 UTC

spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space

Hello,

I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
distribution and spark, and set up the PATH), and I stumble upon a Java
Heap space error. My input file is ~100MB. It seems the various parameters
I tried to give won't change this. I do :

~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
~/query_result.tsv --output ~/work/result -sem 24g
-D:spark.executor.memory=24g

Do I just need to input more memory, or is there another step I can do to
solve this ?

Re: spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space

Posted by Pat Ferrel <pa...@occamsmachete.com>.
The way the code work is:
1) create a BiMap for every id space in the client code (users and items). This is non-distributed code, typically run on the machine you launch from although in yarn-cluster mode the actual machine may be different. In any case the heap used is associated with the driver itself, not distributed code.
2) the BiMap is broadcast (copied) to every worker. This instantiates it in memory shared with all executors on the worker so there is only one copy per machine. Since it may be large this is the best way to handle it.

#1 requires that you have enough memory in the driver to create the BiMap. This memory is allocated when the driver is launched and available as heap. If you are not using yarn this would be JVM memory so the various methods for setting -Xmx4g (or however much you need). This will be something like “export JAVA_OPTS= -Xmx4g” or something.  You would have to have a giant BiMap to us that much memory. A Hashmap storage has an index and copy of every key/value pair. A BiMap has two HashMaps. If your ID strings are very long this increases the space required. So index aside the memory needed increases with the size of you ID strings, ints are used as Mahout IDs.

If you are using spark-submit you can change executor memory there. You can change it in the Spark conf files and using the driver’s -D:spark.executor.memory=4g. These use different mechanisms to get the config changed but should all work. Feel free to try a different method if you think -sem doesn’t.

Are you using yarn-client or yarn-cluster? Can you share your entire command line and console error log? The line also states that you have 1.8g free so we need to pinpoint the memory chunk that is being exhausted. Also is you could share a snippet or you data.

On May 18, 2015, at 6:10 AM, Xavier Rampino <xr...@senscritique.com> wrote:

I just did that but I ran into the same problem, I feel like -sem doesn't
work with my setup. For instance I have :

15/05/18 13:44:39 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
localhost:60596 in memory (size: 2.7 KB, free: *1761.1 MB*)

(Maybe it's not related though)

On Wed, May 13, 2015 at 7:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> There is a bug in mahout 0.10.0 that you can fix if you are able to build
> from source. Get the source tar for 0.10.0, not the current master.
> 
> Got to
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
> 
> remove the line that says: interactions.collect()
> 
> See this Jira https://issues.apache.org/jira/browse/MAHOUT-1707
> 
> There is one other thing that can cause this and is fixed by increasing
> you client JVM heap space but try the above first.
> 
> BTW setting the executor memory twice, is not necessary.
> 
> 
> On May 13, 2015, at 2:21 AM, Xavier Rampino <xr...@senscritique.com>
> wrote:
> 
> Hello,
> 
> I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
> distribution and spark, and set up the PATH), and I stumble upon a Java
> Heap space error. My input file is ~100MB. It seems the various parameters
> I tried to give won't change this. I do :
> 
> ~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
> ~/query_result.tsv --output ~/work/result -sem 24g
> -D:spark.executor.memory=24g
> 
> Do I just need to input more memory, or is there another step I can do to
> solve this ?
> 
> 


Re: spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space

Posted by Xavier Rampino <xr...@senscritique.com>.
I just did that but I ran into the same problem, I feel like -sem doesn't
work with my setup. For instance I have :

15/05/18 13:44:39 INFO BlockManagerInfo: Removed broadcast_13_piece0 on
localhost:60596 in memory (size: 2.7 KB, free: *1761.1 MB*)

(Maybe it's not related though)

On Wed, May 13, 2015 at 7:27 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> There is a bug in mahout 0.10.0 that you can fix if you are able to build
> from source. Get the source tar for 0.10.0, not the current master.
>
> Got to
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
>
> remove the line that says: interactions.collect()
>
> See this Jira https://issues.apache.org/jira/browse/MAHOUT-1707
>
> There is one other thing that can cause this and is fixed by increasing
> you client JVM heap space but try the above first.
>
> BTW setting the executor memory twice, is not necessary.
>
>
> On May 13, 2015, at 2:21 AM, Xavier Rampino <xr...@senscritique.com>
> wrote:
>
> Hello,
>
> I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
> distribution and spark, and set up the PATH), and I stumble upon a Java
> Heap space error. My input file is ~100MB. It seems the various parameters
> I tried to give won't change this. I do :
>
> ~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
> ~/query_result.tsv --output ~/work/result -sem 24g
> -D:spark.executor.memory=24g
>
> Do I just need to input more memory, or is there another step I can do to
> solve this ?
>
>

Re: spark-rowsimilarity java.lang.OutOfMemoryError: Java heap space

Posted by Pat Ferrel <pa...@occamsmachete.com>.
There is a bug in mahout 0.10.0 that you can fix if you are able to build from source. Get the source tar for 0.10.0, not the current master.

Got to https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157

remove the line that says: interactions.collect()

See this Jira https://issues.apache.org/jira/browse/MAHOUT-1707

There is one other thing that can cause this and is fixed by increasing you client JVM heap space but try the above first.

BTW setting the executor memory twice, is not necessary.


On May 13, 2015, at 2:21 AM, Xavier Rampino <xr...@senscritique.com> wrote:

Hello,

I've tried spark-rowsimilarity with out-of-the-box setup (downloaded mahout
distribution and spark, and set up the PATH), and I stumble upon a Java
Heap space error. My input file is ~100MB. It seems the various parameters
I tried to give won't change this. I do :

~/mahout-distribution-0.10.0/bin/mahout spark-rowsimilarity --input
~/query_result.tsv --output ~/work/result -sem 24g
-D:spark.executor.memory=24g

Do I just need to input more memory, or is there another step I can do to
solve this ?