You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/11/19 22:04:46 UTC

Spark defaults

For historical reasons some code I stole to do the similarity stuff had the following spark settings:

    sparkConf.set("spark.kryo.referenceTracking", "false")
      .set("spark.kryoserializer.buffer.mb", "200")// todo: should this be left to config or an option?

I’m not all that familiar with kryo. Are these things better left to a -D:key=value type param? Seems like they shouldn’t be hard coded unless tracking is universal. Any opinions?

Re: Spark defaults

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Makes sense, will leave them as defaults

On Nov 19, 2014, at 1:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

the buffer thing default for Mahout should be bigger than Spark's default.
It may seem like a poor decision, but the fact is that optimizer merges
("blockifies") matrix row partitions in some partitions in a lazy way in
order to simplify/ (and even perhaps speed up) block-wise matrix
algorithms.

As a result, there are some times situations when Spark may decide to put
the entire block on the wire as a single blob. That implies that the entire
matrix partition may need to be able to fit into kryo buffer at times.

On Wed, Nov 19, 2014 at 1:04 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> For historical reasons some code I stole to do the similarity stuff had
> the following spark settings:
> 
>    sparkConf.set("spark.kryo.referenceTracking", "false")
>      .set("spark.kryoserializer.buffer.mb", "200")// todo: should this be
> left to config or an option?
> 
> I’m not all that familiar with kryo. Are these things better left to a
> -D:key=value type param? Seems like they shouldn’t be hard coded unless
> tracking is universal. Any opinions?
> 
> 
>

Re: Spark defaults

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

the buffer thing default for Mahout should be bigger than Spark's default.
It may seem like a poor decision, but the fact is that optimizer merges
("blockifies") matrix row partitions in some partitions in a lazy way in
order to simplify/ (and even perhaps speed up) block-wise matrix
algorithms.

As a result, there are some times situations when Spark may decide to put
the entire block on the wire as a single blob. That implies that the entire
matrix partition may need to be able to fit into kryo buffer at times.

On Wed, Nov 19, 2014 at 1:04 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> For historical reasons some code I stole to do the similarity stuff had
> the following spark settings:
>
>     sparkConf.set("spark.kryo.referenceTracking", "false")
>       .set("spark.kryoserializer.buffer.mb", "200")// todo: should this be
> left to config or an option?
>
> I’m not all that familiar with kryo. Are these things better left to a
> -D:key=value type param? Seems like they shouldn’t be hard coded unless
> tracking is universal. Any opinions?
>
>
>