You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/07/29 18:10:09 UTC

Kryo

I need to do a sort each vector inside an rdd.map. The last time I added a collection class, Guava’s HashBiMap, I had to add it to the MahoutKryoRegisrator. 

This time at first it wouldn't serialize when I used a Scala List[Vector.Element], but the problem is I can’t seem to add the Scala List to the MahoutKryoRegisrator because it doesn’t understand the classname. So I had to fall back to using Java’s ArrayList, which doesn’t require registering for some reason.

What are the rules for when, why, and what we need to register with the MahoutKryoRegisrator? Is there a problem with just registering the Scala collection library?

Re: Kryo

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
what i mean, among other things, in (4) is that the fanciest possible
bidirectional hash map is still a map, i.e. simplest serialization envelope
for this data is a bag of (key, value) pairs. Serializing that as
java-serialized way of biHashMap is probably the bulkiest thing to do here.
much faster way is to serialize scala iterator of the said tuples (a
collection wrapper). In non-strict way. Obviously, it is also possible to
map it to a strict scala collection and serialize as such (probably shorter
notation but bigger memory overhead).


On Tue, Jul 29, 2014 at 10:30 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> There are a few facts useful to know also mixed with my opinion:
>
> (1) My take is Mahout (at this point at least) doesn't need to support
> serialization anything of specific classes but Matrix and Vector because
> anything else is not algebra.
>
> (2) Most native scala types, including scala collections, are already
> supported by kryo by default.
>
> (3) We don't want use java collections in scala code as a serialization
> envelope. Like, ever.
>
> (4) Clearly, a Spark application working with RDD outside of Mahout
> algebraic support may want to use a specific serialization envelope which
> is neither matrix nor standard Scala type/collection. (not sure why it
> would -- but ok). In this case the real solution is to  provide a way for
> application to _decorate_ default mahout registrator, rather than hack the
> registrator itself.
>
>
>
> On Tue, Jul 29, 2014 at 10:18 AM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
>
>> This time it doesn’t seem to be related to registering class serializers.
>> Seems like the Scala collections work as well as the Java ones. It would
>> still be nice to know when we have to add to that list in
>> MahoutKryoRegistrator. When a job fails to serialize the message is not
>> very helpful.
>>
>>
>> On Jul 29, 2014, at 9:10 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>>
>> I need to do a sort each vector inside an rdd.map. The last time I added
>> a collection class, Guava’s HashBiMap, I had to add it to the
>> MahoutKryoRegisrator.
>>
>> This time at first it wouldn't serialize when I used a Scala
>> List[Vector.Element], but the problem is I can’t seem to add the Scala List
>> to the MahoutKryoRegisrator because it doesn’t understand the classname. So
>> I had to fall back to using Java’s ArrayList, which doesn’t require
>> registering for some reason.
>>
>> What are the rules for when, why, and what we need to register with the
>> MahoutKryoRegisrator? Is there a problem with just registering the Scala
>> collection library?
>>
>>
>

Re: Kryo

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
There are a few facts useful to know also mixed with my opinion:

(1) My take is Mahout (at this point at least) doesn't need to support
serialization anything of specific classes but Matrix and Vector because
anything else is not algebra.

(2) Most native scala types, including scala collections, are already
supported by kryo by default.

(3) We don't want use java collections in scala code as a serialization
envelope. Like, ever.

(4) Clearly, a Spark application working with RDD outside of Mahout
algebraic support may want to use a specific serialization envelope which
is neither matrix nor standard Scala type/collection. (not sure why it
would -- but ok). In this case the real solution is to  provide a way for
application to _decorate_ default mahout registrator, rather than hack the
registrator itself.



On Tue, Jul 29, 2014 at 10:18 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This time it doesn’t seem to be related to registering class serializers.
> Seems like the Scala collections work as well as the Java ones. It would
> still be nice to know when we have to add to that list in
> MahoutKryoRegistrator. When a job fails to serialize the message is not
> very helpful.
>
>
> On Jul 29, 2014, at 9:10 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>
> I need to do a sort each vector inside an rdd.map. The last time I added a
> collection class, Guava’s HashBiMap, I had to add it to the
> MahoutKryoRegisrator.
>
> This time at first it wouldn't serialize when I used a Scala
> List[Vector.Element], but the problem is I can’t seem to add the Scala List
> to the MahoutKryoRegisrator because it doesn’t understand the classname. So
> I had to fall back to using Java’s ArrayList, which doesn’t require
> registering for some reason.
>
> What are the rules for when, why, and what we need to register with the
> MahoutKryoRegisrator? Is there a problem with just registering the Scala
> collection library?
>
>

Re: Kryo

Posted by Pat Ferrel <pa...@occamsmachete.com>.
This time it doesn’t seem to be related to registering class serializers. Seems like the Scala collections work as well as the Java ones. It would still be nice to know when we have to add to that list in MahoutKryoRegistrator. When a job fails to serialize the message is not very helpful.


On Jul 29, 2014, at 9:10 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

I need to do a sort each vector inside an rdd.map. The last time I added a collection class, Guava’s HashBiMap, I had to add it to the MahoutKryoRegisrator. 

This time at first it wouldn't serialize when I used a Scala List[Vector.Element], but the problem is I can’t seem to add the Scala List to the MahoutKryoRegisrator because it doesn’t understand the classname. So I had to fall back to using Java’s ArrayList, which doesn’t require registering for some reason.

What are the rules for when, why, and what we need to register with the MahoutKryoRegisrator? Is there a problem with just registering the Scala collection library?