You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Pere Ferrera <fe...@gmail.com> on 2012/09/19 20:19:41 UTC

serialization / deserialization improvement suggestion

Hi to all,

I have been taking a look to Giraph's source code. I have noticed the heavy
usage of Writables in it and, even though I don't know many of the details
of the project, I think it would be a good idea to at least consider the
usage of Pangool instead of the Java Hadoop API.

Pangool (http://pangool.net) is a low-level Java API on top of Hadoop that
aims to make several things easier, one of them is dealing with compound
types. Most of the others don't apply to Giraph since you are doing
Map-Only jobs.

The most interesting part of it for Giraph is that you would be able to
have a Vertexs with Java classes (Integer, Float, ... or arbitrary
serializable Objects) without needing to worry them being Writable. This
would reduce some of the code and complexity of the project and it would
allow for a more expressive, decoupled from Hadoop code where user
functions (business logic) operate directly on Java types rather than on
Hadoop types.

Pangool has been designed for performance so it should perform in the same
order than plain Hadoop (we did a benchmark to show that). Pangool uses
Avro for persisting data. It is being used in production in some of our
consulting projects (datasalt.com) successfully so we contribute actively
to it.

So, if this could be interesting at all I will be glad to submit a proposal
in a patch and contribute. It will be a win-win situation where Pangool
will benefit a lot from being actively used by a serious open-source
project like Giraph. Of course, many details will need to be discussed.
Take this as a preliminar suggestion just to see how it sounds. Feel free
to ask any questions or concerns you may have.

Thanks,

Pere.

Re: serialization / deserialization improvement suggestion

Posted by Pere Ferrera <fe...@gmail.com>.
Hi Avery,

There is also the stateful Mapper/Reducer feature that could potentially be
nice for Giraph since users could write Serializable stuff in their
business logic that would be automatically available by being
serialized/deserialized with the DistributedCache underneath.

Regarding serialization, Pangool is very efficient in intermediate
serialization (the one used in shuffle/sort) and less efficient for
persisting (Avro). As a downside, for every user custom data type there is
an extra byte used (to identify its size).

Besides, I will analyze further the project to get a better picture and
suggest something more specific if that's the case.

On Wed, Sep 19, 2012 at 11:12 PM, Avery Ching <ac...@apache.org> wrote:

> Thanks for contacting us Pere.
>
> We use Writable for serialization/deserialization given it's speed.  We
> are open to other APIs, but speed is an important concern (a lot of time is
> spent doing serialization/deserialization)**.  We don't use the actual
> Hadoop framework for much except for scheduling, so I'm not sure how we can
> take advantage of Pangool's interesting features.
>
> Avery
>
>
> On 9/19/12 11:19 AM, Pere Ferrera wrote:
>
>> Hi to all,
>>
>> I have been taking a look to Giraph's source code. I have noticed the
>> heavy
>> usage of Writables in it and, even though I don't know many of the details
>> of the project, I think it would be a good idea to at least consider the
>> usage of Pangool instead of the Java Hadoop API.
>>
>> Pangool (http://pangool.net) is a low-level Java API on top of Hadoop
>> that
>> aims to make several things easier, one of them is dealing with compound
>> types. Most of the others don't apply to Giraph since you are doing
>> Map-Only jobs.
>>
>> The most interesting part of it for Giraph is that you would be able to
>> have a Vertexs with Java classes (Integer, Float, ... or arbitrary
>> serializable Objects) without needing to worry them being Writable. This
>> would reduce some of the code and complexity of the project and it would
>> allow for a more expressive, decoupled from Hadoop code where user
>> functions (business logic) operate directly on Java types rather than on
>> Hadoop types.
>>
>> Pangool has been designed for performance so it should perform in the same
>> order than plain Hadoop (we did a benchmark to show that). Pangool uses
>> Avro for persisting data. It is being used in production in some of our
>> consulting projects (datasalt.com) successfully so we contribute actively
>> to it.
>>
>> So, if this could be interesting at all I will be glad to submit a
>> proposal
>> in a patch and contribute. It will be a win-win situation where Pangool
>> will benefit a lot from being actively used by a serious open-source
>> project like Giraph. Of course, many details will need to be discussed.
>> Take this as a preliminar suggestion just to see how it sounds. Feel free
>> to ask any questions or concerns you may have.
>>
>> Thanks,
>>
>> Pere.
>>
>>
>

Re: serialization / deserialization improvement suggestion

Posted by Avery Ching <ac...@apache.org>.
Thanks for contacting us Pere.

We use Writable for serialization/deserialization given it's speed.  We 
are open to other APIs, but speed is an important concern (a lot of time 
is spent doing serialization/deserialization).  We don't use the actual 
Hadoop framework for much except for scheduling, so I'm not sure how we 
can take advantage of Pangool's interesting features.

Avery

On 9/19/12 11:19 AM, Pere Ferrera wrote:
> Hi to all,
>
> I have been taking a look to Giraph's source code. I have noticed the heavy
> usage of Writables in it and, even though I don't know many of the details
> of the project, I think it would be a good idea to at least consider the
> usage of Pangool instead of the Java Hadoop API.
>
> Pangool (http://pangool.net) is a low-level Java API on top of Hadoop that
> aims to make several things easier, one of them is dealing with compound
> types. Most of the others don't apply to Giraph since you are doing
> Map-Only jobs.
>
> The most interesting part of it for Giraph is that you would be able to
> have a Vertexs with Java classes (Integer, Float, ... or arbitrary
> serializable Objects) without needing to worry them being Writable. This
> would reduce some of the code and complexity of the project and it would
> allow for a more expressive, decoupled from Hadoop code where user
> functions (business logic) operate directly on Java types rather than on
> Hadoop types.
>
> Pangool has been designed for performance so it should perform in the same
> order than plain Hadoop (we did a benchmark to show that). Pangool uses
> Avro for persisting data. It is being used in production in some of our
> consulting projects (datasalt.com) successfully so we contribute actively
> to it.
>
> So, if this could be interesting at all I will be glad to submit a proposal
> in a patch and contribute. It will be a win-win situation where Pangool
> will benefit a lot from being actively used by a serious open-source
> project like Giraph. Of course, many details will need to be discussed.
> Take this as a preliminar suggestion just to see how it sounds. Feel free
> to ask any questions or concerns you may have.
>
> Thanks,
>
> Pere.
>