You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Marshall Bockrath-Vandegrift <ll...@gmail.com> on 2013/05/09 00:33:36 UTC

Hadoop serialization DatumReader/Writer

Hi all:

Is there a reason Avro’s Hadoop serialization classes don’t allow
configuration of the DatumReader and DatumWriter classes?

My use-case is that I’m implementing Clojure DatumReader and -Writer
classes which produce and consume Clojure’s data structures directly.
I’d like to then extend that to Hadoop MapReduce jobs which operate in
terms of Clojure data, with Avro handling all de/serialization directly
to/from that Clojure data.

Am I going around this in a backwards fashion, or would a patch to allow
configuration of the Hadoop serialization DatumReader/Writers be
welcome?

-Marshall

Re: Hadoop serialization DatumReader/Writer

Posted by Marshall Bockrath-Vandegrift <ll...@gmail.com>.

Scott Carey <sc...@apache.org> writes:

> Making the DatumReader/Writers configurable would be a welcome
> addition.

Excellent!

> Ideally, much more of what goes on there could be:
>  1. configuration driven
>  2. pre-computed to avoid repeated work during decoding/encoding
>
> We do some of both already.  The trick is to do #1 without impacting
> performance and #2 requires a bigger overhaul.

Which work in particular?  In my pass through the AvroSerialization
implementation so far, it looks like each MR task would create either
one or two Serializers/Deserializers (key and value), each of which in
turn would create one DatumWriter/DatumReader and Encoder/Decoder pair.
Or do De/Serializers get created multiple times per task?

> If you would like, a contribution including a Clojure related maven
> module or two that depends on the Java stuff would be a welcome
> addition and allow us to identify compatibility issues as we change
> the Java library over time.

That sounds like a great end-goal.  Right now at the company I work for
(Damballa) we've just started getting our toes wet with Avro.  Avro won
our serialization-format bake-off, but we haven't started actually using
it.  I just finished an initial pass at Avro-Clojure integration and we
have released it under an open source license:

    https://github.com/damballa/abracad

I would very much like to eventually get a iteration of it into Avro
proper, but I want to actually start using it and Avro first, so we can
hammer out any interface issues etc.

Anyway, I'll try to work up a patch to add some more configuration hooks
to the AvroSerialization.  Should I also create a ticket in the Avro
issue tracker?

-Marshall

Re: Hadoop serialization DatumReader/Writer

Posted by Scott Carey <sc...@apache.org>.

Making the DatumReader/Writers configurable would be a welcome addition.

Ideally, much more of what goes on there could be:
 1. configuration driven
 2. pre-computed to avoid repeated work during decoding/encoding

We do some of both already.  The trick is to do #1 without impacting
performance and #2 requires a bigger overhaul.

If you would like, a contribution including a Clojure related maven module
or two that depends on the Java stuff would be a welcome addition and
allow us to identify compatibility issues as we change the Java library
over time.

On 5/8/13 3:33 PM, "Marshall Bockrath-Vandegrift" <ll...@gmail.com>
wrote:

>Hi all:
>
>Is there a reason Avro¹s Hadoop serialization classes don¹t allow
>configuration of the DatumReader and DatumWriter classes?
>
>My use-case is that I¹m implementing Clojure DatumReader and -Writer
>classes which produce and consume Clojure¹s data structures directly.
>I¹d like to then extend that to Hadoop MapReduce jobs which operate in
>terms of Clojure data, with Avro handling all de/serialization directly
>to/from that Clojure data.
>
>Am I going around this in a backwards fashion, or would a patch to allow
>configuration of the Hadoop serialization DatumReader/Writers be
>welcome?
>
>-Marshall
>