You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Vyacheslav Zholudev <vy...@gmail.com> on 2011/07/26 14:46:34 UTC

Multiple avro outputs from a reducer

Hi,

I'm using the avro format both for input and output, for a mapper and a
reducer. I would like to output multiple avro items with different schemata.
For sequence files I would use the MultipleOutputs class from the mapreduce
package.

I looked into the same class but from the old package "mapred" and realized
that I can pass an AvroOutputFormat.class parameter when adding another
output. However, I didn't manage to figure out how to provide an avro schema
for each output. Moreover, when writing to output , I need to provide a key
and a value, but in case of avro we usually just pass a specific avro
object. All above makes me think that the old MultipleOutputs API wouldn't
work with avro files. Am I right?

Any pointers of how to output multiple avro records in the same reducer are
appreciated.

P.S. Another thought was to create an avro schema of type union that will
contain all possible output schemata, but I would like to avoid that.

Thanks in advance!!!

-- 
Best,
Vyacheslav

Multiple avro outputs from a reducer

Posted by Vyacheslav Zholudev <vy...@gmail.com>.
Hi,

I'm using the avro format both for input and output, for a mapper and a
reducer. I would like to output multiple avro items with different schemata.
For sequence files I would use the MultipleOutputs class from the mapreduce
package.

I looked into the same class but from the old package "mapred" and realized
that I can pass an AvroOutputFormat.class parameter when adding another
output. However, I didn't manage to figure out how to provide an avro schema
for each output. Moreover, when writing to output , I need to provide a key
and a value, but in case of avro we usually just pass a specific avro
object. All above makes me think that the old MultipleOutputs API wouldn't
work with avro files. Am I right?

Any pointers of how to output multiple avro records in the same reducer are
appreciated.

P.S. Another thought was to create an avro schema of type union that will
contain all possible output schemata, but I would like to avoid that.

Thanks in advance!!!

-- 
Best,
Vyacheslav

Re: Multiple avro outputs from a reducer

Posted by Vyacheslav Zholudev <vy...@gmail.com>.
Hi all,

I tried to follow the suggestions and also looked at the code how the Avro thing works in mappers and reducers and created a simple class for Avro multiple outputs. If you are interested in looking or reviewing you can follow the link:
http://pastebin.com/HMPfgttg

Any suggestions and comments are highly appreciated

Vyacheslav

On Jul 30, 2011, at 7:26 PM, Jason wrote:

> You can extend/customize MultipleOutputs and pass schema related settings via properties prefixed with MO name, just like it is done with format classes there.
> 
> Also to send a dummy key or value why not just to use NullWritable? It's efficient as it does not consume any space.
> 
> Sent from my iPhone
> 
> On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev <vy...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I'm using the avro format both for input and output, for a mapper and a reducer. I would like to output multiple avro items with different schemata. For sequence files I would use the MultipleOutputs class from the mapreduce package.
>> 
>> I looked into the same class but from the old package "mapred" and realized that I can pass an AvroOutputFormat.class parameter when adding another output. However, I didn't manage to figure out how to provide an avro schema for each output. Moreover, when writing to output , I need to provide a key and a value, but in case of avro we usually just pass a specific avro object. All above makes me think that the old MultipleOutputs API wouldn't work with avro files. Am I right?
>> 
>> Any pointers of how to output multiple avro records in the same reducer are appreciated. 
>> 
>> P.S. Another thought was to create an avro schema of type union that will contain all possible output schemata, but I would like to avoid that.
>> 
>> Thanks in advance!!!
>> 
>> -- 
>> Best,
>> Vyacheslav


Re: Multiple avro outputs from a reducer

Posted by Vyacheslav Zholudev <vy...@gmail.com>.
Thanks, Jason. I will try that

Vyacheslav

On 30 July 2011 19:26, Jason <ur...@gmail.com> wrote:

> You can extend/customize MultipleOutputs and pass schema related settings
> via properties prefixed with MO name, just like it is done with format
> classes there.
>
> Also to send a dummy key or value why not just to use NullWritable? It's
> efficient as it does not consume any space.
>
> Sent from my iPhone
>
> On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev <
> vyacheslav.zholudev@gmail.com> wrote:
>
> > Hi,
> >
> > I'm using the avro format both for input and output, for a mapper and a
> reducer. I would like to output multiple avro items with different schemata.
> For sequence files I would use the MultipleOutputs class from the mapreduce
> package.
> >
> > I looked into the same class but from the old package "mapred" and
> realized that I can pass an AvroOutputFormat.class parameter when adding
> another output. However, I didn't manage to figure out how to provide an
> avro schema for each output. Moreover, when writing to output , I need to
> provide a key and a value, but in case of avro we usually just pass a
> specific avro object. All above makes me think that the old MultipleOutputs
> API wouldn't work with avro files. Am I right?
> >
> > Any pointers of how to output multiple avro records in the same reducer
> are appreciated.
> >
> > P.S. Another thought was to create an avro schema of type union that will
> contain all possible output schemata, but I would like to avoid that.
> >
> > Thanks in advance!!!
> >
> > --
> > Best,
> > Vyacheslav
>



-- 
Best,
Vyacheslav Zholudev

Re: Multiple avro outputs from a reducer

Posted by Jason <ur...@gmail.com>.
You can extend/customize MultipleOutputs and pass schema related settings via properties prefixed with MO name, just like it is done with format classes there.

Also to send a dummy key or value why not just to use NullWritable? It's efficient as it does not consume any space.

Sent from my iPhone

On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev <vy...@gmail.com> wrote:

> Hi,
> 
> I'm using the avro format both for input and output, for a mapper and a reducer. I would like to output multiple avro items with different schemata. For sequence files I would use the MultipleOutputs class from the mapreduce package.
> 
> I looked into the same class but from the old package "mapred" and realized that I can pass an AvroOutputFormat.class parameter when adding another output. However, I didn't manage to figure out how to provide an avro schema for each output. Moreover, when writing to output , I need to provide a key and a value, but in case of avro we usually just pass a specific avro object. All above makes me think that the old MultipleOutputs API wouldn't work with avro files. Am I right?
> 
> Any pointers of how to output multiple avro records in the same reducer are appreciated. 
> 
> P.S. Another thought was to create an avro schema of type union that will contain all possible output schemata, but I would like to avoid that.
> 
> Thanks in advance!!!
> 
> -- 
> Best,
> Vyacheslav