You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by ey-chih chow <ey...@hotmail.com> on 2010/08/17 03:14:13 UTC

how to specify MultipleOutputs, MultipleInputs in using Avro mapred API









Hi,
I got a Map/Reduce job that require multiple inputs and outputs.  One of the inputs will be processed by a mapper and a reducer that are subclasses of AvroMapper/AvroReducer respectively.  And the reducer has multiple outputs.  I appreciate if anybody could let me know how to configure the job to do this.
Ey-Chih

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by Doug Cutting <cu...@apache.org>.

On 08/18/2010 10:49 AM, Harsh J wrote:
>> We hope to add more such tools for such conversion/ingest, e.g.:
>>
>> https://issues.apache.org/jira/browse/AVRO-458
> Offtopic, but is there any work being done on this already? I saw one
> of them tagged with 'GSOC', so wish to know before I sink something
> down.

No.  To my knowledge no one is actively working on these at present.

Doug

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by Harsh J <qw...@gmail.com>.

On Wed, Aug 18, 2010 at 11:07 PM, Doug Cutting <cu...@apache.org> wrote:
> On 08/18/2010 10:18 AM, ey-chih chow wrote:
>>
>> Thanks. But by doing this way, what kind of advantage we can get from
>> Avro?
>
> The Avro MapReduce API is easiest to use when both inputs and outputs are
> Avro data.
>
> If inputs are not Avro data, but you want to use the rest of the Avro MR
> API, then you'd need to write an InputFormat that produces an AvroWrapper<T>
> where T is a type that Avro can serialize.
>
> Another alternative might be to first convert your inputs to be avro data
> files.  For example, one can use Avro's 'fromtext' tool to convert
> line-oriented files into equivalent compressed, splittable, Avro data files.
>  This could be done as log files are loaded into HDFS, since this tool
> accepts Hadoop paths as output.
>
> We hope to add more such tools for such conversion/ingest, e.g.:
>
> https://issues.apache.org/jira/browse/AVRO-458
Offtopic, but is there any work being done on this already? I saw one
of them tagged with 'GSOC', so wish to know before I sink something
down.
>
> We also expect that systems like Flume will produce Avro data files.
>
> Doug
>



-- 
Harsh J
www.harshj.com

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by Doug Cutting <cu...@apache.org>.

On 08/18/2010 10:18 AM, ey-chih chow wrote:
> Thanks. But by doing this way, what kind of advantage we can get from Avro?

The Avro MapReduce API is easiest to use when both inputs and outputs 
are Avro data.

If inputs are not Avro data, but you want to use the rest of the Avro MR 
API, then you'd need to write an InputFormat that produces an 
AvroWrapper<T> where T is a type that Avro can serialize.

Another alternative might be to first convert your inputs to be avro 
data files.  For example, one can use Avro's 'fromtext' tool to convert 
line-oriented files into equivalent compressed, splittable, Avro data 
files.  This could be done as log files are loaded into HDFS, since this 
tool accepts Hadoop paths as output.

We hope to add more such tools for such conversion/ingest, e.g.:

https://issues.apache.org/jira/browse/AVRO-458

We also expect that systems like Flume will produce Avro data files.

Doug

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by Harsh J <qw...@gmail.com>.

If you're asking about advantages of using avro in intermediates, then
this is what I've noticed so far:

Smaller intermediate outputs (avro's serialization is beautiful).
Compression with its deflate provision isn't difficult at all either.

That and raw comparators helps speed up the intermediate stages.

On Wed, Aug 18, 2010 at 10:48 PM, ey-chih chow <ey...@hotmail.com> wrote:
> Thanks.  But by doing this way, what kind of advantage we can get from Avro?
> Ey-Chih
>
>> From: qwertymaniac@gmail.com
>> Date: Wed, 18 Aug 2010 19:39:17 +0530
>> Subject: Re: how to specify MultipleOutputs, MultipleInputs in using Avro
>> mapred API
>> To: user@avro.apache.org
>>
>> If I got your issue right, all you need to ensure is that both your
>> mappers emit the same "type" of keys and values out. This can easily
>> be done by implementing a custom Avro Mapper [which reads records from
>> avro files, processes them and spews out legal K/V types instead of
>> avro datums, such that they match your HBase mapper's collected
>> outputs].
>>
>> Your reducer shouldn't be bothered about avro/etc then.
>>
>> * Note: You may also use avro as intermediate K/V format, but it might
>> require some extra work to do so :)
>>
>> On Wed, Aug 18, 2010 at 6:45 PM, ey-chih chow <ey...@hotmail.com> wrote:
>> > Hi,
>> > Let me rephrase my question to see if anybody is interested in answering
>> > it.
>> >  For the new version of Avro 1.4.0, the class hierarchy of AvroMapper
>> > and
>> > AvroReducer have been changed to subclass from Configured, rather than
>> > from
>> > MapReduceBase to implement the interfaces Mapper and Reducer
>> > respectively.
>> >  The configuration of Avro mapred jobs are also different from that of
>> > the
>> > other mapred jobs.  Furthermore, text log files have to be imported to
>> > become Avro formats for Avro mapred jobs to process.  If I get a mapred
>> > job
>> > that requires a reducer-side join of a two inputs, one from HBase and
>> > the
>> > other from an imported log file with the Avro format, how can I
>> > configure
>> > the two mappers to process inputs from HBase and the log file
>> > respectively?
>> >  Also how can I configure an Avro reducer to generate multiple outputs?
>> >  For
>> > multiple inputs and outputs, I got some examples programs from Tom
>> > White's
>> > Hadoop book.  But I simply don't know what kind of changes I should make
>> > for
>> > the Avro case.
>> > Ey-Chih
>> >
>> > ________________________________
>> > From: eychih@hotmail.com
>> > To: user@avro.apache.org
>> > Subject: how to specify MultipleOutputs, MultipleInputs in using Avro
>> > mapred
>> > API
>> > Date: Mon, 16 Aug 2010 18:22:24 -0700
>> >
>> > Hi,
>> > I got a Map/Reduce job that require multiple inputs and outputs.  One of
>> > the
>> > inputs will be processed by a mapper and a reducer that are subclasses
>> > of
>> > AvroMapper/AvroReducer respectively.  And the reducer has multiple
>> > outputs.
>> >  I appreciate if anybody could let me know how to configure the job to
>> > do
>> > this.
>> > Ey-Chih
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>



-- 
Harsh J
www.harshj.com

RE: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by ey-chih chow <ey...@hotmail.com>.

Thanks.  But by doing this way, what kind of advantage we can get from Avro?
Ey-Chih

> From: qwertymaniac@gmail.com
> Date: Wed, 18 Aug 2010 19:39:17 +0530
> Subject: Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API
> To: user@avro.apache.org
> 
> If I got your issue right, all you need to ensure is that both your
> mappers emit the same "type" of keys and values out. This can easily
> be done by implementing a custom Avro Mapper [which reads records from
> avro files, processes them and spews out legal K/V types instead of
> avro datums, such that they match your HBase mapper's collected
> outputs].
> 
> Your reducer shouldn't be bothered about avro/etc then.
> 
> * Note: You may also use avro as intermediate K/V format, but it might
> require some extra work to do so :)
> 
> On Wed, Aug 18, 2010 at 6:45 PM, ey-chih chow <ey...@hotmail.com> wrote:
> > Hi,
> > Let me rephrase my question to see if anybody is interested in answering it.
> >  For the new version of Avro 1.4.0, the class hierarchy of AvroMapper and
> > AvroReducer have been changed to subclass from Configured, rather than from
> > MapReduceBase to implement the interfaces Mapper and Reducer respectively.
> >  The configuration of Avro mapred jobs are also different from that of the
> > other mapred jobs.  Furthermore, text log files have to be imported to
> > become Avro formats for Avro mapred jobs to process.  If I get a mapred job
> > that requires a reducer-side join of a two inputs, one from HBase and the
> > other from an imported log file with the Avro format, how can I configure
> > the two mappers to process inputs from HBase and the log file respectively?
> >  Also how can I configure an Avro reducer to generate multiple outputs?  For
> > multiple inputs and outputs, I got some examples programs from Tom White's
> > Hadoop book.  But I simply don't know what kind of changes I should make for
> > the Avro case.
> > Ey-Chih
> >
> > ________________________________
> > From: eychih@hotmail.com
> > To: user@avro.apache.org
> > Subject: how to specify MultipleOutputs, MultipleInputs in using Avro mapred
> > API
> > Date: Mon, 16 Aug 2010 18:22:24 -0700
> >
> > Hi,
> > I got a Map/Reduce job that require multiple inputs and outputs.  One of the
> > inputs will be processed by a mapper and a reducer that are subclasses of
> > AvroMapper/AvroReducer respectively.  And the reducer has multiple outputs.
> >  I appreciate if anybody could let me know how to configure the job to do
> > this.
> > Ey-Chih
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by Harsh J <qw...@gmail.com>.

If I got your issue right, all you need to ensure is that both your
mappers emit the same "type" of keys and values out. This can easily
be done by implementing a custom Avro Mapper [which reads records from
avro files, processes them and spews out legal K/V types instead of
avro datums, such that they match your HBase mapper's collected
outputs].

Your reducer shouldn't be bothered about avro/etc then.

* Note: You may also use avro as intermediate K/V format, but it might
require some extra work to do so :)

On Wed, Aug 18, 2010 at 6:45 PM, ey-chih chow <ey...@hotmail.com> wrote:
> Hi,
> Let me rephrase my question to see if anybody is interested in answering it.
>  For the new version of Avro 1.4.0, the class hierarchy of AvroMapper and
> AvroReducer have been changed to subclass from Configured, rather than from
> MapReduceBase to implement the interfaces Mapper and Reducer respectively.
>  The configuration of Avro mapred jobs are also different from that of the
> other mapred jobs.  Furthermore, text log files have to be imported to
> become Avro formats for Avro mapred jobs to process.  If I get a mapred job
> that requires a reducer-side join of a two inputs, one from HBase and the
> other from an imported log file with the Avro format, how can I configure
> the two mappers to process inputs from HBase and the log file respectively?
>  Also how can I configure an Avro reducer to generate multiple outputs?  For
> multiple inputs and outputs, I got some examples programs from Tom White's
> Hadoop book.  But I simply don't know what kind of changes I should make for
> the Avro case.
> Ey-Chih
>
> ________________________________
> From: eychih@hotmail.com
> To: user@avro.apache.org
> Subject: how to specify MultipleOutputs, MultipleInputs in using Avro mapred
> API
> Date: Mon, 16 Aug 2010 18:22:24 -0700
>
> Hi,
> I got a Map/Reduce job that require multiple inputs and outputs.  One of the
> inputs will be processed by a mapper and a reducer that are subclasses of
> AvroMapper/AvroReducer respectively.  And the reducer has multiple outputs.
>  I appreciate if anybody could let me know how to configure the job to do
> this.
> Ey-Chih



-- 
Harsh J
www.harshj.com

RE: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by ey-chih chow <ey...@hotmail.com>.

Hi,
Let me rephrase my question to see if anybody is interested in answering it.  For the new version of Avro 1.4.0, the class hierarchy of AvroMapper and AvroReducer have been changed to subclass from Configured, rather than from MapReduceBase to implement the interfaces Mapper and Reducer respectively.  The configuration of Avro mapred jobs are also different from that of the other mapred jobs.  Furthermore, text log files have to be imported to become Avro formats for Avro mapred jobs to process.  If I get a mapred job that requires a reducer-side join of a two inputs, one from HBase and the other from an imported log file with the Avro format, how can I configure the two mappers to process inputs from HBase and the log file respectively?  Also how can I configure an Avro reducer to generate multiple outputs?  For multiple inputs and outputs, I got some examples programs from Tom White's Hadoop book.  But I simply don't know what kind of changes I should make for the Avro case.   
Ey-Chih  

From: eychih@hotmail.com
To: user@avro.apache.org
Subject: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API
Date: Mon, 16 Aug 2010 18:22:24 -0700

Hi,
I got a Map/Reduce job that require multiple inputs and outputs.  One of the inputs will be processed by a mapper and a reducer that are subclasses of AvroMapper/AvroReducer respectively.  And the reducer has multiple outputs.  I appreciate if anybody could let me know how to configure the job to do this.
Ey-Chih

how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Posted by ey-chih chow <ey...@hotmail.com>.

Hi,
I got a Map/Reduce job that require multiple inputs and outputs.  One of the inputs will be processed by a mapper and a reducer that are subclasses of AvroMapper/AvroReducer respectively.  And the reducer has multiple outputs.  I appreciate if anybody could let me know how to configure the job to do this.
Ey-Chih