You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by snikhil0 <sn...@telenav.com> on 2012/06/08 20:49:09 UTC

Avro Map Reduce Question: GenericRecord, renaming reduce output

My problem:
I have an input file which is avro schema but it has shuffled datums(think
ids in mixed order)
I need to sort them by items from the schema (id) and run a
mux-demux/shuffle-sort.

So my mapper: reads from avro schema (GenericRecord) and outputs key(id) and
value(GenericRecord).

My reducer: for each key (id) gets the list of values and outputs to a file
(part-r-00000) just the genericrecords.

My expectation is that I can use the same input schema to read the output
file. But alas this is not working. 
In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why is
this?

Also how can rename the reduce output file to something other than
part-r-0000*?

Some snippets of code:
================
public void map(GenericData.Record datum,
			AvroCollector<Pair&lt;LogKeyWritable, GenericData.Record>> collector,
Reporter reporter)
			throws IOException
	{
		long tstamp = ((Long) datum.get("timestamp")).longValue();
		String keyPath = CollectorUtils.getKeyHour(tstamp,
				((String) datum.get("appid")));

		LogKeyWritable key = new LogKeyWritable(keyPath, tstamp);
		Pair<LogKeyWritable, GenericData.Record> pair = new Pair<LogKeyWritable,
GenericData.Record>(
				key, datum);
		collector.collect(pair);
	}


public void reduce(LogKeyWritable key, Iterable<GenericData.Record> values,
			AvroCollector<GenericData.Record> collector, Reporter reporter) throws
IOException
	{

		for (GenericData.Record r : values)
		{
			collector.collect(r);
		}

	}

My job setup:
=========
AvroJob.setInputSchema(jobConf, AVRO_SCHEMA);
AvroJob.setOutputSchema(jobConf, AVRO_SCHEMA);

CAN SOMEONE PLEASE HELP!

Nikhil

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Map-Reduce-Question-GenericRecord-renaming-reduce-output-tp4025105.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by snikhil0 <sn...@telenav.com>.
Ok this things looks like a map-reduce api issue:

I went back to the old style of map-reduce api: now I get a good avro header
but no datums. Sheesh! can someone please help !

The main function:

        final static Schema IN_SCHEMA =
LogshedCollectorUtils.getResourceSchema();
	final static Schema OUT_SCHEMA = LogshedCollectorUtils.getResourceSchema();
	final static ReflectData reflectData = ReflectData.get();
        final static Schema KEY_SCHEMA =
reflectData.getSchema(LogKeyWritable.class);
	final static Schema MAP_OUT_SCHEMA = Pair.getPairSchema(KEY_SCHEMA,
OUT_SCHEMA);

                Configuration conf =
LogshedCollectorUtils.getLocalHadoopConfiguartion();
		JobConf jobConf = new
JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion(),
				MuxDemuxJob.class);
		jobConf.setJobName("muxdemux");
		jobConf.setJarByClass(MuxDemuxJob.class);
		
		jobConf.setInputFormat(AvroInputFormat.class);
		jobConf.setOutputFormat(AvroOutputFormat.class);
		
		AvroJob.setInputSchema(jobConf, IN_SCHEMA);
		AvroJob.setMapOutputSchema(jobConf, MAP_OUT_SCHEMA);
		AvroJob.setOutputSchema(jobConf, OUT_SCHEMA);
		
		AvroJob.setMapperClass(jobConf, LogshedMapper.class);
		AvroJob.setReducerClass(jobConf, LogshedReducer.class);
		
		//Job job = new Job(jobConf, "muxdemux");
		
		FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
		Path outPath = new Path(args[1]);
		FileOutputFormat.setOutputPath(jobConf, outPath);

		JobClient.runJob(jobConf);
		return 0;

Nikhil

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Map-Reduce-Question-GenericRecord-renaming-reduce-output-tp4025105p4025126.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
Another thing: when I try the AvroJob settings before job instantiation, I
basically get no reduce output file?

Nikhil
On 6/12/12 10:24 AM, "Shirahatti, Nikhil" <sn...@telenav.com> wrote:

>That¹s right. The junit test, did not do any asserts on the file checking.
>I've checked it in, so please try again. However, if you try to open the
>file in /logshed you'll probably see what I'm talking about.
>
>I also tried setting AvroJob before job instantiation, but I got the same
>error.
>
>Snippet:
>JobConf jobConf = new
>JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion());
>	
>		AvroJob.setInputSchema(jobConf, IN_SCHEMA);
>		AvroJob.setOutputSchema(jobConf, OUT_SCHEMA);
>
>		AvroJob.setMapperClass(jobConf, LogshedMapper.class);
>		AvroJob.setReducerClass(jobConf, LogshedReducer.class);
>		
>		Job job = new Job(jobConf, "muxdemux_job");
>
>		FileInputFormat.setInputPaths(job, new Path(args[0]));
>		Path outPath = new Path(args[1]);
>		FileOutputFormat.setOutputPath(job, outPath);
>		job.setJarByClass(MuxDemuxJob.class);
>
>
>
>Thanks,
>Nikhil
>
>
>On 6/12/12 10:05 AM, "Doug Cutting" <cu...@apache.org> wrote:
>
>>When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd
>>avro-mr; ant test', I see:
>>
>>    [junit] Running
>>com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest
>>    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637
>>sec
>>
>>BUILD SUCCESSFUL
>>
>>Finally, Nikhil suggested above that your problem is in
>>MuxDemuxJob.java, where you set properties on the JobConf after
>>creating the Job.  The AvroJob methods should instead be called before
>>the Job is constructed.
>>
>>Doug
>


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
That¹s right. The junit test, did not do any asserts on the file checking.
I've checked it in, so please try again. However, if you try to open the
file in /logshed you'll probably see what I'm talking about.

I also tried setting AvroJob before job instantiation, but I got the same
error.

Snippet:
JobConf jobConf = new
JobConf(LogshedCollectorUtils.getLocalHadoopConfiguartion());
	
		AvroJob.setInputSchema(jobConf, IN_SCHEMA);
		AvroJob.setOutputSchema(jobConf, OUT_SCHEMA);

		AvroJob.setMapperClass(jobConf, LogshedMapper.class);
		AvroJob.setReducerClass(jobConf, LogshedReducer.class);
		
		Job job = new Job(jobConf, "muxdemux_job");

		FileInputFormat.setInputPaths(job, new Path(args[0]));
		Path outPath = new Path(args[1]);
		FileOutputFormat.setOutputPath(job, outPath);
		job.setJarByClass(MuxDemuxJob.class);



Thanks,
Nikhil


On 6/12/12 10:05 AM, "Doug Cutting" <cu...@apache.org> wrote:

>When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd
>avro-mr; ant test', I see:
>
>    [junit] Running
>com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest
>    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637 sec
>
>BUILD SUCCESSFUL
>
>Finally, Nikhil suggested above that your problem is in
>MuxDemuxJob.java, where you set properties on the JobConf after
>creating the Job.  The AvroJob methods should instead be called before
>the Job is constructed.
>
>Doug


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by Doug Cutting <cu...@apache.org>.
When I do 'git clone https://github.com/snikhil0/avro-mr.git; cd
avro-mr; ant test', I see:

    [junit] Running com.telenav.logshed.collector.muxdemux.MuxDemuxRunnableTest
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 12.637 sec

BUILD SUCCESSFUL

Finally, Nikhil suggested above that your problem is in
MuxDemuxJob.java, where you set properties on the JobConf after
creating the Job.  The AvroJob methods should instead be called before
the Job is constructed.

Doug

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
Sorry for the delay. I am still having the problem.

I added the ant file: run ant test (https://github.com/snikhil0/avro-mr)

Creates the o/p file under: /logshed/test/<timebased>/part-r-00000

Its not completely kosher code: each time u invoke the test, delete your
previous output (/logshed/test/<time-based>)

Thanks,
Nikhil

On 6/8/12 4:05 PM, "Doug Cutting" <cu...@apache.org> wrote:

>There's no Ant or Maven build file.  What command line should one use
>to run the test?
>
>Doug
>
>On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <sn...@telenav.com>
>wrote:
>> Hello,
>>
>> The code is checked in here: https://github.com/snikhil0/avro-mr
>>
>> The test class is: MuxDemuxRunnableTest
>>
>>
>> Nikhil
>>
>> On 6/8/12 2:22 PM, "Doug Cutting" <cu...@apache.org> wrote:
>>
>>>On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <sn...@telenav.com>
>>>wrote:
>>>> Whereas the reduce output file: has the 0<tab> before the
>>>
>>>It sounds like something is writing to the file before AvroOutputFormat.
>>>
>>>Can you provide a complete example that illustrates this?  E.g., like
>>>those in the unit tests?
>>>
>>>http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/jav
>>>a/
>>>org/apache/avro/mapred/
>>>
>>>Thanks,
>>>
>>>Doug
>>


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by Doug Cutting <cu...@apache.org>.
There's no Ant or Maven build file.  What command line should one use
to run the test?

Doug

On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <sn...@telenav.com> wrote:
> Hello,
>
> The code is checked in here: https://github.com/snikhil0/avro-mr
>
> The test class is: MuxDemuxRunnableTest
>
>
> Nikhil
>
> On 6/8/12 2:22 PM, "Doug Cutting" <cu...@apache.org> wrote:
>
>>On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <sn...@telenav.com>
>>wrote:
>>> Whereas the reduce output file: has the 0<tab> before the
>>
>>It sounds like something is writing to the file before AvroOutputFormat.
>>
>>Can you provide a complete example that illustrates this?  E.g., like
>>those in the unit tests?
>>
>>http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/
>>org/apache/avro/mapred/
>>
>>Thanks,
>>
>>Doug
>

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by tazan007 <ta...@gmail.com>.
Looks like the output format probably isn't being set right, it looks like
TextOutputFormat.

You need to set the properties on Job not the JobConf you created.  When
you create the Job and pass in the JobConf, a copy of the JobConf is made
which is used in the Job.  So when you set the properties in the JobConf
you created after creating the Job, they are not reflected in the
configuration of the Job since it made a copy.

-Hiral

On Fri, Jun 8, 2012 at 3:46 PM, Shirahatti, Nikhil <sn...@telenav.com>wrote:

> Hello,
>
> The code is checked in here: https://github.com/snikhil0/avro-mr
>
> The test class is: MuxDemuxRunnableTest
>
>
> Nikhil
>
> On 6/8/12 2:22 PM, "Doug Cutting" <cu...@apache.org> wrote:
>
> >On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <sn...@telenav.com>
> >wrote:
> >> Whereas the reduce output file: has the 0<tab> before the
> >
> >It sounds like something is writing to the file before AvroOutputFormat.
> >
> >Can you provide a complete example that illustrates this?  E.g., like
> >those in the unit tests?
> >
> >
> http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/
> >org/apache/avro/mapred/
> >
> >Thanks,
> >
> >Doug
>
>

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
Hello,

The code is checked in here: https://github.com/snikhil0/avro-mr

The test class is: MuxDemuxRunnableTest


Nikhil

On 6/8/12 2:22 PM, "Doug Cutting" <cu...@apache.org> wrote:

>On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <sn...@telenav.com>
>wrote:
>> Whereas the reduce output file: has the 0<tab> before the
>
>It sounds like something is writing to the file before AvroOutputFormat.
>
>Can you provide a complete example that illustrates this?  E.g., like
>those in the unit tests?
>
>http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/
>org/apache/avro/mapred/
>
>Thanks,
>
>Doug


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by Doug Cutting <cu...@apache.org>.
On Fri, Jun 8, 2012 at 2:04 PM, Shirahatti, Nikhil <sn...@telenav.com> wrote:
> Whereas the reduce output file: has the 0<tab> before the

It sounds like something is writing to the file before AvroOutputFormat.

Can you provide a complete example that illustrates this?  E.g., like
those in the unit tests?

http://svn.apache.org/repos/asf/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/

Thanks,

Doug

Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
The magic number check is failing: so the top of the file has some junk in
it?

if (!Arrays.equals(DataFileConstants.MAGIC, magic))
      throw new IOException("Not a data file.");



I checked the (verified by read operation) input file: which has the same
schema:
This starts with the Obj^A^B^Vavro.schema<E0>^D

Whereas the reduce output file: has the 0<tab> before the
Obj^A^B^Vavro.schema<E0>^D
0       Obj^A^B^Vavro.schema<E0>^D


This was what I did not expect. Maybe my previous email was unclear.

Thanks,
Nikhil

On 6/8/12 1:35 PM, "Shirahatti, Nikhil" <sn...@telenav.com> wrote:

>The reason is: when I try to read the file using GenericReader.. I get the
>error: not a data file.
>
>
>Code snippet:
>--------------
>DatumReader<GenericData.Record> reader = new
>GenericDatumReader<Record>(AVRO_SCHEMA);
>
>String MUXDEMUX_FILE = outpath.concat("part-r-00000");
>		InputStream in = new BufferedInputStream(new
>FileInputStream(MUXDEMUX_FILE));
>		DataFileStream<GenericData.Record> records = new
>DataFileStream<GenericData.Record>(in,
>				reader);
>		for (GenericData.Record r : records)
>		{
>			System.out.println(r.toString());
>		}
>
>
>
>Nikhil
>
>On 6/8/12 12:17 PM, "Doug Cutting" <cu...@apache.org> wrote:
>
>>On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <sn...@telenav.com> wrote:
>>> My expectation is that I can use the same input schema to read the
>>>output
>>> file. But alas this is not working.
>>> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why
>>>is
>>> this?
>>
>>That looks approximately like an Avro data file.  How is it not what you
>>expect?
>>
>>> Also how can rename the reduce output file to something other than
>>> part-r-0000*?
>>
>>That's the standard name for Hadoop mapreduce output files.  You could
>>override it in the OutputFormat, but most folks do not.  The name of
>>the directory these are in is normally used to identify the result
>>set.  The files within the directory are just fragments of that result
>>set.
>>
>>Doug
>


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by "Shirahatti, Nikhil" <sn...@telenav.com>.
The reason is: when I try to read the file using GenericReader.. I get the
error: not a data file.


Code snippet:
--------------
DatumReader<GenericData.Record> reader = new
GenericDatumReader<Record>(AVRO_SCHEMA);

String MUXDEMUX_FILE = outpath.concat("part-r-00000");
		InputStream in = new BufferedInputStream(new
FileInputStream(MUXDEMUX_FILE));
		DataFileStream<GenericData.Record> records = new
DataFileStream<GenericData.Record>(in,
				reader);
		for (GenericData.Record r : records)
		{
			System.out.println(r.toString());
		}



Nikhil

On 6/8/12 12:17 PM, "Doug Cutting" <cu...@apache.org> wrote:

>On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <sn...@telenav.com> wrote:
>> My expectation is that I can use the same input schema to read the
>>output
>> file. But alas this is not working.
>> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why
>>is
>> this?
>
>That looks approximately like an Avro data file.  How is it not what you
>expect?
>
>> Also how can rename the reduce output file to something other than
>> part-r-0000*?
>
>That's the standard name for Hadoop mapreduce output files.  You could
>override it in the OutputFormat, but most folks do not.  The name of
>the directory these are in is normally used to identify the result
>set.  The files within the directory are just fragments of that result
>set.
>
>Doug


Re: Avro Map Reduce Question: GenericRecord, renaming reduce output

Posted by Doug Cutting <cu...@apache.org>.
On Fri, Jun 8, 2012 at 11:49 AM, snikhil0 <sn...@telenav.com> wrote:
> My expectation is that I can use the same input schema to read the output
> file. But alas this is not working.
> In the part-r-00000 I have a 0<tab>Obj<Avroschema>....datums...... Why is
> this?

That looks approximately like an Avro data file.  How is it not what you expect?

> Also how can rename the reduce output file to something other than
> part-r-0000*?

That's the standard name for Hadoop mapreduce output files.  You could
override it in the OutputFormat, but most folks do not.  The name of
the directory these are in is normally used to identify the result
set.  The files within the directory are just fragments of that result
set.

Doug