You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andy Sautins <an...@returnpath.net> on 2009/10/01 18:10:53 UTC

Map/Reduce and sequence file metadata...

   Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.

   I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?

   I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?

   Thanks

   Andy

RE: Map/Reduce and sequence file metadata...

Posted by Andy Sautins <an...@returnpath.net>.

  Thanks for the response Tom.  I'll probably try the approach of extending SequenceFileOutputFormat to write sequence file metadata.

  What I am getting from your response is that it doesn't seem like using the sequence file metadata is that common, especially for sequence files generated as map/reduce output.  Sounds like using MultipleInput and having files in different locations is a more common way of addressing having different file types fed into the same job.  Does that sound right?

   Thanks again for the insight.

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Friday, October 02, 2009 3:26 AM
To: common-user@hadoop.apache.org
Cc: core-user@hadoop.apache.org
Subject: Re: Map/Reduce and sequence file metadata...

On Thu, Oct 1, 2009 at 5:10 PM, Andy Sautins
<an...@returnpath.net> wrote:
>
>   Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.
>
>   I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?

By default no SequenceFile metadata is written by
SequenceFileOutputFormat. SequenceFile metadata is written at the
beginning of the file, so it needs to be passed in when the
SequenceFile is opened. One way of doing this would be to extend
SequenceFileOutputFormat and override the getSequenceWriter() method
to call the SequenceFile.createWriter() factory method that takes
metadata.

>
>   I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?

Another approach would be to use MultipleInputs which allows you to
use different mappers for different input paths. Could this help?

>
>   Thanks
>
>   Andy
>

Re: Map/Reduce and sequence file metadata...

Posted by Tom White <to...@cloudera.com>.

On Thu, Oct 1, 2009 at 5:10 PM, Andy Sautins
<an...@returnpath.net> wrote:
>
>   Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.
>
>   I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?

By default no SequenceFile metadata is written by
SequenceFileOutputFormat. SequenceFile metadata is written at the
beginning of the file, so it needs to be passed in when the
SequenceFile is opened. One way of doing this would be to extend
SequenceFileOutputFormat and override the getSequenceWriter() method
to call the SequenceFile.createWriter() factory method that takes
metadata.

>
>   I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?

Another approach would be to use MultipleInputs which allows you to
use different mappers for different input paths. Could this help?

>
>   Thanks
>
>   Andy
>