You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Hari Sreekumar <hs...@clickable.com> on 2011/04/14 07:39:27 UTC

Using MultipleTextOutputFormat for map-only jobs

Hi,

I have a map-only mapreduce job where I want to deduce the output filename
from the output key/value. I figured
MultipleTextOutputFormat<http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/MultipleTextOutputFormat.html>
is
the best fit for my purpose. But I am unable to use it in map-only jobs. I
was able to run it if I add a reduce phase. But when I use map-only jobs,
the file gets written to the usual part-0000xx files. Also, is there no
support for this output format in v0.20.2? I mean, is it necessary to use
the deprecated classes if I want to use this?

Thanks,
Hari

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Geoffry Roberts <ge...@gmail.com>.

All,

I read this thread and noticed the example code sited in it is based on what
I believe is the older, and at one time deprecated,
org.apache.hadoop.mapred.lib.* package.

I am attempting to output to multiple files, but I am using the
org.apache.hadoop.mapreduce.lib.output.*
package. I am not getting good results.

Question: Is this newer package ready for prime time?

I looked at the source code and it appears ok.

I am specifying an output file name in my reduce method, but when I run the
job I get the part-r-0000* file names that hadoop generates.

My reduce method is included below:
 protected void reduce(Text key, Iterable<Text> values, Reducer.Context ctx)
            throws IOException, InterruptedException {
  int k = 0;
  for (Text value : values) {
    k++;
    String[] ss =
      value.toString().split(F.DELIMITER);
    mos.write(new Text(ss[F.ID]), value, key.toString());
// I want my output files to have the name of the key value.
        }
    }

Thanks in advance.

On 14 April 2011 22:11, Hari Sreekumar <hs...@clickable.com> wrote:

> I changes jobConf.setMapOutputKeyClass(Text.class); to
> jobConf.setMapOutputKeyClass(NullWritable.class);
>
> Still no luck..
>
> I also get this error in many mappers:
>
> java.io.IOException: Failed to delete earlier output of task: attempt_201104041514_0069_m_000003_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:110)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
> On Fri, Apr 15, 2011 at 10:37 AM, Hari Sreekumar <hsreekumar@clickable.com
> > wrote:
>
>> Here's what I tried:
>>
>>   static class MapperClass extends MapReduceBase implements
>>           Mapper<LongWritable, Text, NullWritable, Text> {
>>     @Override
>>     public void map(LongWritable key, Text value,
>>             OutputCollector<NullWritable, Text> output, Reporter reporter)
>>             throws IOException {
>>       output.collect(
>>               NullWritable.get(),
>>               value);
>>     }
>>   }
>>
>>   static class SameFilenameOutputFormat extends
>>           MultipleTextOutputFormat<NullWritable, Text> {
>>
>>     @Override
>>     protected String getInputFileBasedOutputFileName(JobConf job, String name) {
>>       String infilepath = job.get("map.input.file");
>>       System.out.println("File path: " + infilepath);
>>       if (infilepath == null) {
>>         return name;
>>       }
>>       return new Path(infilepath).getName();
>>     }
>>
>>
>> And the config I set in the run() method:
>>  JobConf jobConf = new JobConf(conf, this.getClass());
>>
>>     jobConf.setMapperClass(MapperClass.class);
>>     jobConf.setNumReduceTasks(0);
>>     jobConf.setMapOutputKeyClass(Text.class);
>>     jobConf.setMapOutputValueClass(Text.class);
>>     jobConf.setOutputKeyClass(NullWritable.class);
>>     jobConf.setOutputValueClass(Text.class);
>>     jobConf.setOutputFormat(SameFilenameOutputFormat.class);
>>
>> I do get output files with same names as input files, but I lose a lot of
>> records. I get this exception and many tasks fail:
>>
>> 2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized
>> 2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
>> 2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
>> 2011-04-15 10:23:53,174 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
>> 2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201104041514_0068_m_000001_0 is done. And is in the process of commiting
>> 2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201104041514_0068_m_000001_0 is allowed to commit now
>> 2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing: java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> 2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> 2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
>>
>> I guess it has something to do with partitioning? Maybe the mappers are
>> not simultaneously able to write to the same file or something of that sort?
>>
>> Thanks,
>> Hari
>>
>> On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hsreekumar@clickable.com
>> > wrote:
>>
>>> That is exactly what I do when I have a reduce phase, and it works. But
>>> in case of map-only jobs, it doesn't work. I'll try overriding the
>>> getOutputfileFromInputFile() method.
>>>
>>>
>>> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Hello again Hari,
>>>>
>>>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>>>> <hs...@clickable.com> wrote:
>>>> > Here is a part of the code I am using:
>>>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>>>
>>>> You need to subclass the OF and use it properly, else the abstract
>>>> class takes over with the default name always used (Thus, 'part'). You
>>>> can see a good, complete example at [1].
>>>>
>>>> I'd still recommend using MultipleOutputs for better portability
>>>> reasons. Its javadocs explain how to go about using it well enough
>>>> [2].
>>>>
>>>> [1] -
>>>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>>>> [2] -
>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>


-- 
Geoffry Roberts

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Hari Sreekumar <hs...@clickable.com>.

I changes jobConf.setMapOutputKeyClass(Text.class); to
jobConf.setMapOutputKeyClass(NullWritable.class);

Still no luck..

I also get this error in many mappers:

java.io.IOException: Failed to delete earlier output of task:
attempt_201104041514_0069_m_000003_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:110)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)


On Fri, Apr 15, 2011 at 10:37 AM, Hari Sreekumar
<hs...@clickable.com>wrote:

> Here's what I tried:
>
>   static class MapperClass extends MapReduceBase implements
>           Mapper<LongWritable, Text, NullWritable, Text> {
>     @Override
>     public void map(LongWritable key, Text value,
>             OutputCollector<NullWritable, Text> output, Reporter reporter)
>             throws IOException {
>       output.collect(
>               NullWritable.get(),
>               value);
>     }
>   }
>
>   static class SameFilenameOutputFormat extends
>           MultipleTextOutputFormat<NullWritable, Text> {
>
>     @Override
>     protected String getInputFileBasedOutputFileName(JobConf job, String name) {
>       String infilepath = job.get("map.input.file");
>       System.out.println("File path: " + infilepath);
>       if (infilepath == null) {
>         return name;
>       }
>       return new Path(infilepath).getName();
>     }
>
>
> And the config I set in the run() method:
>  JobConf jobConf = new JobConf(conf, this.getClass());
>
>     jobConf.setMapperClass(MapperClass.class);
>     jobConf.setNumReduceTasks(0);
>     jobConf.setMapOutputKeyClass(Text.class);
>     jobConf.setMapOutputValueClass(Text.class);
>     jobConf.setOutputKeyClass(NullWritable.class);
>     jobConf.setOutputValueClass(Text.class);
>     jobConf.setOutputFormat(SameFilenameOutputFormat.class);
>
> I do get output files with same names as input files, but I lose a lot of
> records. I get this exception and many tasks fail:
>
> 2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized
> 2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
> 2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2011-04-15 10:23:53,174 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201104041514_0068_m_000001_0 is done. And is in the process of commiting
> 2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201104041514_0068_m_000001_0 is allowed to commit now
> 2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing: java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> 2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
>
> I guess it has something to do with partitioning? Maybe the mappers are not
> simultaneously able to write to the same file or something of that sort?
>
> Thanks,
> Hari
>
> On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hs...@clickable.com>wrote:
>
>> That is exactly what I do when I have a reduce phase, and it works. But in
>> case of map-only jobs, it doesn't work. I'll try overriding the
>> getOutputfileFromInputFile() method.
>>
>>
>> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hello again Hari,
>>>
>>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>>> <hs...@clickable.com> wrote:
>>> > Here is a part of the code I am using:
>>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>>
>>> You need to subclass the OF and use it properly, else the abstract
>>> class takes over with the default name always used (Thus, 'part'). You
>>> can see a good, complete example at [1].
>>>
>>> I'd still recommend using MultipleOutputs for better portability
>>> reasons. Its javadocs explain how to go about using it well enough
>>> [2].
>>>
>>> [1] -
>>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>>> [2] -
>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Hari Sreekumar <hs...@clickable.com>.

Here's what I tried:

  static class MapperClass extends MapReduceBase implements
          Mapper<LongWritable, Text, NullWritable, Text> {
    @Override
    public void map(LongWritable key, Text value,
            OutputCollector<NullWritable, Text> output, Reporter reporter)
            throws IOException {
      output.collect(
              NullWritable.get(),
              value);
    }
  }

  static class SameFilenameOutputFormat extends
          MultipleTextOutputFormat<NullWritable, Text> {

    @Override
    protected String getInputFileBasedOutputFileName(JobConf job, String name) {
      String infilepath = job.get("map.input.file");
      System.out.println("File path: " + infilepath);
      if (infilepath == null) {
        return name;
      }
      return new Path(infilepath).getName();
    }


And the config I set in the run() method:
 JobConf jobConf = new JobConf(conf, this.getClass());

    jobConf.setMapperClass(MapperClass.class);
    jobConf.setNumReduceTasks(0);
    jobConf.setMapOutputKeyClass(Text.class);
    jobConf.setMapOutputValueClass(Text.class);
    jobConf.setOutputKeyClass(NullWritable.class);
    jobConf.setOutputValueClass(Text.class);
    jobConf.setOutputFormat(SameFilenameOutputFormat.class);

I do get output files with same names as input files, but I lose a lot of
records. I get this exception and many tasks fail:

2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Cannot initialize JVM Metrics with processName=MAP, sessionId= -
already initialized
2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2011-04-15 10:23:53,174 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201104041514_0068_m_000001_0 is done. And is in the
process of commiting
2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task
attempt_201104041514_0068_m_000001_0 is allowed to commit now
2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner:
Failure committing: java.io.IOException: Failed to save output of
task: attempt_201104041514_0068_m_000001_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
java.io.IOException: Failed to save output of task:
attempt_201104041514_0068_m_000001_0
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
	at org.apache.hadoop.mapred.Task.done(Task.java:691)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner:
Runnning cleanup for the task

I guess it has something to do with partitioning? Maybe the mappers are not
simultaneously able to write to the same file or something of that sort?

Thanks,
Hari

On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hs...@clickable.com>wrote:

> That is exactly what I do when I have a reduce phase, and it works. But in
> case of map-only jobs, it doesn't work. I'll try overriding the
> getOutputfileFromInputFile() method.
>
>
> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hello again Hari,
>>
>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>> <hs...@clickable.com> wrote:
>> > Here is a part of the code I am using:
>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>
>> You need to subclass the OF and use it properly, else the abstract
>> class takes over with the default name always used (Thus, 'part'). You
>> can see a good, complete example at [1].
>>
>> I'd still recommend using MultipleOutputs for better portability
>> reasons. Its javadocs explain how to go about using it well enough
>> [2].
>>
>> [1] -
>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>> [2] -
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>
>> --
>> Harsh J
>>
>
>

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Hari Sreekumar <hs...@clickable.com>.

That is exactly what I do when I have a reduce phase, and it works. But in
case of map-only jobs, it doesn't work. I'll try overriding the
getOutputfileFromInputFile() method.

On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:

> Hello again Hari,
>
> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
> <hs...@clickable.com> wrote:
> > Here is a part of the code I am using:
> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>
> You need to subclass the OF and use it properly, else the abstract
> class takes over with the default name always used (Thus, 'part'). You
> can see a good, complete example at [1].
>
> I'd still recommend using MultipleOutputs for better portability
> reasons. Its javadocs explain how to go about using it well enough
> [2].
>
> [1] -
> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
> [2] -
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>
> --
> Harsh J
>

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Harsh J <ha...@cloudera.com>.

Hello again Hari,

On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
<hs...@clickable.com> wrote:
> Here is a part of the code I am using:
>     jobConf.setOutputFormat(MultipleTextOutputFormat.class);

You need to subclass the OF and use it properly, else the abstract
class takes over with the default name always used (Thus, 'part'). You
can see a good, complete example at [1].

I'd still recommend using MultipleOutputs for better portability
reasons. Its javadocs explain how to go about using it well enough
[2].

[1] - https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
[2] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

-- 
Harsh J

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Hari Sreekumar <hs...@clickable.com>.

Here is a part of the code I am using:

static class mapperClass extends MapReduceBase implements
          Mapper<LongWritable, Text, Text, Text> {
    @Override
    public void map(LongWritable key, Text value,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
      output.collect(
              NullWritable.get(),
              value);
    }
  }

...
...
@Override
  public int run(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Path[] inputPaths = new Path[args.length - 1];
    for (int i = 0; i < args.length - 1; ++i) {
      inputPaths[i] = new Path(args[i]);
    }

    String outputPath = args[args.length - 1].trim();

    JobConf jobConf = new JobConf(conf, this.getClass());

    jobConf.setMapperClass(mapperClass.class);
    jobConf.setNumReduceTasks(0);
    jobConf.setMapOutputKeyClass(NullWritable.class);
    jobConf.setMapOutputValueClass(Text.class);
    jobConf.setOutputKeyClass(NullWritable.class);
    jobConf.setOutputValueClass(Text.class);
    jobConf.setOutputFormat(MultipleTextOutputFormat.class);
    jobConf.setBoolean(
            "mapred.output.compress",
            true);
    jobConf.setClass(
            "mapred.output.compression.codec",
            GzipCodec.class,
            CompressionCodec.class);
    FileInputFormat.setInputPaths(
            jobConf,
            inputPaths);
    FileOutputFormat.setOutputPath(
            jobConf,
            new Path(outputPath));

    JobClient.runJob(jobConf);
    return 0;
  }
  public static void main(String[] args) throws Exception {
    int returnValue = ToolRunner.run(
            new MapReduceClass(),
            args);
    System.exit(returnValue);
  }

Thanks,
Hari

On Thu, Apr 14, 2011 at 1:22 PM, Harsh J <ha...@cloudera.com> wrote:

> Hello Hari,
>
> On Thu, Apr 14, 2011 at 11:09 AM, Hari Sreekumar
> <hs...@clickable.com> wrote:
> > Hi,
> > I have a map-only mapreduce job where I want to deduce the output
> filename
> > from the output key/value. I figured MultipleTextOutputFormat is the best
> > fit for my purpose. But I am unable to use it in map-only jobs. I was
> able
> > to run it if I add a reduce phase. But when I use map-only jobs, the file
> > gets written to the usual part-0000xx files. Also, is there no support
> for
> > this output format in v0.20.2? I mean, is it necessary to use the
> deprecated
> > classes if I want to use this?
> > Thanks,
> > Hari
>
> The class MultipleOutputFormat is not available in the Hadoop for the
> new, unstable API, as it has been replaced in functionality by the
> MultipleOutputs class that does the same very similarly. However, the
> new API MultipleOutputs is not part of the Apache's Hadoop 0.20.2
> release either [1].
>
> Using the stable API is still recommended (it is no longer marked
> deprecated in 0.20.3 and 0.21 also supports the old API)
>
> That said, it should still work for Map-only jobs as described in two
> of its usecases [2]. Could you give us some details of your code setup
> for using this?
>
> [1] - It is available as part of 0.21.0, though, or in Cloudera's
> Distribution including Apache Hadoop 0.20.2.
> [2] -
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
>
> --
> Harsh J
>

Re: Using MultipleTextOutputFormat for map-only jobs

Posted by Harsh J <ha...@cloudera.com>.

Hello Hari,

On Thu, Apr 14, 2011 at 11:09 AM, Hari Sreekumar
<hs...@clickable.com> wrote:
> Hi,
> I have a map-only mapreduce job where I want to deduce the output filename
> from the output key/value. I figured MultipleTextOutputFormat is the best
> fit for my purpose. But I am unable to use it in map-only jobs. I was able
> to run it if I add a reduce phase. But when I use map-only jobs, the file
> gets written to the usual part-0000xx files. Also, is there no support for
> this output format in v0.20.2? I mean, is it necessary to use the deprecated
> classes if I want to use this?
> Thanks,
> Hari

The class MultipleOutputFormat is not available in the Hadoop for the
new, unstable API, as it has been replaced in functionality by the
MultipleOutputs class that does the same very similarly. However, the
new API MultipleOutputs is not part of the Apache's Hadoop 0.20.2
release either [1].

Using the stable API is still recommended (it is no longer marked
deprecated in 0.20.3 and 0.21 also supports the old API)

That said, it should still work for Map-only jobs as described in two
of its usecases [2]. Could you give us some details of your code setup
for using this?

[1] - It is available as part of 0.21.0, though, or in Cloudera's
Distribution including Apache Hadoop 0.20.2.
[2] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

-- 
Harsh J