You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Geoffry Roberts <ge...@gmail.com> on 2011/05/03 00:54:36 UTC
Re: Using MultipleTextOutputFormat for map-only jobs

All,

I read this thread and noticed the example code sited in it is based on what
I believe is the older, and at one time deprecated,
org.apache.hadoop.mapred.lib.* package.

I am attempting to output to multiple files, but I am using the
org.apache.hadoop.mapreduce.lib.output.*
package. I am not getting good results.

Question: Is this newer package ready for prime time?

I looked at the source code and it appears ok.

I am specifying an output file name in my reduce method, but when I run the
job I get the part-r-0000* file names that hadoop generates.

My reduce method is included below:
 protected void reduce(Text key, Iterable<Text> values, Reducer.Context ctx)
            throws IOException, InterruptedException {
  int k = 0;
  for (Text value : values) {
    k++;
    String[] ss =
      value.toString().split(F.DELIMITER);
    mos.write(new Text(ss[F.ID]), value, key.toString());
// I want my output files to have the name of the key value.
        }
    }

Thanks in advance.

On 14 April 2011 22:11, Hari Sreekumar <hs...@clickable.com> wrote:

> I changes jobConf.setMapOutputKeyClass(Text.class); to
> jobConf.setMapOutputKeyClass(NullWritable.class);
>
> Still no luck..
>
> I also get this error in many mappers:
>
> java.io.IOException: Failed to delete earlier output of task: attempt_201104041514_0069_m_000003_0
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:110)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
> On Fri, Apr 15, 2011 at 10:37 AM, Hari Sreekumar <hsreekumar@clickable.com
> > wrote:
>
>> Here's what I tried:
>>
>>   static class MapperClass extends MapReduceBase implements
>>           Mapper<LongWritable, Text, NullWritable, Text> {
>>     @Override
>>     public void map(LongWritable key, Text value,
>>             OutputCollector<NullWritable, Text> output, Reporter reporter)
>>             throws IOException {
>>       output.collect(
>>               NullWritable.get(),
>>               value);
>>     }
>>   }
>>
>>   static class SameFilenameOutputFormat extends
>>           MultipleTextOutputFormat<NullWritable, Text> {
>>
>>     @Override
>>     protected String getInputFileBasedOutputFileName(JobConf job, String name) {
>>       String infilepath = job.get("map.input.file");
>>       System.out.println("File path: " + infilepath);
>>       if (infilepath == null) {
>>         return name;
>>       }
>>       return new Path(infilepath).getName();
>>     }
>>
>>
>> And the config I set in the run() method:
>>  JobConf jobConf = new JobConf(conf, this.getClass());
>>
>>     jobConf.setMapperClass(MapperClass.class);
>>     jobConf.setNumReduceTasks(0);
>>     jobConf.setMapOutputKeyClass(Text.class);
>>     jobConf.setMapOutputValueClass(Text.class);
>>     jobConf.setOutputKeyClass(NullWritable.class);
>>     jobConf.setOutputValueClass(Text.class);
>>     jobConf.setOutputFormat(SameFilenameOutputFormat.class);
>>
>> I do get output files with same names as input files, but I lose a lot of
>> records. I get this exception and many tasks fail:
>>
>> 2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized
>> 2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
>> 2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
>> 2011-04-15 10:23:53,174 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
>> 2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201104041514_0068_m_000001_0 is done. And is in the process of commiting
>> 2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201104041514_0068_m_000001_0 is allowed to commit now
>> 2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing: java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> 2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> 	at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> 	at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> 	at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> 	at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> 2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
>>
>> I guess it has something to do with partitioning? Maybe the mappers are
>> not simultaneously able to write to the same file or something of that sort?
>>
>> Thanks,
>> Hari
>>
>> On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hsreekumar@clickable.com
>> > wrote:
>>
>>> That is exactly what I do when I have a reduce phase, and it works. But
>>> in case of map-only jobs, it doesn't work. I'll try overriding the
>>> getOutputfileFromInputFile() method.
>>>
>>>
>>> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Hello again Hari,
>>>>
>>>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>>>> <hs...@clickable.com> wrote:
>>>> > Here is a part of the code I am using:
>>>> >     jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>>>
>>>> You need to subclass the OF and use it properly, else the abstract
>>>> class takes over with the default name always used (Thus, 'part'). You
>>>> can see a good, complete example at [1].
>>>>
>>>> I'd still recommend using MultipleOutputs for better portability
>>>> reasons. Its javadocs explain how to go about using it well enough
>>>> [2].
>>>>
>>>> [1] -
>>>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>>>> [2] -
>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>


-- 
Geoffry Roberts