You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Geoffry Roberts <ge...@gmail.com> on 2011/05/03 00:54:36 UTC
Re: Using MultipleTextOutputFormat for map-only jobs
All,
I read this thread and noticed the example code sited in it is based on what
I believe is the older, and at one time deprecated,
org.apache.hadoop.mapred.lib.* package.
I am attempting to output to multiple files, but I am using the
org.apache.hadoop.mapreduce.lib.output.*
package. I am not getting good results.
Question: Is this newer package ready for prime time?
I looked at the source code and it appears ok.
I am specifying an output file name in my reduce method, but when I run the
job I get the part-r-0000* file names that hadoop generates.
My reduce method is included below:
protected void reduce(Text key, Iterable<Text> values, Reducer.Context ctx)
throws IOException, InterruptedException {
int k = 0;
for (Text value : values) {
k++;
String[] ss =
value.toString().split(F.DELIMITER);
mos.write(new Text(ss[F.ID]), value, key.toString());
// I want my output files to have the name of the key value.
}
}
Thanks in advance.
On 14 April 2011 22:11, Hari Sreekumar <hs...@clickable.com> wrote:
> I changes jobConf.setMapOutputKeyClass(Text.class); to
> jobConf.setMapOutputKeyClass(NullWritable.class);
>
> Still no luck..
>
> I also get this error in many mappers:
>
> java.io.IOException: Failed to delete earlier output of task: attempt_201104041514_0069_m_000003_0
> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:110)
> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
> at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
> at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
> at org.apache.hadoop.mapred.Task.commit(Task.java:779)
> at org.apache.hadoop.mapred.Task.done(Task.java:691)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
>
> On Fri, Apr 15, 2011 at 10:37 AM, Hari Sreekumar <hsreekumar@clickable.com
> > wrote:
>
>> Here's what I tried:
>>
>> static class MapperClass extends MapReduceBase implements
>> Mapper<LongWritable, Text, NullWritable, Text> {
>> @Override
>> public void map(LongWritable key, Text value,
>> OutputCollector<NullWritable, Text> output, Reporter reporter)
>> throws IOException {
>> output.collect(
>> NullWritable.get(),
>> value);
>> }
>> }
>>
>> static class SameFilenameOutputFormat extends
>> MultipleTextOutputFormat<NullWritable, Text> {
>>
>> @Override
>> protected String getInputFileBasedOutputFileName(JobConf job, String name) {
>> String infilepath = job.get("map.input.file");
>> System.out.println("File path: " + infilepath);
>> if (infilepath == null) {
>> return name;
>> }
>> return new Path(infilepath).getName();
>> }
>>
>>
>> And the config I set in the run() method:
>> JobConf jobConf = new JobConf(conf, this.getClass());
>>
>> jobConf.setMapperClass(MapperClass.class);
>> jobConf.setNumReduceTasks(0);
>> jobConf.setMapOutputKeyClass(Text.class);
>> jobConf.setMapOutputValueClass(Text.class);
>> jobConf.setOutputKeyClass(NullWritable.class);
>> jobConf.setOutputValueClass(Text.class);
>> jobConf.setOutputFormat(SameFilenameOutputFormat.class);
>>
>> I do get output files with same names as input files, but I lose a lot of
>> records. I get this exception and many tasks fail:
>>
>> 2011-04-15 10:23:53,090 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=MAP, sessionId= - already initialized
>> 2011-04-15 10:23:53,139 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
>> 2011-04-15 10:23:53,171 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
>> 2011-04-15 10:23:53,174 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
>> 2011-04-15 10:24:01,829 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201104041514_0068_m_000001_0 is done. And is in the process of commiting
>> 2011-04-15 10:24:04,842 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201104041514_0068_m_000001_0 is allowed to commit now
>> 2011-04-15 10:24:05,405 WARN org.apache.hadoop.mapred.TaskRunner: Failure committing: java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> 2011-04-15 10:24:11,846 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Failed to save output of task: attempt_201104041514_0068_m_000001_0
>> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:114)
>> at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:126)
>> at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:86)
>> at org.apache.hadoop.mapred.OutputCommitter.commitTask(OutputCommitter.java:171)
>> at org.apache.hadoop.mapred.Task.commit(Task.java:779)
>> at org.apache.hadoop.mapred.Task.done(Task.java:691)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
>> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> 2011-04-15 10:24:11,863 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
>>
>> I guess it has something to do with partitioning? Maybe the mappers are
>> not simultaneously able to write to the same file or something of that sort?
>>
>> Thanks,
>> Hari
>>
>> On Thu, Apr 14, 2011 at 6:37 PM, Hari Sreekumar <hsreekumar@clickable.com
>> > wrote:
>>
>>> That is exactly what I do when I have a reduce phase, and it works. But
>>> in case of map-only jobs, it doesn't work. I'll try overriding the
>>> getOutputfileFromInputFile() method.
>>>
>>>
>>> On Thu, Apr 14, 2011 at 5:19 PM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Hello again Hari,
>>>>
>>>> On Thu, Apr 14, 2011 at 5:10 PM, Hari Sreekumar
>>>> <hs...@clickable.com> wrote:
>>>> > Here is a part of the code I am using:
>>>> > jobConf.setOutputFormat(MultipleTextOutputFormat.class);
>>>>
>>>> You need to subclass the OF and use it properly, else the abstract
>>>> class takes over with the default name always used (Thus, 'part'). You
>>>> can see a good, complete example at [1].
>>>>
>>>> I'd still recommend using MultipleOutputs for better portability
>>>> reasons. Its javadocs explain how to go about using it well enough
>>>> [2].
>>>>
>>>> [1] -
>>>> https://sites.google.com/site/hadoopandhive/home/how-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat
>>>> [2] -
>>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>
--
Geoffry Roberts