You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Berry, Matt" <mw...@amazon.com> on 2012/07/17 23:05:55 UTC

One output file per key with MultipleOutputs (r0.20.205.0)

I would like to create a hierarchy of output files based on the keys passed to the reducer. The first folder level is the first few digits of the key, the next level is the next few, etc. I had written a very ugly hack that achieved this by passing a filesystem object into the record writer. It seems however that this use case is what the MultipleOutputs APi was designed to handle. I began to implement it based on examples I have found but I am getting stuck.

In my Tool I have the following:
----------------
  MultipleOutputs.addNamedOutput(job, "namedOutput", SlightlyModifiedTextOutputFormat.class, keyClass, valueClass);
----------------


In my Reducer I have the following:
----------------
    private MultipleOutputs<KeyValue> mo_context;

    public void setup(Context context) {
        mo_context = new MultipleOutputs<Key, Value>(context);
    }
    
    protected void reduce(Key key, Iterable<Value> values, Context context) 
        throws IOException, InterruptedException {
              
        for(Value value: values) {    
            //context.write(key, value);
            mo_context.write(key, value, key.toString()); // I can change key.toString() to include the folder tree if needed
            context.progress();
        }
    }

    public void cleanup(Context context) throws IOException, InterruptedException {
        if (mo_context != null) {
            mo_context.close();
        }
    }
----------------

When I run it I receive the following stack trace just as reducing begins:
----------------
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputName(Lorg/apache/hadoop/mapreduce/JobContext;Ljava/lang/String;)V
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:439)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:408)
        at xxxxxx.xxxxxxxxxxx.xxxx.xxxxx.xxxxxxxxxxxxxxxxxReducer.reduce(xxxxxxxxxxxxxxxxxReducer.java:54)
        at xxxxxx.xxxxxxxxxxx.xxxx.xxxxx.xxxxxxxxxxxxxxxxxReducer.reduce(xxxxxxxxxxxxxxxxxReducer.java:27)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
----------------

I must be setting this up incorrectly somehow. Does anyone have a solid example of using OutputFormats that shows the job setup, reduction, and possibly the output format, and is using a version around 0.20.205.0?

RE: One output file per key with MultipleOutputs (r0.20.205.0)

Posted by "Berry, Matt" <mw...@amazon.com>.
An additional detail. MultipleOutputs was ported into my version of 0.20.205.0 from 0.23 by someone in my organization (the individual was not documented). So please update all references of 0.20.205.0 to 0.23 in my question

-----Original Message-----
From: Berry, Matt [mailto:mwberry@amazon.com] 
Sent: Tuesday, July 17, 2012 2:06 PM
To: mapreduce-user@hadoop.apache.org
Subject: One output file per key with MultipleOutputs (r0.20.205.0)

I would like to create a hierarchy of output files based on the keys passed to the reducer. The first folder level is the first few digits of the key, the next level is the next few, etc. I had written a very ugly hack that achieved this by passing a filesystem object into the record writer. It seems however that this use case is what the MultipleOutputs APi was designed to handle. I began to implement it based on examples I have found but I am getting stuck.

In my Tool I have the following:
----------------
  MultipleOutputs.addNamedOutput(job, "namedOutput", SlightlyModifiedTextOutputFormat.class, keyClass, valueClass);
----------------


In my Reducer I have the following:
----------------
    private MultipleOutputs<KeyValue> mo_context;

    public void setup(Context context) {
        mo_context = new MultipleOutputs<Key, Value>(context);
    }
    
    protected void reduce(Key key, Iterable<Value> values, Context context) 
        throws IOException, InterruptedException {
              
        for(Value value: values) {    
            //context.write(key, value);
            mo_context.write(key, value, key.toString()); // I can change key.toString() to include the folder tree if needed
            context.progress();
        }
    }

    public void cleanup(Context context) throws IOException, InterruptedException {
        if (mo_context != null) {
            mo_context.close();
        }
    }
----------------

When I run it I receive the following stack trace just as reducing begins:
----------------
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputName(Lorg/apache/hadoop/mapreduce/JobContext;Ljava/lang/String;)V
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:439)
        at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:408)
        at xxxxxx.xxxxxxxxxxx.xxxx.xxxxx.xxxxxxxxxxxxxxxxxReducer.reduce(xxxxxxxxxxxxxxxxxReducer.java:54)
        at xxxxxx.xxxxxxxxxxx.xxxx.xxxxx.xxxxxxxxxxxxxxxxxReducer.reduce(xxxxxxxxxxxxxxxxxReducer.java:27)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
----------------

I must be setting this up incorrectly somehow. Does anyone have a solid example of using OutputFormats that shows the job setup, reduction, and possibly the output format, and is using a version around 0.20.205.0?