You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by steven zhuang <zh...@gmail.com> on 2010/01/05 08:39:06 UTC

some questions about the InputSampler key types mismatch.

hi, there,
             I am trying to make the word-count example output total
ordered, after specifying the input sampler and totalorderpartitioner in the
main function, I always get the IOException:
*
"main" java.io.IOException: wrong key class:
org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.Text
       at
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1112)
       at
org.apache.hadoop.mapred.lib.InputSampler.writePartitionFile(InputSampler.java:338)
*
After a check of the source code, I found that in method
InputSampler.writePartitionFile, the sampler reads data from InputFormat(in
my code it's o.a.h.mapred.TextInputFormat), and when it writes to partition
file, it uses the mapoutput keyclass as the output key type, this explains
why there is key type mismatch(<K, V> for TextInputFormat is <LongWritable,
Text>, Map's output is <Text, IntWritable>).

*    final InputFormat<K,V> inf = (InputFormat<K,V>) job.getInputFormat();*
*    int numPartitions = job.getNumReduceTasks();*
*    K[] samples = sampler.getSample(inf, job);*
    ......
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, job, dst,
        job.getMapOutputKeyClass(), NullWritable.class);
    NullWritable nullValue = NullWritable.get();
    ......
    writer.append(samples[k], nullValue);

To me I think it's more reasonable that the sampler samples
the mapper's output, not mapper's input. But either way, I think
the writePartitionFile method should make sure the sampled key class type in
accordance with the key types it outputs to partition file.

Has any body successfully made a total order sort?

-- 
       best wishes.
                            steven