You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ned Rockson <nr...@stanford.edu> on 2007/10/20 01:40:30 UTC

Out of order key while in reduce phase

I'm trying to perform a mapreduce of IntWritable/{URL,CrawlDatum} ->
URL/CrawlDatum but I want the output to be sorted by the initial
IntWritable and the partitioner to partition by host.  I wrote a
mapreduce with an identity mapper, a partitioner that pulls out the
host from the url and the reducer outputs just url, crawldatum,
however every time I run it, as soon as the reduce phase begin Reduce
> Reduce it gives me this error:

java.io.IOException: key out of order: http://web1.incl.ne.jp/ after
http://who2.com/
        at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:169)
        at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)
        at org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFileOutputFormat.java:56)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:340)
        at org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(TimeSorter.java:96)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)


When I checked out the MapFileOutputFormat.append() method, it says
the keys must be sorted, so I figured a quick change to
job.setOutputFormat(SequenceFileOutputFormat.class) would fix it, but
I still see the exact same error message.  Is this something others
have seen or would this be better fit in the hadoop-user mailing list?

Thanks,
Ned

Re: Out of order key while in reduce phase

Posted by Sagar Naik <sa...@visvo.com>.

Hey Ned,

SequenceFile : Support for flat files of binary key/value pairs.
SequenceFileOutputFormat : plain files with name as data-xxxxx
MapFile : A file-based map from keys to values
MapFileOutputFormat : A dir for each MapFile containing "data" file and 
an "index" file.

So my thought is that both files are still  representation of Map

As far as map reduce is concerned, I think this solution might work for ya
Create new a key
KEY : URL and INT. Compare function shud use INT values.
and the custom partitioner will be able to partition on the basis of 
host of URL




Ned Rockson wrote:
> I'm trying to perform a mapreduce of IntWritable/{URL,CrawlDatum} ->
> URL/CrawlDatum but I want the output to be sorted by the initial
> IntWritable and the partitioner to partition by host.  I wrote a
> mapreduce with an identity mapper, a partitioner that pulls out the
> host from the url and the reducer outputs just url, crawldatum,
> however every time I run it, as soon as the reduce phase begin Reduce
>   
>> Reduce it gives me this error:
>>     
>
> java.io.IOException: key out of order: http://web1.incl.ne.jp/ after
> http://who2.com/
>         at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:169)
>         at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)
>         at org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFileOutputFormat.java:56)
>         at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:340)
>         at org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(TimeSorter.java:96)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1707)
>
>
> When I checked out the MapFileOutputFormat.append() method, it says
> the keys must be sorted, so I figured a quick change to
> job.setOutputFormat(SequenceFileOutputFormat.class) would fix it, but
> I still see the exact same error message.  Is this something others
> have seen or would this be better fit in the hadoop-user mailing list?
>
> Thanks,
> Ned
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.