You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Hamza Kaya <ha...@gmail.com> on 2006/05/24 11:13:29 UTC

Fetcher and MapReduce

Hi,

I'm trying to crawl approx. 500.000 urls. After inject and generate I
started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had
successfully completed while all the reduce tasks got an OutOfMemory
exception. This exception was caught after the append phase (during the sort
phase). As far as I observed, during a fetch operation, all the map tasks
outputs to a temp. sequence file. During the reduce operation, each reducer
copies all map outputs to their local disk and append them to a single seq.
file. After this operation reducer try to sort this file and output the
sorted file to its local disk. And then, a record writer is opened to write
this sorted file to the segment, which is in DFS. If this scenario is
correct, then all the reduce tasks are supposed to do the same job. All try
to sort the whole map outputs and the winner of this operation will be able
to write to dfs. So only one reducer is expected to write to dfs. If this is
the case then an OutOfMemory exception will not be surprising for
500.000+urls. Since reducers will try to sort a file bigger then 1GB.
Any comments
on this scenario are welcome. And how can I avoid these exceptions? Thanx,

--
Hamza KAYA

Re: Fetcher and MapReduce

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi,

so you have 3 boxes, since you run 3 reduce tasks?
What happens is that 3 splits of your data are sorted. In the very  
end you will get as much out put files as you have reduce tasks.
The sorting itself does happen in memory.
Check in hadoop-default.xml (it is may be in the hadoop jar)
  <name>io.sort.factor</name>
and
   <name>io.sort.mb</name>

HTH
Stefan


Am 24.05.2006 um 11:13 schrieb Hamza Kaya:

> Hi,
>
> I'm trying to crawl approx. 500.000 urls. After inject and generate I
> started fetchers using 6 map tasks and 3 reduce tasks. All the map  
> tasks had
> successfully completed while all the reduce tasks got an OutOfMemory
> exception. This exception was caught after the append phase (during  
> the sort
> phase). As far as I observed, during a fetch operation, all the map  
> tasks
> outputs to a temp. sequence file. During the reduce operation, each  
> reducer
> copies all map outputs to their local disk and append them to a  
> single seq.
> file. After this operation reducer try to sort this file and output  
> the
> sorted file to its local disk. And then, a record writer is opened  
> to write
> this sorted file to the segment, which is in DFS. If this scenario is
> correct, then all the reduce tasks are supposed to do the same job.  
> All try
> to sort the whole map outputs and the winner of this operation will  
> be able
> to write to dfs. So only one reducer is expected to write to dfs.  
> If this is
> the case then an OutOfMemory exception will not be surprising for
> 500.000+urls. Since reducers will try to sort a file bigger then 1GB.
> Any comments
> on this scenario are welcome. And how can I avoid these exceptions?  
> Thanx,
>
> --
> Hamza KAYA