You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by King Going <fa...@gmail.com> on 2011/10/20 08:36:32 UTC

Nutch Fetcher single Map output too large caused a very slow spill merge

I managed to setup nutch to crawl a site(only one host), so Fetcher has only
one Map. I assume that's reasonable.
But after several hours of crawling, the only Map generated about 50GB of
data, namely spill*.out. Then hadoop begin to merge those spill*.out files
into imtermediate.* files, and finally into file.out.
This merge process takes about 10 hours to finish, isn't that too long?
If that's a problem of hadoop, how can I avoid?

some of my configs:
only one machine, 4 cores, 8GB Ram, 2 x 1TB harddisk @ raid0

io.sort.factor=50
io.sort.mb=100

Re: Nutch Fetcher single Map output too large caused a very slow spill merge

Posted by Markus Jelsma <ma...@openindex.io>.

On Thursday 20 October 2011 11:05:22 King Going wrote:
> Thank you for your reply.
> 
> > > I managed to setup nutch to crawl a site(only one host), so Fetcher has
> > > only one Map. I assume that's reasonable.
> > > But after several hours of crawling, the only Map generated about 50GB
> > > of data, namely spill*.out. Then hadoop begin to merge those
> > > spill*.out files into imtermediate.* files, and finally into file.out.
> > > This merge process takes about 10 hours to finish, isn't that too long?
> > > If that's a problem of hadoop, how can I avoid?
> > 
> > io.sort.mb and io.sort.factor is much too low for such a large amount of
> > output to merge. I recommend to start with these values as you use right
> > now and simply decrease your total number of url's to fetch per segment
> > or to use multiple machines. 50GB is about 15 million URL's and that's
> > way too much for a singe machine with only 8GB RAM.
> 
> yes, I do think that is pretty large. But since the crawling cycle
> (generate, updatedb)
> also spend lots of time (more than 4 hours). If I decrease the number
> of url per segment
> I'll spend more time on generate and updatedb. The overall utility of
> network is still low,
> am I right? Any advice on the network utility?

You can generate multiple small segments using the -maxNumSegments switch. 
Instead of one segments with 15 million URL's you could generate 20 or 30 
segments in one go. It will make your life a lot easier.

> 
> > > some of my configs:
> > > only one machine, 4 cores, 8GB Ram, 2 x 1TB harddisk @ raid0
> > 
> > How many mappers are running on that same machine?
> 
> Only one mapper.
> 
> PS: I checked the speed of the map-side merge (hadoop) process, that
> is only 1 MB/s. After all,
> the merge process cannot be so slow, while CPU is mostly idle, and the
> raid0 is much faster than
> 1 MB/s.

Keep in mind that merging is a memory intensive operation. If RAM is not 
sufficiently allocated more I/O is to be used.

Also, having RAID is not something that is considered really useful on HDFS. 
Redundancy is supposed to be on the cluster level, not inside a single 
machine. This won't help for data-locality as well. Use more nodes.

> 
> > > io.sort.factor=50
> > > io.sort.mb=100

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch Fetcher single Map output too large caused a very slow spill merge

Posted by King Going <fa...@gmail.com>.
Thank you for your reply.

> > I managed to setup nutch to crawl a site(only one host), so Fetcher has
> > only one Map. I assume that's reasonable.
> > But after several hours of crawling, the only Map generated about 50GB of
> > data, namely spill*.out. Then hadoop begin to merge those spill*.out files
> > into imtermediate.* files, and finally into file.out.
> > This merge process takes about 10 hours to finish, isn't that too long?
> > If that's a problem of hadoop, how can I avoid?
> >
>
> io.sort.mb and io.sort.factor is much too low for such a large amount of
> output to merge. I recommend to start with these values as you use right now
> and simply decrease your total number of url's to fetch per segment or to use
> multiple machines. 50GB is about 15 million URL's and that's way too much for
> a singe machine with only 8GB RAM.

yes, I do think that is pretty large. But since the crawling cycle
(generate, updatedb)
also spend lots of time (more than 4 hours). If I decrease the number
of url per segment
I'll spend more time on generate and updatedb. The overall utility of
network is still low,
am I right? Any advice on the network utility?

> > some of my configs:
> > only one machine, 4 cores, 8GB Ram, 2 x 1TB harddisk @ raid0
>
> How many mappers are running on that same machine?

Only one mapper.

PS: I checked the speed of the map-side merge (hadoop) process, that
is only 1 MB/s. After all,
the merge process cannot be so slow, while CPU is mostly idle, and the
raid0 is much faster than
1 MB/s.

> >
> > io.sort.factor=50
> > io.sort.mb=100

Re: Nutch Fetcher single Map output too large caused a very slow spill merge

Posted by Markus Jelsma <ma...@openindex.io>.
> I managed to setup nutch to crawl a site(only one host), so Fetcher has
> only one Map. I assume that's reasonable.
> But after several hours of crawling, the only Map generated about 50GB of
> data, namely spill*.out. Then hadoop begin to merge those spill*.out files
> into imtermediate.* files, and finally into file.out.
> This merge process takes about 10 hours to finish, isn't that too long?
> If that's a problem of hadoop, how can I avoid?
>

io.sort.mb and io.sort.factor is much too low for such a large amount of 
output to merge. I recommend to start with these values as you use right now 
and simply decrease your total number of url's to fetch per segment or to use 
multiple machines. 50GB is about 15 million URL's and that's way too much for 
a singe machine with only 8GB RAM.
 
> some of my configs:
> only one machine, 4 cores, 8GB Ram, 2 x 1TB harddisk @ raid0

How many mappers are running on that same machine?

> 
> io.sort.factor=50
> io.sort.mb=100