You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Florent Gluck <fl...@busytonight.com> on 2005/12/14 20:39:45 UTC

mapreduce fetcher doesn't fetch all urls

When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1

I tried to inject different amount of urls to see around what threshold
I start to see some missing ones.  Here are the results of my tests so far:

#urls
15000 and below: 100% fetched
16000: 15998 fetched (~100%)
25000: 21379 fetched (86%)
50000: 26565 fetched (53%)
100000: 22088 fetched (22%)

After having seen bug NUTCH-136 "mapreduce segment generator generates
50 % less than excepted urls", I thought it may fix my problem.  I  only
applied the 2nd change mentioned in the description (the change in
Generator.java, line 48) since I didn't know how to set the partition to
use a normal hashPartitioner.  The fix didn't make any difference.

Then I started debugging the generator to see if all the urls were
generated.  I confirmed they were all generated (did a check w/ 50k), so
the problem lays further in the pipeline.  I assume it's somewhere in
the fetcher, but I'm not sure where yet.  I'm gonna keep investigating.

Has anyone encountered a similar issue ?
I read messages of people crawling million of pages and I wonder why it
seems I'm the only one to have this issue.  I'm apparently unable to
fetch more than ~30k pages even though I inject 1 million urls.

Any help would be greatly appreciated.

Thanks,
--Flo

Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
Doug,

> I cannot reproduce this.
I was able to reproduce it on different system several times.
Important is that you use at least two boxes.
Create a crawldb with may 100 000 entries.
Generate a segment from this without limitations and count the  
entries in the fresh generated segment.
I had written a own tool testing this using  sequence file reader,  
you will see that the generated segment is around 50 000  enties not  
100 000.

The problem is somehow related to the two boxes.

If you like I can write a test that makes the problem reproducible,  
but it may takes some time since there is just to much in the queue.

Stefan


Re: mapreduce fetcher doesn't fetch all urls

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> In case you setup one thread per host, you have maximal as much  
> connections to one host as you have boxes. In may case that are not  
> that much.

Anything more than one is not generally considered polite.

> Also it is a reproducible bug that the segment is everytime ~half  size 
> of the size you specify or expect based on your crawldb.
> See my mail posting.

I cannot reproduce this.  I just now ran a crawl with depth=5, topN=100 
and mapred.map.tasks=2, starting from a single url.  Segments (after the 
first two) contain over 80 pages with a total of more than 300 pages 
fetched.

Doug

Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
Doug,
> I don't recommend this change.  It makes your crawler impolite,  
> since multiple tasks may reference each host.  Perhaps you simply  
> need to increase http.max.delays?  What is this set to?

In case you setup one thread per host, you have maximal as much  
connections to one host as you have boxes. In may case that are not  
that much.
Also it is a reproducible bug that the segment is everytime ~half  
size of the size you specify or expect based on your crawldb.
See my mail posting.
I hadn't time to dig into the problem and find the bug exactly, the  
partioner itself works, but somehow a combination of things fails.
Anyway it is on my list and as soon I discover the real problem it is  
a fair workaround to use the hash partitioner for some days.

Stefan 

Re: mapreduce fetcher doesn't fetch all urls

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
>> - "job.setPartitionerClass(PartitionUrlByHost.class);" in the generate
>> method
> 
> 
> yes, this line is the one you need to change. The other stuff can be  as 
> it is for now.

I don't recommend this change.  It makes your crawler impolite, since 
multiple tasks may reference each host.  Perhaps you simply need to 
increase http.max.delays?  What is this set to?

Doug

Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
http://issues.apache.org/jira/browse/NUTCH-135

Am 15.12.2005 um 02:24 schrieb Florent Gluck:

> Stefan,
>
> I searched in the nutch bug database to find the fix to the fetcher  
> you
> talked about earlier, but I couldn't find it...
> What is the bug number ? or what was the svn revision of the fix ?  
> also,
> what nutch version was it ? (0.7, 0.7.1, 0.7.2)
>
> Thanks,
> --Flo
>
> Stefan Groschupf wrote:
>
>>> ´So, with your patch, did you see 100% of urls *attempting* a  
>>> fetch ?
>>
>> 100% ! :-)
>
>
>


Re: mapreduce fetcher doesn't fetch all urls

Posted by Florent Gluck <fl...@busytonight.com>.
Stefan,

I searched in the nutch bug database to find the fix to the fetcher you
talked about earlier, but I couldn't find it...
What is the bug number ? or what was the svn revision of the fix ? also,
what nutch version was it ? (0.7, 0.7.1, 0.7.2)

Thanks,
--Flo

Stefan Groschupf wrote:

>> ´So, with your patch, did you see 100% of urls *attempting* a fetch ?
>
> 100% ! :-)



Re: mapreduce fetcher doesn't fetch all urls

Posted by Florent Gluck <fl...@busytonight.com>.
AWESOME !!  =:)

Stefan Groschupf wrote:

>> ´So, with your patch, did you see 100% of urls *attempting* a fetch ?
>
> 100% ! :-)



Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
> ´So, with your patch, did you see 100% of urls *attempting* a fetch ?
100% ! :-)

Re: mapreduce fetcher doesn't fetch all urls

Posted by Florent Gluck <fl...@busytonight.com>.
> Just submit my patch and try to compile you will see what you need to 
> change.
> Just some changes of new Properties() to  ContentProperties() and may 
> the import of this class.

Cool, I'll have a look at your patch :)

>
>> It's much better than what I have right now.  However, it's still not
>> 100% and fetching all the urls would mean implementing some sort of
>> iterative process until all the urls are finally fetched.
>> Do you have an idea why we are still missing 10 to 20% ?
>
>
> Well since i strated with dmoz that are the urls that does not exists 
> anymore but still listen in dmoz. You also have some general errors 
> like, unable to parse, host down etc.
> So 10 % error rate is not to bad, if you have later on some hundred 
> million you will see that this error rate is around less than 5%.


In my results I didn't include the urls that failed to fetch, regardless
of the error.  The % were the fetch attempts (so it includes the
errors), which should be 100%.
So, with your patch, did you see 100% of urls *attempting* a fetch ?

Thanks,
--Flo


Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
> - "job.setPartitionerClass(PartitionUrlByHost.class);" in the generate
> method

yes, this line is the one you need to change. The other stuff can be  
as it is for now.

> Do I only need to change the last line to using HashPartitioner.class,
> or do I need to modify the other 2 references as well?
>
>> Than also assign the case insensitive content properties patch to the
>> 0.8. You may need to change 3 other classes (e.g fetcher) since the
>> patch is for 0.7.
>
Just submit my patch and try to compile you will see what you need to  
change.
Just some changes of new Properties() to  ContentProperties() and may  
the import of this class.

> It's much better than what I have right now.  However, it's still not
> 100% and fetching all the urls would mean implementing some sort of
> iterative process until all the urls are finally fetched.
> Do you have an idea why we are still missing 10 to 20% ?

Well since i strated with dmoz that are the urls that does not exists  
anymore but still listen in dmoz. You also have some general errors  
like, unable to parse, host down etc.
So 10 % error rate is not to bad, if you have later on some hundred  
million you will see that this error rate is around less than 5%.

Stefan


Re: mapreduce fetcher doesn't fetch all urls

Posted by Florent Gluck <fl...@busytonight.com>.
Stephan,

Thanks for your input, I'm glad to see I'm not the only one :)

> Change change fetcher to hashpartitoner, see the job setup where 
> actually the Url host partioner is used.

There are several references to PartitionUrlByHost in Generator.java:
- "private Partitioner hostPartitioner = new PartitionUrlByHost();" in
the members declaration
- in the "getPartition" method
- "job.setPartitionerClass(PartitionUrlByHost.class);" in the generate
method
Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?

> Than also assign the case insensitive content properties patch to the 
> 0.8. You may need to change 3 other classes (e.g fetcher) since the 
> patch is for 0.7.

I'm not sure I understand what I need to do... Do I need to modify 3
other classes ?
Was 0.7 prone to this bug as well and it's been fixed ?  So I'd need to
port it to 0.8 ?

> After that I was able to get at least a 80 -90 % success-rate running 
> a 2 million pages fetch. I actually I only have the problem that the 
> reduce tasks hangs somehow, as discussed in the user list.

It's much better than what I have right now.  However, it's still not
100% and fetching all the urls would mean implementing some sort of
iterative process until all the urls are finally fetched.
Do you have an idea why we are still missing 10 to 20% ?

Thanks,
--Flo

>
> Stefan
>
>
> Am 14.12.2005 um 20:39 schrieb Florent Gluck:
>
>> When doing a one-pass crawl, I noticed that when I inject more than
>> ~16000 urls, the fetcher only fetches a subset of the set initially
>> injected.
>> I use 1 master and 3 slaves with the following properties:
>> mapred.map.tasks = 30
>> mapred.reduce.tasks = 6
>> generate.max.per.host = -1
>>
>> I tried to inject different amount of urls to see around what  threshold
>> I start to see some missing ones.  Here are the results of my tests 
>> so far:
>>
>> #urls
>> 15000 and below: 100% fetched
>> 16000: 15998 fetched (~100%)
>> 25000: 21379 fetched (86%)
>> 50000: 26565 fetched (53%)
>> 100000: 22088 fetched (22%)
>>
>> After having seen bug NUTCH-136 "mapreduce segment generator generates
>> 50 % less than excepted urls", I thought it may fix my problem.  I  
>> only
>> applied the 2nd change mentioned in the description (the change in
>> Generator.java, line 48) since I didn't know how to set the 
>> partition to
>> use a normal hashPartitioner.  The fix didn't make any difference.
>>
>> Then I started debugging the generator to see if all the urls were
>> generated.  I confirmed they were all generated (did a check w/ 
>> 50k), so
>> the problem lays further in the pipeline.  I assume it's somewhere in
>> the fetcher, but I'm not sure where yet.  I'm gonna keep  investigating.
>>
>> Has anyone encountered a similar issue ?
>> I read messages of people crawling million of pages and I wonder  why it
>> seems I'm the only one to have this issue.  I'm apparently unable to
>> fetch more than ~30k pages even though I inject 1 million urls.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks,
>> --Flo
>>
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>


Re: mapreduce fetcher doesn't fetch all urls

Posted by Stefan Groschupf <sg...@media-style.com>.
Flo,
I had the same problem!
Change change fetcher to hashpartitoner, see the job setup where  
actually the Url host partioner is used.
Than also assign the case insensitive content properties patch to the  
0.8. You may need to change 3 other classes (e.g fetcher) since the  
patch is for 0.7.
After that I was able to get at least a 80 -90 % success-rate running  
a 2 million pages fetch. I actually I only have the problem that the  
reduce tasks hangs somehow, as discussed in the user list.

Stefan


Am 14.12.2005 um 20:39 schrieb Florent Gluck:

> When doing a one-pass crawl, I noticed that when I inject more than
> ~16000 urls, the fetcher only fetches a subset of the set initially
> injected.
> I use 1 master and 3 slaves with the following properties:
> mapred.map.tasks = 30
> mapred.reduce.tasks = 6
> generate.max.per.host = -1
>
> I tried to inject different amount of urls to see around what  
> threshold
> I start to see some missing ones.  Here are the results of my tests  
> so far:
>
> #urls
> 15000 and below: 100% fetched
> 16000: 15998 fetched (~100%)
> 25000: 21379 fetched (86%)
> 50000: 26565 fetched (53%)
> 100000: 22088 fetched (22%)
>
> After having seen bug NUTCH-136 "mapreduce segment generator generates
> 50 % less than excepted urls", I thought it may fix my problem.  I   
> only
> applied the 2nd change mentioned in the description (the change in
> Generator.java, line 48) since I didn't know how to set the  
> partition to
> use a normal hashPartitioner.  The fix didn't make any difference.
>
> Then I started debugging the generator to see if all the urls were
> generated.  I confirmed they were all generated (did a check w/  
> 50k), so
> the problem lays further in the pipeline.  I assume it's somewhere in
> the fetcher, but I'm not sure where yet.  I'm gonna keep  
> investigating.
>
> Has anyone encountered a similar issue ?
> I read messages of people crawling million of pages and I wonder  
> why it
> seems I'm the only one to have this issue.  I'm apparently unable to
> fetch more than ~30k pages even though I inject 1 million urls.
>
> Any help would be greatly appreciated.
>
> Thanks,
> --Flo
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net