You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ned Rockson <ne...@discoveryengine.com> on 2007/10/23 23:41:51 UTC

Update to URL ordering from Generator.java

So recently I switched to Fetcher2 over Fetcher for larger whole web 
fetches (50-100M at a time).  I found that the URLs generated are not 
optimal because they are simply randomized by a hash comparator.  In one 
crawl on 24 machines it took about 3 days to crawl 30M URLs.  In 
comparison with old benchmarks I had set with regular Fetcher.java this 
was at least 3 fold more time.

Anyway, I realized that the best situation for ordering can be 
approached by randomization, but in order to get optimal ordering, urls 
from the same host should be as far apart in the list as possible.  So I 
wrote a series of 2 map/reduces to optimize the ordering and for a list 
of 25M documents it takes about 10 minutes on our cluster.  Right now I 
have it in its own class, but I figured it can go in Generator.java and 
just add a flag in nutch-default.xml determining if the user wants to 
use it. 

So, should I submit the code by email?  Is there some way to change 
Generator.java or should I just submit the function in its own class?

Thanks,
Ned

[formerly nedrocks at gmail dot com]

Re: Update to URL ordering from Generator.java

Posted by Ken Krugler <kk...@transpac.com>.

>That's really interesting.  From the literature I have been reading, 
>I always though that the politeness policies required a 1s - 5s wait 
>between queries to the same host.  However, I always thought that 
>regardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, do 
>we really want to add the logic of determining if this is a 
>"num-sites-per-minute-host" or a "num-sites-per-hour-host"

Actually it's a num-requests-per-xxx setting. Where typically 1 
request == 1 web page, if you exclude things like PDFs.

>or a "1-request-per-second-host"?  If we assume the former options 
>then we will definitely increase our throughput

The throughput would depend on what setting you use. E.g. if the 
setting was 20 pages/minute, this would be roughly the equivalent 
(leaving aside latency) of a 3 second delay between requests.

>but most likely get a lot of hate mail from webmasters for smaller 
>sites, and if we assume the latter than we are where we started.

For the small number - 5 or so - of ops people I surveyed, they _all_ 
cared about requests per time interval, not time delay between 
requests. And having gone through the experience of getting a bunch 
of pages crawled on our site, I understand why.

Nobody that I know of monitors individual request timing - it's all 
based on summed, time-bucketed info for an originating IP address, or 
block of addresses.

So my main point is that currently it feels like a mismatch between 
how Nutch tries to implement polite crawling, versus how an ops 
person would evaluate the politeness of a crawler.

A secondary point is that it should be possible to use keep-alive, 
and still be considered polite, if ops people really did care about 
requests/time interval.

Which could lead to significantly more efficient crawling. For 
example, see page 6 of "Scheduling Algorithms for Web Crawling", 
http://www.dcc.uchile.cl/~ccastill/papers/castillo04_scheduling_algorithms_web_crawling.pdf). 
In their tests, with a big pipe, the crawl was 3x faster...though it 
doesn't say how they constrained the per-domain crawling rate to 
remain polite.

-- Ken

PS -  Also note that the HTTP 1.1 response header can contain the 
server's max requests/connection value, so the site admin does have 
control over what they feel is a reasonable limit.

>On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote:
>
>>>Ned Rockson wrote:
>>>>So recently I switched to Fetcher2 over Fetcher for larger whole 
>>>>web fetches (50-100M at a time).  I found that the URLs generated 
>>>>are not optimal because they are simply randomized by a hash 
>>>>comparator.  In one crawl on 24 machines it took about 3 days to 
>>>>crawl 30M URLs.  In comparison with old benchmarks I had set with 
>>>>regular Fetcher.java this was at least 3 fold more time.
>>>>
>>>>Anyway, I realized that the best situation for ordering can be 
>>>>approached by randomization, but in order to get optimal 
>>>>ordering, urls from the same host should be as far apart in the 
>>>>list as possible.  So I wrote a series of 2 map/reduces to 
>>>>optimize the ordering and for a list of 25M documents it takes 
>>>>about 10 minutes on our cluster.  Right now I have it in its own 
>>>>class, but I figured it can go in Generator.java and just add a 
>>>>flag in nutch-default.xml determining if the user wants to use it.
>>>>So, should I submit the code by email?  Is there some way to 
>>>>change Generator.java or should I just submit the function in its 
>>>>own class?
>>>>
>>>
>>>This sounds intriguing. Nutch mailing lists strip attachments - 
>>>you should create a JIRA issue, copy this description, and attach 
>>>a patch in unified diff format (svn diff will do).
>>
>>For what it's worth, a year ago I had a series of discussions with 
>>people who manage large web sites, w.r.t. acceptable crawl rates.
>>
>>This was in the context of whether using keep-alive made sense, to 
>>optimize crawl performance.
>>
>>For sites that mostly serve up static content, keep-alive was seen 
>>as a good thing, in that it reduced the load on the server 
>>constantly creating/destroying connections.
>>
>>For sites that build pages from multiple data sources, this was a 
>>minor issue - they were more concerned about total # of pages 
>>fetched per time interval, given the high cost of assembling the 
>>page.
>>
>>Which is the interesting point - nobody talked about the amount of 
>>time between requests. They all were interested in monitoring pages 
>>per time interval - typically an hour for smaller sites, and pages 
>>per minute for ones with heavier traffic.
>>
>>Based on the above, what seems like the most efficient approach to 
>>fetching would be to use a pages/minute/domain restriction, and use 
>>a kept-alive connection (w/HTTP 1.1) with no delay until that limit 
>>is reached, then switch to another set of URLs from a different 
>>domain.
>>
>>This assumes a maximum of one thread/domain, and that URLs are 
>>segmented such that each domain is handled inside of one task, such 
>>that the one thread/domain and pages/minute/domain restrictions can 
>>be enforced properly.
>>
>>-- Ken
>>--
>>Ken Krugler
>>Krugle, Inc.
>>+1 530-210-6378
>>"If you can't find it, you can't fix it"

-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Update to URL ordering from Generator.java

Posted by Ned Rockson <ne...@discoveryengine.com>.

That's really interesting.  From the literature I have been reading,  
I always though that the politeness policies required a 1s - 5s wait  
between queries to the same host.  However, I always thought that  
regardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, do  
we really want to add the logic of determining if this is a "num- 
sites-per-minute-host" or a "num-sites-per-hour-host" or a "1-request- 
per-second-host"?  If we assume the former options then we will  
definitely increase our throughput but most likely get a lot of hate  
mail from webmasters for smaller sites, and if we assume the latter  
than we are where we started.

--Ned
(formerly nedrocks at gmail dot com)

On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote:

>> Ned Rockson wrote:
>>> So recently I switched to Fetcher2 over Fetcher for larger whole  
>>> web fetches (50-100M at a time).  I found that the URLs generated  
>>> are not optimal because they are simply randomized by a hash  
>>> comparator.  In one crawl on 24 machines it took about 3 days to  
>>> crawl 30M URLs.  In comparison with old benchmarks I had set with  
>>> regular Fetcher.java this was at least 3 fold more time.
>>>
>>> Anyway, I realized that the best situation for ordering can be  
>>> approached by randomization, but in order to get optimal  
>>> ordering, urls from the same host should be as far apart in the  
>>> list as possible.  So I wrote a series of 2 map/reduces to  
>>> optimize the ordering and for a list of 25M documents it takes  
>>> about 10 minutes on our cluster.  Right now I have it in its own  
>>> class, but I figured it can go in Generator.java and just add a  
>>> flag in nutch-default.xml determining if the user wants to use it.
>>> So, should I submit the code by email?  Is there some way to  
>>> change Generator.java or should I just submit the function in its  
>>> own class?
>>>
>>
>> This sounds intriguing. Nutch mailing lists strip attachments -  
>> you should create a JIRA issue, copy this description, and attach  
>> a patch in unified diff format (svn diff will do).
>
> For what it's worth, a year ago I had a series of discussions with  
> people who manage large web sites, w.r.t. acceptable crawl rates.
>
> This was in the context of whether using keep-alive made sense, to  
> optimize crawl performance.
>
> For sites that mostly serve up static content, keep-alive was seen  
> as a good thing, in that it reduced the load on the server  
> constantly creating/destroying connections.
>
> For sites that build pages from multiple data sources, this was a  
> minor issue - they were more concerned about total # of pages  
> fetched per time interval, given the high cost of assembling the page.
>
> Which is the interesting point - nobody talked about the amount of  
> time between requests. They all were interested in monitoring pages  
> per time interval - typically an hour for smaller sites, and pages  
> per minute for ones with heavier traffic.
>
> Based on the above, what seems like the most efficient approach to  
> fetching would be to use a pages/minute/domain restriction, and use  
> a kept-alive connection (w/HTTP 1.1) with no delay until that limit  
> is reached, then switch to another set of URLs from a different  
> domain.
>
> This assumes a maximum of one thread/domain, and that URLs are  
> segmented such that each domain is handled inside of one task, such  
> that the one thread/domain and pages/minute/domain restrictions can  
> be enforced properly.
>
> -- Ken
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"

Re: Update to URL ordering from Generator.java

Posted by Ken Krugler <kk...@transpac.com>.

>Ned Rockson wrote:
>>So recently I switched to Fetcher2 over Fetcher for larger whole 
>>web fetches (50-100M at a time).  I found that the URLs generated 
>>are not optimal because they are simply randomized by a hash 
>>comparator.  In one crawl on 24 machines it took about 3 days to 
>>crawl 30M URLs.  In comparison with old benchmarks I had set with 
>>regular Fetcher.java this was at least 3 fold more time.
>>
>>Anyway, I realized that the best situation for ordering can be 
>>approached by randomization, but in order to get optimal ordering, 
>>urls from the same host should be as far apart in the list as 
>>possible.  So I wrote a series of 2 map/reduces to optimize the 
>>ordering and for a list of 25M documents it takes about 10 minutes 
>>on our cluster.  Right now I have it in its own class, but I 
>>figured it can go in Generator.java and just add a flag in 
>>nutch-default.xml determining if the user wants to use it.
>>So, should I submit the code by email?  Is there some way to change 
>>Generator.java or should I just submit the function in its own 
>>class?
>>
>
>This sounds intriguing. Nutch mailing lists strip attachments - you 
>should create a JIRA issue, copy this description, and attach a 
>patch in unified diff format (svn diff will do).

For what it's worth, a year ago I had a series of discussions with 
people who manage large web sites, w.r.t. acceptable crawl rates.

This was in the context of whether using keep-alive made sense, to 
optimize crawl performance.

For sites that mostly serve up static content, keep-alive was seen as 
a good thing, in that it reduced the load on the server constantly 
creating/destroying connections.

For sites that build pages from multiple data sources, this was a 
minor issue - they were more concerned about total # of pages fetched 
per time interval, given the high cost of assembling the page.

Which is the interesting point - nobody talked about the amount of 
time between requests. They all were interested in monitoring pages 
per time interval - typically an hour for smaller sites, and pages 
per minute for ones with heavier traffic.

Based on the above, what seems like the most efficient approach to 
fetching would be to use a pages/minute/domain restriction, and use a 
kept-alive connection (w/HTTP 1.1) with no delay until that limit is 
reached, then switch to another set of URLs from a different domain.

This assumes a maximum of one thread/domain, and that URLs are 
segmented such that each domain is handled inside of one task, such 
that the one thread/domain and pages/minute/domain restrictions can 
be enforced properly.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Update to URL ordering from Generator.java

Posted by Ned Rockson <ne...@discoveryengine.com>.

Sweet, I'll look into how to go about doing that later today.

On Oct 24, 2007, at 6:06 AM, Andrzej Bialecki wrote:

> Ned Rockson wrote:
>> So recently I switched to Fetcher2 over Fetcher for larger whole  
>> web fetches (50-100M at a time).  I found that the URLs generated  
>> are not optimal because they are simply randomized by a hash  
>> comparator.  In one crawl on 24 machines it took about 3 days to  
>> crawl 30M URLs.  In comparison with old benchmarks I had set with  
>> regular Fetcher.java this was at least 3 fold more time.
>> Anyway, I realized that the best situation for ordering can be  
>> approached by randomization, but in order to get optimal ordering,  
>> urls from the same host should be as far apart in the list as  
>> possible.  So I wrote a series of 2 map/reduces to optimize the  
>> ordering and for a list of 25M documents it takes about 10 minutes  
>> on our cluster.  Right now I have it in its own class, but I  
>> figured it can go in Generator.java and just add a flag in nutch- 
>> default.xml determining if the user wants to use it.
>> So, should I submit the code by email?  Is there some way to  
>> change Generator.java or should I just submit the function in its  
>> own class?
>>
>
> This sounds intriguing. Nutch mailing lists strip attachments - you  
> should create a JIRA issue, copy this description, and attach a  
> patch in unified diff format (svn diff will do).
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Update to URL ordering from Generator.java

Posted by Andrzej Bialecki <ab...@getopt.org>.

Ned Rockson wrote:
> So recently I switched to Fetcher2 over Fetcher for larger whole web 
> fetches (50-100M at a time).  I found that the URLs generated are not 
> optimal because they are simply randomized by a hash comparator.  In one 
> crawl on 24 machines it took about 3 days to crawl 30M URLs.  In 
> comparison with old benchmarks I had set with regular Fetcher.java this 
> was at least 3 fold more time.
> 
> Anyway, I realized that the best situation for ordering can be 
> approached by randomization, but in order to get optimal ordering, urls 
> from the same host should be as far apart in the list as possible.  So I 
> wrote a series of 2 map/reduces to optimize the ordering and for a list 
> of 25M documents it takes about 10 minutes on our cluster.  Right now I 
> have it in its own class, but I figured it can go in Generator.java and 
> just add a flag in nutch-default.xml determining if the user wants to 
> use it.
> So, should I submit the code by email?  Is there some way to change 
> Generator.java or should I just submit the function in its own class?
>

This sounds intriguing. Nutch mailing lists strip attachments - you 
should create a JIRA issue, copy this description, and attach a patch in 
unified diff format (svn diff will do).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com