You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Semyon Semyonov <se...@mail.com> on 2017/11/03 14:13:43 UTC

Re: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth

I managed to apply the issue, but I had to made small modification of the code(it didn't work for Nutch RestAPI, a patch is attached to the issue.)

I used the path with the following settings:
<property>
 <name>generate.max.count.expr</name>
<value> if(fetched > 120) {return new("java.lang.Double", 0);} else {return conf.getDouble("generate.max.count", -1);} </value>
</property> 

That works and I stop on this approach though it adds one more step in the crawling process(update hostdb), but it seems like a necessary evil for time beign.
 

Sent: Monday, October 23, 2017 at 2:57 PM
From: "Markus Jelsma" <ma...@openindex.io>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth
How about NUTCH-2368's variable generate.max.count based on HostDB data?

Regards,
Markus

[1] https://issues.apache.org/jira/browse/NUTCH-2368

-----Original message-----
> From:Semyon Semyonov <se...@mail.com>
> Sent: Monday 23rd October 2017 15:51
> To: user@nutch.apache.org
> Subject: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth
>
> Hi,
>
> Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages.
> The external links are turned off for the crawling processes.
>
> My own proposal can be found at 3)
>  
> 1)Using https://www.mail-archive.com/user@nutch.apache.org/msg10245.html[https://www.mail-archive.com/user@nutch.apache.org/msg10245.html]
> We know the size of the cluster(number of Nodes) and now the size of the list(M). 
> If we divide M/(number of Nodes in the cluster * number of fetches per Node) we can get the total amount of rounds for first level crawling(K).
> Then we multiply this parameter on necessary number of level for the website(N = 2,3,4...) depending on how deep we want to get to the specific website.
> Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl each website up to 4th level N= 4, therefore the total amount of rounds KN = 2000
> Combining with  generate.max.count = MaxPages we get maximum pages MaxPages * N. 
> Problem: the process should be smooth enough to guarantee the full list crawl for K rounds. Potential problems with crawling process and/or Hadoop cluster.
>  
> 2) The second approach is to use hostdb https://www.mail-archive.com/user@nutch.apache.org/msg14330.html[https://www.mail-archive.com/user@nutch.apache.org/msg14330.html][https://www.mail-archive.com/user@nutch.apache.org/msg14330.html[https://www.mail-archive.com/user@nutch.apache.org/msg14330.html]]
> Problem : that asks for additional computations for hostdb + workaround with the black list
>  
> 3) My own solution, it is a bit tricky.
> Using scoring-depth plugin extension and generate.min.score config.
>  
> That plugin set up the weights of linked pages as ParrentWeight/Number of linked pages. The initial weight equals to 1 by default.
>  
> My idea that we can estimate the maximum amount of page for the host.
> To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 pages and 7 pages). 
>  
>         1
>    /   / \     \
>   /   /   \     \ 
>  /   /     \     \
> 1/4   1/4     1/4  1/4
>         1
>        / \
>       /   \
>      /     \
>     1/2     1/2
>             / \
>           1/4 1/4
>     
>         1
>        / \
>       /   \
>      /     \
>     1/2     1/2
>    / \     / \
>   1/4 1/4 1/4 1/4
>
> The last tree gives maximum amount of pages with weight of 1/4( 3 levels each one sums up to 1). Total sum  = 7.
> The idea behind it is the maximum amount of links are given with the deepest tree.The deepest tree can be factorized on prime factors of the final weight.
>  
> For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7.
> For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13
> For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 
>
> The calculator: http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22][http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22]] 
> Problem : the score can be affected by other plugins.
>  
> Thanks.
>
> Semyon.
>