You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Semyon Semyonov <se...@mail.com> on 2018/01/19 11:36:31 UTC

Re: Usage previous stage HostDb data for generate(fetched deltas)

I have proposed a solution for this(https://issues.apache.org/jira/browse/NUTCH-2481).

With this commit we are capable of using deltas stastics of hostdb(hostdb before update and after) and calculate the differences that saved in the metadata. 

For example to use fetched deltas in generate.

1) To calculate FetchedDelta in the hostdb update
<property>
  <name>hostdb.deltaExpression</name>
  <value>{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);}</value>
</property>

2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta < 5

<property>
 <name>generate.max.count.expr</name>  
<value> if(fetched > 70 &#038;&#038; FetchedDelta &#60; 5 ) {return new("java.lang.Double", 0);} else {return conf.getDouble("generate.max.count", -1);} </value>
</property>

The commit should be tested though. So, feel free to test/modify. 
 

Sent: Thursday, December 14, 2017 at 2:07 PM
From: "Semyon Semyonov" <se...@mail.com>
To: "usernutch.apache.org" <us...@nutch.apache.org>
Subject: Usage previous stage HostDb data for generate(fetched deltas)
Dear all,

I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage.

Lets say for each website we have condition of generate while number of fetched < 150.
The problem is for some websites that condition will (almost)never be finished, because of its structure.

For example
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.

I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of fetched equals to 92.

To make it I plan to modify updatehostdb function and add delta variable in hostdatum for fetched.

Do you think it is a good idea to make it in such a way?

Semyon.