You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Semyon Semyonov (JIRA)" <ji...@apache.org> on 2018/01/17 13:32:00 UTC

[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

     [ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Semyon Semyonov updated NUTCH-2481:
-----------------------------------
    Component/s: generator

> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
>                 Key: NUTCH-2481
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2481
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator, hostdb
>            Reporter: Semyon Semyonov
>            Priority: Major
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of fetched < 150. 
> The problem is for some websites that condition will (almost)never be finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
> Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary(RAM)*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)