You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Semyon Semyonov (JIRA)" <ji...@apache.org> on 2018/01/17 13:32:00 UTC
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step
statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Semyon Semyonov updated NUTCH-2481:
-----------------------------------
Component/s: generator
> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
> Issue Type: Improvement
> Components: generator, hostdb
> Reporter: Semyon Semyonov
> Priority: Major
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of fetched < 150.
> The problem is for some websites that condition will (almost)never be finished, because of its structure.
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page
> ...etc.
> I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
> Therefore in this case the process should stop on round 5 with total number of fetched equals to 92.
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step statistics.
> class PagesStatistics
> {
> protected int unfetched = 0;
> protected int fetched = 0;
> protected int notModified = 0;
> protected int redirTemp = 0;
> protected int redirPerm = 0;
> protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary(RAM)*
> 2.
> Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate*
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)