You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/02/14 12:02:19 UTC

[jira] [Commented] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

    [ https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901313#comment-13901313 ] 

Lewis John McGibbney commented on NUTCH-1525:
---------------------------------------------

[~sabio], thank you for the patch. I totally forgot about this issue. 
Can we verify if we are able to derive Hadoop counters as well as/instead of simple logging?
If we can obtain counters then it is much easier to analyze the number of external links we filter.

> Generator to record external links even when  db.ignore.external.links set to true
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-1525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1525
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: nutch-logExternal.patch
>
>
> When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)