You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marcin Okraszewski (JIRA)" <ji...@apache.org> on 2007/09/20 22:17:50 UTC

[jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

     [ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcin Okraszewski updated NUTCH-488:
-------------------------------------

    Attachment: ignore_tags_v2.patch

Yet another patch. The differences to the Emmanuel's patch are:
- there is a single property to manage which links should be taken; simply coma-separated list of tags to ignore.
- all tags can be blocked, including "a". Emmanuel's proposition do not allow to turn off a,area,frame,iframe
- the coma-separated property seems to be more flexible once a new tag was added...

The patch was done against trunk. I hope to see it in Nutch 1.0 :)

> Avoid parsing uneccessary links and get a more relevant outlink list
> --------------------------------------------------------------------
>
>                 Key: NUTCH-488
>                 URL: https://issues.apache.org/jira/browse/NUTCH-488
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: Windows, Java 1.5
>            Reporter: Emmanuel Joke
>         Attachments: DOMContentUtils.patch, ignore_tags_v2.patch, nutch-default.xml.patch
>
>
> NekoHTML parser use a method to extract all outlinks from the HTML page. It will extracts them from the HTML content based on the list of param defined in the method setConf(). Then this list of links will be truncated to be limit to the the maximum number of outlinks that we'll process for a page defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and finally it will be go through all urlfilter defined.
> Unfortunetly it can happen that the list of outlinks is more than 100, so it will truncated the list and could remove some relevant links.
> So I've added few options in the nutch-default.xml in order to enable/disable the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, LINK).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.