You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/01/12 20:30:12 UTC

[jira] [Resolved] (NUTCH-734) option to filter "a" tag text

     [ https://issues.apache.org/jira/browse/NUTCH-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-734.
----------------------------------------

    Resolution: Won't Fix

This is simply not required and dated. Plus I assume by referring to "a", we mean stop words. These are filtered during the IR process in (all?) modern indexing servers. 
                
> option to filter "a" tag text
> -----------------------------
>
>                 Key: NUTCH-734
>                 URL: https://issues.apache.org/jira/browse/NUTCH-734
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: ron
>
> Motivation:
> When fetching pages with "menue links" the menues (for example search) appear on all pages of the site. Searching for the word "search" then returns all pages of the site, instead of just returning the the search page.
> Change request:
> Add options to filter texts of "a" tags, or more generally add filters to avoid texts within specific tags.
> I have worked around this by changing DOMContentUtils.getTextHelper : 
>      if (nodeType == Node.TEXT_NODE && !(currentNode.getParentNode() != null && "a".equalsIgnoreCase(currentNode.getParentNode().getNodeName()))) 
> - Ron

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira