You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/11/07 21:13:50 UTC

[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540873 ] 

Doğacan Güney commented on NUTCH-574:
-------------------------------------

I respectfully disagree. IMHO, inlink anchor text is one of the most descriptive things about a page. If inlink anchor text has too much noise as you suggest, then we must work on eliminating this noise, I don't think that 'disabling'  it is the answer. Some ideas:

* We may try reducing inlink text importance (by readjusting its boost). 

* We may ignore inlink anchor text if inlink anchor text and parse text is completely unrelated, i.e none of the words actually appear on the page (I think google does something similar to avoid google bombs).

* We may ignore inlink text from untrusted sites/low-score sites.


> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.