You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2007/11/07 20:45:51 UTC

[jira] Created: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Including inlink anchor text in index can create irrelevant search results.
---------------------------------------------------------------------------

                 Key: NUTCH-574
                 URL: https://issues.apache.org/jira/browse/NUTCH-574
             Project: Nutch
          Issue Type: Bug
          Components: indexer
         Environment: All, basic indexing filter
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542196 ] 

Hudson commented on NUTCH-574:
------------------------------

Integrated in Nutch-Nightly #265 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/265/])

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542103 ] 

Doğacan Güney commented on NUTCH-574:
-------------------------------------

There are still some tabs there. (for example in src/plugin/build.xml)

+1 from me.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-574.
--------------------------------

    Resolution: Fixed

This has been comitted.  Inbound anchor text indexing moved to index-anchor plugin from index-basic.  The nutch-default.xml file changed to load index-anchor by default.  Thanks to all for comments and suggestions.  

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540873 ] 

Doğacan Güney commented on NUTCH-574:
-------------------------------------

I respectfully disagree. IMHO, inlink anchor text is one of the most descriptive things about a page. If inlink anchor text has too much noise as you suggest, then we must work on eliminating this noise, I don't think that 'disabling'  it is the answer. Some ideas:

* We may try reducing inlink text importance (by readjusting its boost). 

* We may ignore inlink anchor text if inlink anchor text and parse text is completely unrelated, i.e none of the words actually appear on the page (I think google does something similar to avoid google bombs).

* We may ignore inlink text from untrusted sites/low-score sites.


> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541326 ] 

Enis Soztutar commented on NUTCH-574:
-------------------------------------

Why don't you just refactor indexing anchor code into another plugin, say index-anchor, enabled by default. Then all you need to do is to not use that plugin but only index-basic, right? That way we can avoid adding to the never-ending-list of configuration parameters *smile*. 

bq. The current idea is to have three options. An always include, never include, and include if matches text on page. 
In another issue, we can add a new plugin called index-anchor-matching that does its thing. Choosing from a list of plugins is the beauty of the plugin system after all. 

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-574.
-------------------------------


> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-574:
-------------------------------

    Attachment: NUTCH-574-1.patch

Adds a config option to nutch-default.xml to prevent including inbound anchor text in the index.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541631 ] 

Enis Soztutar commented on NUTCH-574:
-------------------------------------

bq. Is this the type of process you were talking about with selecting most frequent words?
Yes, something like that, though i have not thought on this entirely. 

Andrzej, what we propose is actually a very challenging task, but it is better that we do not rule it out before careful analysis. All the commercial search engines based on nutch will definitely benefit from such an advanced feature. What Dennis proposes sounds good and can even be the default behavior once it is justified with enough statistical/scientific material. 

Dennis, have you do some analysis on anchor text? If so i think we better get started *smile*. 

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542135 ] 

Andrzej Bialecki  commented on NUTCH-574:
-----------------------------------------

+1 for the patch. -1 for the tabs ;)

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541377 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

It may be a little complex but we could do some type of scoring.  For instance, for every word give something like a + (1 * (frequency over a links / total number of links)), then sort  highest to lowest, taking only the best links.  This way if car manufactuer does show up consistently in links, it will be indexed for the page, and something like dallas hotels which only shows up 1 each for google will not.  Is this the type of process you were talking about with selecting most frequent words?

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541565 ] 

Doğacan Güney commented on NUTCH-574:
-------------------------------------

Dennis, it seems you forgot to add the java files for the index-anchor plugin :) Also, I am not sure why you are modifying build.xml...

Btw, it may be a good time to remove "throws IOException" from inlinks.getAnchors since no path in getAnchors throws an IOException.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541653 ] 

Andrzej Bialecki  commented on NUTCH-574:
-----------------------------------------

I don't rule it out - I support the patch as is, i.e. separating the anchor indexing from index-basic. My point was that anchor text is a complicated issue, and how you use anchor depends on your requirements - in other words, I think it may be difficult to find a more advanced solution that would satisfy most users.

Some comments to the latest patch:

* I think it would be good to put a NOTE: in CHANGES.txt that reminds users who wish to keep the curent behavior that they should make sure that their nutch-default / nutch-site.xml contain this plugin in plugin.includes.

* there are literal Tab characters in plugin/build.xml - they should be converted to spaces.

Other than that I think the patch can be applied as is, and we should continue the discussion :)

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540889 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

Ok, talked with Docagan.  The current idea is to have three options.  An always include, never include, and include if matches text on page.  The default option would be include if match to preserve backwards compatibility but to allow indexes to return more relevant results.  So I am going to work up a new patch for this, should have it posted sometime today or tomorrow.  

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541582 ] 

Doğacan Güney commented on NUTCH-574:
-------------------------------------

+1 from me (though I still think that we can remove IOException:)

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541978 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

Oh, and I agree that this is just the first basic patch.  I think we should definitely contiue the discuss, possibly even having multiple different plugin implementations for indexing anchor text so users can pick and choose.  This is just a first step.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-574:
-------------------------------

    Attachment: NUTCH-574-2.patch

Basic patch that refactors anchor indexing into its own plugin.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541347 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

I agree, refactoring the code to a plugin is a better solution.  Have started down that path.  One issue is how we are doing the matching.

The initial problem is that we are indexing words for pages that don't contain those words, but for words that are contained in the page we want the boost factor.  So as I see it there are two options.  

1) We can be strict and say if an inbound link contains *any* anchor text that is not currently in the page then we don't index the entire link.  
2) We can manipulate the text of the anchor remvoing any words in the anchor text that do no appear in the page and in effect not indexing those words.  

I am leaning toward the second option of indexing all links but removing words.  I think it is likely that a some of the words in a link will be on the page and some will not and we want to include those that are and exclude those that are not.  Would like opinions on this.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540875 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

I am ok with ignoring it if doesn't appear on the page text.  My goal here is to eliminate irrelevant sites from showing in search results.  I will create a patch that handles be checking parse text and ignoring if it doesn't appear in the page.  Sound good?

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541359 ] 

Enis Soztutar commented on NUTCH-574:
-------------------------------------

Honestly, i don't think not indexing anchor words that do not appear in the web site text is not a wise solution. What made google so successful is indexing anchor text + PR, the classic example being that, the page http://www.honda.com/ never mentions that Honda is a car manufacturer, but the anchor text does.   

That said, I think we should focus on finding a way to eliminate the noise on anchor text. At this point we take the first 10K links and discard the others, due to size constraints. But a better way would be to select the best ones, or select the most frequent words, etc. 




> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by Matt Kangas <ka...@gmail.com>.
+1 on making it into a plugin. Echoing Chris & Andrzej's points -- if  
Dennis wants to try a novel treatment of inlink text, why not give  
him a way to do so, so long as the current strategy remains the default?

With luck, experimentation will lead to a better default strategy  
over time.

--matt

On Nov 9, 2007, at 3:25 PM, Andrzej Bialecki (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/NUTCH-574? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12541428 ]
>
> Andrzej Bialecki  commented on NUTCH-574:
> -----------------------------------------
>
> +1 on making it into a plugin (e.g. index-anchors). -1 on  
> implementing any sort of filtering - as Enis pointed out, the issue  
> is complicated in itself, and additionally depends on the user  
> requirements. I propose the following: let's implement a basic  
> version (which is implemented now in the form of LinkDb.getAnchors 
> ()), and leave users the freedom to complicate away if they wish to  
> do so.
>
> Re: scoring - this is again tricky, because the top-N most frequent  
> words happen to be stopwords, and if that's the case you need to  
> know the language of the corpus in order to properly detect them  
> and remove from the top-N ... very messy.
>
>> Including inlink anchor text in index can create irrelevant search  
>> results.
>> --------------------------------------------------------------------- 
>> ------
>>
>>                 Key: NUTCH-574
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: indexer
>>         Environment: All, basic indexing filter
>>            Reporter: Dennis Kubes
>>            Assignee: Dennis Kubes
>>             Fix For: 1.0.0
>>
>>         Attachments: NUTCH-574-1.patch
>>
>>
>> Currently the basic indexing filter includes inbound anchor text  
>> for a given URL in the index.  This sometimes allows pages to show  
>> up in search results where they may not be relevant.  An example  
>> of this is a search for "dallas hotels" in our production index  
>> (www.visvo.com).  Google would show up first in this example  
>> although there is no text matching either dallas or hotels on the  
>> google home page.  What is happening here is there are inlinks  
>> into google with the words dallas and hotels which get included in  
>> the index for google.com and because google would have a very high  
>> boost due to inlinks, google shows up first for these search  
>> terms.  I propose we add an option to allow/prevent inlink anchor  
>> text from being included in the index and set the default for this  
>> option to NOT include inbound link anchor text.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

--
Matt Kangas / kangas@gmail.com



[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541428 ] 

Andrzej Bialecki  commented on NUTCH-574:
-----------------------------------------

+1 on making it into a plugin (e.g. index-anchors). -1 on implementing any sort of filtering - as Enis pointed out, the issue is complicated in itself, and additionally depends on the user requirements. I propose the following: let's implement a basic version (which is implemented now in the form of LinkDb.getAnchors()), and leave users the freedom to complicate away if they wish to do so.

Re: scoring - this is again tricky, because the top-N most frequent words happen to be stopwords, and if that's the case you need to know the language of the corpus in order to properly detect them and remove from the top-N ... very messy.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540877 ] 

Chris A. Mattmann commented on NUTCH-574:
-----------------------------------------

IMHO what Dennis suggest is fine so long as it's a configurable option, that doesn't change the default behavior of the system. That is to say, if Dennis wants to make it something that you can turn on or off in the nutch-default.xml file, and then commit the default to off (e.g., the way Nutch behaves now), and then in his own local environment, simply set it to "on" and maintain that conf file locally, then it's probably something that we should think about, since it seems to support a use case that Dennis is having and we don't want to shut anyone's use case out -- if it can be supported with a configurable option.

My +1 for the patch so long as the default doesn't change Nutch's existing behavior.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-574:
-------------------------------

    Attachment: NUTCH-574-4.patch

Ok, tabs removed.  Changes.txt updated.  If everyone is good with this patch, I will commit tonight or tomorrow morning.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542712 ] 

Hudson commented on NUTCH-574:
------------------------------

Integrated in Nutch-Nightly #267 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/267/])

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch, NUTCH-574-4.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541508 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

So I think what we are really saying is this.  It would be good to make this a plugin and we really don't know what would be the best way to score this right now, but it would be good to experiment with it and find out.  So I am going to make a generic plugin that turns indexing anchor text on and off.  I am also going to create a new extension point from this that will allow creating scoring algorithms for indexing anchor text.  That way we can play around.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-574:
-------------------------------

    Attachment: NUTCH-574-3.patch

Dohhh, wrong patch :).  The build.xml changes are actually for another patch I am working on where we can have all plugins in a jar and they can be deployed and called from the jar instead.  Removed those and actually included the java files.  Always a good thing.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the index.  This sometimes allows pages to show up in search results where they may not be relevant.  An example of this is a search for "dallas hotels" in our production index (www.visvo.com).  Google would show up first in this example although there is no text matching either dallas or hotels on the google home page.  What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms.  I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.