You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2016/03/17 08:15:33 UTC

[jira] [Resolved] (NUTCH-2206) Provide example scoring.similarity.stopword.file

     [ https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-2206.
-----------------------------------------
    Resolution: Fixed

Thank you [~sujenshah] he plugin and stop words are good

> Provide example scoring.similarity.stopword.file
> ------------------------------------------------
>
>                 Key: NUTCH-2206
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2206
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, scoring
>    Affects Versions: 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2206.patch, NUTCH-2206.patch
>
>
> The scoring-similarity plugin does not provide an example file for the property scoring.similarity.stopword.file.
> This is an issue for a number of reasons, namely 
>  * A user does not know what it is meant to look like, and
>  * We always check of this file and will [throw an exception if it is not found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80], this may not be picked up by the user until much later.
> I suggest a simple fix here, simply include the [standard English stop words taken from Lucene's StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt]. The comments will help people to easily customize the list to whatever they require. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)