You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2009/10/20 13:58:00 UTC
[jira] Assigned: (LUCENE-1993) MoreLikeThis - allow to exclude
terms that appear in too many documents (patch included)
[ https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless reassigned LUCENE-1993:
------------------------------------------
Assignee: Michael McCandless
> MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)
> ----------------------------------------------------------------------------------------
>
> Key: LUCENE-1993
> URL: https://issues.apache.org/jira/browse/LUCENE-1993
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/*
> Affects Versions: 2.9
> Reporter: Christian Steinert
> Assignee: Michael McCandless
> Attachments: MoreLikeThis.java.patch
>
> Original Estimate: 0.17h
> Remaining Estimate: 0.17h
>
> The MoreLikeThis class allows to generate a likeness query based on a given document. So far, it is impossible to suppress words from the likeness query, that appear in almost all documents, making it necessary to use extensive lists of stop words.
> Therefore I suggest to allow excluding words for which a certain absolute document count or a certain percentage of documents is exceeded. Depending on the corpus of text, words that appear in more than 50 or even 70% of documents can usually be considered insignificant for classifying a document.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org