You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Jerome Charron (JIRA)" <ji...@apache.org> on 2006/05/06 00:56:34 UTC

[jira] Updated: (NUTCH-134) Summarizer doesn't select the best snippets

     [ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]

Jerome Charron updated NUTCH-134:
---------------------------------

    Attachment: summarizer.060506.patch

Here is a patch that add a summarizer extension point and two summarizer plugins : summarizer-basic (the current nutch implementation) and summarizer-lucene (the lucene highlighter implementation).
Please notice that the lucene plugin is a very crude implementation : the highlighter directly constructs a text representation of the summary, so we need to parse the text to build a Summary object!!! (improvements are welcome).

This is a first step to this issue resolution.
If no objection, I will commit this patch in the next few days and then:
1. Fix in the summarizer-basic the original issue reported by Andrzej 
2. Add a toString(Encoder, Formatter) method in Summarizer so that a Summary object could be encoded and formatted with many implementations (it is the same logic as the one in Lucene Highlight) - Andrzej, do you prefer this solution or a solution where Summary is Writable?

PS: Chris, sorry but the major part of this patch was already done when you added your comment.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
>     Reporter: Andrzej Bialecki 
>  Attachments: summarizer.060506.patch
>
> Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira